Date: 2026-03-23
Scope: protoc --decode text output (canonical wire input, schema present)
This document reverse-engineers the exhaustive text output syntax of
protoc --decode, identifies every point where prototext currently
diverges, and classifies each divergence by severity.
protoc --decode invokes TextFormat::Print(*message, &out) using a
default-constructed Printer (protobuf text_format.cc, entry point
command_line_interface.cc ~line 3192). The default printer uses:
DebugStringFieldValuePrinter — strings and bytes escaped with
absl::CEscape(), not UTF-8-safe.use_field_number_ = false — known fields rendered by name, not number.use_short_repeated_primitives_ = false — repeated fields always one
element per line, never bracket notation.protoc --decode emits no header line before the message body.
prototext emits #@ prototext: protoc\n as the first line.
Divergence D1 — prototext-specific header. Acceptable: the header is a prototext extension.
--no-annotationsdoes not suppress it (the header is always emitted). Impact on protoc compatibility: the header line is present in prototext output but absent from protoc output. Not a concern for the protoc-superset goal (which only requires that the body fields match), but worth documenting.
| Situation | protoc output | prototext output |
|---|---|---|
| Known field | field_name: value | field_name: value ✓ |
| Extension field | [pkg.ExtName]: value | (unknown — see below) |
| Unknown field (no schema match) | field_number: ... | field_number: ... ✓ |
Unknown fields without annotations are silently skipped by prototext
(render_len_field returns early when annotations=false).
Divergence D2 — unknown fields suppressed without annotations.
protoc --decodealways renders unknown fields (by field number).prototext --no-annotationssilently drops them. This is a known, documented limitation.
Both use : (colon + space) between field name/number and scalar value.
Message and group fields use { (space + open brace, no colon) in both.
No divergence.
Both use 2 spaces per nesting level.
No divergence.
All integer wire types render as signed or unsigned decimal:
| Type | Sign | protoc | prototext |
|---|---|---|---|
int32, int64 | signed | -42 | -42 ✓ |
uint32, uint64 | unsigned | 42 | 42 ✓ |
sint32, sint64 | signed (zigzag-decoded) | -42 | -42 ✓ |
fixed32, fixed64 | unsigned | 42 | 42 ✓ |
sfixed32, sfixed64 | signed | -42 | -42 ✓ |
bool | — | true / false | true / false ✓ |
No hex notation for any integer type in known fields.
No divergence.
| Wire type | protoc | prototext (annotations=true) |
|---|---|---|
| VARINT (WT=0) | N: <decimal> | N: <decimal> ✓ |
| FIXED32 (WT=5) | N: 0x<8 lowercase hex digits> | N: 0x<8 lowercase hex digits> ✓ |
| FIXED64 (WT=1) | N: 0x<16 lowercase hex digits> | N: 0x<16 lowercase hex digits> ✓ |
protoc renders unknown VARINTs as unsigned decimal
(absl::StrCat(field.varint())). Prototext renders them the same way
(format_wire_varint_protoc).
No divergence for annotations=true. With annotations=false, unknown fields are suppressed (Divergence D2 above).
protoc attempts to parse the payload as a sub-message (recursion limit 10).
If successful, it renders as a nested { } block with no colon.
If not, it renders as a CEscaped quoted string with a colon.
prototext (annotations=true) always renders unknown LEN payloads as a quoted escaped bytes string:
N: "\001\002\003" #@ bytes
It does not attempt sub-message parsing for unknown fields.
Divergence D3 — unknown LEN fields: no sub-message heuristic.
protoc --decodespeculatively parses unknown LEN payloads as nested messages; prototext always renders them as raw bytes. Impact: for canonical input where an unknown LEN field happens to contain a valid sub-message, protoc renders nested structure while prototext renders opaque bytes. This only affects truly unknown fields (no schema match), which are outside the primary use case. Severity: low.
| Value | protoc | prototext (current) |
|---|---|---|
| Canonical quiet NaN | nan | nan ✓ |
| Any non-canonical NaN | nan | nan(0x...) ✗ |
| +Infinity | inf | inf ✓ |
| -Infinity | -inf | -inf ✓ |
Divergence D4 — non-canonical NaN value token. Already identified in spec 0010 (change B). For canonical wire input this is not triggered (canonical NaN is always
0x7FC00000/0x7FF8000000000000). Fix: movenan(0x...)to annotation modifiernan_bits:.
protoc uses:
%g with 6 significant digits, retry at 9 if not round-trip exact.%g with 15 significant digits, retry at 17 if not round-trip exact.prototext uses the same algorithm (common.rs format_float_protoc /
format_double_protoc).
No divergence in precision.
protoc uses absl::CEscape() with use_hex=false, utf8_safe=false.
Named escapes produced: \n, \r, \t, \", \', \\ only.
All other non-printable bytes (including \x00–\x1F, \x7F, \x80–\xFF)
use 3-digit octal (\NNN).
Notable: \x07 (BEL) → \007 (not \a); \x08 (BS) → \010 (not \b);
\x0B (VT) → \013 (not \v); \x0C (FF) → \014 (not \f).
High bytes (\x80–\xFF) → 3-digit octal, never hex.
Multi-byte UTF-8 characters → each byte separately as 3-digit octal.
prototext (common.rs escape_bytes_into / escape_string_into) uses
the same named escapes (\n, \r, \t, \", \', \\) and 3-digit
octal for everything else.
The implementations must be verified to match exactly on the BEL/BS/VT/FF
boundary cases and on high bytes. A quick review of common.rs:31-96
shows the escape table matches protoc's CEscape output.
No divergence (pending byte-level verification of edge cases — see recommended test in §4).
| Case | protoc | prototext |
|---|---|---|
| Known value | Symbolic name (e.g. GREEN) | Symbolic name ✓ |
| Unknown value | Decimal integer (e.g. 10) | Decimal integer ✓ |
No divergence.
protoc always uses { } (never < >). Colon is omitted before {.
prototext uses { } with no colon.
No divergence in delimiter style.
protoc emits each } on its own line at the correct indentation level.
prototext emits each } on its own line. The codebase contains a
CBL_START thread-local and comments describing a close-brace folding
mechanism, but the folding logic itself has been removed. write_close_brace
(helpers.rs:117) simply writes }\n; no patching or folding occurs.
CBL_START is maintained as dead scaffolding.
No divergence.
protoc uses the capitalized message type name (not the field name) as
the identifier, e.g. OptionalGroup {.
prototext uses the same convention (verified in render_group_field).
No divergence.
Both render one element per line with the field name repeated:
int32Rp: 1
int32Rp: 2
int32Rp: 3
No divergence.
protoc renders packed repeated fields identically to non-packed repeated
fields — one element per line, field name repeated. The bracket notation
([1, 2, 3]) is not produced by protoc --decode.
prototext currently renders packed fields using bracket notation:
int32Pk: [1, 2, 3] #@ repeated int32 [packed=true] = 85
Divergence D5 — packed field bracket syntax. Already identified in spec 0010 (change A). This is the most visible format divergence. Fix: per-line rendering with
pack_sizeboundary markers.
protoc renders extension fields with their bracketed fully-qualified name:
[com.example.MyExtension]: value
prototext has no extension field support. Extension fields would appear as unknown fields.
Divergence D6 — extension fields. Out of scope for the current compatibility work.
| ID | Description | Canonical input affected? | Severity | Tracked in |
|---|---|---|---|---|
| D1 | prototext header line | Yes (always present) | Cosmetic | — |
| D2 | Unknown fields suppressed without annotations | Yes (unknown fields) | Known limitation | — |
| D3 | Unknown LEN: no sub-message heuristic | No (requires unknown field) | Low | — |
| D4 | Non-canonical NaN value token nan(0x...) | No (canonical NaN is always nan) | Low | spec 0010 change B |
| D5 | Packed field bracket syntax | Yes | High | spec 0010 change A |
| D6 | Extension fields not supported | Yes (if extensions present) | Out of scope | — |
For canonical wire input with a complete schema, only D1 (header) and D5 (packed syntax) affect the output. D1 is cosmetic and intentional. D5 is the primary fix target (spec 0010).
The following edge cases are not currently covered by the test suite and should be added (separately from or as part of spec 0009's e2e suite):
String escaping edge cases: BEL (\x07), BS (\x08), VT (\x0B),
FF (\x0C), DEL (\x7F), null (\x00), high bytes (\x80–\xFF),
multi-byte UTF-8 sequences — verify byte-for-byte match with protoc output.
Unknown VARINT large values: verify uint64 max
(18446744073709551615) renders as unsigned decimal, not signed.
-inf and inf for float and double: confirm rendering matches
protoc exactly (lowercase, no leading +).
Packed empty record: verify the empty-packed case is handled when spec 0010 change A is implemented.
Repeated non-packed vs packed canonical: verify that with annotations disabled, a canonical packed field renders identically to a non-packed repeated field (after spec 0010 change A).
src/google/protobuf/text_format.cc — protoc text printersrc/google/protobuf/io/strtod.cc — SimpleFtoa / SimpleDtoaabsl/strings/escaping.cc — absl::CEscapesrc/google/protobuf/compiler/command_line_interface.cc — --decode entryprototext-core/src/serialize/render_text/mod.rs — prototext top-level rendererprototext-core/src/serialize/render_text/helpers.rs — field rendering helpersprototext-core/src/serialize/common.rs — float/string formattingdocs/specs/0010-protoc-compatibility.md — planned fixes for D4 and D5