0004 — Enum annotation syntax and #@ delimiter

Status: implemented App: prototext Implemented in: 2026-03-11

Problem

Three related issues affect the annotation syntax emitted by the protoc kernel:

  1. Ambiguous delimiter. Annotations use # — the standard proto comment character — with no visual distinction from free-form human comments. This makes it hard for readers (and future parsers) to recognise that annotations are machine-generated and carry semantics.

  2. Numeric enum values. When a field with an enum type is decoded, the emitted value is the raw integer (9), not the symbolic constant (TYPE_STRING). The original intent was to match protoc --decode output, but protoc --decode actually emits the symbolic name. Using the numeric value is therefore both a divergence from the reference and a usability regression.

  3. Latent encoder bug: enum type name collides with primitive name. The encoder (encode_text.rs) dispatches on the type token in the annotation to determine the wire encoding. For unrecognised tokens it falls back to raw varint — which is correct for enums by coincidence. However, proto syntax permits enum type names that collide with primitive keywords (e.g. enum float { … }). In that case the encoder would match the "float" arm and silently emit a fixed32 value instead of a varint, producing wrong wire bytes. The parenthesised format introduced in this spec eliminates this bug (see §5.1).

Current output (erroneous, field number 5, numeric enum value 9):

type: 9  # Type = 5

Target output after this spec:

type: TYPE_STRING  #@ Type(9) = 5

Reading the annotation: Type(9) = enum type Type, raw wire value 9; = 5 = field number 5.

Goals

Non-goals


Specification

1. Delimiter change: ##@

Every field-level annotation currently starts with # (two spaces, hash, space). After this change it starts with #@ (two spaces, hash, at-sign, space).

This applies to all annotation tokens — wire-type labels, field declarations, modifiers — not just enum annotations.

Examples:

BeforeAfter
9: 5 # varint9: 5 #@ varint
name: "hello" # string = 3name: "hello" #@ string = 3
9: 5 # varint; TAG_OOR9: 5 #@ varint; TAG_OOR
items: 3 # repeated int32 = 2items: 3 #@ repeated int32 = 2

The separator between annotation tokens remains "; " (semicolon-space).

2. Enum field rendering

When a field has proto type ENUM and the schema contains a value name for the decoded integer:

Full format for a known enum value:

<field_name>: <SYMBOLIC_NAME>  #@ <EnumTypeName>(<numeric>) = <field_number>

Examples (field number 5, enum type Type, numeric value 9 = TYPE_STRING):

type: TYPE_STRING  #@ Type(9) = 5

For a repeated enum field (field number 2, type Label, value 1 = LABEL_OPTIONAL):

label: LABEL_OPTIONAL  #@ repeated Label(1) = 2

For a packed repeated enum field:

label: LABEL_OPTIONAL  #@ repeated Label(1) [packed=true] = 2

The parenthesised numeric suffix — (<numeric>) — is unique to enum annotations. No other annotation token uses parentheses, so the format is unambiguous.

Field declaration structure for enums

For non-enum known fields the annotation contains:

[repeated |required ]<type_or_display_name>[ [packed=true]] = <field_number>

For enum known fields the annotation contains:

[repeated |required ]<EnumTypeName>(<numeric>)[ [packed=true]] = <field_number>

where <EnumTypeName> is the short (unqualified) enum type name (last component of the fully-qualified type name).

optional label

optional continues to be omitted as the default label.

3. Unknown enum values — ENUM_UNKNOWN modifier

When the decoded integer is not present in the enum's value table in the schema (an unrecognised value), the field is rendered with:

type: 99  #@ Type(99) = 5; ENUM_UNKNOWN

Casing rationale

ENUM_UNKNOWN uses ALL_CAPS to match the existing token convention for anomaly flags (TYPE_MISMATCH, TAG_OOR, TRUNCATED_BYTES): the wire type is correct (varint), the field is schema-known, but the value is outside the declared enum set — a semantic anomaly the user should notice.

Round-trip for unknown enum values

The encoder sees the raw integer 99 as the value token and encodes it directly as a varint. The ENUM_UNKNOWN modifier is ignored by the encoder (comments-are-stripped rule applies). Round-trip is lossless.

4. Schemaless / unknown field rendering

When no schema is available (schemaless mode), enum fields cannot be identified; they are emitted as plain varint with the numeric value:

9: 5  #@ varint

5. Round-trip (encode)

The encoder already carries the numeric value in the annotation (Type(9)), so it does not need to resolve symbolic names via a schema lookup. The encode path works as follows:

Consequence: the encoder requires no schema access and no name-resolution logic. Lossless round-trip is guaranteed by the annotation carrying the numeric value explicitly.

5.1 Enum type name vs primitive name disambiguation

The encoder's encode_num function dispatches on ann.field_type (a &str extracted from the annotation) to select the wire encoding:

field_type tokenWire encoding
"double", "fixed64", "sfixed64"fixed 64-bit
"float", "fixed32", "sfixed32"fixed 32-bit
"sint32", "sint64"zigzag varint
"bool"varint (masked to 1 bit)
"int32", "enum"varint (with truncation flag)
"uint32", "int64", "uint64", …varint
anything else (_)varint fallback

Before this spec, enum fields emit e.g. Label = 4 in the annotation. If an enum is named float, the annotation would be float = 4, and the encoder would match the "float" arm and emit a fixed32 value.

After this spec, enum fields emit Label(1) = 4. The ( character cannot appear in any primitive type name, so parse_field_decl_into detects ( unconditionally and routes it through the varint path. The primitive dispatch table is never consulted for enum fields.

6. Schema changes — Rust

FieldInfo gains one new field (in prototext-core/src/schema.rs):

/// Numeric value → symbolic name table for ENUM fields.
/// Populated at schema-parse time; empty for non-ENUM fields.
/// Sorted by numeric value for O(log n) lookup via binary_search_by_key.
pub enum_values: Box<[(i32, Box<str>)]>,

Data structure rationale:

Build procedure

build_message_schema is updated with a two-pass approach:

  1. Collect enums: walk all EnumDescriptorProto entries in all FileDescriptorProto files, building a temporary HashMap<String, Vec<(i32, Box<str>)>> keyed by fully-qualified enum type name (e.g. .google.protobuf.FieldDescriptorProto.Type). Sort each Vec by numeric value.
  2. Resolve per field: for each FieldInfo with proto_type == ENUM, look up enum_type_name in the temporary map, sort the entries by i32 key, and store as Box<[(i32, Box<str>)]>. Fields with an unresolvable enum type get an empty slice.

Additional implementation hazards

ENUM_UNKNOWN silencing in parse_annotation

In encode_text.rs, parse_annotation handles bare tokens (no :, no =) with a match token that explicitly silences "TAG_OOR", "ETAG_OOR", and "TYPE_MISMATCH". Everything else falls to _ => ann.wire_type = token, which would set ann.wire_type = "ENUM_UNKNOWN". The fix is to add "ENUM_UNKNOWN" to the explicit ignore list:

"TAG_OOR" | "ETAG_OOR" | "TYPE_MISMATCH" | "ENUM_UNKNOWN" => {}

Bounds check update in split_at_annotation

The current bounds check after memrchr(b'#') is:

p + 1 < b.len() && b[p + 1] == b' '

After the change:

p + 2 < b.len() && b[p + 1] == b'@' && b[p + 2] == b' '

The bound must increase from p + 1 to p + 2 to avoid an out-of-bounds read when # is the second-to-last byte of the line.

Packed enum decoder — structural change

The existing decode_packed_to_str / decode_packed_varints_to_str functions return a single formatted String such as "[1, 2, 3]" for the entire value list. For enum fields, each integer must become a symbolic name; the numeric values must additionally be collected for the annotation.

decode_packed_varints_to_str is extended to carry a parallel Vec<i32> of raw numeric values for enum fields. render_packed then:

For elements not found in fi.enum_values, the raw integer is emitted at that position and ; ENUM_UNKNOWN is appended (one modifier covers all unknown values — no per-element flag).

Packed enum encoder — value list ignored

encode_packed_array_line iterates over the comma-separated elements of the [v1, v2, …] LHS list, calling parse_num(elem) for each element. After this spec the LHS list will contain symbolic names (e.g. [LABEL_OPTIONAL, LABEL_REQUIRED]), for which parse_num returns None.

The fix: the encoder ignores the LHS value list for enum fields and instead extracts the numeric values from Label([1, 2]) in the annotation. A new Ann field (enum_packed_values: Vec<i64>) is populated by parse_field_decl_into when it detects the ([…]) form.

Truncated-negative enum values

A packed or scalar enum field with a truncated 5-byte negative value is annotated by the decoder with a truncated_neg modifier. This modifier is used by the encoder to select the 5-byte encoding path.

After this spec the value is rendered symbolically (if the decoded i32 is in enum_values) or as the raw i32 (if unknown). The truncated_neg modifier in the annotation continues to carry the encoding information. No change is needed for this case.


New test schema: enum_collision.proto

A new proto schema fixtures/schemas/enum_collision.proto contains:

syntax = "proto2";

// An enum whose name collides with a primitive keyword.
// Under the old annotation format this would be encoded as fixed32 (wrong).
// Under the new format the (N) suffix makes it unambiguously a varint.
enum float {
  FLOAT_ZERO  = 0;
  FLOAT_ONE   = 1;
  FLOAT_TWO   = 2;
}

// A normal enum with a non-colliding name, for the happy-path and
// ENUM_UNKNOWN cases.
enum Color {
  RED   = 0;
  GREEN = 1;
  BLUE  = 2;
}

message EnumCollision {
  optional float  kind      = 1;  // enum named after primitive keyword
  optional Color  color     = 2;  // normal enum, known value
  optional Color  unknown_color = 3;  // populated with value 99
  repeated Color  colors    = 4;
  repeated Color  colors_pk = 5 [packed=true];
  optional EnumCollision nested = 6;   // for nesting tests
  optional group EnumGroup = 7 {       // for group + enum tests
    optional Color group_color = 1;
  }
}

The compiled .pb descriptor lives at fixtures/schemas/enum_collision.pb.


Fixture coverage

Four core fixtures exercise this spec's paths:

Fixture namePurpose
enum_collision_float_kindEnum named float — exercises the primitive-keyword collision path
enum_collision_color_knownNormal enum, value present in schema — happy path
enum_collision_color_unknownNormal enum, value 99 not in schema — exercises ENUM_UNKNOWN
enum_collision_color_packedPacked repeated enum

All four must pass the round-trip invariant:

wire → [decode] → text → [encode] → wire'
assert wire' == wire

References