prototext performance

Environment: Linux 6.12.57 x86-64, single core (virtualised). Criterion: 100 samples, 3 s warmup. Workload: descriptor.pb — a FileDescriptorSet with 17 KB binary / 118 KB annotated protoc text / 2,946 fields.


Benchmark summary

Path A — binary → protoc text (decode_and_render)

BenchmarkTime (median)Throughput
A1 no schema, no annotations37 µs
A2 schema + annotations205 µs80 MiB/s

Annotations cost 5.5× the no-annotation baseline. Schema lookup and annotation string writes account for the majority of the gap.

Path B — protoc text → binary (encode_text_to_binary)

BenchmarkTime (median)Throughput
B1 annotated text (118 KB)441 µs255 MiB/s

Path B is 3.2× faster per byte than Path A with annotations (255 vs 80 MiB/s), because it scans pre-formatted text rather than formatting values from scratch.

Micro-benchmarks (varint)

OperationTime
parse_varint 1-byte6.4 ns
parse_varint 2-byte6.9 ns
parse_varint 10-byte11.6 ns
write_varint into reused Vec1.7 ns
write_varint into fresh Vec15 ns
encode_varint_bytes(150) — allocates Vec18 ns

Reusing a write buffer is 10× faster than allocating a new Vec per varint.


perf flat profiles

Profiled with perf record -e task-clock. Criterion itself contributes ~25% of samples (rayon statistical analysis, libm exp()); percentages are of all samples.

Path A2 — decode_and_render with schema + annotations

 12.18%  rayon (Criterion overhead)
  8.33%  __ieee754_exp_fma (Criterion overhead)
  6.25%  malloc
  5.54%  _int_free
  4.90%  __memmove_avx_unaligned_erms
  4.24%  render_message
  4.11%  core::fmt::write
  3.28%  String::write_str
  3.06%  RawVecInner::reserve
  3.01%  exp (Criterion overhead)
  2.45%  core::hash::BuildHasher::hash_one
  2.30%  escape_string
  2.14%  alloc::fmt::format
  2.09%  alloc::str::join_generic_copy
  1.84%  render_len_field
  1.55%  core::str::from_utf8
  1.36%  parse_varint

Allocator pressure (malloc + free + memmove + realloc): ~24% of all samples. format! machinery (fmt::write + format + join): ~8%.

Path B1 — encode_text_to_binary

 13.80%  core::slice::memchr::memchr_aligned
 10.89%  rayon (Criterion overhead)
  9.31%  encode_text_to_binary (self)
  8.57%  CharSearcher::next_match
  8.52%  SplitWhitespace::next
  7.43%  __ieee754_exp_fma (Criterion overhead)
  4.14%  parse_annotation
  3.47%  malloc
  3.29%  StrSearcher::new
  2.78%  str::trim_matches
  1.82%  __memmove_avx_unaligned_erms
  1.44%  unescape_bytes
  1.28%  write_varint_ohb

Text scanning (memchr + CharSearcher + SplitWhitespace + StrSearcher + trim): ~35%. Allocator: ~7%.


Optimisation history

All measurements on the descriptor.pb workload.

P1–P4 (−23.8% A2, −29.0% B1)

P1 — direct buffer writes for value formatting (Path A)

Replaced the format!("\"{}\"", escape_string(s)) pattern — which allocates an intermediate String for the escape output and another for the format! wrapper — with direct writes to the output Vec<u8> via escape_bytes_into / escape_string_into. Eliminated 2 heap allocations per LEN-field value.

P2 — eliminate the mods: Vec<String> pattern (Path A)

Annotation modifiers were accumulated in a Vec<String> then joined: mods.join(" "). Replaced with AnnWriter, a zero-allocation helper that writes each modifier directly to the output buffer with an inline separator. Eliminated 3–6 heap allocations per annotated field.

P3 — zero-alloc field declaration parsing (Path B)

parse_field_decl_into collected tokens with split_whitespace().collect::<Vec<&str>>(), allocating a fresh Vec per annotated line. Replaced with a lazy split_ascii_whitespace() iterator consuming at most 5 tokens — zero allocation.

Isolated benchmark: manual split is 18× faster than a compiled regex for this task (16.9 ns vs 310 ns per 10-token corpus).

P4 — byte scan in split_at_annotation (Path B)

line.rfind(" # ") constructed a StrSearcher (Boyer-Moore) per line. Replaced with memrchr(b'#') + 5-byte pattern verify. Eliminates searcher construction overhead on every line.

BenchmarkBaselineAfter P1–P4Change
A2 decode_and_render (schema + annotations)637 µs · 25.8 MiB/s485 µs · 33.9 MiB/s−23.8%
B1 encode_text_to_binary642 µs · 175 MiB/s456 µs · 247 MiB/s−29.0%

P8–P10 (−57.8% A2 vs P1–P4 baseline)

After P1–P4 two String allocations per rendered schema field remained:

  1. field_name() -> String: either field_number.to_string() or field_schema.unwrap().name.clone() — immediately consumed as &str.
  2. field_decl() -> Option<String>: format!("{} {}{} = {};", ...) — one per annotated schema field.

P8 — replaced field_name() -> String with wfl_prefix_n / wob_prefix_n helpers that write the field name directly to the output buffer via out.extend_from_slice(fi.name.as_bytes()) or write_dec_u64.

P9 — replaced field_decl() -> Option<String> with AnnWriter::push_field_decl, which writes the field declaration directly to the output buffer. type_str(fi) is a &'static str match — zero allocation; fi.type_display_name.as_deref() borrows from the schema with no clone.

P10 — removed scattered .clone() / .to_string() / .to_owned() in render_len_field and render_group_field at call sites subsumed by P8.

BenchmarkAfter P1–P4After P8–P10Change vs P1–P4Cumulative
A2 decode_and_render485 µs · 33.9 MiB/s205 µs · 80.4 MiB/s−57.8%−67.9%
B1 encode_text_to_binary456 µs · 247 MiB/s462 µs · 244 MiB/s+1% (noise)−28.1%

P14–P15 (−4.6% B1)

P14 — added #[inline] to 16 cross-module hot-path helpers (write_varint_ohb, parse_varint, escape_bytes_into, etc.). Without #[inline], function bodies are invisible across codegen-unit boundaries, preventing inlining.

P15 — added codegen-units = 1 and lto = "thin" to [profile.release]. Single-CGU gives the optimiser full intra-crate visibility; thin LTO extends it across crate boundaries (into prost, memchr).

BenchmarkAfter P8–P10After P14–P15Change
A2 decode_and_render205 µs · 80.4 MiB/s205 µs · 80.4 MiB/s~0% (escape loops are the bottleneck)
B1 encode_text_to_binary462 µs · 244 MiB/s441 µs · 255 MiB/s−4.6%
write_varint reused Vec3.1 ns1.7 ns−44%

The A2 residual (~200 µs) is dominated by escape_bytes_into / escape_string_into — byte-by-byte scanning of all string/bytes field values. This is inherent to lossless output and cannot be eliminated without algorithmic changes (e.g. SIMD escape scanning).


Cumulative journey

Path A2 (binary → annotated text, 17 KB input)

CheckpointTimeThroughputΔ
Original637 µs25.8 MiB/s
After P1–P4485 µs33.9 MiB/s−23.8%
After P8–P10205 µs80.4 MiB/s−57.8%
After P14–P15205 µs80 MiB/s~0%
Total−67.9%

Path B1 (annotated text → binary, 118 KB input)

CheckpointTimeThroughputΔ
Original636 µs177 MiB/s
After P3–P4456 µs247 MiB/s−28.3%
After P8–P10462 µs244 MiB/s+1% (noise)
After P14–P15441 µs255 MiB/s−4.6%
Total−30.7%

Remaining bottlenecks

Path A — residual ~200 µs

BottleneckShareNotes
escape_bytes_into / escape_string_intolargestbyte-by-byte scan of all string/bytes values; unavoidable for lossless output
parse_varintmodestalready tight; inherent per-field cost
HashMap lookups (schema.fields.get, all_schemas.get)< 1%P11 (FxHashMap) would save < 1%
Criterion framework overhead~25% of samplesnot real codec cost

Path B — remaining opportunities

ProposalEstimated gainStatus
P12 — byte-level trim / ends_with in parse loop~4–6% on B1Pending

P12 replaces str::trim_end(), ends_with(char), starts_with(char) (which use CharSearcher, a Unicode-aware state machine) with direct byte comparisons. For ASCII-only annotation content these are equivalent and ~3–5× cheaper.


Earlier analysis: ingest_pb / encode_to_binary paths

These numbers come from an earlier profiling round on the ingest_pb / encode_to_binary inner functions (before the single-pass render path existed). They remain useful context for the binary decode/encode kernel.

Criterion results (ingest_pb)

InputMeanThroughput
50-field structured4.85 µs112 MiB/s
200-field structured17.75 µs127 MiB/s
Random bytes 64 B60.9 ns1,003 MiB/s
Random bytes 1024 B70.0 ns13.6 GiB/s

Random bytes parse very quickly because most bytes are consumed as garbage on the first invalid wire tag. The structured path (~115 MiB/s) is the meaningful real-world figure: ~96 ns per field for a 50-field message.

Criterion results (encode_to_binary)

InputMeanThroughput
50-field structured4.25 µs128 MiB/s
200-field structured15.95 µs142 MiB/s

Nearly symmetric with ingest_pb — both paths have similar per-field allocation cost.

Hot functions (perf, ingest_pb bench)

After removing Criterion overhead (~27%):

SymbolApprox % (codec only)Notes
__memmove_avx_unaligned_erms~21%Vec copies from allocate-copy-free pattern
parse_wiretag~15%called once per field
ingest_pb + parse_message~16%dispatch and field loop
parse_varint~6%called twice per field
malloc + free + realloc~11%heap allocator overhead
drop_in_place<ProtoTextContent>~2%destructors

The memmove (21%) and allocator (11%) costs were driven by encode_varint_bytes returning a fresh Vec<u8> per call, immediately extend_from_sliced into a parent buffer. This was replaced by write_varint_ohb(value, ohb, &mut out) — an in-place write with zero allocation.