Environment: Linux 6.12.57 x86-64, single core (virtualised).
Criterion: 100 samples, 3 s warmup.
Workload: descriptor.pb — a FileDescriptorSet with 17 KB binary /
118 KB annotated protoc text / 2,946 fields.
decode_and_render)| Benchmark | Time (median) | Throughput |
|---|---|---|
| A1 no schema, no annotations | 37 µs | — |
| A2 schema + annotations | 205 µs | 80 MiB/s |
Annotations cost 5.5× the no-annotation baseline. Schema lookup and annotation string writes account for the majority of the gap.
encode_text_to_binary)| Benchmark | Time (median) | Throughput |
|---|---|---|
| B1 annotated text (118 KB) | 441 µs | 255 MiB/s |
Path B is 3.2× faster per byte than Path A with annotations (255 vs 80 MiB/s), because it scans pre-formatted text rather than formatting values from scratch.
| Operation | Time |
|---|---|
parse_varint 1-byte | 6.4 ns |
parse_varint 2-byte | 6.9 ns |
parse_varint 10-byte | 11.6 ns |
write_varint into reused Vec | 1.7 ns |
write_varint into fresh Vec | 15 ns |
encode_varint_bytes(150) — allocates Vec | 18 ns |
Reusing a write buffer is 10× faster than allocating a new Vec per varint.
Profiled with perf record -e task-clock. Criterion itself contributes ~25% of
samples (rayon statistical analysis, libm exp()); percentages are of all samples.
decode_and_render with schema + annotations 12.18% rayon (Criterion overhead)
8.33% __ieee754_exp_fma (Criterion overhead)
6.25% malloc
5.54% _int_free
4.90% __memmove_avx_unaligned_erms
4.24% render_message
4.11% core::fmt::write
3.28% String::write_str
3.06% RawVecInner::reserve
3.01% exp (Criterion overhead)
2.45% core::hash::BuildHasher::hash_one
2.30% escape_string
2.14% alloc::fmt::format
2.09% alloc::str::join_generic_copy
1.84% render_len_field
1.55% core::str::from_utf8
1.36% parse_varint
Allocator pressure (malloc + free + memmove + realloc): ~24% of all samples.
format! machinery (fmt::write + format + join): ~8%.
encode_text_to_binary 13.80% core::slice::memchr::memchr_aligned
10.89% rayon (Criterion overhead)
9.31% encode_text_to_binary (self)
8.57% CharSearcher::next_match
8.52% SplitWhitespace::next
7.43% __ieee754_exp_fma (Criterion overhead)
4.14% parse_annotation
3.47% malloc
3.29% StrSearcher::new
2.78% str::trim_matches
1.82% __memmove_avx_unaligned_erms
1.44% unescape_bytes
1.28% write_varint_ohb
Text scanning (memchr + CharSearcher + SplitWhitespace + StrSearcher + trim): ~35%. Allocator: ~7%.
All measurements on the descriptor.pb workload.
P1 — direct buffer writes for value formatting (Path A)
Replaced the format!("\"{}\"", escape_string(s)) pattern — which allocates an
intermediate String for the escape output and another for the format! wrapper
— with direct writes to the output Vec<u8> via escape_bytes_into /
escape_string_into. Eliminated 2 heap allocations per LEN-field value.
P2 — eliminate the mods: Vec<String> pattern (Path A)
Annotation modifiers were accumulated in a Vec<String> then joined:
mods.join(" "). Replaced with AnnWriter, a zero-allocation helper that
writes each modifier directly to the output buffer with an inline separator.
Eliminated 3–6 heap allocations per annotated field.
P3 — zero-alloc field declaration parsing (Path B)
parse_field_decl_into collected tokens with split_whitespace().collect::<Vec<&str>>(),
allocating a fresh Vec per annotated line. Replaced with a lazy
split_ascii_whitespace() iterator consuming at most 5 tokens — zero allocation.
Isolated benchmark: manual split is 18× faster than a compiled regex for this task (16.9 ns vs 310 ns per 10-token corpus).
P4 — byte scan in split_at_annotation (Path B)
line.rfind(" # ") constructed a StrSearcher (Boyer-Moore) per line.
Replaced with memrchr(b'#') + 5-byte pattern verify. Eliminates searcher
construction overhead on every line.
| Benchmark | Baseline | After P1–P4 | Change |
|---|---|---|---|
| A2 decode_and_render (schema + annotations) | 637 µs · 25.8 MiB/s | 485 µs · 33.9 MiB/s | −23.8% |
| B1 encode_text_to_binary | 642 µs · 175 MiB/s | 456 µs · 247 MiB/s | −29.0% |
After P1–P4 two String allocations per rendered schema field remained:
field_name() -> String: either field_number.to_string() or
field_schema.unwrap().name.clone() — immediately consumed as &str.field_decl() -> Option<String>: format!("{} {}{} = {};", ...) — one
per annotated schema field.P8 — replaced field_name() -> String with wfl_prefix_n / wob_prefix_n
helpers that write the field name directly to the output buffer via
out.extend_from_slice(fi.name.as_bytes()) or write_dec_u64.
P9 — replaced field_decl() -> Option<String> with AnnWriter::push_field_decl,
which writes the field declaration directly to the output buffer. type_str(fi)
is a &'static str match — zero allocation; fi.type_display_name.as_deref()
borrows from the schema with no clone.
P10 — removed scattered .clone() / .to_string() / .to_owned() in
render_len_field and render_group_field at call sites subsumed by P8.
| Benchmark | After P1–P4 | After P8–P10 | Change vs P1–P4 | Cumulative |
|---|---|---|---|---|
| A2 decode_and_render | 485 µs · 33.9 MiB/s | 205 µs · 80.4 MiB/s | −57.8% | −67.9% |
| B1 encode_text_to_binary | 456 µs · 247 MiB/s | 462 µs · 244 MiB/s | +1% (noise) | −28.1% |
P14 — added #[inline] to 16 cross-module hot-path helpers (write_varint_ohb,
parse_varint, escape_bytes_into, etc.). Without #[inline], function bodies
are invisible across codegen-unit boundaries, preventing inlining.
P15 — added codegen-units = 1 and lto = "thin" to [profile.release].
Single-CGU gives the optimiser full intra-crate visibility; thin LTO extends it
across crate boundaries (into prost, memchr).
| Benchmark | After P8–P10 | After P14–P15 | Change |
|---|---|---|---|
| A2 decode_and_render | 205 µs · 80.4 MiB/s | 205 µs · 80.4 MiB/s | ~0% (escape loops are the bottleneck) |
| B1 encode_text_to_binary | 462 µs · 244 MiB/s | 441 µs · 255 MiB/s | −4.6% |
write_varint reused Vec | 3.1 ns | 1.7 ns | −44% |
The A2 residual (~200 µs) is dominated by escape_bytes_into / escape_string_into
— byte-by-byte scanning of all string/bytes field values. This is inherent to
lossless output and cannot be eliminated without algorithmic changes (e.g. SIMD
escape scanning).
| Checkpoint | Time | Throughput | Δ |
|---|---|---|---|
| Original | 637 µs | 25.8 MiB/s | — |
| After P1–P4 | 485 µs | 33.9 MiB/s | −23.8% |
| After P8–P10 | 205 µs | 80.4 MiB/s | −57.8% |
| After P14–P15 | 205 µs | 80 MiB/s | ~0% |
| Total | −67.9% |
| Checkpoint | Time | Throughput | Δ |
|---|---|---|---|
| Original | 636 µs | 177 MiB/s | — |
| After P3–P4 | 456 µs | 247 MiB/s | −28.3% |
| After P8–P10 | 462 µs | 244 MiB/s | +1% (noise) |
| After P14–P15 | 441 µs | 255 MiB/s | −4.6% |
| Total | −30.7% |
| Bottleneck | Share | Notes |
|---|---|---|
escape_bytes_into / escape_string_into | largest | byte-by-byte scan of all string/bytes values; unavoidable for lossless output |
parse_varint | modest | already tight; inherent per-field cost |
HashMap lookups (schema.fields.get, all_schemas.get) | < 1% | P11 (FxHashMap) would save < 1% |
| Criterion framework overhead | ~25% of samples | not real codec cost |
| Proposal | Estimated gain | Status |
|---|---|---|
| P12 — byte-level trim / ends_with in parse loop | ~4–6% on B1 | Pending |
P12 replaces str::trim_end(), ends_with(char), starts_with(char) (which use
CharSearcher, a Unicode-aware state machine) with direct byte comparisons.
For ASCII-only annotation content these are equivalent and ~3–5× cheaper.
ingest_pb / encode_to_binary pathsThese numbers come from an earlier profiling round on the ingest_pb /
encode_to_binary inner functions (before the single-pass render path existed).
They remain useful context for the binary decode/encode kernel.
ingest_pb)| Input | Mean | Throughput |
|---|---|---|
| 50-field structured | 4.85 µs | 112 MiB/s |
| 200-field structured | 17.75 µs | 127 MiB/s |
| Random bytes 64 B | 60.9 ns | 1,003 MiB/s |
| Random bytes 1024 B | 70.0 ns | 13.6 GiB/s |
Random bytes parse very quickly because most bytes are consumed as garbage on the first invalid wire tag. The structured path (~115 MiB/s) is the meaningful real-world figure: ~96 ns per field for a 50-field message.
encode_to_binary)| Input | Mean | Throughput |
|---|---|---|
| 50-field structured | 4.25 µs | 128 MiB/s |
| 200-field structured | 15.95 µs | 142 MiB/s |
Nearly symmetric with ingest_pb — both paths have similar per-field allocation
cost.
ingest_pb bench)After removing Criterion overhead (~27%):
| Symbol | Approx % (codec only) | Notes |
|---|---|---|
__memmove_avx_unaligned_erms | ~21% | Vec copies from allocate-copy-free pattern |
parse_wiretag | ~15% | called once per field |
ingest_pb + parse_message | ~16% | dispatch and field loop |
parse_varint | ~6% | called twice per field |
malloc + free + realloc | ~11% | heap allocator overhead |
drop_in_place<ProtoTextContent> | ~2% | destructors |
The memmove (21%) and allocator (11%) costs were driven by encode_varint_bytes
returning a fresh Vec<u8> per call, immediately extend_from_sliced into a
parent buffer. This was replaced by write_varint_ohb(value, ohb, &mut out) —
an in-place write with zero allocation.