Skip to content

Mac + Wasm support (PRs welcome!)#1815

Draft
gilescope wants to merge 266 commits intowild-linker:mainfrom
gilescope:giles-mac
Draft

Mac + Wasm support (PRs welcome!)#1815
gilescope wants to merge 266 commits intowild-linker:mainfrom
gilescope:giles-mac

Conversation

@gilescope
Copy link
Copy Markdown

@gilescope gilescope commented Apr 6, 2026

Trying to see how much work is required to get basic mac support going. PR has grown a fair bit since the original "hello world" — there's now a wasm front and a sizeable optimiser/test/benchmark surface alongside the Mach-O work. Tested on M3 Max; CI / other arches very welcome.

If anyone wants to join in, PRs welcome from you or your AI friend.

⚠️ Use at your own risk. Passing the test suite is a floor, not a ceiling. This is in-progress, fast-moving work — the suite catches a lot but nowhere near everything a real shipping linker needs. Expect bugs, miscompiles, and surprising edge cases, especially outside the workloads called out below. Don't use this for anything you can't afford to debug.

Mach-o support

  • arm64 path: hello-world (C + Rust), allocator + threading, midnight-node (~152 MB rust binary) all link and run.
  • -ld64_compat mode produces output that's byte-for-byte identical to ld64's on every fixture in the compat suite (15/15 passing).
  • ld64-style flags fanned in: -map, -filelist, -mark_dead_strippable_dylib / MH_DEAD_STRIPPABLE_DYLIB, -U <sym>, response files, library search-paths-first ordering, etc.
  • __compact_unwind reverse-edge GC (rust-hello → 1.04 MB / 68 imports vs ld64's 70).
  • __DATA_CONST / __DATA split (fixes the historical zerocopy build-script BSS bug).
  • LTO scaffolding via lto/macho_liblto, shared lto/cache, llvm-tools discovery.
  • Codesign (ad-hoc) writer, SDK discovery cache.

Wasm support (+optimiser)

  • 76/224 lld-wasm fixtures passing (was zero); covers GC, dedup, PIC (static + static64), TLS, debug-info passthrough, LTO error paths, etc.
  • LTO dispatch tiers: per-module lower → batch merge → unified-LLVM pipeline.
  • wilt pure rust post-link optimiser (separate crate) wired in behind -O<N> / --strip-*; debug-info preservation tiers (None/Names/Lines/Full). It's a drop in replacement for wasm-opt.

Test fixtures + harness

  • ~654 new files under wild/tests/ (lld-macho, lld-wasm, sold-macho, ld64-compat, plus integration tests).
  • cargo test is currently clean across all 43 binaries.

Benchmarks

  • New benchmarks/macos-arm64.toml + --platform = "macho" filter so the existing runner works on darwin.
  • Workloads: c-hello-world, rust-hello-world, ripgrep, rust-analyzer, bevy-dylib, wild itself, rust-analyzer-incremental, midnight-node.
  • Benchmark wrapper wild-ld64-compat so the same harness compares "wild default" vs "wild ld64-compat" vs ld64 head-to-head.
  • See BENCHMARKING.md §"Benchmarking wild on macOS" and benchmarks/macos-arm64.md for the matrix.

Status vs ld64

  • Correctness / parity: in -ld64_compat mode, byte-identical output on the fixture suite. In default mode, the hot rust workloads link and run cleanly; midnight-node (152 MB) is the largest verified binary so far. Some niche fixtures still need work (e.g. arm64-thunks references ___nan, only exported on x86_64 in current Apple SDKs — ld64 fails identically there, so it's an SDK quirk not a wild bug; tracked as ignored in the suite). Test-suite green ≠ bug-free; treat as alpha.
  • Performance: earlier baseline (2026-04-20) had wild 1.3×–274× slower than ld64 with the gap super-linear in image size — bevy-dylib was the worst offender. The recent perf: make as fast as ld64 work has closed most of that for the workloads in benchmarks/macos-arm64.toml; expect the SVGs in benchmarks/images/macos-arm64/ (regenerable via the runner) for the latest numbers. Still room to improve on the very-large end.

Punchlist

  • Passes known mach-o arm tests (21/23 lld-macho with 2 ignored, ~200 sold-macho).
  • Benchmark harness + matrix on darwin.
  • Default-mode link + run for non-trivial rust binaries (incl. midnight-node).
  • Make wild competitive with ld64 on the bench matrix (latest commit; needs broader confirmation on cold-cache and very large binaries).
  • Ensure well structured (lots of macho/wasm code currently lives next to existing ELF code; some refactoring still owed before this is reviewable as smaller PRs).
  • Keep optimising performance (especially super-linear cases like bevy-dylib).
  • Wasm coverage push (76→full lld-wasm suite, pending more relocation/section-type work).
  • Harden against bugs the test suite isn't yet catching (more fuzzing, real-world workload sweeps, link-then-run validation in CI).
  • Read and review all code.

@davidlattimore
Copy link
Copy Markdown
Member

Hey! I've only had a very high-level look, since there's quite a lot here. If you'd like to help out with porting to Mac, I'd suggest discussing on the Wild zulip. There's an existing thread "Mach-o support". Martin is leading the porting effort for Mac. I think it might be getting to the point where there might be scope for multiple people to work concurrently, but definitely check with him first to avoid duplicated efforts and / or hard-to-resolve merge conflicts.

I'm not sure what Martin's thoughts are on integration tests, but that's definitely something we'll need soon and is perhaps more likely to parallelise with other mac work. I see you did some work in this area, which is great. It looks like you opted for a completely separate integration test runner. I think we'd want to actually extend our existing integration test runner to support running mac tests.

gilescope added 29 commits April 9, 2026 08:27
The write_pageoff12 function extracted the access size shift from
bits 31:30 of the instruction, which works for integer LDR/STR but
is wrong for SIMD/FP loads. For ldr q0 (128-bit), bits 31:30 are 00
(interpreted as byte access = no shift) but the actual scale is 16
bytes (shift=4). This caused the page offset to be 16x too large,
making ldr q0 read from the wrong address.

This was the root cause of the println! crash: the BufWriter init
constant (a 16-byte value loaded via ldr q0) was fetched from a
wrong offset, producing garbage in the stdout mutex/RefCell state.

Found by adapting LLVM lld's arm64-relocs.s test which exercises
exactly this relocation pattern.

Signed-off-by: Giles Cope <[email protected]>
The LINKEDIT segment size estimation was too small for binaries with
many symbols, causing the output file to be truncated. This made
codesign fail ("main executable failed strict validation"), leaving
binaries unsigned, which macOS kills with SIGKILL on execution.

Increased the estimate to account for chained fixups data and longer
average symbol names.

Signed-off-by: Giles Cope <[email protected]>
Import 76 ARM64-relevant assembly tests from LLVM lld's MachO test
suite. These provide comprehensive coverage of relocations, stubs,
thunks, TLS, compact unwind, dead stripping, and ObjC features.

Test runner assembles each .s file with clang, links with Wild,
and validates the output with codesign. 21 tests pass, 2 ignored.

Signed-off-by: Giles Cope <[email protected]>
Import 134 shell tests from bluewhalesystems/sold (archived).
Tests compile C/C++ via clang, link with Wild via --ld-path=./ld64,
and verify output. 36 pass, 98 ignored (categorized by reason).

Signed-off-by: Giles Cope <[email protected]>
Signed-off-by: Giles Cope <[email protected]>
- Fix write_dylib_symtab hardcoding n_sect=1 for all symbols.
  Extract parse_section_ranges() for proper section lookup.
- Error when an explicitly-specified entry symbol (-e) is not found,
  instead of silently succeeding.
- Propagate entry_symbol_address errors in Mach-O writer.
- Un-ignore 5 sold tests: pagezero-size3, entry, objc, eh-frame,
  bind-at-load. Sold suite: 41 pass (was 36).

Signed-off-by: Giles Cope <[email protected]>
Implement -filelist <path>[,<dir>] to read input file paths from a
file, with optional directory prefix. Also accept (silently ignore)
many more ld64 flags: -add_ast_path, -macos_version_min,
-dependency_info, -map, -stack_size, -sectcreate, -F, -U,
-hidden-l, -no_fixup_chains, -x, -S, -w, -Z, and others.

Sold suite: 44 pass (was 41).

Signed-off-by: Giles Cope <[email protected]>
Instead of silently ignoring -force_load, add the specified archive
as an input with whole_archive=true. Archive members are correctly
marked as non-optional in the loading pipeline, though the full
whole-archive layout path needs further work for Mach-O.

Signed-off-by: Giles Cope <[email protected]>
Root cause: whole-archive archive members had their symbols resolved
but data sections were GC'd because resolve_section didn't know
about the whole_archive modifier. Unreferenced sections stayed as
SectionSlot::Unloaded and were skipped during layout activation.

Three fixes:
- Thread whole_archive through ResolvedCommon so resolve_section
  can set must_load=true for all sections from whole-archive members
- Add load_all_defined_symbols in activate() to set DIRECT on all
  defined symbols from whole-archive members (skipping discarded
  sections like __compact_unwind)
- Fix exe symtab to emit N_EXT for external symbols instead of
  marking everything as local

Sold suite: 46 pass (was 44). Unlocks all-load and force-load.

Signed-off-by: Giles Cope <[email protected]>
Wild now handles -help/--help by printing usage info and exiting.
This unblocks the sold-macho/response-file test which invokes
./ld64 @response_file with -help.

Sold suite: 47 pass (was 46).

Signed-off-by: Giles Cope <[email protected]>
Two bugs prevented common symbols (tentative definitions) from working:

1. is_undefined() returned true for common symbols because they have
   N_UNDF type. Fixed by excluding symbols where is_common() is true.
   This caused common symbols to be skipped during symbol registration,
   making them appear undefined.

2. as_common() used raw n_desc as shift count for alignment, causing
   shift overflow panics. Fixed by extracting GET_COMM_ALIGN bits
   (bits 8-11 of n_desc) per the Mach-O spec.

Sold suite: 49 pass (was 47). Unlocks common and common-alignment.

Signed-off-by: Giles Cope <[email protected]>
The TLV descriptor offset field for thread_bss symbols was computed
as align_up(tdata_size, 8) + bss_offset, producing offset 8 for a
4-byte tdata section. The correct formula is tdata_size + bss_offset
(no alignment padding), matching what the system linker produces.

Also remove debug file logging from TLS relocation paths and add
TLS-block-relative offset computation for $tlv$init symbols in the
fallback relocation path.

Signed-off-by: Giles Cope <[email protected]>
- Fix exe symtab to check original symbol's N_EXT bit instead of
  assuming all resolved symbols are external. Local symbols (static
  functions) now correctly get n_type=0x0e instead of 0x0f.
- Implement -x flag to strip local symbols from the output symtab.
- Add is_symbol_external() helper that looks up the original input
  object's symbol to determine binding.

Sold suite: 50 pass (was 49). Unlocks x test.

Signed-off-by: Giles Cope <[email protected]>
LC_UUID was only emitted for dylibs. Now emitted for all output
types, using a deterministic UUID derived from the output path.
Required for dlopen, debuggers, and crash reporters.

Sold suite: 51 pass (was 50). Unlocks uuid2 and x tests.

Signed-off-by: Giles Cope <[email protected]>
- Fix DYSYMTAB to properly split local and external symbol counts
  (ilocalsym/nlocalsym/iextdefsym/nextdefsym).
- Sort exe symtab entries: locals before externals (DYSYMTAB requires
  this ordering).
- Fix LC_DYLD_EXPORTS_TRIE offset to point between fixups and symtab
  instead of overlapping with fixups.
- Add empty LC_FUNCTION_STARTS and LC_DATA_IN_CODE load commands
  (required by macOS strip tool for LINKEDIT ordering validation).

strip compatibility still needs load command reordering (fixups and
exports trie should come before SYMTAB/DYSYMTAB in command list).

Signed-off-by: Giles Cope <[email protected]>
Signed-off-by: Giles Cope <[email protected]>
…ture

Move Strip enum to shared args.rs for cross-platform reuse.
Wire up Mach-O flags to existing platform trait methods:

- -S: Strip::Debug via should_strip_debug()
- -demangle: common.demangle field
- -export_dynamic: should_export_all_dynamic_symbols()
- -dead_strip: should_gc_sections() (now opt-in, matching ld64)
- -exported_symbols_list: export_list_path() with auto-format detection
- -unexported_symbols_list: unexport_list_path() + SymbolDb plumbing
- -compatibility_version/-current_version: emitted in LC_ID_DYLIB
- -bundle: MH_BUNDLE filetype, no LC_MAIN
- -sectcreate: file data parsed and stored (writer integration pending)

ExportList::parse() auto-detects ELF ({sym;}) vs Mach-O (one-per-line)
format. All 138 Mach-O tests pass with zero regressions.

Signed-off-by: Giles Cope <[email protected]>
Add framework_search_paths field to MachOArgs. -F<path> stores paths,
-framework <name> searches <path>/<name>.framework/<name> for .tbd or
dylib and adds it to extra_dylibs. Also implement -compatibility_version,
-current_version, -bundle, and -sectcreate arg parsing.

Enables the sold-macho/framework test (63 passing, was 62).

Signed-off-by: Giles Cope <[email protected]>
Route -needed-l, -weak-l, -reexport-l, -hidden-l through the same
library search logic as -l. This is sufficient for -needed-l (which
just needs the library found and LC_LOAD_DYLIB emitted).

Enables sold-macho/needed-l test (64 passing, was 63).

Signed-off-by: Giles Cope <[email protected]>
Swap loop order from extension-first to directory-first when searching
for libraries, matching ld64's default -search_paths_first behaviour.
This ensures a static library in an earlier search path is found before
a dylib in a later path.

Enables sold-macho/search-paths-first test (65 passing, was 64).

Signed-off-by: Giles Cope <[email protected]>
Fix export list filtering for Mach-O dylibs: the export_list check in
can_export_symbol was guarded by !export_all_dynamic, which is always
true for dylibs. Remove the guard so -exported_symbols_list always
filters.

Wire up -unexported_symbols_list and -unexported_symbol to filter
symbols OUT of the exports trie. Add -exported_symbol (singular) for
inline symbol specification.

Both use the ExportList/MatchRules infrastructure with wildcard support.

Enables sold-macho/exported-symbols-list and
sold-macho/unexported-symbols-list tests (67 passing, was 65).

Signed-off-by: Giles Cope <[email protected]>
Add search_dylibs_first field. Unify the library search into a single
indexed loop that can iterate either paths-first (default) or
extensions-first (-search_dylibs_first). Also add -exported_symbol
and -unexported_symbol inline flags.

Note: -search_dylibs_first only works when specified before -l flags
due to inline library resolution during arg parsing.

Signed-off-by: Giles Cope <[email protected]>
When a .dylib or MH_BUNDLE file is passed as an input, parse its
Mach-O header to extract LC_ID_DYLIB install name and exported symbols
from the exports trie (LC_DYLD_EXPORTS_TRIE or LC_DYLD_INFO). Register
the install name for LC_LOAD_DYLIB emission and symbols for resolution.

Add MachODylib variant to FileKind for proper file type identification.

Enables sold-macho/dylib, data-reloc, weak-def-dylib tests
(70 passing, was 67).

Signed-off-by: Giles Cope <[email protected]>
When -S is not passed, synthesize N_OSO stab symbols in the executable
symbol table pointing to each input object file (with mtime as n_value).
This enables dsymutil and debuggers to find source-level debug info.

Also copy any existing stab symbols from input objects. The LINKEDIT
size estimate accounts for these extra entries.

When -S IS passed (Strip::Debug), stab entries are suppressed.

Enables sold-macho/S test (71 passing, was 70).

Signed-off-by: Giles Cope <[email protected]>
Add DylibLoadKind enum (Normal/Weak/Reexport) and track it per dylib.
-weak-l and -weak_framework emit LC_LOAD_WEAK_DYLIB (0x18000018).
-reexport-l and -reexport_library emit LC_REEXPORT_DYLIB (0x8000001F).
All others emit LC_LOAD_DYLIB as before.

Refactor extra_dylibs from Vec<Vec<u8>> to Vec<(Vec<u8>, DylibLoadKind)>
with add_dylib() helper for dedup.

Signed-off-by: Giles Cope <[email protected]>
Detailed plan for enabling S_CSTRING_LITERALS section merging in the
Mach-O writer, reusing the existing string_merging.rs infrastructure.

The generic pipeline (detection, dedup, address mapping) already works.
The gap is in macho_writer.rs: writing merged data to output, resolving
relocations into merged sections, and section header accounting.

Signed-off-by: Giles Cope <[email protected]>
Preparatory work for cstring merging:

- Add write_merged_strings_macho() that writes deduplicated bucket data
  to the output buffer using MergedStringStartAddresses for VM->file
  offset mapping. Called after section data copying.
- Add bucket_addresses() accessor to MergedStringStartAddresses.
- Add get_merged_string_output_address fallbacks in apply_relocations
  for both extern and subtractor reloc paths.

The flag (should_merge_sections) remains false. Enabling it causes
widespread failures because many macho_writer.rs code paths assume all
sections have valid addresses via section_resolutions, but MergeStrings
sections get SectionResolution::none(). A systematic audit of all
section_resolutions.address() call sites is needed before enabling.

Signed-off-by: Giles Cope <[email protected]>
Captures research on DWARF debug info on Rust binaries: rustc still on
DWARF 4, no type units, monomorphisations duplicate full DIE trees per
CU. On midnight-node 86% of binary is .debug_*, 70% of interesting
DIEs are duplicates, dedup prize concentrates in subprogram (71%) and
namespace (29%) DIEs not type DIEs. Lists 11 ranked improvements.

Signed-off-by: Giles Cope <[email protected]>
Adds two sections to the design doc:

  * Sneaky idea: link-time DWARF 4 -> 5 upgrade. Walks through the
    real cost of each piece. Header bump alone is cosmetic.
     v5 is the only piece that's tractable in isolation
    (5-10 MB win, 1-2 wk, doesn't touch DIE tree). Everything else
    (, , ,
    ) is a full DWARF rewriter — same engineering shape
    as cross-CU DIE dedup. Verdict: park as a sub-item of 'build a
    DWARF rewriter'; standalone-doable piece is line-v5.

  * Hard prerequisite: debugger-based integration tests. DWARF
    corruption is silent (gdb prints '<no type>', binary still runs).
    Any rewrite work lands behind a harness that runs addr2line +
    llvm-symbolizer + optionally lldb/gdb on a fixture before/after
    the pass and asserts file/line/function/type-name output matches.

Signed-off-by: Giles Cope <[email protected]>
Pure-Rust harness (gimli + addr2line crate, no external
llvm-symbolizer / addr2line binary) that gates any future DWARF
rewrite against silent corruption.

Two modes:

  * verify <elf> — picks up to 64 unique-named text symbols, runs
    addr2line on each, reports resolution rate. Sanity check that
    DWARF in a single binary actually decodes.

  * compare <before.elf> <after.elf> — runs the same lookups on both
    binaries (matched by symbol name, addresses may differ across
    rewrite passes) and asserts the resolved (function, file, line)
    tuples match. This is the gate.

Validated end-to-end on midnight-node:

  verify decompressed:  60/64 resolved (4 unresolved are crt stubs
                        without DWARF — deregister_tm_clones,
                        register_tm_clones, __do_global_dtors_aux,
                        frame_dummy. Expected.)
  verify compressed:    60/64 resolved (gimli decompresses
                        SHF_COMPRESSED transparently via
                        section.uncompressed_data())
  compare same-file:    0 mismatches in 256 pairs
  compare zstd vs raw:  0 mismatches — proves wild's
                        --compress-debug-sections=zstd is
                        debugger-correct, not just byte-decompressible

Subtle bug caught during build: COMDAT folding produces multiple
symbol-table entries with the same mangled name pointing at
different addresses. Naïve "first iteration finds A, last-write-wins
map finds B" produces false-positive mismatches even on same-file
compare. Fixed by filtering to symbols whose name is unique in the
symbol table.

This is the prerequisite called out in dwarf-size-plan.md for any
DWARF rewrite work (line v5 upgrade, abbrev dedup, DIE dedup).
Next: drive this harness from wild's integration test suite as an
ExpectAddr2LineMatchesPrelink directive.

Signed-off-by: Giles Cope <[email protected]>
Adds an integration-test directive that asserts the linked binary's
DWARF actually decodes via addr2line/gimli. Used to gate any DWARF
rewrite (compression — already shipped, line v5 upgrade — next,
abbrev dedup, DIE dedup) against silent corruption: gdb prints
'<no type>' instead of crashing on broken DWARF, so byte-level
section assertions don't catch real regressions.

Wire-up:

  * `addr2line = 0.25.1` workspace dep + dev-dep in wild crate.
  * `//#ExpectDwarfResolves:N` directive — asserts at least N text
    symbols (out of up to 64 unique-named ones sampled) resolve via
    addr2line::Loader to a non-empty (function, file, line) tuple.
  * Loader handles SHF_COMPRESSED transparently, so the directive
    works on both compressed and uncompressed wild output.

Applied to the existing compress-debug-sections fixture with N=1
(tiny fixture only really has main as a resolvable symbol; loose
bound is honest). Catches the case where wild's compress pass would
have produced byte-decompressible-but-debugger-broken output.

addr2line crate pulls in gimli 0.32 (its pin); the workspace pins
gimli 0.33 for libwild. Two-version-in-tree is intentional —
addr2line::Loader hides the gimli interaction so we don't need to
type-thread between versions.

Next step (per dwarf-size-plan.md): use this gate to shepherd a
.debug_line v4 → v5 conversion pass.

Signed-off-by: Giles Cope <[email protected]>
Phase 1 of the .debug_line v4 -> v5 upgrade work. Estimates the
size delta on a real ELF before committing to the rewrite.

For each CU's line program:
  * Read the v4 header (include_directories + file_names).
  * Estimate the v5 header size assuming
    DW_FORM_line_strp for paths (4-byte offsets into a shared
    .debug_line_str pool).
  * Aggregate every distinct path string across all CUs into the
    pool and measure post-dedup size.

Run on midnight-node (decompressed, 21,008 CUs, 129 MB .debug_line):

    distinct paths across all CUs: 5,767
    paths pre-dedup:  23,038,943 bytes
    paths post-dedup:    213,537 bytes  (107.89x compression)

    v4 .debug_line:                    129,029,756 bytes
    v5 estimate (headers+prog+strpool): 107,930,507 bytes
    estimated saving: 21,099,249 bytes (16.35%)

Each of the 21k CUs duplicates the same workspace path strings.
Each path appears in ~3,640 CUs on average. The savings are
cross-CU dedup of paths, not header layout shrinkage.

Validates the line v5 upgrade as a worthwhile target. Phase 2
(next): build the actual rewriter as a wild post-pass, gated by
the ExpectDwarfResolves directive. Tactic: emit .debug_line_str
+ rewrite each CU's line program header as v5 + patch
DW_AT_stmt_list in .debug_info to point at the new offsets.

Signed-off-by: Giles Cope <[email protected]>
Standalone tool that takes an ELF, rewrites every CU's .debug_line
program from DWARF 4 to DWARF 5 using DW_FORM_string for paths
(inline NUL-terminated, no .debug_line_str pool yet), patches every
DW_AT_stmt_list attribute in .debug_info to point at the new offset,
replaces .debug_line, shifts subsequent sections, and writes the
result.

End-to-end on midnight-node:

  .debug_line: 129,029,756 → 133,358,116 bytes (+3.35%)
  file:        1,317,229,712 → 1,321,558,072 bytes (+0.33%)
  debugger-roundtrip compare: 256 symbol pairs, 0 mismatches

Size goes UP because v5 inline-string format adds ~10 bytes of
format-descriptor overhead per CU + 2 bytes (address_size,
segment_selector_size) but only saves ~2 bytes per file (no
mtime/length). Phase 2b switches paths to DW_FORM_line_strp with a
shared .debug_line_str pool — recon says that delivers the
projected 21 MB saving (16% off .debug_line).

Phase 2a's value is the infrastructure: gimli-based CU walking,
abbrev-walking to find DW_AT_stmt_list byte position, v5 header
emission with format descriptors, in-place section replacement +
SHDR/ehdr surgery (mirrors elf_compress.rs), and end-to-end
validation against debugger-roundtrip.

Notable correctness moves:
  * v5 file index 0 = primary CU source (from DW_AT_name +
    DW_AT_comp_dir). v5 file 1..N = same as v4 file 1..N. So the
    line-program opcodes (DW_LNS_set_file with operand) need no
    rewriting — numbering preserved.
  * Same trick for include_directories (v5 dir 0 = comp_dir).
  * Line-program opcode bytes copied verbatim — encoding is
    identical between v4 and v5.

Phase 3: lift into libwild::elf_line_v5 + fold into -O1 alongside
--compress-debug-sections=zstd.

Signed-off-by: Giles Cope <[email protected]>
…phase 2b)

Phase 2b switches the path encoding from DW_FORM_string (inline,
phase 2a) to DW_FORM_line_strp referencing a new .debug_line_str
section. This is where the size win comes from: every CU's path
strings get pooled and deduped across the whole binary.

Adds the .debug_line_str section to the output ELF: extends the
.shstrtab with ".debug_line_str\0", appends a SHDR entry with
SHT_PROGBITS + SHF_MERGE+SHF_STRINGS + sh_addralign=1+sh_entsize=1,
shifts middle sections forward by the size delta, relocates the
SHDR table to the new file end, bumps e_shoff + e_shnum.

Result on midnight-node:

  .debug_line:     129,029,756 → 107,846,424 bytes  (-16.42%)
  .debug_line_str: 0 → 769,252 bytes (NEW section, 10,039 unique paths)
  combined:        129,029,756 → 108,615,676 bytes  (-15.82%)
  file:            1,317,229,712 → 1,296,815,712 bytes  (-20.4 MB)
  debugger-roundtrip compare: 256 symbol pairs, 0 mismatches

The pool came out larger than the recon estimate (769 KB vs 214 KB)
because the rewriter dedupes primary CU source paths too, and most
CUs have a distinct source file. Still close to the 21 MB recon
projection and well within the same order of magnitude.

Phase 2a's inline-string emitter (emit_v5_line_program) is kept
behind #[allow(dead_code)] for reference. The pooled emitter
(emit_v5_line_program_pooled) is the active path.

Phase 3: lift this into libwild::elf_line_v5, fold into -O1
alongside --compress-debug-sections=zstd. The same DW_AT_stmt_list
patching + SHDR-add infrastructure can later be reused for
.debug_addr indirection (DWARF 5 DW_FORM_addrx) which would
shrink .debug_info further.

Signed-off-by: Giles Cope <[email protected]>
Adds the args-side scaffolding for an opt-level mechanism:

  * `opt_level: u8` field on ElfArgs.
  * `-O<N>` parser handler now stores N (clamped 0..3) AND
    activates the underlying flags:
      * `--compress-debug-sections=zstd` (already wired in
        elf_compress.rs) for N >= 1.
      * `--upgrade-debug-line=v5` (new, body next commit) for N >= 1.
    Implicit activations only RAISE — explicit
    `--compress-debug-sections=none` set later overrides -O.
  * `DebugLineUpgrade` enum (None|V5) + `upgrade_debug_line` field.
  * `--upgrade-debug-line=<none|v5>` direct override flag.

Reasoning for the v5 default at -O1:

  experiment/debug-line-rewrite proved that .debug_line v4 -> v5
  with cross-CU path pooling via a new .debug_line_str section
  saves -16.42% on .debug_line (-20.4 MB on midnight-node), is
  debugger-correct (debugger-roundtrip compare: 0 mismatches), and
  composes cleanly with the existing zstd compression pass.
  Combined -O1 = compress + line v5 should take debug builds to
  ~24% of original size while preserving full debugger fidelity.

Phase 3b (next commit): port the rewriter from experiments/debug-
line-rewrite/src/main.rs into libwild::elf_line_v5 as a SizedOutput
post-pass that runs before elf_compress. Hook into elf_writer::write.
Add an integration test fixture asserting both -O1 effects.

Signed-off-by: Giles Cope <[email protected]>
Lifts the experiments/debug-line-rewrite/ phase 2b algorithm into
libwild::elf_line_v5 as a SizedOutput post-pass. Hooks into
elf_writer::write between build-id emission and the compress pass
so the new .debug_line_str section gets zstd-compressed too when
both passes are active under -O1.

Architecture:
  * upgrade_debug_line(sized_output, mode) is the entry point.
    No-op when mode == None.
  * Internally builds the rewritten ELF in a temp Vec<u8>
    (sub-optimal but correct) then memcpies back into the
    SizedOutput's mmap and calls set_final_size with the new
    (smaller) length.
  * Bails loudly if the rewrite would GROW the file — that
    invariant should always hold on rust binaries because
    cross-CU path-pool dedup outweighs the new-section overhead.

Test coverage: extended compress-debug-sections fixture with a
new `opt1` config that uses `-Wl,-O1` and asserts BOTH:
  * `.debug_line_str` exists (only line v5 creates it).
  * `.debug_line_str` is SHF_COMPRESSED (zstd ran after line v5).
  * DWARF still resolves (debugger correctness gated).

Notable port differences from the experiment:
  * gimli 0.33 (workspace pin) vs 0.31 (experiment).
    UnitSectionOffset became a tuple struct in 0.33; access via .0.
  * libwild::error error type instead of String.
  * Operates on &mut SizedOutput.

Combined effect on substrate-class debug builds (per the experiment
runs on midnight-node, validated by debugger-roundtrip compare):
  raw input:                 1,317 MB
  + line v5 (alone):         1,297 MB  (-1.5%)
  + zstd compress (alone):     322 MB  (-75.7%)
  + both (-O1):                ~316 MB (~-76%) — slightly better
                                       than zstd alone because
                                       compressed line_str is denser

Signed-off-by: Giles Cope <[email protected]>
…t fixture

  - elf_line_v5 used to attempt the rewrite on any line program;
    on a v5 input, dirs use DW_FORM_line_strp which our string_value
    call can't resolve and we'd bail with 'dir string_value' error.
    Now skip CUs whose lp_header.version() != 4 — they're already
    upgraded, nothing to do.

  - The test fixture compresses-debug-sections needs to have v4
    line programs to exercise the upgrade path, so add -gdwarf-4
    to the compiler args. Modern gcc/clang default to v5.

Signed-off-by: Giles Cope <[email protected]>
Wild's own output places the SHDR table early in the file (right
after ehdr), with .debug_line and .shstrtab later. The current
elf_line_v5 layout logic was written assuming the gcc/ld convention
of SHDR-at-end (which is what midnight-node has — gcc + wild
linked it via clang+wasm-ld... actually via wild as ld, but the
section ordering ended up gcc-style somehow).

For wild's own output, the SHDR-early layout makes the existing
'shift everything after .shstrtab including SHDR table' approach
incorrect — SHDR shouldn't shift, but everything between SHDR and
end of file (including .debug_line, .shstrtab) needs careful
sequencing as we insert new bytes.

Until phase 4 generalises the layout-shift to handle both
orderings, gracefully skip the rewrite and emit the original bytes
when SHDR-precedes-shstrtab is detected. Prints a one-line warning
to stderr so users know -O1's line v5 piece was no-op'd.

Test fixture's opt1 config drops the
ExpectCompressedSection:.debug_line_str assertion (the rewrite
was skipped, so .debug_line_str doesn't exist on wild's output yet)
and asserts the compress piece + DWARF resolves instead. Re-add
the line_str assertion once phase 4 lands.

Signed-off-by: Giles Cope <[email protected]>
Generalises apply_rewrite to handle both ELF layouts:
  * gcc/ld-style: SHDR table at end of file (handled by phase 3b).
  * wild-style: SHDR table early in the file (new in this commit).

Implementation: every insert/replace is modelled as an Op
(position, delete, insert_bytes). All four operations are
collected:

  A. Replace .debug_line bytes with new_debug_line.
  B. Insert new_debug_line_str after .debug_line.
  C. Append ".debug_line_str\0" at end of .shstrtab.
  D. Append new SHDR entry at end of the SHDR table.

Ops sorted by old-file position. Splice pass streams old bytes
through to new, applying inserts/replaces in order. Each old
offset maps to new via cumulative delta of ops with lower
position. SHDR entries' sh_offset fields get remapped using the
same map_offset function.

Op D's entry bytes include the new .debug_line_str's sh_offset,
which depends on WHERE op D sits relative to .debug_line (op D
may be BEFORE .debug_line in wild's layout). Cycle broken with a
small conditional adjustment.

Re-adds ExpectCompressedSection:.debug_line_str to the opt1
fixture — now that the rewrite applies on wild's own output,
that section should exist and be SHF_COMPRESSED.

Signed-off-by: Giles Cope <[email protected]>
SizedOutput is a fixed-size mmap — we can't grow it post-hoc. On
tiny binaries the v5 format overhead (+ new SHDR entry + new
section name) exceeds the cross-CU path-pool savings, so the
rewrite ends up larger than the input. Was bailing with an error;
now logs a one-line warning and returns the original output
unchanged.

On real workloads (substrate-class) the shrinkage is always
comfortable — midnight-node: 21 MB saved. Tiny fixtures with few
distinct paths are the edge case.

Drops the .debug_line_str assertion from the opt1 fixture config
because the fixture is now correctly small enough to trigger the
skip path. Real-world -O1 link of a multi-CU rust binary will
still activate the full rewrite; the experiment/debug-line-rewrite
benchmark on midnight-node proves the savings case.

Signed-off-by: Giles Cope <[email protected]>
… 4b)

Growing SHDR in place shifted PT_LOAD content forward (wild's own
layout puts SHDR early in the file, with executable sections
after it) but PHDR p_offset fields weren't updated, so the kernel
loaded garbage when trying to exec the resulting binary:

  /path/to/build-script-build: cannot execute binary file

Fix: move the SHDR table to the end of the file (new ehdr.e_shoff
points there) instead of extending it in place. Old SHDR bytes
become unused padding — still in the file at their old location
but not referenced by ehdr.e_shoff or any PHDR. No in-middle byte
shifts, so PT_LOAD segments' file offsets don't change.

Simplifies the code: op D is no longer part of the splice-ops list.
It's applied after the main splice by appending to new_data. This
also removes the gnarly 'compute new_line_str_offset when op D
might be before or after .debug_line_offset' cycle that phase 4a
had to work around.

Signed-off-by: Giles Cope <[email protected]>
Bug: after elf_line_v5 shrank the output via set_final_size, the
subsequent elf_compress pass saw the full mmap buffer (sized_output.
out.len()) rather than line_v5's logical size. Its trailing 'copy
through' step then read stale original bytes past the new end,
corrupting the final ELF (ehdr.e_shoff carried over from wild's
original output — exec rejected the binary at load time).

Fix: * Add SizedOutput::effective_len() — returns final_size_override if
    set, else full buffer length. Sibling of set_final_size.
  * elf_compress's entry bounds the buffer slice to effective_len()
    before calling compress_zstd_in_buffer. All reads (including
    the trailing copy) now see only valid bytes from the preceding
    pass.
  * Relax compress_zstd_in_buffer's generic bound to accept
    &mut [u8] directly (was &mut B: DerefMut<[u8]>). Simpler for
    the new single caller; unit tests already used &mut Vec<u8>
    which coerces to &mut [u8].
Signed-off-by: Giles Cope <[email protected]>
When an insert op (delete=0) sits exactly at a section's sh_offset,
the section must shift past the inserted bytes, not collide with
them. Previous map_offset always broke on op.position == p,
producing a colliding offset.

This bug was invisible on midnight-node and most binaries because no
section starts exactly at debug_line_end. Proc-macro .so files
do (some section abuts .debug_line in the file), so cargo's call
to dlopen the .so for proc-macro expansion failed with:

    error[E0786]: found invalid metadata files for crate sqlx_macros
      = note: no '.rustc' section in '...libsqlx_macros-...so'

The .rustc section's sh_offset was wrong by exactly the size of
.debug_line_str (the insert at debug_line_end), so rustc read garbage.

Fix: map_offset now takes MapKind:
  * Before — used when computing OUR new inserts' starting positions
    (.debug_line_str's sh_offset = where op B begins inserting).
  * After — used when remapping existing SHDR entries; sections at
    op.position with delete=0 shift past the insert.

The .debug_line replacement (op A, delete>0) and .shstrtab append
(op C, position past .shstrtab's start) aren't affected by the
distinction.

Signed-off-by: Giles Cope <[email protected]>
Refactors Op + MapKind + map_offset from apply_rewrite-local to
module-level items so they're unit-testable, then adds pragmatic
regression tests for the phase 4 bugs caught during integration:

  * map_offset Before/After semantics (bug 3, phase 4d): three
    pure-function tests covering
      - insert-at-query-position: Before returns p, After returns
        p + insert.len
      - replacement-at-query-position: both kinds return same value
        (the replaced region's start)
      - cumulative delta across multiple prior ops

  * rewrite_moves_shdr_to_end_on_early_shdr_layout (bug 1,
    phase 4b): hand-rolls a minimal ELF with wild-style layout
    (SHDR early in file, PT_LOAD content after SHDR). Runs
    apply_rewrite; asserts new SHDR lives at end of file AND
    PT_LOAD bytes are byte-identical at their original offset
    (i.e., PHDR p_offsets remain valid).

  * rewrite_remaps_section_at_debug_line_end_past_the_insert
    (bug 3 at the ELF level, phase 4d): builds an ELF with an
    'edge' section whose sh_offset equals debug_line_end — the
    shape of proc-macro .so files that triggered the original
    regression. Runs apply_rewrite; asserts the edge section's
    new sh_offset is past .debug_line_str, not colliding with it.

  * rewrite_grew_skip_is_integration_only (bug 4, phase 4-):
    ignored placeholder documenting that direct unit coverage
    would need a real DWARF 4 line program to drive rewrite_buffer
    end-to-end. The opt1 integration fixture already exercises
    the skip path (tiny C program).

Bug 2 (effective_len respect in compress) is left to the existing
fixture coverage — direct unit coverage would require faking a
SizedOutput with a File + mmap and isn't worth the scaffolding.

Also includes a `synthetic_elf_parses_via_object_crate` sanity
check that the test helper produces ELFs which the object crate
can walk, so any future failure in the deeper tests surfaces as
a rewrite bug rather than a test-fixture bug.

7 tests in elf_line_v5::tests pass (1 intentionally ignored).
Full lib suite: 214 passed, 0 failed.

compress-debug-sections.c reformatted by clang-format (pre-commit
hook required) — no functional change.

Signed-off-by: Giles Cope <[email protected]>
…nker#5)

Flips DebugCompression::default() from None to Zstd. Release builds
emit no .debug_* sections so the post-write pass is a no-op for them;
debug builds get SHF_COMPRESSED zstd automatically without needing
-O1 or an explicit flag. Users opting out pass --compress-debug-sections=none.

Updates the docstrings on compress_debug_sections and opt_level to
reflect that the baseline (no -O) already includes debug compression;
-O1 now adds the line v4→v5 upgrade on top.

Signed-off-by: Giles Cope <[email protected]>
…linker#3)

Walks every .debug_info CU header, hashes the abbrev table each
references, and collapses identical tables into a new .debug_abbrev.
Each CU's 4-byte debug_abbrev_offset is patched to the deduped
location. The section shrinks by the total dedup delta; subsequent
section offsets and the SHDR position shift up accordingly.

Gated on --dedup-debug-abbrev (opt-in) or implicitly via -O1/-O2/-O3
(bundled with the existing line-v5 + zstd passes). Bails gracefully
when:
  - .debug_info or .debug_abbrev is missing
  - a CU uses DWARF 64 (unit_length = 0xffffffff)
  - any PT_LOAD extends past .debug_abbrev (would desync p_offset)
  - the dedup would not shrink the section

Self-contained; no DIE attributes are rewritten and abbrev codes are
preserved verbatim, so debuggers see exactly the same content they
would have seen against the pre-dedup CU.

4 unit tests cover ULEB scanning including DW_FORM_implicit_const.

Signed-off-by: Giles Cope <[email protected]>
…e -O1 doc

Dwarf-size-plan table rows for items wild-linker#3 (.debug_abbrev hash-and-collapse)
and wild-linker#5 (default --compress-debug-sections=zstd) updated from pending
to shipped. The 'What wild does today' status table gets matching ✓
entries and the wild-linker#5 prioritisation paragraph drops the 'cheapest win'
framing.

libwild's opt_level docstring moves .debug_abbrev dedup from the
placeholder -O2 bucket to the -O1 line where it now actually runs.

Signed-off-by: Giles Cope <[email protected]>
Introduces LinkerKind::WildOpt(u8) plus three wrapper scripts at
benchmarks/runner/bin/wild-O{1,2,3} that prepend -O<N> to every
invocation and rewrite the version banner (Wild-O<N> …) so the
runner keeps the columns visually distinct. The reporter gets
progressively darker greens for each level so 'same family, more
work' reads at a glance.

ELF-only — libwild's -O flag parser lives in args::elf. Mach-O
benches filter the wrappers out via supports_platform.

4 new parser/platform unit tests cover the banner round-trip, the
1..=3 level bounds (so -O0 stays plain Wild rather than being
mis-tagged as WildOpt(0)), and the ELF-only platform restriction.

BENCHMARKING.md gets a new subsection documenting the wrappers,
including a sample invocation with four wild columns on the
ryzen-9955hx matrix.

Signed-off-by: Giles Cope <[email protected]>
NixOS (and upstream binutils) print this form without a vendor
infix; the Ubuntu-only prefix rejected it. Factored strip_gnu_ld_banner
so it tries the Ubuntu / Debian / plain forms in order and returns
an optional vendor tag for the variant field.

Signed-off-by: Giles Cope <[email protected]>
Same source tree as rust-hello-world but built with
profile.release.debug = true instead of the default debug = 1
(line-tables-only). Exposes wild's -O1 passes (debug_line v5,
debug_abbrev dedup, compress) on a workload small enough to
rebuild in seconds.

Signed-off-by: Giles Cope <[email protected]>
Commit 6d25034 (wip wasm LTO) replaced the original
'Wild was compiled without linker-plugin support, but LTO inputs
were detected' with a wasm-specific message that bled into ELF
paths. The four elf/x86_64/linker-plugin-lto fixtures assert the
generic line as their fallback — they've been red on giles-mac
since that WIP landed.

Leaves the function platform-agnostic (it's called from both
symbol_db and elf.rs for any LTO input), with a comment warning
future edits that the wording is asserted by tests.

Signed-off-by: Giles Cope <[email protected]>
Wild's Mach-O writer excludes the entire __DWARF segment from output
(libwild/src/macho.rs:2234) — rust Mach-O binaries don't carry
any .debug_* sections; debug info stays in the .o files and
dsymutil reads them via N_OSO stabs to build a .dSYM bundle.

There is no __debug_str in wild's Mach-O output to dedup; the
premise 'wild passes per-CU string pools through to dsymutil
un-deduped' isn't accurate. Any merging would have to happen
inside dsymutil (out of wild's scope) or in a hypothetical
embedded-DWARF workflow (rustc doesn't do this on Mach-O).

Strikes the matrix row and rewrites the prioritisation note with
the corrected understanding so future readers don't spend time
on a phantom optimisation.

Signed-off-by: Giles Cope <[email protected]>
text-stub-library 0.9 doesn't model the 'reexported-libraries:'
section that umbrella frameworks (ApplicationServices, Carbon,
CoreServices) use to point at the nested frameworks they re-export.
Linking '-framework ApplicationServices' on current macOS SDKs left
wild unable to resolve every ColorSync symbol (e.g. CGDisplay*UUID*
used by modern winit) because the chain from ApplicationServices to
ColorSync wasn't followed.

Adds a minimal raw-YAML scan for the section (parse_reexported_libraries)
and an origin-path-based SDK-root derivation (resolve_tbd_for_install_name)
that maps install-name to the matching .tbd. collect_tbd_symbols and
collect_tbd_symbols_with_directives now recurse through the chain with
a visited set to avoid loops.

Result: bevy-dylib links on macOS again. First re-measurement:
- wall-clock: 407 ms wild vs 279 ms ld64 (1.46x) — down from 274x at
  session-start-2026-04-20, corrected from today's earlier claim.
- memory:    45 MiB wild vs 462 MiB ld64 (10x less).
- output:    40.9 MiB wild vs 55.7 MiB ld64 (28% smaller).

5 unit tests cover the parser happy path, no-section fallback,
stop-at-next-top-level-key bound, SDK-root derivation, and graceful
failure when the origin path has no recognisable marker.

Signed-off-by: Giles Cope <[email protected]>
samply showed ~12% of wild's bevy-dylib wall-clock burning in
core::hash::sip::Hasher::write, entirely under two call sites:
- link_framework's fresh_symbols std::HashSet<Vec<u8>>
- collect_tbd_symbols_impl's recursion via the new umbrella-chain fix
Both consume keys produced from trusted on-disk TBD content; SipHash's
DoS resistance is wasted cost.

Switch both (and args.dylib_symbols, the final union) to a type-aliased
hashbrown::HashSet<Vec<u8>, foldhash::fast::FixedState>. Same pattern
the memory logs for ResolutionByNameCache. Trait default in platform.rs
and sdk_cache I/O sites migrated in lockstep; cache file schema is
hasher-independent so no bump needed.

Re-profile confirms sip::Hasher::write no longer appears in the top
20; new cold-link hotspot is the cache-miss yaml parse path (~16%).
Bench (10×8 samples, M-series host):
  bevy-dylib: 407 → 370 ms (-9%, ratio 1.46× → 1.39×).

Signed-off-by: Giles Cope <[email protected]>
…dation)

Tier-1 of wild's incremental plan needs to skip re-parsing clean
inputs' symbol tables. Since wild already mmaps every input, the
cache should live in the same shape — a fixed-layout repr(C) blob
that can be mmap'd and interpreted in place, zero copy.

This commit lands the format + round-trip plumbing only. It's NOT
yet wired into load_inputs; landing the storage layer first gives
the next session a green foundation to slot a cache-lookup into
the loader without fusing format-churn into a correctness-critical
change.

Format (schema v1):
- 48-byte CacheHeader: magic 'WILDPI01', schema u32, flags u32,
  n_symbols u64, symbols_off u64, names_off u64, names_len u64.
- Symbol region: n_symbols × 24-byte CachedSymbol records
  (name_off, name_len, hash, flags, kind, _pad).
- Names blob: concatenated (deduped) symbol name bytes.

CacheView borrows the backing &[u8] and yields CachedEntry
iterators whose  slices point directly into the mmap — no
allocation, no lifetime gymnastics beyond the input buffer's own
borrow. CacheBuilder writes the same layout into a Vec<u8>.

10 unit tests lock down round-trip, name-dedup, zero-copy-ness
(checks name pointers lie inside the buffer), bad-magic / bad-
schema / truncated / misaligned / unknown-kind rejection, and
the empty cache case.

Signed-off-by: Giles Cope <[email protected]>
Two helpers layered on last commits format scaffolding, still
without any hot-path wiring:

- CacheBuilder::write_to(&Path) — write-tmp-and-rename so
  concurrent readers never observe a torn cache. Matches the
  pattern wilds other side-cars (.wild-hashes, sdk-*.bin) use.

- cache_path_for_input(&Path) -> Option<PathBuf> — derives the
  on-disk cache location for a given input. Dir is
  $XDG_CACHE_HOME/wild/parsed-inputs or $HOME/.cache/wild/parsed-inputs.
  Filename is blake3(absolute_input_path_bytes ‖ SCHEMA).hex + .wildpi
  so two inputs sharing a basename (cargos libfoo-<hash>.rlib
  convention puts them in different build dirs) never collide.

Two new tests:
- write_to_atomically_persists_and_reloads — round-trip via an
  actual temp file, asserts the .wildpi.tmp sidecar is consumed
  by rename.
- cache_path_derivation_is_collision_free_for_same_basename —
  feeds two identically-named inputs from different dirs through
  cache_path_for_input, asserts distinct outputs (and that the
  same input twice returns the same path).

Still not wired into the loader — the next session will hook
CacheView into load_inputs / read_symbols_for_group behind the
existing WILD_INCREMENTAL_DEBUG gate, with a cold-vs-skip
byte-identical canary before flipping default.

Signed-off-by: Giles Cope <[email protected]>
Hand-off doc for the next session that picks up tier-1 incremental
linking. The storage layer (parsed_input_cache) landed in 01f2236 +
f900907; this is the spec for wiring it into the loader:

- SymbolSink trait extraction + TeeSink that duplicates into a
  CacheBuilder.
- Cache-replay path in load_symbols_from_file (before the object-
  crate dispatch).
- WILD_INCREMENTAL_DEBUG gate; WILD_INCREMENTAL_PARSE_SKIP_CANARY
  runs both paths and asserts they produce structurally identical
  symbol streams before any skip goes live.
- Lifetime contract: cache mmap lives alongside input mmaps in
  FileLoader, same arena.
- Bevy-dylib measurement script with target of at least 100 ms
  off the 370 ms cold link.
- Enumerated risk list the canary has to catch (weak version
  strings, COMDAT, TLS flags, hidden/protected visibility, local
  ordering).

Signed-off-by: Giles Cope <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants