Mac + Wasm support (PRs welcome!) by gilescope · Pull Request #1815 · wild-linker/wild

gilescope · 2026-04-06T06:07:46Z

Trying to see how much work is required to get basic mac support going. PR has grown a fair bit since the original "hello world" — there's now a wasm front and a sizeable optimiser/test/benchmark surface alongside the Mach-O work. Tested on M3 Max; CI / other arches very welcome.

If anyone wants to join in, PRs welcome from you or your AI friend.

⚠️ Use at your own risk. Passing the test suite is a floor, not a ceiling. This is in-progress, fast-moving work — the suite catches a lot but nowhere near everything a real shipping linker needs. Expect bugs, miscompiles, and surprising edge cases, especially outside the workloads called out below. Don't use this for anything you can't afford to debug.

Mach-o support

arm64 path: hello-world (C + Rust), allocator + threading, midnight-node (~152 MB rust binary) all link and run.
-ld64_compat mode produces output that's byte-for-byte identical to ld64's on every fixture in the compat suite (15/15 passing).
ld64-style flags fanned in: -map, -filelist, -mark_dead_strippable_dylib / MH_DEAD_STRIPPABLE_DYLIB, -U <sym>, response files, library search-paths-first ordering, etc.
__compact_unwind reverse-edge GC (rust-hello → 1.04 MB / 68 imports vs ld64's 70).
__DATA_CONST / __DATA split (fixes the historical zerocopy build-script BSS bug).
LTO scaffolding via lto/macho_liblto, shared lto/cache, llvm-tools discovery.
Codesign (ad-hoc) writer, SDK discovery cache.

Wasm support (+optimiser)

76/224 lld-wasm fixtures passing (was zero); covers GC, dedup, PIC (static + static64), TLS, debug-info passthrough, LTO error paths, etc.
LTO dispatch tiers: per-module lower → batch merge → unified-LLVM pipeline.
wilt pure rust post-link optimiser (separate crate) wired in behind -O<N> / --strip-*; debug-info preservation tiers (None/Names/Lines/Full). It's a drop in replacement for wasm-opt.

Test fixtures + harness

~654 new files under wild/tests/ (lld-macho, lld-wasm, sold-macho, ld64-compat, plus integration tests).
cargo test is currently clean across all 43 binaries.

Benchmarks

New benchmarks/macos-arm64.toml + --platform = "macho" filter so the existing runner works on darwin.
Workloads: c-hello-world, rust-hello-world, ripgrep, rust-analyzer, bevy-dylib, wild itself, rust-analyzer-incremental, midnight-node.
Benchmark wrapper wild-ld64-compat so the same harness compares "wild default" vs "wild ld64-compat" vs ld64 head-to-head.
See BENCHMARKING.md §"Benchmarking wild on macOS" and benchmarks/macos-arm64.md for the matrix.

Status vs ld64

Correctness / parity: in -ld64_compat mode, byte-identical output on the fixture suite. In default mode, the hot rust workloads link and run cleanly; midnight-node (152 MB) is the largest verified binary so far. Some niche fixtures still need work (e.g. arm64-thunks references ___nan, only exported on x86_64 in current Apple SDKs — ld64 fails identically there, so it's an SDK quirk not a wild bug; tracked as ignored in the suite). Test-suite green ≠ bug-free; treat as alpha.
Performance: earlier baseline (2026-04-20) had wild 1.3×–274× slower than ld64 with the gap super-linear in image size — bevy-dylib was the worst offender. The recent perf: make as fast as ld64 work has closed most of that for the workloads in benchmarks/macos-arm64.toml; expect the SVGs in benchmarks/images/macos-arm64/ (regenerable via the runner) for the latest numbers. Still room to improve on the very-large end.

Punchlist

Passes known mach-o arm tests (21/23 lld-macho with 2 ignored, ~200 sold-macho).
Benchmark harness + matrix on darwin.
Default-mode link + run for non-trivial rust binaries (incl. midnight-node).
Make wild competitive with ld64 on the bench matrix (latest commit; needs broader confirmation on cold-cache and very large binaries).
Ensure well structured (lots of macho/wasm code currently lives next to existing ELF code; some refactoring still owed before this is reviewable as smaller PRs).
Keep optimising performance (especially super-linear cases like bevy-dylib).
Wasm coverage push (76→full lld-wasm suite, pending more relocation/section-type work).
Harden against bugs the test suite isn't yet catching (more fuzzing, real-world workload sweeps, link-then-run validation in CI).
Read and review all code.

davidlattimore · 2026-04-06T23:39:29Z

Hey! I've only had a very high-level look, since there's quite a lot here. If you'd like to help out with porting to Mac, I'd suggest discussing on the Wild zulip. There's an existing thread "Mach-o support". Martin is leading the porting effort for Mac. I think it might be getting to the point where there might be scope for multiple people to work concurrently, but definitely check with him first to avoid duplicated efforts and / or hard-to-resolve merge conflicts.

I'm not sure what Martin's thoughts are on integration tests, but that's definitely something we'll need soon and is perhaps more likely to parallelise with other mac work. I see you did some work in this area, which is great. It looks like you opted for a completely separate integration test runner. I think we'd want to actually extend our existing integration test runner to support running mac tests.

The write_pageoff12 function extracted the access size shift from bits 31:30 of the instruction, which works for integer LDR/STR but is wrong for SIMD/FP loads. For ldr q0 (128-bit), bits 31:30 are 00 (interpreted as byte access = no shift) but the actual scale is 16 bytes (shift=4). This caused the page offset to be 16x too large, making ldr q0 read from the wrong address. This was the root cause of the println! crash: the BufWriter init constant (a 16-byte value loaded via ldr q0) was fetched from a wrong offset, producing garbage in the stdout mutex/RefCell state. Found by adapting LLVM lld's arm64-relocs.s test which exercises exactly this relocation pattern. Signed-off-by: Giles Cope <[email protected]>

The LINKEDIT segment size estimation was too small for binaries with many symbols, causing the output file to be truncated. This made codesign fail ("main executable failed strict validation"), leaving binaries unsigned, which macOS kills with SIGKILL on execution. Increased the estimate to account for chained fixups data and longer average symbol names. Signed-off-by: Giles Cope <[email protected]>

Signed-off-by: Giles Cope <[email protected]>

Import 76 ARM64-relevant assembly tests from LLVM lld's MachO test suite. These provide comprehensive coverage of relocations, stubs, thunks, TLS, compact unwind, dead stripping, and ObjC features. Test runner assembles each .s file with clang, links with Wild, and validates the output with codesign. 21 tests pass, 2 ignored. Signed-off-by: Giles Cope <[email protected]>

Import 134 shell tests from bluewhalesystems/sold (archived). Tests compile C/C++ via clang, link with Wild via --ld-path=./ld64, and verify output. 36 pass, 98 ignored (categorized by reason). Signed-off-by: Giles Cope <[email protected]>

Signed-off-by: Giles Cope <[email protected]>

- Fix write_dylib_symtab hardcoding n_sect=1 for all symbols. Extract parse_section_ranges() for proper section lookup. - Error when an explicitly-specified entry symbol (-e) is not found, instead of silently succeeding. - Propagate entry_symbol_address errors in Mach-O writer. - Un-ignore 5 sold tests: pagezero-size3, entry, objc, eh-frame, bind-at-load. Sold suite: 41 pass (was 36). Signed-off-by: Giles Cope <[email protected]>

Implement -filelist <path>[,<dir>] to read input file paths from a file, with optional directory prefix. Also accept (silently ignore) many more ld64 flags: -add_ast_path, -macos_version_min, -dependency_info, -map, -stack_size, -sectcreate, -F, -U, -hidden-l, -no_fixup_chains, -x, -S, -w, -Z, and others. Sold suite: 44 pass (was 41). Signed-off-by: Giles Cope <[email protected]>

Instead of silently ignoring -force_load, add the specified archive as an input with whole_archive=true. Archive members are correctly marked as non-optional in the loading pipeline, though the full whole-archive layout path needs further work for Mach-O. Signed-off-by: Giles Cope <[email protected]>

Root cause: whole-archive archive members had their symbols resolved but data sections were GC'd because resolve_section didn't know about the whole_archive modifier. Unreferenced sections stayed as SectionSlot::Unloaded and were skipped during layout activation. Three fixes: - Thread whole_archive through ResolvedCommon so resolve_section can set must_load=true for all sections from whole-archive members - Add load_all_defined_symbols in activate() to set DIRECT on all defined symbols from whole-archive members (skipping discarded sections like __compact_unwind) - Fix exe symtab to emit N_EXT for external symbols instead of marking everything as local Sold suite: 46 pass (was 44). Unlocks all-load and force-load. Signed-off-by: Giles Cope <[email protected]>

Wild now handles -help/--help by printing usage info and exiting. This unblocks the sold-macho/response-file test which invokes ./ld64 @response_file with -help. Sold suite: 47 pass (was 46). Signed-off-by: Giles Cope <[email protected]>

Two bugs prevented common symbols (tentative definitions) from working: 1. is_undefined() returned true for common symbols because they have N_UNDF type. Fixed by excluding symbols where is_common() is true. This caused common symbols to be skipped during symbol registration, making them appear undefined. 2. as_common() used raw n_desc as shift count for alignment, causing shift overflow panics. Fixed by extracting GET_COMM_ALIGN bits (bits 8-11 of n_desc) per the Mach-O spec. Sold suite: 49 pass (was 47). Unlocks common and common-alignment. Signed-off-by: Giles Cope <[email protected]>

The TLV descriptor offset field for thread_bss symbols was computed as align_up(tdata_size, 8) + bss_offset, producing offset 8 for a 4-byte tdata section. The correct formula is tdata_size + bss_offset (no alignment padding), matching what the system linker produces. Also remove debug file logging from TLS relocation paths and add TLS-block-relative offset computation for $tlv$init symbols in the fallback relocation path. Signed-off-by: Giles Cope <[email protected]>

- Fix exe symtab to check original symbol's N_EXT bit instead of assuming all resolved symbols are external. Local symbols (static functions) now correctly get n_type=0x0e instead of 0x0f. - Implement -x flag to strip local symbols from the output symtab. - Add is_symbol_external() helper that looks up the original input object's symbol to determine binding. Sold suite: 50 pass (was 49). Unlocks x test. Signed-off-by: Giles Cope <[email protected]>

LC_UUID was only emitted for dylibs. Now emitted for all output types, using a deterministic UUID derived from the output path. Required for dlopen, debuggers, and crash reporters. Sold suite: 51 pass (was 50). Unlocks uuid2 and x tests. Signed-off-by: Giles Cope <[email protected]>

- Fix DYSYMTAB to properly split local and external symbol counts (ilocalsym/nlocalsym/iextdefsym/nextdefsym). - Sort exe symtab entries: locals before externals (DYSYMTAB requires this ordering). - Fix LC_DYLD_EXPORTS_TRIE offset to point between fixups and symtab instead of overlapping with fixups. - Add empty LC_FUNCTION_STARTS and LC_DATA_IN_CODE load commands (required by macOS strip tool for LINKEDIT ordering validation). strip compatibility still needs load command reordering (fixups and exports trie should come before SYMTAB/DYSYMTAB in command list). Signed-off-by: Giles Cope <[email protected]>

Signed-off-by: Giles Cope <[email protected]>

…ture Move Strip enum to shared args.rs for cross-platform reuse. Wire up Mach-O flags to existing platform trait methods: - -S: Strip::Debug via should_strip_debug() - -demangle: common.demangle field - -export_dynamic: should_export_all_dynamic_symbols() - -dead_strip: should_gc_sections() (now opt-in, matching ld64) - -exported_symbols_list: export_list_path() with auto-format detection - -unexported_symbols_list: unexport_list_path() + SymbolDb plumbing - -compatibility_version/-current_version: emitted in LC_ID_DYLIB - -bundle: MH_BUNDLE filetype, no LC_MAIN - -sectcreate: file data parsed and stored (writer integration pending) ExportList::parse() auto-detects ELF ({sym;}) vs Mach-O (one-per-line) format. All 138 Mach-O tests pass with zero regressions. Signed-off-by: Giles Cope <[email protected]>

Add framework_search_paths field to MachOArgs. -F<path> stores paths, -framework <name> searches <path>/<name>.framework/<name> for .tbd or dylib and adds it to extra_dylibs. Also implement -compatibility_version, -current_version, -bundle, and -sectcreate arg parsing. Enables the sold-macho/framework test (63 passing, was 62). Signed-off-by: Giles Cope <[email protected]>

Route -needed-l, -weak-l, -reexport-l, -hidden-l through the same library search logic as -l. This is sufficient for -needed-l (which just needs the library found and LC_LOAD_DYLIB emitted). Enables sold-macho/needed-l test (64 passing, was 63). Signed-off-by: Giles Cope <[email protected]>

Swap loop order from extension-first to directory-first when searching for libraries, matching ld64's default -search_paths_first behaviour. This ensures a static library in an earlier search path is found before a dylib in a later path. Enables sold-macho/search-paths-first test (65 passing, was 64). Signed-off-by: Giles Cope <[email protected]>

Fix export list filtering for Mach-O dylibs: the export_list check in can_export_symbol was guarded by !export_all_dynamic, which is always true for dylibs. Remove the guard so -exported_symbols_list always filters. Wire up -unexported_symbols_list and -unexported_symbol to filter symbols OUT of the exports trie. Add -exported_symbol (singular) for inline symbol specification. Both use the ExportList/MatchRules infrastructure with wildcard support. Enables sold-macho/exported-symbols-list and sold-macho/unexported-symbols-list tests (67 passing, was 65). Signed-off-by: Giles Cope <[email protected]>

Add search_dylibs_first field. Unify the library search into a single indexed loop that can iterate either paths-first (default) or extensions-first (-search_dylibs_first). Also add -exported_symbol and -unexported_symbol inline flags. Note: -search_dylibs_first only works when specified before -l flags due to inline library resolution during arg parsing. Signed-off-by: Giles Cope <[email protected]>

When a .dylib or MH_BUNDLE file is passed as an input, parse its Mach-O header to extract LC_ID_DYLIB install name and exported symbols from the exports trie (LC_DYLD_EXPORTS_TRIE or LC_DYLD_INFO). Register the install name for LC_LOAD_DYLIB emission and symbols for resolution. Add MachODylib variant to FileKind for proper file type identification. Enables sold-macho/dylib, data-reloc, weak-def-dylib tests (70 passing, was 67). Signed-off-by: Giles Cope <[email protected]>

When -S is not passed, synthesize N_OSO stab symbols in the executable symbol table pointing to each input object file (with mtime as n_value). This enables dsymutil and debuggers to find source-level debug info. Also copy any existing stab symbols from input objects. The LINKEDIT size estimate accounts for these extra entries. When -S IS passed (Strip::Debug), stab entries are suppressed. Enables sold-macho/S test (71 passing, was 70). Signed-off-by: Giles Cope <[email protected]>

Add DylibLoadKind enum (Normal/Weak/Reexport) and track it per dylib. -weak-l and -weak_framework emit LC_LOAD_WEAK_DYLIB (0x18000018). -reexport-l and -reexport_library emit LC_REEXPORT_DYLIB (0x8000001F). All others emit LC_LOAD_DYLIB as before. Refactor extra_dylibs from Vec<Vec<u8>> to Vec<(Vec<u8>, DylibLoadKind)> with add_dylib() helper for dedup. Signed-off-by: Giles Cope <[email protected]>

Detailed plan for enabling S_CSTRING_LITERALS section merging in the Mach-O writer, reusing the existing string_merging.rs infrastructure. The generic pipeline (detection, dedup, address mapping) already works. The gap is in macho_writer.rs: writing merged data to output, resolving relocations into merged sections, and section header accounting. Signed-off-by: Giles Cope <[email protected]>

Preparatory work for cstring merging: - Add write_merged_strings_macho() that writes deduplicated bucket data to the output buffer using MergedStringStartAddresses for VM->file offset mapping. Called after section data copying. - Add bucket_addresses() accessor to MergedStringStartAddresses. - Add get_merged_string_output_address fallbacks in apply_relocations for both extern and subtractor reloc paths. The flag (should_merge_sections) remains false. Enabling it causes widespread failures because many macho_writer.rs code paths assume all sections have valid addresses via section_resolutions, but MergeStrings sections get SectionResolution::none(). A systematic audit of all section_resolutions.address() call sites is needed before enabling. Signed-off-by: Giles Cope <[email protected]>

Captures research on DWARF debug info on Rust binaries: rustc still on DWARF 4, no type units, monomorphisations duplicate full DIE trees per CU. On midnight-node 86% of binary is .debug_*, 70% of interesting DIEs are duplicates, dedup prize concentrates in subprogram (71%) and namespace (29%) DIEs not type DIEs. Lists 11 ranked improvements. Signed-off-by: Giles Cope <[email protected]>

Adds two sections to the design doc: * Sneaky idea: link-time DWARF 4 -> 5 upgrade. Walks through the real cost of each piece. Header bump alone is cosmetic. v5 is the only piece that's tractable in isolation (5-10 MB win, 1-2 wk, doesn't touch DIE tree). Everything else (, , , ) is a full DWARF rewriter — same engineering shape as cross-CU DIE dedup. Verdict: park as a sub-item of 'build a DWARF rewriter'; standalone-doable piece is line-v5. * Hard prerequisite: debugger-based integration tests. DWARF corruption is silent (gdb prints '<no type>', binary still runs). Any rewrite work lands behind a harness that runs addr2line + llvm-symbolizer + optionally lldb/gdb on a fixture before/after the pass and asserts file/line/function/type-name output matches. Signed-off-by: Giles Cope <[email protected]>

Pure-Rust harness (gimli + addr2line crate, no external llvm-symbolizer / addr2line binary) that gates any future DWARF rewrite against silent corruption. Two modes: * verify <elf> — picks up to 64 unique-named text symbols, runs addr2line on each, reports resolution rate. Sanity check that DWARF in a single binary actually decodes. * compare <before.elf> <after.elf> — runs the same lookups on both binaries (matched by symbol name, addresses may differ across rewrite passes) and asserts the resolved (function, file, line) tuples match. This is the gate. Validated end-to-end on midnight-node: verify decompressed: 60/64 resolved (4 unresolved are crt stubs without DWARF — deregister_tm_clones, register_tm_clones, __do_global_dtors_aux, frame_dummy. Expected.) verify compressed: 60/64 resolved (gimli decompresses SHF_COMPRESSED transparently via section.uncompressed_data()) compare same-file: 0 mismatches in 256 pairs compare zstd vs raw: 0 mismatches — proves wild's --compress-debug-sections=zstd is debugger-correct, not just byte-decompressible Subtle bug caught during build: COMDAT folding produces multiple symbol-table entries with the same mangled name pointing at different addresses. Naïve "first iteration finds A, last-write-wins map finds B" produces false-positive mismatches even on same-file compare. Fixed by filtering to symbols whose name is unique in the symbol table. This is the prerequisite called out in dwarf-size-plan.md for any DWARF rewrite work (line v5 upgrade, abbrev dedup, DIE dedup). Next: drive this harness from wild's integration test suite as an ExpectAddr2LineMatchesPrelink directive. Signed-off-by: Giles Cope <[email protected]>

Adds an integration-test directive that asserts the linked binary's DWARF actually decodes via addr2line/gimli. Used to gate any DWARF rewrite (compression — already shipped, line v5 upgrade — next, abbrev dedup, DIE dedup) against silent corruption: gdb prints '<no type>' instead of crashing on broken DWARF, so byte-level section assertions don't catch real regressions. Wire-up: * `addr2line = 0.25.1` workspace dep + dev-dep in wild crate. * `//#ExpectDwarfResolves:N` directive — asserts at least N text symbols (out of up to 64 unique-named ones sampled) resolve via addr2line::Loader to a non-empty (function, file, line) tuple. * Loader handles SHF_COMPRESSED transparently, so the directive works on both compressed and uncompressed wild output. Applied to the existing compress-debug-sections fixture with N=1 (tiny fixture only really has main as a resolvable symbol; loose bound is honest). Catches the case where wild's compress pass would have produced byte-decompressible-but-debugger-broken output. addr2line crate pulls in gimli 0.32 (its pin); the workspace pins gimli 0.33 for libwild. Two-version-in-tree is intentional — addr2line::Loader hides the gimli interaction so we don't need to type-thread between versions. Next step (per dwarf-size-plan.md): use this gate to shepherd a .debug_line v4 → v5 conversion pass. Signed-off-by: Giles Cope <[email protected]>

Phase 1 of the .debug_line v4 -> v5 upgrade work. Estimates the size delta on a real ELF before committing to the rewrite. For each CU's line program: * Read the v4 header (include_directories + file_names). * Estimate the v5 header size assuming DW_FORM_line_strp for paths (4-byte offsets into a shared .debug_line_str pool). * Aggregate every distinct path string across all CUs into the pool and measure post-dedup size. Run on midnight-node (decompressed, 21,008 CUs, 129 MB .debug_line): distinct paths across all CUs: 5,767 paths pre-dedup: 23,038,943 bytes paths post-dedup: 213,537 bytes (107.89x compression) v4 .debug_line: 129,029,756 bytes v5 estimate (headers+prog+strpool): 107,930,507 bytes estimated saving: 21,099,249 bytes (16.35%) Each of the 21k CUs duplicates the same workspace path strings. Each path appears in ~3,640 CUs on average. The savings are cross-CU dedup of paths, not header layout shrinkage. Validates the line v5 upgrade as a worthwhile target. Phase 2 (next): build the actual rewriter as a wild post-pass, gated by the ExpectDwarfResolves directive. Tactic: emit .debug_line_str + rewrite each CU's line program header as v5 + patch DW_AT_stmt_list in .debug_info to point at the new offsets. Signed-off-by: Giles Cope <[email protected]>

Standalone tool that takes an ELF, rewrites every CU's .debug_line program from DWARF 4 to DWARF 5 using DW_FORM_string for paths (inline NUL-terminated, no .debug_line_str pool yet), patches every DW_AT_stmt_list attribute in .debug_info to point at the new offset, replaces .debug_line, shifts subsequent sections, and writes the result. End-to-end on midnight-node: .debug_line: 129,029,756 → 133,358,116 bytes (+3.35%) file: 1,317,229,712 → 1,321,558,072 bytes (+0.33%) debugger-roundtrip compare: 256 symbol pairs, 0 mismatches Size goes UP because v5 inline-string format adds ~10 bytes of format-descriptor overhead per CU + 2 bytes (address_size, segment_selector_size) but only saves ~2 bytes per file (no mtime/length). Phase 2b switches paths to DW_FORM_line_strp with a shared .debug_line_str pool — recon says that delivers the projected 21 MB saving (16% off .debug_line). Phase 2a's value is the infrastructure: gimli-based CU walking, abbrev-walking to find DW_AT_stmt_list byte position, v5 header emission with format descriptors, in-place section replacement + SHDR/ehdr surgery (mirrors elf_compress.rs), and end-to-end validation against debugger-roundtrip. Notable correctness moves: * v5 file index 0 = primary CU source (from DW_AT_name + DW_AT_comp_dir). v5 file 1..N = same as v4 file 1..N. So the line-program opcodes (DW_LNS_set_file with operand) need no rewriting — numbering preserved. * Same trick for include_directories (v5 dir 0 = comp_dir). * Line-program opcode bytes copied verbatim — encoding is identical between v4 and v5. Phase 3: lift into libwild::elf_line_v5 + fold into -O1 alongside --compress-debug-sections=zstd. Signed-off-by: Giles Cope <[email protected]>

…phase 2b) Phase 2b switches the path encoding from DW_FORM_string (inline, phase 2a) to DW_FORM_line_strp referencing a new .debug_line_str section. This is where the size win comes from: every CU's path strings get pooled and deduped across the whole binary. Adds the .debug_line_str section to the output ELF: extends the .shstrtab with ".debug_line_str\0", appends a SHDR entry with SHT_PROGBITS + SHF_MERGE+SHF_STRINGS + sh_addralign=1+sh_entsize=1, shifts middle sections forward by the size delta, relocates the SHDR table to the new file end, bumps e_shoff + e_shnum. Result on midnight-node: .debug_line: 129,029,756 → 107,846,424 bytes (-16.42%) .debug_line_str: 0 → 769,252 bytes (NEW section, 10,039 unique paths) combined: 129,029,756 → 108,615,676 bytes (-15.82%) file: 1,317,229,712 → 1,296,815,712 bytes (-20.4 MB) debugger-roundtrip compare: 256 symbol pairs, 0 mismatches The pool came out larger than the recon estimate (769 KB vs 214 KB) because the rewriter dedupes primary CU source paths too, and most CUs have a distinct source file. Still close to the 21 MB recon projection and well within the same order of magnitude. Phase 2a's inline-string emitter (emit_v5_line_program) is kept behind #[allow(dead_code)] for reference. The pooled emitter (emit_v5_line_program_pooled) is the active path. Phase 3: lift this into libwild::elf_line_v5, fold into -O1 alongside --compress-debug-sections=zstd. The same DW_AT_stmt_list patching + SHDR-add infrastructure can later be reused for .debug_addr indirection (DWARF 5 DW_FORM_addrx) which would shrink .debug_info further. Signed-off-by: Giles Cope <[email protected]>

Adds the args-side scaffolding for an opt-level mechanism: * `opt_level: u8` field on ElfArgs. * `-O<N>` parser handler now stores N (clamped 0..3) AND activates the underlying flags: * `--compress-debug-sections=zstd` (already wired in elf_compress.rs) for N >= 1. * `--upgrade-debug-line=v5` (new, body next commit) for N >= 1. Implicit activations only RAISE — explicit `--compress-debug-sections=none` set later overrides -O. * `DebugLineUpgrade` enum (None|V5) + `upgrade_debug_line` field. * `--upgrade-debug-line=<none|v5>` direct override flag. Reasoning for the v5 default at -O1: experiment/debug-line-rewrite proved that .debug_line v4 -> v5 with cross-CU path pooling via a new .debug_line_str section saves -16.42% on .debug_line (-20.4 MB on midnight-node), is debugger-correct (debugger-roundtrip compare: 0 mismatches), and composes cleanly with the existing zstd compression pass. Combined -O1 = compress + line v5 should take debug builds to ~24% of original size while preserving full debugger fidelity. Phase 3b (next commit): port the rewriter from experiments/debug- line-rewrite/src/main.rs into libwild::elf_line_v5 as a SizedOutput post-pass that runs before elf_compress. Hook into elf_writer::write. Add an integration test fixture asserting both -O1 effects. Signed-off-by: Giles Cope <[email protected]>

Lifts the experiments/debug-line-rewrite/ phase 2b algorithm into libwild::elf_line_v5 as a SizedOutput post-pass. Hooks into elf_writer::write between build-id emission and the compress pass so the new .debug_line_str section gets zstd-compressed too when both passes are active under -O1. Architecture: * upgrade_debug_line(sized_output, mode) is the entry point. No-op when mode == None. * Internally builds the rewritten ELF in a temp Vec<u8> (sub-optimal but correct) then memcpies back into the SizedOutput's mmap and calls set_final_size with the new (smaller) length. * Bails loudly if the rewrite would GROW the file — that invariant should always hold on rust binaries because cross-CU path-pool dedup outweighs the new-section overhead. Test coverage: extended compress-debug-sections fixture with a new `opt1` config that uses `-Wl,-O1` and asserts BOTH: * `.debug_line_str` exists (only line v5 creates it). * `.debug_line_str` is SHF_COMPRESSED (zstd ran after line v5). * DWARF still resolves (debugger correctness gated). Notable port differences from the experiment: * gimli 0.33 (workspace pin) vs 0.31 (experiment). UnitSectionOffset became a tuple struct in 0.33; access via .0. * libwild::error error type instead of String. * Operates on &mut SizedOutput. Combined effect on substrate-class debug builds (per the experiment runs on midnight-node, validated by debugger-roundtrip compare): raw input: 1,317 MB + line v5 (alone): 1,297 MB (-1.5%) + zstd compress (alone): 322 MB (-75.7%) + both (-O1): ~316 MB (~-76%) — slightly better than zstd alone because compressed line_str is denser Signed-off-by: Giles Cope <[email protected]>

…t fixture - elf_line_v5 used to attempt the rewrite on any line program; on a v5 input, dirs use DW_FORM_line_strp which our string_value call can't resolve and we'd bail with 'dir string_value' error. Now skip CUs whose lp_header.version() != 4 — they're already upgraded, nothing to do. - The test fixture compresses-debug-sections needs to have v4 line programs to exercise the upgrade path, so add -gdwarf-4 to the compiler args. Modern gcc/clang default to v5. Signed-off-by: Giles Cope <[email protected]>

Wild's own output places the SHDR table early in the file (right after ehdr), with .debug_line and .shstrtab later. The current elf_line_v5 layout logic was written assuming the gcc/ld convention of SHDR-at-end (which is what midnight-node has — gcc + wild linked it via clang+wasm-ld... actually via wild as ld, but the section ordering ended up gcc-style somehow). For wild's own output, the SHDR-early layout makes the existing 'shift everything after .shstrtab including SHDR table' approach incorrect — SHDR shouldn't shift, but everything between SHDR and end of file (including .debug_line, .shstrtab) needs careful sequencing as we insert new bytes. Until phase 4 generalises the layout-shift to handle both orderings, gracefully skip the rewrite and emit the original bytes when SHDR-precedes-shstrtab is detected. Prints a one-line warning to stderr so users know -O1's line v5 piece was no-op'd. Test fixture's opt1 config drops the ExpectCompressedSection:.debug_line_str assertion (the rewrite was skipped, so .debug_line_str doesn't exist on wild's output yet) and asserts the compress piece + DWARF resolves instead. Re-add the line_str assertion once phase 4 lands. Signed-off-by: Giles Cope <[email protected]>

Generalises apply_rewrite to handle both ELF layouts: * gcc/ld-style: SHDR table at end of file (handled by phase 3b). * wild-style: SHDR table early in the file (new in this commit). Implementation: every insert/replace is modelled as an Op (position, delete, insert_bytes). All four operations are collected: A. Replace .debug_line bytes with new_debug_line. B. Insert new_debug_line_str after .debug_line. C. Append ".debug_line_str\0" at end of .shstrtab. D. Append new SHDR entry at end of the SHDR table. Ops sorted by old-file position. Splice pass streams old bytes through to new, applying inserts/replaces in order. Each old offset maps to new via cumulative delta of ops with lower position. SHDR entries' sh_offset fields get remapped using the same map_offset function. Op D's entry bytes include the new .debug_line_str's sh_offset, which depends on WHERE op D sits relative to .debug_line (op D may be BEFORE .debug_line in wild's layout). Cycle broken with a small conditional adjustment. Re-adds ExpectCompressedSection:.debug_line_str to the opt1 fixture — now that the rewrite applies on wild's own output, that section should exist and be SHF_COMPRESSED. Signed-off-by: Giles Cope <[email protected]>

SizedOutput is a fixed-size mmap — we can't grow it post-hoc. On tiny binaries the v5 format overhead (+ new SHDR entry + new section name) exceeds the cross-CU path-pool savings, so the rewrite ends up larger than the input. Was bailing with an error; now logs a one-line warning and returns the original output unchanged. On real workloads (substrate-class) the shrinkage is always comfortable — midnight-node: 21 MB saved. Tiny fixtures with few distinct paths are the edge case. Drops the .debug_line_str assertion from the opt1 fixture config because the fixture is now correctly small enough to trigger the skip path. Real-world -O1 link of a multi-CU rust binary will still activate the full rewrite; the experiment/debug-line-rewrite benchmark on midnight-node proves the savings case. Signed-off-by: Giles Cope <[email protected]>

… 4b) Growing SHDR in place shifted PT_LOAD content forward (wild's own layout puts SHDR early in the file, with executable sections after it) but PHDR p_offset fields weren't updated, so the kernel loaded garbage when trying to exec the resulting binary: /path/to/build-script-build: cannot execute binary file Fix: move the SHDR table to the end of the file (new ehdr.e_shoff points there) instead of extending it in place. Old SHDR bytes become unused padding — still in the file at their old location but not referenced by ehdr.e_shoff or any PHDR. No in-middle byte shifts, so PT_LOAD segments' file offsets don't change. Simplifies the code: op D is no longer part of the splice-ops list. It's applied after the main splice by appending to new_data. This also removes the gnarly 'compute new_line_str_offset when op D might be before or after .debug_line_offset' cycle that phase 4a had to work around. Signed-off-by: Giles Cope <[email protected]>

Bug: after elf_line_v5 shrank the output via set_final_size, the subsequent elf_compress pass saw the full mmap buffer (sized_output. out.len()) rather than line_v5's logical size. Its trailing 'copy through' step then read stale original bytes past the new end, corrupting the final ELF (ehdr.e_shoff carried over from wild's original output — exec rejected the binary at load time). Fix: * Add SizedOutput::effective_len() — returns final_size_override if set, else full buffer length. Sibling of set_final_size. * elf_compress's entry bounds the buffer slice to effective_len() before calling compress_zstd_in_buffer. All reads (including the trailing copy) now see only valid bytes from the preceding pass. * Relax compress_zstd_in_buffer's generic bound to accept &mut [u8] directly (was &mut B: DerefMut<[u8]>). Simpler for the new single caller; unit tests already used &mut Vec<u8> which coerces to &mut [u8]. Signed-off-by: Giles Cope <[email protected]>

When an insert op (delete=0) sits exactly at a section's sh_offset, the section must shift past the inserted bytes, not collide with them. Previous map_offset always broke on op.position == p, producing a colliding offset. This bug was invisible on midnight-node and most binaries because no section starts exactly at debug_line_end. Proc-macro .so files do (some section abuts .debug_line in the file), so cargo's call to dlopen the .so for proc-macro expansion failed with: error[E0786]: found invalid metadata files for crate sqlx_macros = note: no '.rustc' section in '...libsqlx_macros-...so' The .rustc section's sh_offset was wrong by exactly the size of .debug_line_str (the insert at debug_line_end), so rustc read garbage. Fix: map_offset now takes MapKind: * Before — used when computing OUR new inserts' starting positions (.debug_line_str's sh_offset = where op B begins inserting). * After — used when remapping existing SHDR entries; sections at op.position with delete=0 shift past the insert. The .debug_line replacement (op A, delete>0) and .shstrtab append (op C, position past .shstrtab's start) aren't affected by the distinction. Signed-off-by: Giles Cope <[email protected]>

Refactors Op + MapKind + map_offset from apply_rewrite-local to module-level items so they're unit-testable, then adds pragmatic regression tests for the phase 4 bugs caught during integration: * map_offset Before/After semantics (bug 3, phase 4d): three pure-function tests covering - insert-at-query-position: Before returns p, After returns p + insert.len - replacement-at-query-position: both kinds return same value (the replaced region's start) - cumulative delta across multiple prior ops * rewrite_moves_shdr_to_end_on_early_shdr_layout (bug 1, phase 4b): hand-rolls a minimal ELF with wild-style layout (SHDR early in file, PT_LOAD content after SHDR). Runs apply_rewrite; asserts new SHDR lives at end of file AND PT_LOAD bytes are byte-identical at their original offset (i.e., PHDR p_offsets remain valid). * rewrite_remaps_section_at_debug_line_end_past_the_insert (bug 3 at the ELF level, phase 4d): builds an ELF with an 'edge' section whose sh_offset equals debug_line_end — the shape of proc-macro .so files that triggered the original regression. Runs apply_rewrite; asserts the edge section's new sh_offset is past .debug_line_str, not colliding with it. * rewrite_grew_skip_is_integration_only (bug 4, phase 4-): ignored placeholder documenting that direct unit coverage would need a real DWARF 4 line program to drive rewrite_buffer end-to-end. The opt1 integration fixture already exercises the skip path (tiny C program). Bug 2 (effective_len respect in compress) is left to the existing fixture coverage — direct unit coverage would require faking a SizedOutput with a File + mmap and isn't worth the scaffolding. Also includes a `synthetic_elf_parses_via_object_crate` sanity check that the test helper produces ELFs which the object crate can walk, so any future failure in the deeper tests surfaces as a rewrite bug rather than a test-fixture bug. 7 tests in elf_line_v5::tests pass (1 intentionally ignored). Full lib suite: 214 passed, 0 failed. compress-debug-sections.c reformatted by clang-format (pre-commit hook required) — no functional change. Signed-off-by: Giles Cope <[email protected]>

…nker#5) Flips DebugCompression::default() from None to Zstd. Release builds emit no .debug_* sections so the post-write pass is a no-op for them; debug builds get SHF_COMPRESSED zstd automatically without needing -O1 or an explicit flag. Users opting out pass --compress-debug-sections=none. Updates the docstrings on compress_debug_sections and opt_level to reflect that the baseline (no -O) already includes debug compression; -O1 now adds the line v4→v5 upgrade on top. Signed-off-by: Giles Cope <[email protected]>

…linker#3) Walks every .debug_info CU header, hashes the abbrev table each references, and collapses identical tables into a new .debug_abbrev. Each CU's 4-byte debug_abbrev_offset is patched to the deduped location. The section shrinks by the total dedup delta; subsequent section offsets and the SHDR position shift up accordingly. Gated on --dedup-debug-abbrev (opt-in) or implicitly via -O1/-O2/-O3 (bundled with the existing line-v5 + zstd passes). Bails gracefully when: - .debug_info or .debug_abbrev is missing - a CU uses DWARF 64 (unit_length = 0xffffffff) - any PT_LOAD extends past .debug_abbrev (would desync p_offset) - the dedup would not shrink the section Self-contained; no DIE attributes are rewritten and abbrev codes are preserved verbatim, so debuggers see exactly the same content they would have seen against the pre-dedup CU. 4 unit tests cover ULEB scanning including DW_FORM_implicit_const. Signed-off-by: Giles Cope <[email protected]>

…e -O1 doc Dwarf-size-plan table rows for items wild-linker#3 (.debug_abbrev hash-and-collapse) and wild-linker#5 (default --compress-debug-sections=zstd) updated from pending to shipped. The 'What wild does today' status table gets matching ✓ entries and the wild-linker#5 prioritisation paragraph drops the 'cheapest win' framing. libwild's opt_level docstring moves .debug_abbrev dedup from the placeholder -O2 bucket to the -O1 line where it now actually runs. Signed-off-by: Giles Cope <[email protected]>

Introduces LinkerKind::WildOpt(u8) plus three wrapper scripts at benchmarks/runner/bin/wild-O{1,2,3} that prepend -O<N> to every invocation and rewrite the version banner (Wild-O<N> …) so the runner keeps the columns visually distinct. The reporter gets progressively darker greens for each level so 'same family, more work' reads at a glance. ELF-only — libwild's -O flag parser lives in args::elf. Mach-O benches filter the wrappers out via supports_platform. 4 new parser/platform unit tests cover the banner round-trip, the 1..=3 level bounds (so -O0 stays plain Wild rather than being mis-tagged as WildOpt(0)), and the ELF-only platform restriction. BENCHMARKING.md gets a new subsection documenting the wrappers, including a sample invocation with four wild columns on the ryzen-9955hx matrix. Signed-off-by: Giles Cope <[email protected]>

NixOS (and upstream binutils) print this form without a vendor infix; the Ubuntu-only prefix rejected it. Factored strip_gnu_ld_banner so it tries the Ubuntu / Debian / plain forms in order and returns an optional vendor tag for the variant field. Signed-off-by: Giles Cope <[email protected]>

Same source tree as rust-hello-world but built with profile.release.debug = true instead of the default debug = 1 (line-tables-only). Exposes wild's -O1 passes (debug_line v5, debug_abbrev dedup, compress) on a workload small enough to rebuild in seconds. Signed-off-by: Giles Cope <[email protected]>

Commit 6d25034 (wip wasm LTO) replaced the original 'Wild was compiled without linker-plugin support, but LTO inputs were detected' with a wasm-specific message that bled into ELF paths. The four elf/x86_64/linker-plugin-lto fixtures assert the generic line as their fallback — they've been red on giles-mac since that WIP landed. Leaves the function platform-agnostic (it's called from both symbol_db and elf.rs for any LTO input), with a comment warning future edits that the wording is asserted by tests. Signed-off-by: Giles Cope <[email protected]>

Wild's Mach-O writer excludes the entire __DWARF segment from output (libwild/src/macho.rs:2234) — rust Mach-O binaries don't carry any .debug_* sections; debug info stays in the .o files and dsymutil reads them via N_OSO stabs to build a .dSYM bundle. There is no __debug_str in wild's Mach-O output to dedup; the premise 'wild passes per-CU string pools through to dsymutil un-deduped' isn't accurate. Any merging would have to happen inside dsymutil (out of wild's scope) or in a hypothetical embedded-DWARF workflow (rustc doesn't do this on Mach-O). Strikes the matrix row and rewrites the prioritisation note with the corrected understanding so future readers don't spend time on a phantom optimisation. Signed-off-by: Giles Cope <[email protected]>

text-stub-library 0.9 doesn't model the 'reexported-libraries:' section that umbrella frameworks (ApplicationServices, Carbon, CoreServices) use to point at the nested frameworks they re-export. Linking '-framework ApplicationServices' on current macOS SDKs left wild unable to resolve every ColorSync symbol (e.g. CGDisplay*UUID* used by modern winit) because the chain from ApplicationServices to ColorSync wasn't followed. Adds a minimal raw-YAML scan for the section (parse_reexported_libraries) and an origin-path-based SDK-root derivation (resolve_tbd_for_install_name) that maps install-name to the matching .tbd. collect_tbd_symbols and collect_tbd_symbols_with_directives now recurse through the chain with a visited set to avoid loops. Result: bevy-dylib links on macOS again. First re-measurement: - wall-clock: 407 ms wild vs 279 ms ld64 (1.46x) — down from 274x at session-start-2026-04-20, corrected from today's earlier claim. - memory: 45 MiB wild vs 462 MiB ld64 (10x less). - output: 40.9 MiB wild vs 55.7 MiB ld64 (28% smaller). 5 unit tests cover the parser happy path, no-section fallback, stop-at-next-top-level-key bound, SDK-root derivation, and graceful failure when the origin path has no recognisable marker. Signed-off-by: Giles Cope <[email protected]>

samply showed ~12% of wild's bevy-dylib wall-clock burning in core::hash::sip::Hasher::write, entirely under two call sites: - link_framework's fresh_symbols std::HashSet<Vec<u8>> - collect_tbd_symbols_impl's recursion via the new umbrella-chain fix Both consume keys produced from trusted on-disk TBD content; SipHash's DoS resistance is wasted cost. Switch both (and args.dylib_symbols, the final union) to a type-aliased hashbrown::HashSet<Vec<u8>, foldhash::fast::FixedState>. Same pattern the memory logs for ResolutionByNameCache. Trait default in platform.rs and sdk_cache I/O sites migrated in lockstep; cache file schema is hasher-independent so no bump needed. Re-profile confirms sip::Hasher::write no longer appears in the top 20; new cold-link hotspot is the cache-miss yaml parse path (~16%). Bench (10×8 samples, M-series host): bevy-dylib: 407 → 370 ms (-9%, ratio 1.46× → 1.39×). Signed-off-by: Giles Cope <[email protected]>

…dation) Tier-1 of wild's incremental plan needs to skip re-parsing clean inputs' symbol tables. Since wild already mmaps every input, the cache should live in the same shape — a fixed-layout repr(C) blob that can be mmap'd and interpreted in place, zero copy. This commit lands the format + round-trip plumbing only. It's NOT yet wired into load_inputs; landing the storage layer first gives the next session a green foundation to slot a cache-lookup into the loader without fusing format-churn into a correctness-critical change. Format (schema v1): - 48-byte CacheHeader: magic 'WILDPI01', schema u32, flags u32, n_symbols u64, symbols_off u64, names_off u64, names_len u64. - Symbol region: n_symbols × 24-byte CachedSymbol records (name_off, name_len, hash, flags, kind, _pad). - Names blob: concatenated (deduped) symbol name bytes. CacheView borrows the backing &[u8] and yields CachedEntry iterators whose slices point directly into the mmap — no allocation, no lifetime gymnastics beyond the input buffer's own borrow. CacheBuilder writes the same layout into a Vec<u8>. 10 unit tests lock down round-trip, name-dedup, zero-copy-ness (checks name pointers lie inside the buffer), bad-magic / bad- schema / truncated / misaligned / unknown-kind rejection, and the empty cache case. Signed-off-by: Giles Cope <[email protected]>

Two helpers layered on last commits format scaffolding, still without any hot-path wiring: - CacheBuilder::write_to(&Path) — write-tmp-and-rename so concurrent readers never observe a torn cache. Matches the pattern wilds other side-cars (.wild-hashes, sdk-*.bin) use. - cache_path_for_input(&Path) -> Option<PathBuf> — derives the on-disk cache location for a given input. Dir is $XDG_CACHE_HOME/wild/parsed-inputs or $HOME/.cache/wild/parsed-inputs. Filename is blake3(absolute_input_path_bytes ‖ SCHEMA).hex + .wildpi so two inputs sharing a basename (cargos libfoo-<hash>.rlib convention puts them in different build dirs) never collide. Two new tests: - write_to_atomically_persists_and_reloads — round-trip via an actual temp file, asserts the .wildpi.tmp sidecar is consumed by rename. - cache_path_derivation_is_collision_free_for_same_basename — feeds two identically-named inputs from different dirs through cache_path_for_input, asserts distinct outputs (and that the same input twice returns the same path). Still not wired into the loader — the next session will hook CacheView into load_inputs / read_symbols_for_group behind the existing WILD_INCREMENTAL_DEBUG gate, with a cold-vs-skip byte-identical canary before flipping default. Signed-off-by: Giles Cope <[email protected]>

Hand-off doc for the next session that picks up tier-1 incremental linking. The storage layer (parsed_input_cache) landed in 01f2236 + f900907; this is the spec for wiring it into the loader: - SymbolSink trait extraction + TeeSink that duplicates into a CacheBuilder. - Cache-replay path in load_symbols_from_file (before the object- crate dispatch). - WILD_INCREMENTAL_DEBUG gate; WILD_INCREMENTAL_PARSE_SKIP_CANARY runs both paths and asserts they produce structurally identical symbol streams before any skip goes live. - Lifetime contract: cache mmap lives alongside input mmaps in FileLoader, same arena. - Bevy-dylib measurement script with target of at least 100 ms off the 370 ms cold link. - Enumerated risk list the canary has to catch (weak version strings, COMDAT, TLS flags, hidden/protected visibility, local ordering). Signed-off-by: Giles Cope <[email protected]>

gilescope added 29 commits April 9, 2026 08:27

fix(macho): increase LINKEDIT buffer estimate for large Rust binaries

7ee1572

Signed-off-by: Giles Cope <[email protected]>

feat: add sold Mach-O test suite (MIT)

932de9a

Import 134 shell tests from bluewhalesystems/sold (archived). Tests compile C/C++ via clang, link with Wild via --ld-path=./ld64, and verify output. 36 pass, 98 ignored (categorized by reason). Signed-off-by: Giles Cope <[email protected]>

chore: fmt

907bd7a

Signed-off-by: Giles Cope <[email protected]>

feat: implement additional args

95b13dc

Signed-off-by: Giles Cope <[email protected]>

chore: fmt

bd99707

Signed-off-by: Giles Cope <[email protected]>

gilescope added 30 commits April 23, 2026 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mac + Wasm support (PRs welcome!)#1815

Mac + Wasm support (PRs welcome!)#1815
gilescope wants to merge 266 commits intowild-linker:mainfrom
gilescope:giles-mac

gilescope commented Apr 6, 2026 •

edited

Loading

Uh oh!

davidlattimore commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gilescope commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mach-o support

Wasm support (+optimiser)

Status vs ld64

Punchlist

Uh oh!

davidlattimore commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gilescope commented Apr 6, 2026 •

edited

Loading