consider order .init_array .fini_array#1408
consider order .init_array .fini_array#1408davidlattimore merged 23 commits intowild-linker:mainfrom
Conversation
|
@davidlattimore |
| return Some(u16::MAX); | ||
| } | ||
|
|
||
| if name.starts_with(b".init_array.") || name.starts_with(b".fini_array.") { |
There was a problem hiding this comment.
I think the parsing code might be simplified slightly if you used strip_prefix. e.g.:
if let Some(rest) = name.strip_prefix(b".init_array.") {
return parse_priority_suffix(rest);
}|
|
||
| for secondary_id in self.secondary.get(section_id) { | ||
| self.events.push(OrderEvent::Section(*secondary_id)); | ||
| let secondaries: Vec<OutputSectionId> = self.secondary.get(section_id).clone(); |
There was a problem hiding this comment.
It shouldn't be necessary to clone here if you change the into_iter below to just iter then change sid to &sid in the map arguments.
| (key_pri, idx, sid) | ||
| }) | ||
| .collect(); | ||
| keyed.sort_by(|a, b| a.0.cmp(&b.0).then_with(|| a.1.cmp(&b.1))); |
There was a problem hiding this comment.
sort_by_key(|a| (a.0, a.1)) is a bit simpler and should be equivalent.
| @@ -0,0 +1,35 @@ | |||
| //#Object:runtime.c | |||
There was a problem hiding this comment.
It might help to add //#Object:init.c, then #include "init.h" and in _start call call_init_functions();. See init_test.c for an example. This will cause GNU ld to see that __start_init_array and __end_init_array are referenced so it will emit those symbols. It'll also mean that the init functions will get run when the program runs.
There was a problem hiding this comment.
When I run the test, I get a failure that looks like this:
┌──────────────────┬─────────────┐
│ wild │ ld │
├──────────────────┼─────────────┤
│ 0x40191600000000 │ init_1000a │
│ 0x40192100000000 │ init_1000b │
│ 0x401a9600000000 │ init_1000c │
│ 0x401aa100000000 │ init_1000d │
│ 0x40192c00000000 │ init_2000a │
│ 0x40193700000000 │ init_2000b │
│ 0x401a8000000000 │ init_2000c │
│ 0x401a8b00000000 │ init_2000d │
│ 0x40190000000000 │ init_a │
│ 0x40190b00000000 │ init_b │
│ 0x40194200000000 │ init_65535a │
│ 0x401aac00000000 │ init_c │
│ 0x401ab700000000 │ init_d │
└──────────────────┴─────────────┘
If I search for the address of init_1000a in Wild's output...
readelf -Ws /home/d/work/wild/wild/tests/build/init-order.c/init-order.c-default-host.wild | grep init_1000a
14: 0000000000401916 11 FUNC GLOBAL DEFAULT 3 init_1000a
So the address in .init_array is 0x40191600000000 when it should be 0x0000000000401916. i.e. there's an extra 4 bytes of zeros at the start, then all the pointers are offset by 4 bytes.
My guess as to what's happening is that the primary section (output_section_id::INIT_ARRAY) is empty (as it should be) and has low alignment requirements (probably alignment=1). When we get to the first secondary section, we have an alignment requirement of 8, so add some padding (presumably 4 bytes of padding).
Looking at the .init_array section:
readelf -WS /home/d/work/wild/wild/tests/build/init-order.c/init-order.c-default-host.wild | grep -E '(Flg|init_array)'
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 4] .init_array INIT_ARRAY 0000000000402b04 000b04 000068 08 WA 0 0 8
[ 6] .preinit_array PREINIT_ARRAY 0000000000402bd8 000bd8 000000 00 WA 0 0 1
So it's listed as having alignment 8 (correct), but has an address ending in 0x4, which isn't 8-byte aligned.
Another problem is that __init_array_end is being placed at the end of the primary section, when it needs to be at the end of all the sections that go into that primary.
readelf -Ws /home/d/work/wild/wild/tests/build/init-order.c/init-order.c-default-host.wild | grep __init_array_
6: 0000000000402b04 0 NOTYPE GLOBAL DEFAULT 4 __init_array_start
7: 0000000000402b04 0 NOTYPE GLOBAL DEFAULT 4 __init_array_end
Note that the addresses of both symbols are equal, indicating that the range is empty.
I'm not sure that we want to change the behaviour of end_symbol_name, since I think for some sections we do want the end symbol to point to the end of a secondary section. But perhaps we can add a new field with the new behaviour. Naming things is hard, but perhaps end_group_symbol_name. Note that I don't think we need an equivalent for start_symbol_name.
|
With the current changes, almost all tests pass locally. Right now |
lapla-cogito
left a comment
There was a problem hiding this comment.
Wild runs mold’s test suite in CI. With this change, the mold test init-array-priorities.sh, which previously failed when run under Wild, now passes💯. You can confirm that this test is actually passing in CI by checking, for example, the logs of the “Check tests that should fail still fail” step in this run.
In that step, CI executes the tests listed in mold_skip_tests.toml, which records tests that are expected to fail on the current implementation of Wild, and verifies that they still fail. Ideally, it would be good to remove init-array-priorities.sh from this file as part of this PR. Doing so would also reduce the risk of accidentally breaking the feature implemented in this PR in your follow-up work.
davidlattimore
left a comment
There was a problem hiding this comment.
Nice work! Looks like this is getting pretty close to being ready to merge!
| { | ||
| self.events.push(OrderEvent::SetLocation(location)); | ||
| } | ||
|
|
There was a problem hiding this comment.
Could you restore some of the blank lines that you've removed? I find that they make it easier to quickly see where a multi-line if-statement or for-loop ends.
| __attribute__((destructor(1000))) void fini_1000c() {} | ||
| __attribute__((destructor(1000))) void fini_1000d() {} | ||
| __attribute__((destructor)) void fini_c() {} | ||
| __attribute__((destructor)) void fini_d() {} No newline at end of file |
There was a problem hiding this comment.
Could you add a newline at the end of files that don't have one? The red circle that github is showing means that there's no newline at the end of the last line.
| let mut mem_end = 0; | ||
| let mut alignment = info.min_alignment; | ||
|
|
||
| if section_id == crate::output_section_id::INIT_ARRAY { |
There was a problem hiding this comment.
Is this needed? For me, at least locally, none of the tests fail if I comment this out.
| || section_id == crate::output_section_id::FINI_ARRAY) | ||
| && offset == 0 | ||
| { | ||
| let a8 = Alignment::new(8).unwrap(); |
There was a problem hiding this comment.
This is the alignment of a pointer. Probably nicer to use alignment::USIZE instead
There was a problem hiding this comment.
@davidlattimore
Is it okay to have INIT_ARRAY in this if condition?
Or should we handle Secondary nicely here too?
There was a problem hiding this comment.
Rather than having this if-statement, you could set min_alignment on the built-in sections. i.e. add min_alignment: alignment::USIZE, to the relevant BuiltInSectionDetails {...} in output_section_id.rs.
We might at some point need a more general way to have properties like alignment from merged sections affect the primary section, but just setting the alignment on the built-in section details will probably do for now.
| (key_pri, idx, sid) | ||
| }) | ||
| .collect(); | ||
| keyed.sort_by_key(|a| (a.0, a.1)); |
There was a problem hiding this comment.
sort_by_key is stable, so a.1 shouldn't be needed - the sort will preserve the original order for any elements that have an equal key. That means that it should be possible to get rid of the enumerate() call above.
| section_id: OutputSectionId, | ||
| ) -> SectionRule<'data> { | ||
| Self::prefix( | ||
| SectionRule { |
| } else { | ||
| output_info.section_id.base_part_id() | ||
| }; | ||
| if let Some(priority) = init_fini_priority(section_name) { |
There was a problem hiding this comment.
When I benchmark linking wild, I see about a 1% slowdown in link time.
Benchmark 1 (1775 runs): /home/david/save/wild/run-with /home/d/wild-builds/2026-01-05 --strip-debug --no-fork
measurement mean ± σ min … max outliers delta
wall_time 33.7ms ± 840us 31.9ms … 37.6ms 41 ( 2%) 0%
peak_rss 161MB ± 858KB 158MB … 164MB 17 ( 1%) 0%
cpu_cycles 708M ± 14.4M 651M … 806M 38 ( 2%) 0%
instructions 699M ± 8.69M 674M … 767M 32 ( 2%) 0%
cache_references 20.9M ± 355K 19.8M … 22.5M 5 ( 0%) 0%
cache_misses 4.54M ± 67.4K 4.28M … 5.04M 14 ( 1%) 0%
branch_misses 1.89M ± 14.8K 1.84M … 2.00M 27 ( 2%) 0%
Benchmark 2 (1755 runs): /home/david/save/wild/run-with target/release/wild --strip-debug --no-fork
measurement mean ± σ min … max outliers delta
wall_time 34.1ms ± 862us 32.2ms … 37.8ms 46 ( 3%) 💩+ 1.2% ± 0.2%
peak_rss 162MB ± 856KB 159MB … 165MB 6 ( 0%) + 0.6% ± 0.0%
cpu_cycles 707M ± 13.8M 668M … 808M 36 ( 2%) - 0.1% ± 0.1%
instructions 700M ± 8.41M 678M … 778M 33 ( 2%) + 0.2% ± 0.1%
cache_references 21.0M ± 356K 20.0M … 23.3M 4 ( 0%) + 0.8% ± 0.1%
cache_misses 4.53M ± 65.3K 4.29M … 5.07M 13 ( 1%) - 0.2% ± 0.1%
branch_misses 1.90M ± 16.0K 1.85M … 2.05M 27 ( 2%) + 0.6% ± 0.1%
I think it's due to this call to init_fini_priority. If I comment it out, the slowdown more or less goes away. We end up checking every section name to see if it's equal to ".init_array", equal to ".fini_array", starts with ".init_array." or starts with ".fini_array". Those checks take time. Can we make use of the matched SectionRuleOutcome so that we only call it when the section is an .init_array or .fini_array section?
Edit: I suspect you'll need to either add a new variant to SectionRuleOutcome (e.g. SortedSection) or add an additional field to SectionOutputInfo.
There was a problem hiding this comment.
I tried making some adjustments, but I still haven't gotten the Benchmark to run properly locally.
There was a problem hiding this comment.
If you'd like help running benchmarks, let us know where you're stuck and we'll see what we can figure out.
There was a problem hiding this comment.
@YamasouA ➜ /workspaces/wild/ripgrep (master) $ hyperfine --warmup 2 \
'/tmp/wild/ripgrep/7/run-with ld' \
'/tmp/wild/ripgrep/7/run-with wild --no-fork'
Benchmark 1: /tmp/wild/ripgrep/7/run-with ld
Time (mean ± σ): 1.161 s ± 0.025 s [User: 0.669 s, System: 0.478 s]
Range (min … max): 1.134 s … 1.210 s 10 runs
Benchmark 2: /tmp/wild/ripgrep/7/run-with wild --no-fork
Time (mean ± σ): 212.4 ms ± 10.7 ms [User: 254.2 ms, System: 87.2 ms]
Range (min … max): 196.5 ms … 230.1 ms 13 runs
Summary
/tmp/wild/ripgrep/7/run-with wild --no-fork ran
5.47 ± 0.30 times faster than /tmp/wild/ripgrep/7/run-with ld
There was a problem hiding this comment.
@davidlattimore
I tried to reproduce your benchmark using poop, but I'm running this in Gitpod.
It looks like Gitpod does not allow perf_event_open, so poop fails with PermissionDenied when trying to collect HW counters (cycles, instructions, cache misses, etc.).
$ ./zig-out/bin/poop
'sh -lc "/tmp/wild/ripgrep/7/run-with ld"'
'sh -lc "/tmp/wild/ripgrep/7/run-with wild --no-fork"'
thread ... panic: unable to open perf event: PermissionDenied
Because of this restriction, I can only measure wall-time (e.g. via hyperfine) in Gitpod, not perf-based metrics.
If you have a recommendation for a Gitpod-friendly way to collect similar metrics, please let me know.
There was a problem hiding this comment.
I'm not familiar with gitpod, but if perf events aren't available, then I agree that it sounds like poop won't work. Hyperfine is sufficient though, since the main thing that we care about is walltime. I expect that gitpod or any other cloud-based platform, being shared, is likely to have pretty noisy benchmarks - i.e. the standard deviation of the timing results is likely to be quite large. e.g. in your results above, 212.4 ms ± 10.7 ms. i.e. the standard deviation is about 5%. This means that any change in performance that's of a similar or smaller magnitude is going to be lost in the noise.
When checking a PR for performance changes, I do a release build without the change, copy the binary somewhere, then do a release build with the change. I then compare the performance of the two binaries.
I'm happy to run benchmarks for you as-needed. Here's one for wild linking itself:
B=wild F="--strip-debug" R=/home/david/save/$B/run-with OUT=/run/user/1000/ttt bench poop -d 60000 "$R /home/d/wild-builds/2026-01-09 $F" "$R target/release/wild $F"
Temperature: 51.0 C
Benchmark 1 (2005 runs): /home/david/save/wild/run-with /home/d/wild-builds/2026-01-09 --strip-debug
measurement mean ± σ min … max outliers delta
wall_time 29.3ms ± 836us 27.5ms … 34.7ms 83 ( 4%) 0%
peak_rss 7.45MB ± 99.6KB 6.91MB … 7.50MB 316 (16%) 0%
cpu_cycles 709M ± 13.0M 652M … 777M 31 ( 2%) 0%
instructions 703M ± 8.37M 679M … 747M 31 ( 2%) 0%
cache_references 21.8M ± 330K 20.6M … 23.5M 57 ( 3%) 0%
cache_misses 4.58M ± 60.9K 4.36M … 4.94M 38 ( 2%) 0%
branch_misses 1.86M ± 15.6K 1.82M … 1.95M 25 ( 1%) 0%
Benchmark 2 (1974 runs): /home/david/save/wild/run-with target/release/wild --strip-debug
measurement mean ± σ min … max outliers delta
wall_time 29.8ms ± 839us 27.7ms … 34.9ms 84 ( 4%) 💩+ 1.6% ± 0.2%
peak_rss 7.44MB ± 108KB 6.91MB … 7.50MB 386 (20%) - 0.1% ± 0.1%
cpu_cycles 710M ± 13.0M 666M … 797M 30 ( 2%) + 0.1% ± 0.1%
instructions 704M ± 8.20M 679M … 767M 27 ( 1%) + 0.2% ± 0.1%
cache_references 21.8M ± 297K 20.6M … 23.9M 41 ( 2%) - 0.0% ± 0.1%
cache_misses 4.58M ± 57.0K 4.34M … 5.02M 35 ( 2%) + 0.1% ± 0.1%
branch_misses 1.91M ± 20.5K 1.86M … 2.06M 19 ( 1%) 💩+ 2.9% ± 0.1%
Temperature: 66.4 C
I did more investigation into the performance change. It seems that it's due (or at least mostly due) to the additional output sections. For linking wild, the number of output sections (including secondary sections) went from 46 to 49. It seems that wild is slightly more sensitive to that than I had realised. I think we should not worry about the performance change for merging of this PR. 1.6% isn't too bad given that this is an important feature being implemented. I've got some ideas that I'll probably experiment with at a later stage that might be able to reduce the per-section overhead.
| sec.mem_offset + sec.mem_size | ||
| }; | ||
|
|
||
| if section_id == output_section_id::INIT_ARRAY |
There was a problem hiding this comment.
I feel like there's probably a different way to do this, but I can't think of anything concrete at the moment, so this will do for now.
There was a problem hiding this comment.
Thought about this some more. I think perhaps what would work well would be to add a new variant, similar to SymbolPlacement::SectionEnd, but which triggers the relevant logic. It's fine to leave it as-is though if you'd rather not tackle that. Perhaps just add something like // TODO: Make this logic more generic, possibly by adding a new kind of SymbolPlacement
| } else { | ||
| output_info.section_id.base_part_id() | ||
| }; | ||
| if let Some(priority) = init_fini_priority(section_name) { |
There was a problem hiding this comment.
I'm not familiar with gitpod, but if perf events aren't available, then I agree that it sounds like poop won't work. Hyperfine is sufficient though, since the main thing that we care about is walltime. I expect that gitpod or any other cloud-based platform, being shared, is likely to have pretty noisy benchmarks - i.e. the standard deviation of the timing results is likely to be quite large. e.g. in your results above, 212.4 ms ± 10.7 ms. i.e. the standard deviation is about 5%. This means that any change in performance that's of a similar or smaller magnitude is going to be lost in the noise.
When checking a PR for performance changes, I do a release build without the change, copy the binary somewhere, then do a release build with the change. I then compare the performance of the two binaries.
I'm happy to run benchmarks for you as-needed. Here's one for wild linking itself:
B=wild F="--strip-debug" R=/home/david/save/$B/run-with OUT=/run/user/1000/ttt bench poop -d 60000 "$R /home/d/wild-builds/2026-01-09 $F" "$R target/release/wild $F"
Temperature: 51.0 C
Benchmark 1 (2005 runs): /home/david/save/wild/run-with /home/d/wild-builds/2026-01-09 --strip-debug
measurement mean ± σ min … max outliers delta
wall_time 29.3ms ± 836us 27.5ms … 34.7ms 83 ( 4%) 0%
peak_rss 7.45MB ± 99.6KB 6.91MB … 7.50MB 316 (16%) 0%
cpu_cycles 709M ± 13.0M 652M … 777M 31 ( 2%) 0%
instructions 703M ± 8.37M 679M … 747M 31 ( 2%) 0%
cache_references 21.8M ± 330K 20.6M … 23.5M 57 ( 3%) 0%
cache_misses 4.58M ± 60.9K 4.36M … 4.94M 38 ( 2%) 0%
branch_misses 1.86M ± 15.6K 1.82M … 1.95M 25 ( 1%) 0%
Benchmark 2 (1974 runs): /home/david/save/wild/run-with target/release/wild --strip-debug
measurement mean ± σ min … max outliers delta
wall_time 29.8ms ± 839us 27.7ms … 34.9ms 84 ( 4%) 💩+ 1.6% ± 0.2%
peak_rss 7.44MB ± 108KB 6.91MB … 7.50MB 386 (20%) - 0.1% ± 0.1%
cpu_cycles 710M ± 13.0M 666M … 797M 30 ( 2%) + 0.1% ± 0.1%
instructions 704M ± 8.20M 679M … 767M 27 ( 1%) + 0.2% ± 0.1%
cache_references 21.8M ± 297K 20.6M … 23.9M 41 ( 2%) - 0.0% ± 0.1%
cache_misses 4.58M ± 57.0K 4.34M … 5.02M 35 ( 2%) + 0.1% ± 0.1%
branch_misses 1.91M ± 20.5K 1.86M … 2.06M 19 ( 1%) 💩+ 2.9% ± 0.1%
Temperature: 66.4 C
I did more investigation into the performance change. It seems that it's due (or at least mostly due) to the additional output sections. For linking wild, the number of output sections (including secondary sections) went from 46 to 49. It seems that wild is slightly more sensitive to that than I had realised. I think we should not worry about the performance change for merging of this PR. 1.6% isn't too bad given that this is an important feature being implemented. I've got some ideas that I'll probably experiment with at a later stage that might be able to reduce the per-section overhead.
| must_keep: matcher.must_keep, | ||
| }; | ||
|
|
||
| let outcome = if section_id |
There was a problem hiding this comment.
This change seems to be done in order to make init-order apply when a linker script is used, but there isn't a test for this. At least when I tried reverting this bit of the change, no tests failed. As far as I know the ordering should probably only be done when the script specifies SORT_BY_INIT_PRIORITY such as in the following snippet of GNU ld's built-in script (which you can see by running ld --verbose.
KEEP (*(SORT_BY_INIT_PRIORITY(.init_array.*) SORT_BY_INIT_PRIORITY(.ctors.*)))
I'd probably suggest reverting this and dealing with init ordering for linker scripts in a separate PR. That said, I don't feel strongly, so it'd be OK to keep this if you've got reason to do so. In that case, it could perhaps do with some comments saying in what ways the behaviour here differs from GNU ld's behaviour of requiring SORT_BY_INIT_PRIORITY in order to sort the sections.
|
@davidlattimore I’ve also updated the linker script-related changes, so I’d appreciate it if you could take a look at those as well. |
davidlattimore
left a comment
There was a problem hiding this comment.
Thanks for all your work on this!
Sorry—I accidentally made commits while this PR was closed, and I’m no longer able to reopen it, so I’ve created a new PR instead.
#1253
Issue: #588