@ptesarik@infosec.exchange I'm afraid not. Decoding of which instructions? If the lone 'ret' needed to be decoded, so would the GCC's 'jmp _x86return_thunk'? The BPU doesn't handle never-seen-before jumps without decoding.
(moreover, doesn't the return thunk desync the return address prediction stack, so its 'ret' is mispredicted and actually runs much slower?)
@amonakov@mastodon.gamedev.place Interesting. Indeed, on a second thought, my explanation doesn't make much sense. But the observed reality is I can reliably get netperf throughput >= 1850 Mbps with CONFIG_RETHUNK=y but <= 1840 Mbps without (and with no other changes to the setup). All with tiny (64-byte) buffer size, so an extremely syscall-heavy workload.
EDIT: Obviously, the memory layout also changes, but I have checked that L1I cache misses are comparable (and approx. 0.1%) this time.
@ptesarik@infosec.exchange have you checked how much of the microbenchmark runs out of the DSB? I'm actually curious how much repeated decoding happens there.
I'm very very surprised that you see no slowdown from rethunk's forced return mispredictions. Unless the hunks are somehow not active in your case? Do you see them if you do 'perf record'/'perf report'?
@amonakov@mastodon.gamedev.place But are you 100% certain that BPU cannot predict never-before-seen unconditional jumps?
@ptesarik@infosec.exchange no, not 100% at all. Branch prediction needs to work in several stages, the initial stage needs to guess the next fetch block just on the basis of current fetch address, without inspecting the bytes. In the following cycles a more educated guess can resteer the prediction, but return address prediction also happens not so late in the frontend.
(also consider that address calculation for an uncond jump is "cheap" only if it doesn't cross a page, otherwise needs virt->phys address xlat)
@ptesarik@infosec.exchange no, not 100% at all. Branch prediction needs to work in several stages, the initial stage needs to guess the next fetch block just on the basis of current fetch address, without inspecting the bytes. In the following cycles a more educated guess can resteer the prediction, but return address prediction also happens not so late in the frontend.
(also consider that address calculation for an uncond jump is "cheap" only if it doesn't cross a page, otherwise needs virt->phys address xlat)
@amonakov@mastodon.gamedev.place FWIW Intel documentation says:
Branches that do not have a history in the BTB are predicted using a static prediction algorithm.
Unconditional jumps are given as an example of a branch that is always predicted as taken.
As for TLB considerations, that's moot. Intel L1 cache is VIPT, so if a TLB entry is missing, this is detected later and reported as an L1 cache stall. I can check L1 stalls if needed. Maybe I should.
@amonakov@mastodon.gamedev.place FWIW Intel documentation says:
Branches that do not have a history in the BTB are predicted using a static prediction algorithm.
Unconditional jumps are given as an example of a branch that is always predicted as taken.
As for TLB considerations, that's moot. Intel L1 cache is VIPT, so if a TLB entry is missing, this is detected later and reported as an L1 cache stall. I can check L1 stalls if needed. Maybe I should.