@ptesarik@infosec.exchange
@amonakov@mastodon.gamedev.place Interesting. Indeed, on a second thought, my explanation doesn't make much sense. But the observed reality is I can reliably get netperf throughput >= 1850 Mbps with CONFIG_RETHUNK=y but <= 1840 Mbps without (and with no other changes to the setup). All with tiny (64-byte) buffer size, so an extremely syscall-heavy workload.
EDIT: Obviously, the memory layout also changes, but I have checked that L1I cache misses are comparable (and approx. 0.1%) this time.
@amonakov@mastodon.gamedev.place
@ptesarik@infosec.exchange have you checked how much of the microbenchmark runs out of the DSB? I'm actually curious how much repeated decoding happens there.
I'm very very surprised that you see no slowdown from rethunk's forced return mispredictions. Unless the hunks are somehow not active in your case? Do you see them if you do 'perf record'/'perf report'?