Brutkey

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@amonakov@mastodon.gamedev.place Interesting. Indeed, on a second thought, my explanation doesn't make much sense. But the observed reality is I can reliably get netperf throughput >= 1850 Mbps with CONFIG_RETHUNK=y but <= 1840 Mbps without (and with no other changes to the setup). All with tiny (64-byte) buffer size, so an extremely syscall-heavy workload.
EDIT: Obviously, the memory layout also changes, but I have checked that L1I cache misses are comparable (and approx. 0.1%) this time.

Alexander Monakov
@amonakov@mastodon.gamedev.place

@ptesarik@infosec.exchange have you checked how much of the microbenchmark runs out of the DSB? I'm actually curious how much repeated decoding happens there.

I'm very very surprised that you see no slowdown from rethunk's forced return mispredictions. Unless the hunks are somehow not active in your case? Do you see them if you do 'perf record'/'perf report'?


Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@amonakov@mastodon.gamedev.place
ALL_IDQ_UOPS = 198633974709
%UOPS.DSB = 62.3%
%UOPS.MITE = 27.6%
%UOPS.MS = 10.1%

The high proportion of micro-ops from the microcode sequencer is due to the
rep movsb in raw_copy_from_user().

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@amonakov@mastodon.gamedev.place Oh, and yes, I do see a lot of hits in its_return_thunk:

Samples β”‚        ffffffff81d940e0 :
        β”‚        .skip 32, 0xcc
        β”‚        SYM_CODE_START(its_return_thunk)
        β”‚        UNWIND_HINT_FUNC
        β”‚        ANNOTATE_NOENDBR
        β”‚        ANNOTATE_UNRET_SAFE
        β”‚        ret
   6088 β”‚ffffffff81d940e0: ← ret
        β”‚        int3
        β”‚ffffffff81d940e1:   int3

Alexander Monakov
@amonakov@mastodon.gamedev.place

@ptesarik@infosec.exchange ah, this its_return_thunk is new, it doesn't desync the return address prediction stack!

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@amonakov@mastodon.gamedev.place Oh, right, I thought I made it clear that this is a jmp to a ret, nothing more.