Brutkey

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org @amonakov@mastodon.gamedev.place
I'm done, but the result is a bit disappointing.

There is a 12,000% increase in
ICACHE_DATA.STALLS between the GCC7 build and the GCC13 build, but AFAICT the GCC7 memory layout was simply extremely lucky to hit no L1I aliasing, and GCC13 layout is extremely unlucky to hit a lot of L1I aliasing on this specific Ice Lake CPU with this kernel version and configuration.

In short, if the performance of your code sucks, try re-ordering compile units and/or functions within a compile unit, and it'll get better. Or worse. But that's something you all knew already, isn't it?

There's one lesson learned, though:
With a little bit of luck, all of
netperf fits into the L1 I-cache on modern CPUs.
With a little bit of bloomin' luck.

Vlastimil Babka πŸ‡¨πŸ‡ΏπŸ‡¨πŸ‡ΏπŸ‡ͺπŸ‡ΊπŸ‡ͺπŸ‡ΊπŸ‡ΊπŸ‡¦πŸ‡ΊπŸ‡¦
@vbabka@mastodon.social

@ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org @amonakov@mastodon.gamedev.place 20 years ago (ugh) my master thesis work started by interleaving execution of some different kinds of code, on a single core, and measuring how they are slowed down (compared to repeated execution with no interleaving) because evicting each other from the cache. In some cases the performance increased. So I'm not surprised, yeah.


Lorenzo Stoakes
@ljs@mastodonapp.uk

@vbabka@mastodon.social @ptesarik@infosec.exchange @mpdesouza@floss.social @gnutools@fosstodon.org @amonakov@mastodon.gamedev.place lol bro you're so old

Alexander Monakov
@amonakov@mastodon.gamedev.place

@vbabka@mastodon.social @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org was that on a pentium4? I hear those could be extra spicy in that regard

Vlastimil Babka πŸ‡¨πŸ‡ΏπŸ‡¨πŸ‡ΏπŸ‡ͺπŸ‡ΊπŸ‡ͺπŸ‡ΊπŸ‡ΊπŸ‡¦πŸ‡ΊπŸ‡¦
@vbabka@mastodon.social

@amonakov@mastodon.gamedev.place @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org AMD Athlon64 IIRC

Vlastimil Babka πŸ‡¨πŸ‡ΏπŸ‡¨πŸ‡ΏπŸ‡ͺπŸ‡ΊπŸ‡ͺπŸ‡ΊπŸ‡ΊπŸ‡¦πŸ‡ΊπŸ‡¦
@vbabka@mastodon.social

@amonakov@mastodon.gamedev.place @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org AMD Athlon64 IIRC

Vlastimil Babka πŸ‡¨πŸ‡ΏπŸ‡¨πŸ‡ΏπŸ‡ͺπŸ‡ΊπŸ‡ͺπŸ‡ΊπŸ‡ΊπŸ‡¦πŸ‡ΊπŸ‡¦
@vbabka@mastodon.social

@amonakov@mastodon.gamedev.place @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org in the end it turned out to be quite logical. The code that got its performance improved was fftw with certain buffer sizes, which happened to leave dirty output data in the cache after execution. On another immediate execution it would read cold input data, forcing a flush of the dirty cache, slowing itself down. Interleaving execution means the other code paid the price of the flush and possibly leaving clean data in the cache...

Vlastimil Babka πŸ‡¨πŸ‡ΏπŸ‡¨πŸ‡ΏπŸ‡ͺπŸ‡ΊπŸ‡ͺπŸ‡ΊπŸ‡ΊπŸ‡¦πŸ‡ΊπŸ‡¦
@vbabka@mastodon.social

@amonakov@mastodon.gamedev.place @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org in the end it turned out to be quite logical. The code that got its performance improved was fftw with certain buffer sizes, which happened to leave dirty output data in the cache after execution. On another immediate execution it would read cold input data, forcing a flush of the dirty cache, slowing itself down. Interleaving execution means the other code paid the price of the flush and possibly leaving clean data in the cache...

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@vbabka@mastodon.social @amonakov@mastodon.gamedev.place @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org Those were easy times…
Now imagine an Ice Lake with a micro-op queue fed by the IDQ (Instruction Decode Queue), which offers three paths: DSB (decoded Icache), MITE (legacy decode pipeline) and MS (microcode sequencer).

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@vbabka@mastodon.social @amonakov@mastodon.gamedev.place @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org Those were easy times…
Now imagine an Ice Lake with a micro-op queue fed by the IDQ (Instruction Decode Queue), which offers three paths: DSB (decoded Icache), MITE (legacy decode pipeline) and MS (microcode sequencer).

Alexander Monakov
@amonakov@mastodon.gamedev.place

@ptesarik@infosec.exchange @vbabka@mastodon.social @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org I think I'd rather work with these than Pentium4's "trace cache"

Alexander Monakov
@amonakov@mastodon.gamedev.place

@ptesarik@infosec.exchange @vbabka@mastodon.social @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org I think I'd rather work with these than Pentium4's "trace cache"

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@amonakov@mastodon.gamedev.place @vbabka@mastodon.social @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org I can feel the pain. Then again, I still missed an important bit on my Ice Lake case. Most likely, it's not L1I aliasing but BPU (Branch Prediction Unit) aliasing. Although the statistics counters don't show any relevant difference between the fast and slow builds, it seems that a single mispredicted branch sends the instruction decoder onto the wrong path, causes a penalty for switching from DSB to MITE and evicts useful information from L1 I-Cache. Unfortunately, I'm unable to confirm this hypothesis, because more tracing also ruins the equilibrium, of course. But if true, it's insane that a single case of bad speculation can cost over 4% in a microbenchmark.

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@amonakov@mastodon.gamedev.place @vbabka@mastodon.social @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org I can feel the pain. Then again, I still missed an important bit on my Ice Lake case. Most likely, it's not L1I aliasing but BPU (Branch Prediction Unit) aliasing. Although the statistics counters don't show any relevant difference between the fast and slow builds, it seems that a single mispredicted branch sends the instruction decoder onto the wrong path, causes a penalty for switching from DSB to MITE and evicts useful information from L1 I-Cache. Unfortunately, I'm unable to confirm this hypothesis, because more tracing also ruins the equilibrium, of course. But if true, it's insane that a single case of bad speculation can cost over 4% in a microbenchmark.