Thread | Brutkey

@ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org @amonakov@mastodon.gamedev.place
I'm done, but the result is a bit disappointing.

There is a 12,000% increase in ICACHE_DATA.STALLS between the GCC7 build and the GCC13 build, but AFAICT the GCC7 memory layout was simply extremely lucky to hit no L1I aliasing, and GCC13 layout is extremely unlucky to hit a lot of L1I aliasing on this specific Ice Lake CPU with this kernel version and configuration.

In short, if the performance of your code sucks, try re-ordering compile units and/or functions within a compile unit, and it'll get better. Or worse. But that's something you all knew already, isn't it?

There's one lesson learned, though:
With a little bit of luck, all of netperf fits into the L1 I-cache on modern CPUs.
With a little bit of bloomin' luck.

Vlastimil Babka 🇨🇿

🇪🇺

🇺🇦

@vbabka@mastodon.social

@ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org @amonakov@mastodon.gamedev.place 20 years ago (ugh) my master thesis work started by interleaving execution of some different kinds of code, on a single core, and measuring how they are slowed down (compared to repeated execution with no interleaving) because evicting each other from the cache. In some cases the performance increased. So I'm not surprised, yeah.

Lorenzo Stoakes
@ljs@mastodonapp.uk

@vbabka@mastodon.social @ptesarik@infosec.exchange @mpdesouza@floss.social @gnutools@fosstodon.org @amonakov@mastodon.gamedev.place lol bro you're so old

Alexander Monakov
@amonakov@mastodon.gamedev.place

@vbabka@mastodon.social @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org was that on a pentium4? I hear those could be extra spicy in that regard

Vlastimil Babka 🇨🇿

🇪🇺

🇺🇦

@vbabka@mastodon.social

@amonakov@mastodon.gamedev.place @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org AMD Athlon64 IIRC

Vlastimil Babka 🇨🇿

🇪🇺

🇺🇦

@vbabka@mastodon.social

@amonakov@mastodon.gamedev.place @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org AMD Athlon64 IIRC

Vlastimil Babka 🇨🇿

🇪🇺

🇺🇦

@vbabka@mastodon.social

@amonakov@mastodon.gamedev.place @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org in the end it turned out to be quite logical. The code that got its performance improved was fftw with certain buffer sizes, which happened to leave dirty output data in the cache after execution. On another immediate execution it would read cold input data, forcing a flush of the dirty cache, slowing itself down. Interleaving execution means the other code paid the price of the flush and possibly leaving clean data in the cache...

Vlastimil Babka 🇨🇿

🇪🇺

🇺🇦

@vbabka@mastodon.social

@amonakov@mastodon.gamedev.place @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org in the end it turned out to be quite logical. The code that got its performance improved was fftw with certain buffer sizes, which happened to leave dirty output data in the cache after execution. On another immediate execution it would read cold input data, forcing a flush of the dirty cache, slowing itself down. Interleaving execution means the other code paid the price of the flush and possibly leaving clean data in the cache...

Petr Tesařík
@ptesarik@infosec.exchange

@vbabka@mastodon.social @amonakov@mastodon.gamedev.place @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org Those were easy times…
Now imagine an Ice Lake with a micro-op queue fed by the IDQ (Instruction Decode Queue), which offers three paths: DSB (decoded Icache), MITE (legacy decode pipeline) and MS (microcode sequencer).

Petr Tesařík
@ptesarik@infosec.exchange

@vbabka@mastodon.social @amonakov@mastodon.gamedev.place @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org Those were easy times…
Now imagine an Ice Lake with a micro-op queue fed by the IDQ (Instruction Decode Queue), which offers three paths: DSB (decoded Icache), MITE (legacy decode pipeline) and MS (microcode sequencer).

Alexander Monakov
@amonakov@mastodon.gamedev.place

@ptesarik@infosec.exchange @vbabka@mastodon.social @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org I think I'd rather work with these than Pentium4's "trace cache"

Alexander Monakov
@amonakov@mastodon.gamedev.place

@ptesarik@infosec.exchange @vbabka@mastodon.social @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org I think I'd rather work with these than Pentium4's "trace cache"

Petr Tesařík
@ptesarik@infosec.exchange

@amonakov@mastodon.gamedev.place @vbabka@mastodon.social @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org I can feel the pain. Then again, I still missed an important bit on my Ice Lake case. Most likely, it's not L1I aliasing but BPU (Branch Prediction Unit) aliasing. Although the statistics counters don't show any relevant difference between the fast and slow builds, it seems that a single mispredicted branch sends the instruction decoder onto the wrong path, causes a penalty for switching from DSB to MITE and evicts useful information from L1 I-Cache. Unfortunately, I'm unable to confirm this hypothesis, because more tracing also ruins the equilibrium, of course. But if true, it's insane that a single case of bad speculation can cost over 4% in a microbenchmark.

Petr Tesařík
@ptesarik@infosec.exchange

@amonakov@mastodon.gamedev.place @vbabka@mastodon.social @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org I can feel the pain. Then again, I still missed an important bit on my Ice Lake case. Most likely, it's not L1I aliasing but BPU (Branch Prediction Unit) aliasing. Although the statistics counters don't show any relevant difference between the fast and slow builds, it seems that a single mispredicted branch sends the instruction decoder onto the wrong path, causes a penalty for switching from DSB to MITE and evicts useful information from L1 I-Cache. Unfortunately, I'm unable to confirm this hypothesis, because more tracing also ruins the equilibrium, of course. But if true, it's insane that a single case of bad speculation can cost over 4% in a microbenchmark.