Thread | Brutkey

@amonakov@mastodon.gamedev.place @ptesarik@infosec.exchange @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org in the end it turned out to be quite logical. The code that got its performance improved was fftw with certain buffer sizes, which happened to leave dirty output data in the cache after execution. On another immediate execution it would read cold input data, forcing a flush of the dirty cache, slowing itself down. Interleaving execution means the other code paid the price of the flush and possibly leaving clean data in the cache...

Petr Tesařík
@ptesarik@infosec.exchange

@vbabka@mastodon.social @amonakov@mastodon.gamedev.place @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org Those were easy times…
Now imagine an Ice Lake with a micro-op queue fed by the IDQ (Instruction Decode Queue), which offers three paths: DSB (decoded Icache), MITE (legacy decode pipeline) and MS (microcode sequencer).

Alexander Monakov
@amonakov@mastodon.gamedev.place

@ptesarik@infosec.exchange @vbabka@mastodon.social @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org I think I'd rather work with these than Pentium4's "trace cache"

Petr Tesařík
@ptesarik@infosec.exchange

@amonakov@mastodon.gamedev.place @vbabka@mastodon.social @ljs@mastodonapp.uk @mpdesouza@floss.social @gnutools@fosstodon.org I can feel the pain. Then again, I still missed an important bit on my Ice Lake case. Most likely, it's not L1I aliasing but BPU (Branch Prediction Unit) aliasing. Although the statistics counters don't show any relevant difference between the fast and slow builds, it seems that a single mispredicted branch sends the instruction decoder onto the wrong path, causes a penalty for switching from DSB to MITE and evicts useful information from L1 I-Cache. Unfortunately, I'm unable to confirm this hypothesis, because more tracing also ruins the equilibrium, of course. But if true, it's insane that a single case of bad speculation can cost over 4% in a microbenchmark.