Thread | Brutkey

If you feel insufficiently frustrated developing stuff, try root-causing performance regressions. Plenty of frustration for everyone!

For example, I am now chasing a netperf regression between 6.4 and 6.12 where perf suggests that raw_copy_from_user() is now slower for some reason. Lacking any better idea, I wrote a small misc driver to stress raw_copy_from_user(). Of course, this raw_copy_from_user() runs equally fast in both kernels. Only the other raw_copy_from_user() which is embedded in copy_from_user_iter() is slower in the newer kernel.

Marcos Paulo de Souza
@mpdesouza@floss.social

@ptesarik@infosec.exchange /me is grabbing popcorn and waiting for more info

Petr Tesařík
@ptesarik@infosec.exchange

@mpdesouza@floss.social How's popcorn? Here's more info:
The issue is not really a difference between the two kernel versions, but the compiler used to build them. SLE15 SP6 kernels (6.4) are built with GCC7, but SLE16 kernels (6.12) are built with GCC13. I rebuilt the SLE15 SP6 kernel with GCC13, and performance drops by more than 4%!

GCC7 produces this:

Samples │
        │    ffffffff81610ee0 :
        │      mov  %rdx,%rax
        │      xor  %ecx,%ecx
        │      add  %rsi,%rax
    156 │      setb %cl
        │      test %rax,%rax
        │    ↓ js   26
        │      test %rcx,%rcx
        │    ↓ jne  26
    932 │      stac
        │      mov  %rdx,%rcx
   2092 │      rep  movsb %ds:(%rsi),%es:(%rdi)
        │      NOP3
      2 │      mov  %rcx,%rdx
   1049 │      clac
        │26:   mov  %edx,%eax
    524 │    → ret

GCC13 produces this:

Samples │
        │    ffffffff81606120 :
        │      mov  %rdx,%rax
        │      mov  %rdx,%rcx
        │      xor  %edx,%edx
     78 │      add  %rsi,%rax
        │      setb %dl
        │      test %rax,%rax
        │    ↓ js   23
        │      test %rdx,%rdx
        │    ↓ jne  23
   1059 │      stac
   2651 │      rep  movsb %ds:(%rsi),%es:(%rdi)
        │      NOP3
   1509 │      clac
        │23:   mov  %ecx,%eax
    779 │    → ret

However, it's not quite clear to me why the latter code is slower.

Cc: @gnutools@fosstodon.org

DougMerritt (log😅

= 💧

log😄

)
@dougmerritt@mathstodon.xyz

@mpdesouza@floss.social @ptesarik@infosec.exchange
Page tables; it's always the damn page tables. They're evil, I'm telling you. Haunted.

Petr Tesařík
@ptesarik@infosec.exchange

@dougmerritt@mathstodon.xyz @mpdesouza@floss.social @ljs@mastodonapp.uk I thought you were joking, Doug, but I have found out it depends on where the function is located in memory, so it might actually be the damn page tables. Ugh.

Petr Tesařík
@ptesarik@infosec.exchange

@dougmerritt@mathstodon.xyz @mpdesouza@floss.social @ljs@mastodonapp.uk I thought you were joking, Doug, but I have found out it depends on where the function is located in memory, so it might actually be the damn page tables. Ugh.