Brutkey

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

If you feel insufficiently frustrated developing stuff, try root-causing performance regressions. Plenty of frustration for everyone!

For example, I am now chasing a netperf regression between 6.4 and 6.12 where perf suggests that
raw_copy_from_user() is now slower for some reason. Lacking any better idea, I wrote a small misc driver to stress raw_copy_from_user(). Of course, this raw_copy_from_user() runs equally fast in both kernels. Only the other raw_copy_from_user() which is embedded in copy_from_user_iter() is slower in the newer kernel.

Marcos Paulo de Souza
@mpdesouza@floss.social

@ptesarik@infosec.exchange /me is grabbing popcorn and waiting for more info


Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@mpdesouza@floss.social How's popcorn? Here's more info:
The issue is not really a difference between the two kernel versions, but the compiler used to build them. SLE15 SP6 kernels (6.4) are built with GCC7, but SLE16 kernels (6.12) are built with GCC13. I rebuilt the SLE15 SP6 kernel with GCC13, and performance drops by more than 4%!

GCC7 produces this:

Samples β”‚
        β”‚    ffffffff81610ee0 :
        β”‚      mov  %rdx,%rax
        β”‚      xor  %ecx,%ecx
        β”‚      add  %rsi,%rax
    156 β”‚      setb %cl
        β”‚      test %rax,%rax
        β”‚    ↓ js   26
        β”‚      test %rcx,%rcx
        β”‚    ↓ jne  26
    932 β”‚      stac
        β”‚      mov  %rdx,%rcx
   2092 β”‚      rep  movsb %ds:(%rsi),%es:(%rdi)
        β”‚      NOP3
      2 β”‚      mov  %rcx,%rdx
   1049 β”‚      clac
        β”‚26:   mov  %edx,%eax
    524 β”‚    β†’ ret


GCC13 produces this:
Samples β”‚
        β”‚    ffffffff81606120 :
        β”‚      mov  %rdx,%rax
        β”‚      mov  %rdx,%rcx
        β”‚      xor  %edx,%edx
     78 β”‚      add  %rsi,%rax
        β”‚      setb %dl
        β”‚      test %rax,%rax
        β”‚    ↓ js   23
        β”‚      test %rdx,%rdx
        β”‚    ↓ jne  23
   1059 β”‚      stac
   2651 β”‚      rep  movsb %ds:(%rsi),%es:(%rdi)
        β”‚      NOP3
   1509 β”‚      clac
        β”‚23:   mov  %ecx,%eax
    779 β”‚    β†’ ret


However, it's not quite clear to me why the latter code is slower.

Cc:
@gnutools@fosstodon.org

DougMerritt (logπŸ˜…πŸ˜… = πŸ’§πŸ’§logπŸ˜„πŸ˜„)
@dougmerritt@mathstodon.xyz

@mpdesouza@floss.social @ptesarik@infosec.exchange
Page tables; it's always the damn page tables. They're evil, I'm telling you. Haunted.

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@dougmerritt@mathstodon.xyz @mpdesouza@floss.social @ljs@mastodonapp.uk I thought you were joking, Doug, but I have found out it depends on where the function is located in memory, so it might actually be the damn page tables. Ugh.

Petr TesaΕ™Γ­k
@ptesarik@infosec.exchange

@dougmerritt@mathstodon.xyz @mpdesouza@floss.social @ljs@mastodonapp.uk I thought you were joking, Doug, but I have found out it depends on where the function is located in memory, so it might actually be the damn page tables. Ugh.