Tracing System CPU on Debian Stretch



Tracing System CPU on Debian Stretch

This is a heavily truncated version of an internal blog post from August 2017. For more recent updates on Kafka, check out another blog post on compression, where we optimized throughput 4.5x for both disks and network.

Tracing System CPU on Debian Stretch
Photo by Alex Povolyashko / Unsplash

Upgrading our systems to Debian Stretch

For quite some time we’ve been rolling out Debian Stretch, to the point where we have reached ~10% adoption in our core datacenters. As part of upgarding the underlying OS, we also evaluate the higher level software stack, e.g. taking a look at our ClickHouse and Kafka clusters.

During our upgrade of Kafka, we sucessfully migrated two smaller clusters, logs and dns, but ran into issues when attempting to upgrade one of our larger clusters, http.

Thankfully, we were able to roll back the http cluster upgrade relatively easily, due to heavy versioning of both the OS and the higher level software stack. If there’s one takeaway from this blog post, it’s to take advantage of consistent versioning.

High level differences

We upgraded one Kafka http node, and it did not go as planned:

Having 5x CPU usage was definitely an unexpected outcome. For control datapoints, we compared to a node where no upgrade happened, and an intermediary node that received a software stack upgrade, but not an OS upgrade. Neither of these two nodes experienced the same CPU saturation issues, even though their setups were practically identical.

For debugging CPU saturation issues, we call on perf to fish out details:

Tracing System CPU on Debian Stretch

The command used was: perf top -F 99.

RCU stalls

In addition to higher system CPU usage, we found secondary slowdowns, including read-copy update (RCU) stalls:

[ 4909.110009] logfwdr (26887) used greatest stack depth: 11544 bytes left
[ 4909.392659] oom_reaper: reaped process 26861 (logfwdr), now anon-rss:8kB, file-rss:0kB, shmem-rss:0kB
[ 4923.462841] INFO: rcu_sched self-detected stall on CPU
[ 4923.462843]  13-...: (2 GPs behind) idle=ea7/140000000000001/0 softirq=1/2 fqs=4198
[ 4923.462845]   (t=8403 jiffies g=110722 c=110721 q=6440)

We’ve seen RCU stalls before, and our (suboptimal) solution was to reboot the machine.

However, one can only handle so many reboots before the problem becomes severe enough to warrant a deep dive. During our deep dive, we noticed in dmesg that we had issues allocating memory, while trying to write errors:

Aug 15 21:51:35 myhost kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 15 21:51:35 myhost kernel:         26-...: (1881 ticks this GP) idle=76f/140000000000000/0 softirq=8/8 fqs=365
Aug 15 21:51:35 myhost kernel:         (detected by 0, t=2102 jiffies, g=1837293, c=1837292, q=262)
Aug 15 21:51:35 myhost kernel: Task dump for CPU 26:
Aug 15 21:51:35 myhost kernel: java            R  running task    13488  1714   1513 0x00080188
Aug 15 21:51:35 myhost kernel:  ffffc9000d1f7898 ffffffff814ee977 ffff88103f410400 000000000000000a
Aug 15 21:51:35 myhost kernel:  0000000000000041 ffffffff82203142 ffffc9000d1f78c0 ffffffff814eea10
Aug 15 21:51:35 myhost kernel:  0000000000000041 ffffffff82203142 ffff88103f410400 ffffc9000d1f7920
Aug 15 21:51:35 myhost kernel: Call Trace:
Aug 15 21:51:35 myhost kernel:  [<ffffffff814ee977>] ? scrup+0x147/0x160
Aug 15 21:51:35 myhost kernel:  [<ffffffff814eea10>] ? lf+0x80/0x90
Aug 15 21:51:35 myhost kernel:  [<ffffffff814eecb5>] ? vt_console_print+0x295/0x3c0
Aug 15 21:51:35 myhost kernel:  [<ffffffff810b1193>] ? call_console_drivers.isra.22.constprop.30+0xf3/0x100
Aug 15 21:51:35 myhost kernel:  [<ffffffff810b1f51>] ? console_unlock+0x281/0x550
Aug 15 21:51:35 myhost kernel:  [<ffffffff810b2498>] ? vprintk_emit+0x278/0x430
Aug 15 21:51:35 myhost kernel:  [<ffffffff810b27ef>] ? vprintk_default+0x1f/0x30
Aug 15 21:51:35 myhost kernel:  [<ffffffff811588df>] ? printk+0x48/0x50
Aug 15 21:51:35 myhost kernel:  [<ffffffff810b30ee>] ? dump_stack_print_info+0x7e/0xc0
Aug 15 21:51:35 myhost kernel:  [<ffffffff8142d41f>] ? dump_stack+0x44/0x65
Aug 15 21:51:35 myhost kernel:  [<ffffffff81162e64>] ? warn_alloc+0x124/0x150
Aug 15 21:51:35 myhost kernel:  [<ffffffff81163842>] ? __alloc_pages_slowpath+0x932/0xb80
Aug 15 21:51:35 myhost kernel:  [<ffffffff81163c92>] ? __alloc_pages_nodemask+0x202/0x250
Aug 15 21:51:35 myhost kernel:  [<ffffffff811ae9c2>] ? alloc_pages_current+0x92/0x120
Aug 15 21:51:35 myhost kernel:  [<ffffffff81159d2f>] ? __page_cache_alloc+0xbf/0xd0
Aug 15 21:51:35 myhost kernel:  [<ffffffff8115cdfa>] ? filemap_fault+0x2ea/0x4d0
Aug 15 21:51:35 myhost kernel:  [<ffffffff8136dc95>] ? xfs_filemap_fault+0x45/0xa0
Aug 15 21:51:35 myhost kernel:  [<ffffffff8118b3eb>] ? __do_fault+0x6b/0xd0
Aug 15 21:51:35 myhost kernel:  [<ffffffff81190028>] ? handle_mm_fault+0xe98/0x12b0
Aug 15 21:51:35 myhost kernel:  [<ffffffff8110756b>] ? __seccomp_filter+0x1db/0x290
Aug 15 21:51:35 myhost kernel:  [<ffffffff8104fa5c>] ? __do_page_fault+0x22c/0x4c0
Aug 15 21:51:35 myhost kernel:  [<ffffffff8104fd10>] ? do_page_fault+0x20/0x70
Aug 15 21:51:35 myhost kernel:  [<ffffffff819bea02>] ? page_fault+0x22/0x30

This suggested that we were logging too many errors, and the actual failure may be earlier in the process. Armed with this hypothesis, we looked at the very beginning of the error chain:

Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages
Aug 16 01:14:51 myhost kernel:  [<ffffffff81171754>] shrink_inactive_list+0x1f4/0x4f0
Aug 16 01:14:51 myhost kernel:  [<ffffffff8117234b>] shrink_node_memcg+0x5bb/0x780
Aug 16 01:14:51 myhost kernel:  [<ffffffff811725e2>] shrink_node+0xd2/0x2f0
Aug 16 01:14:51 myhost kernel:  [<ffffffff811728ef>] do_try_to_free_pages+0xef/0x310
Aug 16 01:14:51 myhost kernel:  [<ffffffff81172be5>] try_to_free_pages+0xd5/0x180
Aug 16 01:14:51 myhost kernel:  [<ffffffff811632db>] __alloc_pages_slowpath+0x31b/0xb80

As much as shrink_node may scream “NUMA issues”, you’re looking primarily at:

Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages

In addition, we also found memory allocation issues:

[78972.506644] Mem-Info:
[78972.506653] active_anon:3936889 inactive_anon:371971 isolated_anon:0
[78972.506653]  active_file:25778474 inactive_file:1214478 isolated_file:2208
[78972.506653]  unevictable:0 dirty:1760643 writeback:0 unstable:0
[78972.506653]  slab_reclaimable:1059804 slab_unreclaimable:141694
[78972.506653]  mapped:47285 shmem:535917 pagetables:10298 bounce:0
[78972.506653]  free:202928 free_pcp:3085 free_cma:0
[78972.506660] Node 0 active_anon:8333016kB inactive_anon:989808kB active_file:50622384kB inactive_file:2401416kB unevictable:0kB isolated(anon):0kB isolated(file):3072kB mapped:96624kB dirty:3422168kB writeback:0kB shmem:1261156kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:15744 all_unreclaimable? no
[78972.506666] Node 1 active_anon:7414540kB inactive_anon:498076kB active_file:52491512kB inactive_file:2456496kB unevictable:0kB isolated(anon):0kB isolated(file):5760kB mapped:92516kB dirty:3620404kB writeback:0kB shmem:882512kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:9080974 all_unreclaimable? no
[78972.506671] Node 0 DMA free:15900kB min:100kB low:124kB high:148kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15900kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
** 9 printk messages dropped ** [78972.506716] Node 0 Normal: 15336*4kB (UMEH) 4584*8kB (MEH) 2119*16kB (UME) 775*32kB (MEH) 106*64kB (UM) 81*128kB (MH) 29*256kB (UM) 25*512kB (M) 19*1024kB (M) 7*2048kB (M) 2*4096kB (M) = 236080kB
[78972.506725] Node 1 Normal: 31740*4kB (UMEH) 3879*8kB (UMEH) 873*16kB (UME) 353*32kB (UM) 286*64kB (UMH) 62*128kB (UMH) 28*256kB (MH) 20*512kB (UMH) 15*1024kB (UM) 7*2048kB (UM) 12*4096kB (M) = 305752kB
[78972.506726] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[78972.506727] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[78972.506728] 27531091 total pagecache pages
[78972.506729] 0 pages in swap cache
[78972.506730] Swap cache stats: add 0, delete 0, find 0/0
[78972.506730] Free swap  = 0kB
[78972.506731] Total swap = 0kB
[78972.506731] 33524975 pages RAM
[78972.506732] 0 pages HighMem/MovableOnly
[78972.506732] 546255 pages reserved
[78972.620129] ntpd: page allocation stalls for 272380ms, order:0, mode:0x24000c0(GFP_KERNEL)
[78972.620132] CPU: 16 PID: 13099 Comm: ntpd Tainted: G           O    4.9.43-cloudflare-2017.8.4 #1
[78972.620133] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A21 10/01/2015
[78972.620136]  ffffc90022f9b6f8 ffffffff8142d668 ffffffff81ca31b8 0000000000000001
[78972.620138]  ffffc90022f9b778 ffffffff81162f14 024000c022f9b740 ffffffff81ca31b8
[78972.620140]  ffffc90022f9b720 0000000000000010 ffffc90022f9b788 ffffc90022f9b738
[78972.620140] Call Trace:
[78972.620148]  [<ffffffff8142d668>] dump_stack+0x4d/0x65
[78972.620152]  [<ffffffff81162f14>] warn_alloc+0x124/0x150
[78972.620154]  [<ffffffff811638f2>] __alloc_pages_slowpath+0x932/0xb80
[78972.620157]  [<ffffffff81163d42>] __alloc_pages_nodemask+0x202/0x250
[78972.620160]  [<ffffffff811aeae2>] alloc_pages_current+0x92/0x120
[78972.620162]  [<ffffffff8115f6ee>] __get_free_pages+0xe/0x40
[78972.620165]  [<ffffffff811e747a>] __pollwait+0x9a/0xe0
[78972.620168]  [<ffffffff817c9ec9>] datagram_poll+0x29/0x100
[78972.620170]  [<ffffffff817b9d48>] sock_poll+0x48/0xa0
[78972.620172]  [<ffffffff811e7c35>] do_select+0x335/0x7b0

This specific error message did seem fun:

[78991.546088] systemd-network: page allocation stalls for 287000ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)

You don’t want your page allocations to stall for 5 minutes, especially when it’s order zero allocation (smallest allocation of one 4 KiB page).

Comparing to our control nodes, the only two possible explanations were: a kernel upgrade, and the switch from Debian Jessie to Debian Stretch. We suspected the former, since CPU usage implies a kernel issue. However, just to be safe, we rolled both the kernel back to 4.4.55, and downgraded the affected nodes back to Debian Jessie. This was a reasonable compromise, since we needed to minimize downtime on production nodes.

Digging a bit deeper

Keeping servers running on older kernel and distribution is not a viable long term solution. Through bisection, we found the issue lay in the Jessie to Stretch upgrade, contrary to our initial hypothesis.

Now that we knew what the problem was, we proceeded to investigate why. With the help from existing automation around perf and Java, we generated the following flamegraphs:

  • Jessie

Tracing System CPU on Debian Stretch

  • Stretch

Tracing System CPU on Debian Stretch

At first it looked like Jessie was doing writev instead of sendfile, but the full flamegraphs revealed that Strech was executing sendfile a lot slower.

If you highlight sendfile:

  • Jessie

Tracing System CPU on Debian Stretch

  • Stretch

Tracing System CPU on Debian Stretch

And zoomed in:

  • Jessie

Tracing System CPU on Debian Stretch

  • Stretch

Tracing System CPU on Debian Stretch

These two look very different.

Some colleagues suggested that the differences in the graphs may be due to TCP offload being disabled, but upon checking our NIC settings, we found that the feature flags were identical.

We’ll dive into the differences in the next section.

And deeper

To trace latency distributions of sendfile syscalls between Jessie and Stretch, we used funclatency from bcc-tools:

  • Jessie
$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:25
     usecs               : count     distribution
         0 -> 1          : 9        |                                        |
         2 -> 3          : 47       |****                                    |
         4 -> 7          : 53       |*****                                   |
         8 -> 15         : 379      |****************************************|
        16 -> 31         : 329      |**********************************      |
        32 -> 63         : 101      |**********                              |
        64 -> 127        : 23       |**                                      |
       128 -> 255        : 50       |*****                                   |
       256 -> 511        : 7        |                                        |
  • Stretch
$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:28
     usecs               : count     distribution
         0 -> 1          : 1        |                                        |
         2 -> 3          : 20       |***                                     |
         4 -> 7          : 46       |*******                                 |
         8 -> 15         : 56       |********                                |
        16 -> 31         : 65       |**********                              |
        32 -> 63         : 75       |***********                             |
        64 -> 127        : 75       |***********                             |
       128 -> 255        : 258      |****************************************|
       256 -> 511        : 144      |**********************                  |
       512 -> 1023       : 24       |***                                     |
      1024 -> 2047       : 27       |****                                    |
      2048 -> 4095       : 28       |****                                    |
      4096 -> 8191       : 35       |*****                                   |
      8192 -> 16383      : 1        |                                        |

In the flamegraphs, you can see timers being set at the tip (mod_timer function), with these timers taking locks. On Stretch we installed 3x more timers, resulting in 10x the amount of contention:

  • Jessie
$ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end.
00:33:36
FUNC                                    COUNT
mod_timer                               60482
00:33:37
FUNC                                    COUNT
mod_timer                               58263
00:33:38
FUNC                                    COUNT
mod_timer                               54626
$ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end.
00:32:36
FUNC                                    COUNT
lock_timer_base                         15962
00:32:37
FUNC                                    COUNT
lock_timer_base                         16261
00:32:38
FUNC                                    COUNT
lock_timer_base                         15806
  • Stretch
$ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end.
00:33:28
FUNC                                    COUNT
mod_timer                              149068
00:33:29
FUNC                                    COUNT
mod_timer                              155994
00:33:30
FUNC                                    COUNT
mod_timer                              160688
$ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end.
00:32:32
FUNC                                    COUNT
lock_timer_base                        119189
00:32:33
FUNC                                    COUNT
lock_timer_base                        196895
00:32:34
FUNC                                    COUNT
lock_timer_base                        140085

The Linux kernel includes debugging facilities for timers, which call the timer:timer_start tracepoint on every timer start. This allowed us to pull up timer names:

  • Jessie
$ sudo perf record -e timer:timer_start -p 23485 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 54 times to write data ]
[ perf record: Captured and wrote 17.778 MB perf.data (173520 samples) ]
      6 blk_rq_timed_out_timer
      2 clocksource_watchdog
      5 commit_timeout
      5 cursor_timer_handler
      2 dev_watchdog
     10 garp_join_timer
      2 ixgbe_service_timer
     36 reqsk_timer_handler
   4769 tcp_delack_timer
    171 tcp_keepalive_timer
 168512 tcp_write_timer
  • Stretch
$ sudo perf record -e timer:timer_start -p 3416 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 671 times to write data ]
[ perf record: Captured and wrote 198.273 MB perf.data (1988650 samples) ]
      6 clocksource_watchdog
      4 commit_timeout
     12 cursor_timer_handler
      2 dev_watchdog
     18 garp_join_timer
      4 ixgbe_service_timer
      1 neigh_timer_handler
      1 reqsk_timer_handler
   4622 tcp_delack_timer
      1 tcp_keepalive_timer
1983978 tcp_write_timer
      1 writeout_period

So basically we install 12x more tcp_write_timer timers, resulting in higher kernel CPU usage.

Taking specific flamegraphs of the timers revealed the differences in their operation:

  • Jessie

Tracing System CPU on Debian Stretch

  • Stretch

Tracing System CPU on Debian Stretch

We then traced the functions that were different:

  • Jessie
$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
03:33:33
FUNC                                    COUNT
tcp_sendmsg                             21166
03:33:34
FUNC                                    COUNT
tcp_sendmsg                             21768
03:33:35
FUNC                                    COUNT
tcp_sendmsg                             21712
$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
03:37:14
FUNC                                    COUNT
tcp_push_one                              496
03:37:15
FUNC                                    COUNT
tcp_push_one                              432
03:37:16
FUNC                                    COUNT
tcp_push_one                              495
$ sudo /usr/share/bcc/tools/trace -p 23485 'tcp_sendmsg "%d", arg3' -T -M 100000 | awk '{ print $NF }' | sort | uniq -c | sort -n | tail
   1583 4
   2043 54
   3546 18
   4016 59
   4423 50
   5349 8
   6154 40
   6620 38
  17121 51
  39528 44
  • Stretch
$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
03:33:30
FUNC                                    COUNT
tcp_sendmsg                             53834
03:33:31
FUNC                                    COUNT
tcp_sendmsg                             49472
03:33:32
FUNC                                    COUNT
tcp_sendmsg                             51221
$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
03:37:10
FUNC                                    COUNT
tcp_push_one                            64483
03:37:11
FUNC                                    COUNT
tcp_push_one                            65058
03:37:12
FUNC                                    COUNT
tcp_push_one                            72394
$ sudo /usr/share/bcc/tools/trace -p 3416 'tcp_sendmsg "%d", arg3' -T -M 100000 | awk '{ print $NF }' | sort | uniq -c | sort -n | tail
    396 46
    409 4
   1124 50
   1305 18
   1547 40
   1672 59
   1729 8
   2181 38
  19052 44
  64504 4096

The traces showed huge variations of tcp_sendmsg and tcp_push_one within sendfile.

To further introspect, we leveraged a kernel feature available since 4.9: the ability to count stacks. This led us to measuring what hits tcp_push_one:

  • Jessie
$ sudo /usr/share/bcc/tools/stackcount -i 10 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  do_iter_readv_writev
  do_readv_writev
  vfs_writev
  do_writev
  SyS_writev
  do_syscall_64
  return_from_SYSCALL_64
    1
  tcp_push_one
  inet_sendpage
  kernel_sendpage
  sock_sendpage
  pipe_to_sendpage
  __splice_from_pipe
  splice_from_pipe
  generic_splice_sendpage
  direct_splice_actor
  splice_direct_to_actor
  do_splice_direct
  do_sendfile
  sys_sendfile64
  do_syscall_64
  return_from_SYSCALL_64
    4950
  • Stretch
$ sudo /usr/share/bcc/tools/stackcount -i 10 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  do_iter_readv_writev
  do_readv_writev
  vfs_writev
  do_writev
  SyS_writev
  do_syscall_64
  return_from_SYSCALL_64
    123
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  __vfs_write
  vfs_write
  SyS_write
  do_syscall_64
  return_from_SYSCALL_64
    172
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  kernel_sendmsg
  sock_no_sendpage
  tcp_sendpage
  inet_sendpage
  kernel_sendpage
  sock_sendpage
  pipe_to_sendpage
  __splice_from_pipe
  splice_from_pipe
  generic_splice_sendpage
  direct_splice_actor
  splice_direct_to_actor
  do_splice_direct
  do_sendfile
  sys_sendfile64
  do_syscall_64
  return_from_SYSCALL_64
    735110

If you diff the most popular stacks, you’ll get:

--- jessie.txt  2017-08-16 21:14:13.000000000 -0700
+++ stretch.txt 2017-08-16 21:14:20.000000000 -0700
@@ -1,4 +1,9 @@
 tcp_push_one
+inet_sendmsg
+sock_sendmsg
+kernel_sendmsg
+sock_no_sendpage
+tcp_sendpage
 inet_sendpage
 kernel_sendpage
 sock_sendpage

Let’s look closer at tcp_sendpage:

int tcp_sendpage(struct sock *sk, struct page *page, int offset,
         size_t size, int flags)
{
    ssize_t res;

    if (!(sk->sk_route_caps & NETIF_F_SG) ||
        !sk_check_csum_caps(sk))
        return sock_no_sendpage(sk->sk_socket, page, offset, size,
                    flags);

    lock_sock(sk);

    tcp_rate_check_app_limited(sk);  /* is sending application-limited? */

    res = do_tcp_sendpages(sk, page, offset, size, flags);
    release_sock(sk);
    return res;
}

It looks like we don’t enter the if body. We looked up what NET_F_SG does: segmentation offload. This difference is peculiar, since both OS’es should have this enabled.

Even deeper, to the crux

It turned out that we had segmentation offload enabled for only a few of our NICs: eth2, eth3, and bond0. Our network setup can be described as follows:

eth2 -->|              |--> vlan10
        |---> bond0 -->|
eth3 -->|              |--> vlan100

The missing piece was that we were missing segmentation offload on VLAN interfaces, where the actual IPs live.

Here’s the diff from ethtook -k vlan10:

$ diff -rup <(ssh jessie sudo ethtool -k vlan10) <(ssh stretch sudo ethtool -k vlan10)
--- /dev/fd/63  2017-08-16 21:21:12.000000000 -0700
+++ /dev/fd/62  2017-08-16 21:21:12.000000000 -0700
@@ -1,21 +1,21 @@
 Features for vlan10:
 rx-checksumming: off [fixed]
-tx-checksumming: off
+tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
-       tx-checksum-ip-generic: off
+       tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off
        tx-checksum-sctp: off
-scatter-gather: off
-       tx-scatter-gather: off
+scatter-gather: on
+       tx-scatter-gather: on
        tx-scatter-gather-fraglist: off
-tcp-segmentation-offload: off
-       tx-tcp-segmentation: off [requested on]
-       tx-tcp-ecn-segmentation: off [requested on]
-       tx-tcp-mangleid-segmentation: off [requested on]
-       tx-tcp6-segmentation: off [requested on]
-udp-fragmentation-offload: off [requested on]
-generic-segmentation-offload: off [requested on]
+tcp-segmentation-offload: on
+       tx-tcp-segmentation: on
+       tx-tcp-ecn-segmentation: on
+       tx-tcp-mangleid-segmentation: on
+       tx-tcp6-segmentation: on
+udp-fragmentation-offload: on
+generic-segmentation-offload: on
 generic-receive-offload: on
 large-receive-offload: off [fixed]
 rx-vlan-offload: off [fixed]

So we entusiastically enabled segmentation offload:

$ sudo ethtool -K vlan10 sg on

And it didn’t help! Will the suffering ever end? Let’s also enable TCP transmission checksumming offload:

$ sudo ethtool -K vlan10 tx on
Actual changes:
tx-checksumming: on
        tx-checksum-ip-generic: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
udp-fragmentation-offload: on

Nothing. The diff is essentially empty now:

$ diff -rup <(ssh jessie sudo ethtool -k vlan10) <(ssh stretch sudo ethtool -k vlan10)
--- /dev/fd/63  2017-08-16 21:31:27.000000000 -0700
+++ /dev/fd/62  2017-08-16 21:31:27.000000000 -0700
@@ -4,11 +4,11 @@ tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
-       tx-checksum-fcoe-crc: off [requested on]
-       tx-checksum-sctp: off [requested on]
+       tx-checksum-fcoe-crc: off
+       tx-checksum-sctp: off
 scatter-gather: on
        tx-scatter-gather: on
-       tx-scatter-gather-fraglist: off [requested on]
+       tx-scatter-gather-fraglist: off
 tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on

The last missing piece we found was that offload changes are applied only during connection initiation, so we restarted Kafka, and we immediately saw a performance improvement (green line):

Tracing System CPU on Debian Stretch

Not enabling offload features when possible seems like a pretty bad regression, so we filed a ticket for systemd:

In the meantime, we work around our upstream issue by enabling offload features automatically on boot if they are disabled on VLAN interfaces.

Having a fix enabled, we rebooted our logs Kafka cluster to upgrade to the latest kernel, and our 5 day CPU usage history yielded positive results:

Tracing System CPU on Debian Stretch

The DNS cluster also yielded positive results, with just 2 nodes rebooted (purple line going down):

Tracing System CPU on Debian Stretch

Conclusion

It was an error on our part to hit performance degradation without a good regression framework in place to catch the issue. Luckily, due to our heavy use of version control, we managed to bisect the issue rather quickly, and had a temp rollback in place while root causing the problem.

In the end, enabling offload also removed RCU stalls. It’s not really clear whether it was the cause or just a catalyst, but the end result speaks for itself.

On the bright side, we dug pretty deep into Linux kernel internals, and although there were fleeting moments of giving up, moving to the woods to become a park ranger, we persevered and came out of the forest successful.


If deep diving from high level symptoms to kernel/OS issues makes you excited, drop us a line.


Read more here:: CloudFlare