Searching for the cause of hung tasks in the Linux kernel

Depending on your configuration, the Linux kernel can produce a hung task warning message in its log. Searching the Internet and the kernel documentation, you can find a brief explanation that the kernel process is stuck in the uninterruptable state and hasn’t been scheduled on the CPU for an unexpectedly long period of time. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.

INFO: task XXX:1495882 blocked for more than YYY seconds.

The hung task message in the kernel log looks like this:

INFO: task XXX:1495882 blocked for more than YYY seconds.
Tainted: G O 6.6.39-cloudflare-2024.7.3 #1
“echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
task:XXX state:D stack:0 pid:1495882 ppid:1 flags:0x00004002
. . .

Processes in Linux can be in different states. Some of them are running or ready to run on the CPU — they are in the TASK_RUNNING state. Others are waiting for some signal or event to happen, e.g. network packets to arrive or terminal input from a user. They are in a TASK_INTERRUPTIBLE state and can spend an arbitrary length of time in this state until being woken up by a signal. The most important thing about these states is that they still can receive signals, and be terminated by a signal. In contrast, a process in the TASK_UNINTERRUPTIBLE state is waiting only for certain special classes of events to wake them up, and can’t be interrupted by a signal. The signals are not delivered until the process emerges from this state and only a system reboot can clear the process. It’s marked with the letter D in the log shown above.

What if this wake up event doesn’t happen or happens with a significant delay? (A “significant delay” may be on the order of seconds or minutes, depending on the system.) Then our dependent process is hung in this state. What if this dependent process holds some lock and prevents other processes from acquiring it? Or if we see many processes in the D state? Then it might tell us that some of the system resources are overwhelmed or are not working correctly. At the same time, this state is very valuable, especially if we want to preserve the process memory. It might be useful if part of the data is written to disk and another part is still in the process memory — we don’t want inconsistent data on a disk. Or maybe we want a snapshot of the process memory when the bug is hit. To preserve this behaviour, but make it more controlled, a new state was introduced in the kernel: TASK_KILLABLE — it still protects a process, but allows termination with a fatal signal.

How Linux identifies the hung process

The Linux kernel has a special thread called khungtaskd. It runs regularly depending on the settings, iterating over all processes in the D state. If a process is in this state for more than YYY seconds, we’ll see a message in the kernel log. There are settings for this daemon that can be changed according to your wishes:

$ sudo sysctl -a –pattern hung
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 10
kernel.hung_task_warnings = 200

At Cloudflare, we changed the notification threshold kernel.hung_task_timeout_secs from the default 120 seconds to 10 seconds. You can adjust the value for your system depending on configuration and how critical this delay is for you. If the process spends more than hung_task_timeout_secs seconds in the D state, a log entry is written, and our internal monitoring system emits an alert based on this log. Another important setting here is kernel.hung_task_warnings — the total number of messages that will be sent to the log. We limit it to 200 messages and reset it every 15 minutes. It allows us not to be overwhelmed by the same issue, and at the same time doesn’t stop our monitoring for too long. You can make it unlimited by setting the value to “-1”.

To better understand the root causes of the hung tasks and how a system can be affected, we’re going to review more detailed examples.

Example #1 or XFS

Typically, there is a meaningful process or application name in the log, but sometimes you might see something like this:

INFO: task kworker/13:0:834409 blocked for more than 11 seconds.
Tainted: G O 6.6.39-cloudflare-2024.7.3 #1
“echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
task:kworker/13:0 state:D stack:0 pid:834409 ppid:2 flags:0x00004000
Workqueue: xfs-sync/dm-6 xfs_log_worker

In this log, kworker is the kernel thread. It’s used as a deferring mechanism, meaning a piece of work will be scheduled to be executed in the future. Under kworker, the work is aggregated from different tasks, which makes it difficult to tell which application is experiencing a delay. Luckily, the kworker is accompanied by the Workqueue line. Workqueue is a linked list, usually predefined in the kernel, where these pieces of work are added and performed by the kworker in the order they were added to the queue. The Workqueue name xfs-sync and the function which it points to, xfs_log_worker, might give a good clue where to look. Here we can make an assumption that the XFS is under pressure and check the relevant metrics. It helped us to discover that due to some configuration changes, we forgot no_read_workqueue / no_write_workqueue flags that were introduced some time ago to speed up Linux disk encryption.

Summary: In this case, nothing critical happened to the system, but the hung tasks warnings gave us an alert that our file system had slowed down.

Example #2 or Coredump

Let’s take a look at the next hung task log and its decoded stack trace:

INFO: task test:964 blocked for more than 5 seconds.
Not tainted 6.6.72-cloudflare-2025.1.7 #1
“echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
task:test state:D stack:0 pid:964 ppid:916 flags:0x00004000
Call Trace:
<TASK>
__schedule (linux/kernel/sched/core.c:5378 linux/kernel/sched/core.c:6697)
schedule (linux/arch/x86/include/asm/preempt.h:85 (discriminator 13) linux/kernel/sched/core.c:6772 (discriminator 13))
[do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4))
? finish_task_switch.isra.0 (linux/arch/x86/include/asm/irqflags.h:42 linux/arch/x86/include/asm/irqflags.h:77 linux/kernel/sched/sched.h:1385 linux/kernel/sched/core.c:5132 linux/kernel/sched/core.c:5250)
do_group_exit (linux/kernel/exit.c:1005)
get_signal (linux/kernel/signal.c:2869)
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217)
? hrtimer_try_to_cancel.part.0 (linux/kernel/time/hrtimer.c:1347)
arch_do_signal_or_restart (linux/arch/x86/kernel/signal.c:310)
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217)
? hrtimer_nanosleep (linux/kernel/time/hrtimer.c:2105)
exit_to_user_mode_prepare (linux/kernel/entry/common.c:176 linux/kernel/entry/common.c:210)
syscall_exit_to_user_mode (linux/arch/x86/include/asm/entry-common.h:91 linux/kernel/entry/common.c:141 linux/kernel/entry/common.c:304)
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217)
do_syscall_64 (linux/arch/x86/entry/common.c:88)
entry_SYSCALL_64_after_hwframe (linux/arch/x86/entry/entry_64.S:121)
</TASK>

The stack trace says that the process or application test was blocked for more than 5 seconds. We might recognise this user space application by the name, but why is it blocked? It’s always helpful to check the stack trace when looking for a cause. The most interesting line here is do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4)). The source code points to the coredump_task_exit function. Additionally, checking the process metrics revealed that the application crashed during the time when the warning message appeared in the log. When a process is terminated based on some set of signals (abnormally), the Linux kernel can provide a core dump file, if enabled. The mechanism — when a process terminates, the kernel makes a snapshot of the process memory before exiting and either writes it to a file or sends it through the socket to another handler — can be systemd-coredump or your custom one. When it happens, the kernel moves the process to the D state to preserve its memory and early termination. The higher the process memory usage, the longer it takes to get a core dump file, and the higher the chance of getting a hung task warning.

Let’s check our hypothesis by triggering it with a small Go program. We’ll use the default Linux coredump handler and will decrease the hung task threshold to 1 second.

Coredump settings:

$ sudo sysctl -a –pattern kernel.core
kernel.core_pattern = core
kernel.core_pipe_limit = 16
kernel.core_uses_pid = 1

You can make changes with sysctl:

$ sudo sysctl -w kernel.core_uses_pid=1

Hung task settings:

Go program:

$ cat main.go
package main

import (
“os”
“time”
)

func main() {
_, err := os.ReadFile(“test.file”)
if err != nil {
panic(err)
}
time.Sleep(8 * time.Minute)
}

This program reads a 10 GB file into process memory. Let’s create the file:

$ yes this is 10GB file | head -c 10GB > test.file

The last step is to build the Go program, crash it, and watch our kernel log:

$ go mod init test
$ go build .
$ GOTRACEBACK=crash ./test
$ (Ctrl+)

Hooray! We can see our hung task warning:

$ sudo dmesg -T | tail -n 31
INFO: task test:8734 blocked for more than 22 seconds.
Not tainted 6.6.72-cloudflare-2025.1.7 #1
Blocked by coredump.
“echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
task:test state:D stack:0 pid:8734 ppid:8406 task_flags:0x400448 flags:0x00004000

By the way, have you noticed the Blocked by coredump. line in the log? It was recently added to the upstream code to improve visibility and remove the blame from the process itself. The patch also added the task_flags information, as Blocked by coredump is detected via the flag PF_POSTCOREDUMP, and knowing all the task flags is useful for further root-cause analysis.

Summary: This example showed that even if everything suggests that the application is the problem, the real root cause can be something else — in this case, coredump.

Example #3 or rtnl_mutex

This one was tricky to debug. Usually, the alerts are limited by one or two different processes, meaning only a certain application or subsystem experiences an issue. In this case, we saw dozens of unrelated tasks hanging for minutes with no improvements over time. Nothing else was in the log, most of the system metrics were fine, and existing traffic was being served, but it was not possible to ssh to the server. New Kubernetes container creations were also stalling. Analyzing the stack traces of different tasks initially revealed that all the traces were limited to just three functions:

rtnetlink_rcv_msg+0x9/0x3c0
dev_ethtool+0xc6/0x2db0
bonding_show_bonds+0x20/0xb0

Further investigation showed that all of these functions were waiting for rtnl_lock to be acquired. It looked like some application acquired the rtnl_mutex and didn’t release it. All other processes were in the D state waiting for this lock.

The RTNL lock is primarily used by the kernel networking subsystem for any network-related config, for both writing and reading. The RTNL is a global mutex lock, although upstream efforts are being made for splitting up RTNL per network namespace (netns).

From the hung task reports, we can observe the “victims” that are being stalled waiting for the lock, but how do we identify the task that is holding this lock for too long? For troubleshooting this, we leveraged BPF via a bpftrace script, as this allows us to inspect the running kernel state. The kernel’s mutex implementation has a struct member called owner. It contains a pointer to the task_struct from the mutex-owning process, except it is encoded as type atomic_long_t. This is because the mutex implementation stores some state information in the lower 3-bits (mask 0x7) of this pointer. Thus, to read and dereference this task_struct pointer, we must first mask off the lower bits (0x7).

Our bpftrace script to determine who holds the mutex is as follows:

#!/usr/bin/env bpftrace
interval:s:10 {
$rtnl_mutex = (struct mutex *) kaddr(“rtnl_mutex”);
$owner = (struct task_struct *) ($rtnl_mutex->owner.counter & ~0x07);
if ($owner != 0) {
printf(“rtnl_mutex->owner = %u %sn”, $owner->pid, $owner->comm);
}
}

In this script, the rtnl_mutex lock is a global lock whose address can be exposed via /proc/kallsyms – using bpftrace helper function kaddr(), we can access the struct mutex pointer from the kallsyms. Thus, we can periodically (via interval:s:10) check if someone is holding this lock.

In the output we had this:

rtnl_mutex->owner = 3895365 calico-node

This allowed us to quickly identify calico-node as the process holding the RTNL lock for too long. To quickly observe where this process itself is stalled, the call stack is available via /proc/3895365/stack. This showed us that the root cause was a Wireguard config change, with function wg_set_device() holding the RTNL lock, and peer_remove_after_dead() waiting too long for a napi_disable() call. We continued debugging via a tool called drgn, which is a programmable debugger that can debug a running kernel via a Python-like interactive shell. We still haven’t discovered the root cause for the Wireguard issue and have asked the upstream for help, but that is another story.

Summary: The hung task messages were the only ones which we had in the kernel log. Each stack trace of these messages was unique, but by carefully analyzing them, we could spot similarities and continue debugging with other instruments.

Epilogue

Your system might have different hung task warnings, and we have many others not mentioned here. Each case is unique, and there is no standard approach to debug them. But hopefully this blog post helps you better understand why it’s good to have these warnings enabled, how they work, and what the meaning is behind them. We tried to provide some navigation guidance for the debugging process as well:

analyzing the stack trace might be a good starting point for debugging it, even if all the messages look unrelated, like we saw in example #3

keep in mind that the alert might be misleading, pointing to the victim and not the offender, as we saw in example #2 and example #3

if the kernel doesn’t schedule your application on the CPU, puts it in the D state, and emits the warning – the real problem might exist in the application code

Good luck with your debugging, and hopefully this material will help you on this journey!

Source:: CloudFlare