When Netlink Process Connectors Don't Process

If you want to track which processes are running on a Linux machine, the Netlink Process Connectors API is a convenient solution, despite certain limitations. The API provides an easy way to receive notifications whenever processes are created (forked/cloned) and whenever they undergo lifecycle events (exec, exit, etc). One possible use of the API is implementing a tool like htop¹

There is only one problem with the API: Sometimes it doesn’t work. You use the API according to the documentation, yet your program receives no process notifications. You scratch your head and run your program a few more times and it still doesn’t work. Then something even stranger happens: the final time you run your program, the API “wakes up” and starts working. You shrug, deploy your code to production, and get the occasional customer complaint that relevant features are broken. What is going on?

Background on the process connector API

Before looking at this issue in detail, some brief background about the netlink process connectors API is appropriate. Like all netlink APIs, you access the API by working with sockets.² First you open a netlink socket with socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR). Then you bind the socket with a sockaddr_nl struct that specifies which netlink API you want - in our case, the process connector API.³ Lastly, you send a packet with a PROC_CN_MCAST_LISTEN message to notify the process connector that you are ready to receive notifications.

Now it is the kernel’s turn. The process connector in the kernel fills your socket with notifications about processes. You read those notifications using recv as if they were regular network packets. Because the API uses sockets, you can filter the notifications using BPFs - just be aware of the usual trouble when applying BPFs to newly created sockets.

When your application is done working with the process connector API then you’re supposed to send a PROC_CN_MCAST_IGNORE message to be a good citizen and let the kernel know that you’re done using the API. You might think that general operating system principles apply here and the OS will clean up for your process when it exits no matter what. As we’ll see later on, that isn’t true here.

Alright, what can possibly go wrong with what we have described?

Searching for the bug in kernel sources

When our bug occurred we didn’t receive any messages from the kernel about process events. It was as if the kernel never received our PROC_CN_MCAST_LISTEN message, so lets start by looking at the kernel code which handles such messages. As always, we’ll start with the elixir source viewer and search for PROC_CN_MCAST_LISTEN. Asides from headers, the constant appears only once in the kernel:

static void cn_proc_mcast_ctl(struct cn_msg *msg, struct netlink_skb_parms *nsp)
{
    // ... snip! code removed for conciseness
	mc_op = (enum proc_cn_mcast_op *)msg->data;
	switch (*mc_op) {
	case PROC_CN_MCAST_LISTEN:
		atomic_inc(&proc_event_num_listeners);
		break;
	case PROC_CN_MCAST_IGNORE:
		atomic_dec(&proc_event_num_listeners);
		break;
	default:
		err = EINVAL;
		break;
	}

out:
	cn_proc_ack(err, msg->seq, msg->ack);
}

Basically, when the kernel receives a PROC_CN_MCAST_LISTEN message it atomically increments proc_event_num_listeners and sends an acknowledgment message to usermode.

Let’s look at the definition of proc_event_num_listeners:

static atomic_t proc_event_num_listeners = ATOMIC_INIT(0);

Yikes. proc_event_num_listeners is a global variable. It is updated on PROC_CN_MCAST_LISTEN and PROC_CN_MCAST_IGNORE regardless of whether the process sending the message was already listening or not. Furthermore, atomic_t is equivalent to int so if a process sends multiple PROC_CN_MCAST_IGNORE messages then proc_event_num_listeners can actually hold negative values. This will impact not just the buggy application but all applications using the process connector API. This looks like the cause of our bug. Lets fill in one last piece of the puzzle by looking at how proc_event_num_listeners is used:

void proc_exec_connector(struct task_struct *task)
{
	struct cn_msg *msg;
	struct proc_event *ev;
	__u8 buffer[CN_PROC_MSG_SIZE] __aligned(8);

	if (atomic_read(&proc_event_num_listeners) < 1)
		return;
    // ... snip!
    // send notification of exec to usermoode
}

The is the handler that runs whenever an exec event happens. The kernel uses proc_event_num_listeners to quickly determine whether there are active users of the process connector API or not. When proc_event_num_listeners < 1 the kernel bypasses the entire process connector.

The bug

Lets put the pieces together. Something like this happened:

Some usermode program sent too manyPROC_CN_MCAST_IGNORE message and proc_event_num_listeners ended up at a negative value - lets say negative three.
We ran our application three times and the process connector didn’t work. Unbeknownst to us, each time we ran our application there was a hidden side effect. The application sent a PROC_CN_MCAST_LISTEN message which incrementedproc_event_num_listeners by one each time.
We ran our application a fourth time. This time proc_event_num_listeners started out at zero, so our PROC_CN_MCAST_LISTEN message worked like it should and the process connector started sending data.

What caused the initial problem where PROC_CN_MCAST_IGNORE was sent too many times? It turns out there was a running application which enabled/disabled a process monitoring module based on a configuration file. Whenever the configuration file was reloaded with the process monitoring module disabled, a PROC_CN_MCAST_IGNORE message was sent even though the application never sent a PROC_CN_MCAST_LISTEN message to begin with.

One last question: should this be considered a kernel bug? I think so, so I’ve opened a ticket.

If you found this post interesting, read my post about the difficulties of tracking running processes on Linux which compares the process connector API with alternative solutions.

Work with me

Does this sort of thing interest you?

I started as a low-level engineer, but today I’m the co-founder and CEO of Robusta.dev and we’re hiring! I still do the occasional deep technical dive, as well as building a world-class team of excellent engineers and a product used by hundreds of companies.

If you join our team, you’ll work closely with me and be a core part of the founding team. Email [email protected] and mention you came from this post. We’re hiring in Israel as well as remote.

In practice, (h)top implementations typically don’t use the process connector API - they just do a full scan of /proc every few seconds. This is easier to implement because they have to scan /proc anyway (even with the process connector API) to read the initial list of processes on startup. Furthermore, the overhead of polling /proc doesn’t matter for a tool like top which is infrequently run by the user on-demand. ^[return]
See this Linux Journal article for the advantages of implementing kernel APIs with netlink sockets as opposed to system calls, ioctls, and virtual filesystems. ^[return]
Other popular options include the Linux audit API which also uses netlink for kernel-usermode communication. ^[return]

When Netlink Process Connectors Don’t Process

Background on the process connector API

Searching for the bug in kernel sources

The bug

Work with me