6 minutes
When Netlink Process Connectors Don’t Process
If you want to track which processes are running on a Linux machine, the Netlink Process Connectors API is a convenient solution, despite certain limitations. The API provides an easy way to receive notifications whenever processes are created (forked/cloned) and whenever they undergo lifecycle events (exec, exit, etc). One possible use of the API is implementing a tool like htop
1
There is only one problem with the API: Sometimes it doesn’t work. You use the API according to the documentation, yet your program receives no process notifications. You scratch your head and run your program a few more times and it still doesn’t work. Then something even stranger happens: the final time you run your program, the API “wakes up” and starts working. You shrug, deploy your code to production, and get the occasional customer complaint that relevant features are broken. What is going on?
Background on the process connector API
Before looking at this issue in detail, some brief background about the netlink process connectors API is appropriate. Like all netlink APIs, you access the API by working with sockets.2 First you open a netlink socket with socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR)
. Then you bind
the socket with a sockaddr_nl
struct that specifies which netlink API you want - in our case, the process connector API.3 Lastly, you send a packet with a PROC_CN_MCAST_LISTEN
message to notify the process connector that you are ready to receive notifications.
Now it is the kernel’s turn. The process connector in the kernel fills your socket with notifications about processes. You read those notifications using recv
as if they were regular network packets. Because the API uses sockets, you can filter the notifications using BPFs - just be aware of the usual trouble when applying BPFs to newly created sockets.
When your application is done working with the process connector API then you’re supposed to send a PROC_CN_MCAST_IGNORE
message to be a good citizen and let the kernel know that you’re done using the API. You might think that general operating system principles apply here and the OS will clean up for your process when it exits no matter what. As we’ll see later on, that isn’t true here.
Alright, what can possibly go wrong with what we have described?
Searching for the bug in kernel sources
When our bug occurred we didn’t receive any messages from the kernel about process events. It was as if the kernel never received our PROC_CN_MCAST_LISTEN
message, so lets start by looking at the kernel code which handles such messages. As always, we’ll start with the elixir source viewer and search for PROC_CN_MCAST_LISTEN
. Asides from headers, the constant appears only once in the kernel:
static void cn_proc_mcast_ctl(struct cn_msg *msg, struct netlink_skb_parms *nsp)
{
// ... snip! code removed for conciseness
mc_op = (enum proc_cn_mcast_op *)msg->data;
switch (*mc_op) {
case PROC_CN_MCAST_LISTEN:
atomic_inc(&proc_event_num_listeners);
break;
case PROC_CN_MCAST_IGNORE:
atomic_dec(&proc_event_num_listeners);
break;
default:
err = EINVAL;
break;
}
out:
cn_proc_ack(err, msg->seq, msg->ack);
}
Basically, when the kernel receives a PROC_CN_MCAST_LISTEN
message it atomically increments proc_event_num_listeners
and sends an acknowledgment message to usermode.
Let’s look at the definition of proc_event_num_listeners
:
static atomic_t proc_event_num_listeners = ATOMIC_INIT(0);
Yikes. proc_event_num_listeners
is a global variable. It is updated on PROC_CN_MCAST_LISTEN
and PROC_CN_MCAST_IGNORE
regardless of whether the process sending the message was already listening or not. Furthermore, atomic_t
is equivalent to int
so if a process sends multiple PROC_CN_MCAST_IGNORE
messages then proc_event_num_listeners
can actually hold negative values. This will impact not just the buggy application but all applications using the process connector API. This looks like the cause of our bug. Lets fill in one last piece of the puzzle by looking at how proc_event_num_listeners
is used:
void proc_exec_connector(struct task_struct *task)
{
struct cn_msg *msg;
struct proc_event *ev;
__u8 buffer[CN_PROC_MSG_SIZE] __aligned(8);
if (atomic_read(&proc_event_num_listeners) < 1)
return;
// ... snip!
// send notification of exec to usermoode
}
The is the handler that runs whenever an exec event happens. The kernel uses proc_event_num_listeners
to quickly determine whether there are active users of the process connector API or not. When proc_event_num_listeners < 1
the kernel bypasses the entire process connector.
The bug
Lets put the pieces together. Something like this happened:
- Some usermode program sent too many
PROC_CN_MCAST_IGNORE
message andproc_event_num_listeners
ended up at a negative value - lets say negative three. - We ran our application three times and the process connector didn’t work. Unbeknownst to us, each time we ran our application there was a hidden side effect. The application sent a
PROC_CN_MCAST_LISTEN
message which incrementedproc_event_num_listeners
by one each time. - We ran our application a fourth time. This time
proc_event_num_listeners
started out at zero, so ourPROC_CN_MCAST_LISTEN
message worked like it should and the process connector started sending data.
What caused the initial problem where PROC_CN_MCAST_IGNORE
was sent too many times? It turns out there was a running application which enabled/disabled a process monitoring module based on a configuration file. Whenever the configuration file was reloaded with the process monitoring module disabled, a PROC_CN_MCAST_IGNORE
message was sent even though the application never sent a PROC_CN_MCAST_LISTEN
message to begin with.
One last question: should this be considered a kernel bug? I think so, so I’ve opened a ticket.
If you found this post interesting, read my post about the difficulties of tracking running processes on Linux which compares the process connector API with alternative solutions.
Work with me
Does this sort of thing interest you?
I started as a low-level engineer, but today I’m the co-founder and CEO of Robusta.dev and we’re hiring! I still do the occasional deep technical dive, as well as building a world-class team of excellent engineers and a product used by hundreds of companies.
If you join our team, you’ll work closely with me and be a core part of the founding team. Email [email protected] and mention you came from this post. We’re hiring in Israel as well as remote.
- In practice, (h)top implementations typically don’t use the process connector API - they just do a full scan of
/proc
every few seconds. This is easier to implement because they have to scan/proc
anyway (even with the process connector API) to read the initial list of processes on startup. Furthermore, the overhead of polling/proc
doesn’t matter for a tool liketop
which is infrequently run by the user on-demand. [return] - See this Linux Journal article for the advantages of implementing kernel APIs with netlink sockets as opposed to system calls, ioctls, and virtual filesystems. [return]
- Other popular options include the Linux audit API which also uses netlink for kernel-usermode communication. [return]
linuxnetlinknetlink-process-connectortracking-processescode-sleuth
1084 Words
2020-08-18 17:00 +0000