The Difficulties of Tracking Running Processes on Linux

Introduction

Everyone knows how to track which processes run on Linux, but almost no-one tracks them accurately. In fact, all of the methods listed in this post have some deficiency or another. Lets define requirements:

All processes should be logged including short-lived processes
We should know the full executable path of every process that runs
Within reason, we shouldn’t need to modify or recompile our code for different kernel versions
Bonus: If the host is a Kubernetes node or runs docker than we should be able to determine which pod/container a process belongs to. To do so, it is often sufficient to know a process’ cgroup ID. ¹

Lets looks at common Linux APIs that can solve this problem. For simplicity’s sake, we’ll focus on detecting execve syscalls. A full solution will also need to monitor fork/clone syscalls and their variants as well execveat.

Simple Usermode Solutions

Poll /proc. This is no good because it will miss short-lived processes.
Use the netlink process connector. The connector will deliver notifications for short-lived processes but the notifications only include numerical data like the process’ pid without data like the executable path. Therefore, you’re back to reading data from /proc and have the same race condition for short-lived processes. If you use the netlink process connector you should be aware of a bug that causes events to disappear and how to interpret strange looking ppids in the output.
Use the Linux audit API. This is the best solution out there. The audit API exists in all modern kernels, provides full executable paths, and won’t miss short-lived processes. There are only two major disadvantages. First of all, only one usermode program can add rules to the kernel audit API at a time. This is a pain if you are developing an enterprise security solution and have customers who use the audit API themselves via auditd or osquery.² Second of all, the audit API isn’t container aware, despite years of kernel mailing list discussions on fixing the issue. I’ve written more about the difficulties I encountered using the Linux audit API here.

Simple Kernel Debugging Solutions

These solutions all involve a single kernel probe of various types.

Use tracepoints³ . The kernel contains several relevant tracepoints which execute at different points in the execve syscall. They are: sched_process_exec, open_exec, sys_enter_execve, sys_exit_execve.⁴ These tracepoints are better than the previous solutions because they will track short-lived processes, but none of these tracepoints provide an executable’s full path when the parameter to exec is a relative path. In other words, if the user runs cd /bin && ./ls then the path will be reported as ./ls and not /bin/ls. Here is a simple demonstration:

# enable the sched_process_exec tracepoint
sudo -s
cd /sys/kernel/debug/tracing
echo 1 > events/sched/sched_process_exec/enable

# run ls via a relative path
cd /bin && ./ls

# fetch data from the sched_process_exec tracepoint
# note that we don't see the full path
cd -
cat trace | grep ls

# disable the tracepoint
echo 0 > events/sched/sched_process_exec/enable

Use kprobes/kretprobes⁵. Unlike tracepoints, there are many, many possible functions where you can insert a kprobe which will be hit during an execve syscall. However, I can’t find a single function in execve’s callgraph which has as function parameters both the process’ PID and the full path of the executable. Therefore we have the same issue with relative paths as the tracepoint solution. There are some clever hacks you can do here - after all, kprobes can read data from the kernel’s callstack - but these solutions wont be stable across kernel versions so I’m ruling them out.
Use eBPF programs with tracepoints/kprobes/kretprobes⁶. This opens up some new options. Now we can run arbitrary code in the kernel every time that the execve syscall runs. In theory, this should let us extract any information we want from kernel and send it to usermode. There are two ways of obtaining such data and neither meets our requirements:
1. Read data from kernel structs like task_struct or linux_binprm. We can indeed fetch the executable’s full path this way⁷ but reading from kernel structs will make us dependent on kernel versions. Our eBPF program needs to know the offsets of struct members so it has to be compiled with kernel headers for each kernel version. This is typically solved by compiling the eBPF program at runtime, but that brings it’s own issues like a requirement that you have kernel headers available on every machine.
2. Use eBPF helper functions to fetch data from the kernel. This is compatible across all kernel versions that contain the helper you use. In this method you never access kernel structs directly - rather you use helper APIs to fetch data. There is only one problem: there is no eBPF helper function which can obtain the executable’s full path. (However, in recent kernel versions, there is an eBPF helper function to get the cgroup ID which is useful for mapping processes to containers.)

Hackish Solutions:

Use LD_PRELOAD on every running executable and hook exec calls in libc. Seriously, don’t do this. It won’t work for statically compiled executables, is easy for malicious code to bypass, and is fairly intrusive.
Use tracepoints on execve, fork/clone, and chdir to track not only the creation of all processes but also their current working directory. For each execve lookup the process’ working directory and combine that with execve’s parameter to obtain a full path. If you do this, make sure you use eBPF maps and put all the logic into your eBPF programs to avoid race conditions where events arrive in usermode in the wrong order.
Use ptrace based solutions. These are too intrusive for production code. However, if you do go this route then use ptrace + seccomp and the SECCOMP_RET_TRACE flag. Then seccomp can intercept all execve syscalls in the kernel and pass them to a usermode debugger which can log the execve call before telling seccomp to continue with the execve as usual.
Use AppArmor. You can write an AppArmor profile which forbids a process from executing any other executables. If you put that profile in complain mode then AppArmor wont actually prevent process execution - it will only issue alerts when the profile is violated. If we attach our profile to every running process then we will have a working but very ugly and hackish solution. You probably shouldn’t do this.

3rd Party Tools

None of these solutions satisfy our requirements, but here they are:

Use ps - this just polls from /proc and therefore has the usual race conditions
Use the eBPF-based execsnoop - this is just a kprobe/kretprobe based solution so it has the same dependency on kernel versions discussed above. Besides, execsnoop doesn’t even expand relative paths so we have gained nothing.
Use the old non-eBPF version of execsnoop - this won’t work either. It is just a simple kprobe.

Future Solutions:

Use the eBPF helper function get_fd_path - this doesn’t yet exist, but once it is added to the kernel it will somewhat help. You’ll still have to get the executable’s FD in a way that doesn’t involve reading from kernel structs.

Closing Notes

None of the APIs covered here are perfect. Here are my recommendations for which solution you should use and when:

If you can, use the audit API via auditd or go-audit. This will log all processes, including short-lived processes, and you’ll get full executable paths without any effort. This solution wont work if someone is already using the audit API via a different usermode tool than you. In that case, read on.
If you don’t care about full-paths and you want a quick, ready-made solution that doesn’t involve writing any code then use execsnoop. This has the disadvantage of requiring kernel headers at runtime.
If you don’t care about full-paths and you’re willing to go the extra mile to avoid requiring kernel headers then use one of the tracepoints mentioned above. There are multiple ways that you can connect to those tracepoints and transfer their data to usermode - whether it is via the filesystem interface shown above, via an eBPF program with eBPF maps, or via perf tools. I’ll cover these options in another post. The main thing to remember is this: if you use an eBPF program make sure it can be statically compiled so that you don’t have the same dependency on kernel headers that you’re trying to avoid. This means you can’t access kernel structs and you can’t use frameworks like BCC which compile eBPF programs at runtime.
If you don’t care about short-lived processes and the previous solutions don’t fit your use-case then use the netlink process connector in conjunction with /proc

Have I forgotten a solution? Message me on twitter!

Work with me

Does this sort of thing interest you?

I started as a low-level engineer, but today I’m the co-founder and CEO of Robusta.dev and we’re hiring! I still do the occasional deep technical dive, as well as building a world-class team of excellent engineers and a product used by hundreds of companies.

If you join our team, you’ll work closely with me and be a core part of the founding team. Email [email protected] and mention you came from this post. We’re hiring in Israel as well as remote.

There is no such thing as a container or a container ID from the kernel’s perspective. The kernel only knows about cgroups, network namespaces, process namespaces, and other independent kernel APIs which container runtimes like docker happen to implement containerization with. When trying to identify containers via kernel IDs you need a kernel identifier which every container has exactly one of. For docker, cgroup IDs satisfy that requirement. ^[return]
In theory user-mode multiplexers like auditd and go-audit can mitigate this issue, but for enterprise solutions you still don’t know if the customer is using a multiplexer, if so which one, and if there are other security solutions present which connect to the audit API directly. ^[return]
tracepoints are probes that are statically compiled into the kernel at set locations. Each probe can be individually enabled so that it emits notifications when the kernel reaches that probe’s location. ^[return]
To obtain this list I ran cat /sys/kernel/tracing/available_events | grep exec and then filtered the output based on a glance at kernel sources ^[return]
kprobes let you extract debug information from almost any kernel location. You can think of them as kernel breakpoints which emit information but don’t stop execution. ^[return]
In other words, use tracepoints/kprobes/kretprobes as the hooking mechanism but set an eBPF program to run on the hook instead of the old-school handlers. ^[return]
e.g. put a tracepoint on sched_process_exec and use a bounded eBPF loop to walk the dentry chain inbprm->file->f_path.dentry, sending it to usermode one piece at a time via a perf ring buffer ^[return]