Everyone knows how to track which processes run on Linux, but almost no-one tracks them accurately. In fact, all of the methods listed in this post have some deficiency or another. Lets define requirements:
- All processes should be logged including short-lived processes
- We should know the full executable path of every process that runs
- Within reason, we shouldn’t need to modify or recompile our code for different kernel versions
- Bonus: If the host is a Kubernetes node or runs docker than we should be able to determine which pod/container a process belongs to. To do so, it is often sufficient to know a process’ cgroup ID. 1
Lets looks at common Linux APIs that can solve this problem. For simplicity’s sake, we’ll focus on detecting
execve syscalls. A full solution will also need to monitor
clone syscalls and their variants as well
Simple Usermode Solutions
/proc. This is no good because it will miss short-lived processes.
- Use the netlink process connector. The connector will deliver notifications for short-lived processes but the notifications only include numerical data like the process’
pidwithout data like the executable path. Therefore, you’re back to reading data from
/procand have the same race condition for short-lived processes. If you use the netlink process connector you should be aware of a bug that causes events to disappear and how to interpret strange looking ppids in the output.
- Use the Linux audit API. This is the best solution out there. The audit API exists in all modern kernels, provides full executable paths, and won’t miss short-lived processes. There are only two major disadvantages. First of all, only one usermode program can add rules to the kernel audit API at a time. This is a pain if you are developing an enterprise security solution and have customers who use the audit API themselves via
osquery.2 Second of all, the audit API isn’t container aware, despite years of kernel mailing list discussions on fixing the issue. I’ve written more about the difficulties I encountered using the Linux audit API here.
Simple Kernel Debugging Solutions
These solutions all involve a single kernel probe of various types.
- Use tracepoints3 . The kernel contains several relevant tracepoints which execute at different points in the
execvesyscall. They are: sched_process_exec, open_exec, sys_enter_execve, sys_exit_execve.4 These tracepoints are better than the previous solutions because they will track short-lived processes, but none of these tracepoints provide an executable’s full path when the parameter to
execis a relative path. In other words, if the user runs
cd /bin && ./lsthen the path will be reported as
/bin/ls. Here is a simple demonstration:
# enable the sched_process_exec tracepoint sudo -s cd /sys/kernel/debug/tracing echo 1 > events/sched/sched_process_exec/enable # run ls via a relative path cd /bin && ./ls # fetch data from the sched_process_exec tracepoint # note that we don't see the full path cd - cat trace | grep ls # disable the tracepoint echo 0 > events/sched/sched_process_exec/enable
Use kprobes/kretprobes5. Unlike tracepoints, there are many, many possible functions where you can insert a kprobe which will be hit during an
execvesyscall. However, I can’t find a single function in
execve’s callgraph which has as function parameters both the process’ PID and the full path of the executable. Therefore we have the same issue with relative paths as the tracepoint solution. There are some clever hacks you can do here - after all, kprobes can read data from the kernel’s callstack - but these solutions wont be stable across kernel versions so I’m ruling them out.
Use eBPF programs with tracepoints/kprobes/kretprobes6. This opens up some new options. Now we can run arbitrary code in the kernel every time that the
execvesyscall runs. In theory, this should let us extract any information we want from kernel and send it to usermode. There are two ways of obtaining such data and neither meets our requirements:
- Read data from kernel structs like
linux_binprm. We can indeed fetch the executable’s full path this way7 but reading from kernel structs will make us dependent on kernel versions. Our eBPF program needs to know the offsets of struct members so it has to be compiled with kernel headers for each kernel version. This is typically solved by compiling the eBPF program at runtime, but that brings it’s own issues like a requirement that you have kernel headers available on every machine.
- Use eBPF helper functions to fetch data from the kernel. This is compatible across all kernel versions that contain the helper you use. In this method you never access kernel structs directly - rather you use helper APIs to fetch data. There is only one problem: there is no eBPF helper function which can obtain the executable’s full path. (However, in recent kernel versions, there is an eBPF helper function to get the cgroup ID which is useful for mapping processes to containers.)
- Read data from kernel structs like
LD_PRELOADon every running executable and hook
execcalls in libc. Seriously, don’t do this. It won’t work for statically compiled executables, is easy for malicious code to bypass, and is fairly intrusive.
- Use tracepoints on
chdirto track not only the creation of all processes but also their current working directory. For each
execvelookup the process’ working directory and combine that with
execve’s parameter to obtain a full path. If you do this, make sure you use eBPF maps and put all the logic into your eBPF programs to avoid race conditions where events arrive in usermode in the wrong order.
- Use ptrace based solutions. These are too intrusive for production code. However, if you do go this route then use ptrace + seccomp and the
SECCOMP_RET_TRACEflag. Then seccomp can intercept all
execvesyscalls in the kernel and pass them to a usermode debugger which can log the
execvecall before telling seccomp to continue with the
- Use AppArmor. You can write an AppArmor profile which forbids a process from executing any other executables. If you put that profile in complain mode then AppArmor wont actually prevent process execution - it will only issue alerts when the profile is violated. If we attach our profile to every running process then we will have a working but very ugly and hackish solution. You probably shouldn’t do this.
3rd Party Tools
None of these solutions satisfy our requirements, but here they are:
ps- this just polls from
/procand therefore has the usual race conditions
- Use the eBPF-based execsnoop - this is just a kprobe/kretprobe based solution so it has the same dependency on kernel versions discussed above. Besides, execsnoop doesn’t even expand relative paths so we have gained nothing.
- Use the old non-eBPF version of execsnoop - this won’t work either. It is just a simple kprobe.
- Use the eBPF helper function get_fd_path - this doesn’t yet exist, but once it is added to the kernel it will somewhat help. You’ll still have to get the executable’s FD in a way that doesn’t involve reading from kernel structs.
None of the APIs covered here are perfect. Here are my recommendations for which solution you should use and when:
- If you can, use the audit API via
auditdor go-audit. This will log all processes, including short-lived processes, and you’ll get full executable paths without any effort. This solution wont work if someone is already using the audit API via a different usermode tool than you. In that case, read on.
- If you don’t care about full-paths and you want a quick, ready-made solution that doesn’t involve writing any code then use
execsnoop. This has the disadvantage of requiring kernel headers at runtime.
- If you don’t care about full-paths and you’re willing to go the extra mile to avoid requiring kernel headers then use one of the tracepoints mentioned above. There are multiple ways that you can connect to those tracepoints and transfer their data to usermode - whether it is via the filesystem interface shown above, via an eBPF program with eBPF maps, or via
perftools. I’ll cover these options in another post. The main thing to remember is this: if you use an eBPF program make sure it can be statically compiled so that you don’t have the same dependency on kernel headers that you’re trying to avoid. This means you can’t access kernel structs and you can’t use frameworks like BCC which compile eBPF programs at runtime.
- If you don’t care about short-lived processes and the previous solutions don’t fit your use-case then use the netlink process connector in conjunction with
Have I forgotten a solution? Message me on twitter!
- There is no such thing as a container or a container ID from the kernel’s perspective. The kernel only knows about cgroups, network namespaces, process namespaces, and other independent kernel APIs which container runtimes like docker happen to implement containerization with. When trying to identify containers via kernel IDs you need a kernel identifier which every container has exactly one of. For docker, cgroup IDs satisfy that requirement. [return]
- In theory user-mode multiplexers like
auditdand go-audit can mitigate this issue, but for enterprise solutions you still don’t know if the customer is using a multiplexer, if so which one, and if there are other security solutions present which connect to the audit API directly. [return]
- tracepoints are probes that are statically compiled into the kernel at set locations. Each probe can be individually enabled so that it emits notifications when the kernel reaches that probe’s location. [return]
- To obtain this list I ran
cat /sys/kernel/tracing/available_events | grep execand then filtered the output based on a glance at kernel sources [return]
- kprobes let you extract debug information from almost any kernel location. You can think of them as kernel breakpoints which emit information but don’t stop execution. [return]
- In other words, use tracepoints/kprobes/kretprobes as the hooking mechanism but set an eBPF program to run on the hook instead of the old-school handlers. [return]
- e.g. put a tracepoint on
sched_process_execand use a bounded eBPF loop to walk the dentry chain in
bprm->file->f_path.dentry, sending it to usermode one piece at a time via a perf ring buffer [return]
2020-07-02 18:24 +0000