I recently wrote that the audit API is the best way to track process lifecycle on Linux for security purposes. It turns out there are several difficulties I underestimated:
- Containers: the audit framework doesn’t (yet) track containers. If you could easily track process hierarchies this wouldn’t be a big deal because you could keep track of which processes are in which containers yourself. However, as I’ll show, process hierarchies are difficult to track.
- Family relationships are complicated: It is tricky to track parent/child relationships1 and probably impossible to know which fork/clone in the parent led to the creation of which child process. The issue is that on forks/clones the parent’s pid is reported in the root pid namespace but the child’s pid is reported in a different pid namespace2. You can work around this by ignoring the child’s pid in fork/clone syscalls and by looking at the ppid field in subsequent syscalls but that is imperfect and tricky.3
- Events can be re-ordered: When using
-a exitrules, forks can be received by usermode in a different order than they occurred - this is annoying but surmountable with better usermode logic. The reason this happens is that
-a exitrules obviously send audit events to usermode when a syscall finishes running. However, if process A forks into B and B forks into C then sometimes the Linux scheduler will run B and even C before A returns from the fork/clone syscall back to usermode - and therefore before the first fork can be reported.
- Threads make life complicated: Audit tracks only processes, not threads - this means you can’t tell when a process dies based on the audit API alone because you can’t differentiate between thread-death and process-death.4
- One for all and all for many: There can only be one process controlling the audit API at a time - although I was pleasantly surprised that you can open a little-documented multicast socket for the audit API and receive audit events in multiple usermode processes. All rules have to be added via the “primary” audit controller process and any rules you add via that process will be received by all processes.
The audit API is still great and using it I’ve built process monitoring systems now in production despite these limitations. However, once BPF CO-RE arrives I’m switching over to BPF. As for audit, all the problems above could be solved by adding container ids (this is eventually going to happen) and/or by supplementing the SYSCALL records from fork/clone/exit with supplemental FORK/EXIT records that contain additional information.5
Work with me
Does this sort of thing interest you?
I started as a low-level engineer, but today I’m the co-founder and CEO of Robusta.dev and we’re hiring! I still do the occasional deep technical dive, as well as building a world-class team of excellent engineers and a product used by hundreds of companies.
If you join our team, you’ll work closely with me and be a core part of the founding team. Email [email protected] and mention you came from this post. We’re hiring in Israel as well as remote.
- In the real world too as demonstrated by the famous song that is surprisingly relevant to Linux process family trees. [return]
- The real issue here is that the audit framework includes a special event for execs but it doesn’t have a similar event for forks/clones. Such an event is seemingly unnecessary because you can add a rule for the fork/clone syscall itself like
-a exit,always -F arch=b64 -S fork -k fork_rule. With such a rule, the child’s pid is seemingly available via the generic exit field (the syscall’s return code) and the parent’s pid is available via the generic pid field. However, this doesn’t work in Kubernetes clusters (or other systems which use pid namespaces) because the exit field shows the same exit code that the parent process sees which is in the parent’s pid namespace. On the other hand, the generic pid field is in the root pid namespace as is the pid field in all future syscalls by the child. [return]
- For starters, the ppid field in a subsequent syscall is equal to the parent at the time of that latter syscall which isn’t necessarily the process that called fork/clone. (e.g. due to reparenting after the parent’s death, clone calls which create threads, and other odd cases.) Furthermore, if process A forks into B which immediately forks into C then you need to look at the pid and ppid field on the intermediate fork in order to properly attach C to the hierarchy, but you should ignore the exit code of that intermediate fork. (There are three pids reported on fork/clone. The pid field which is the parent who called fork, the ppid field which is usually the grandparent, and the exit field which is the child’s pid in the parent’s pid namespace.)
- The only way to track process death is via exit/exit_group syscalls and the audit event for abnormal process termination (e.g. kill -9). However, every time that a thread ends, exit is called and there is no trivial way to determine whether that was the last thread in the process or not. Tracking process death is important because pids are eventually recycled and as discussed above you can’t properly figure out when specific pids are created based on fork/clone syscalls [return]
- There is a precedent for this - when execve is audited a special EXECVE record is issued in addition to the SYSCALL record. [return]