Here is a bug that everyone who uses BPF to filter packets on Linux eventually encounters: you create a socket, use setsockopt to apply a BPF, and then read from the socket using recv. You read a packet that does not match the filter but was received from the socket anyway. The bug only happens when there is a lot of traffic and even then it only occurs when the application first starts. What is happening?

Here is some code which suffers from this issue. Error-handling has been removed for conciseness:

int sock = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));

struct sock_filter bpf_bytecode[] = { ... }; // bytecode generated by hand or using "tcpdump -dd"
struct sock_fprog bpf_program = { sizeof(bpf_bytecode) / sizeof(bpf_bytecode[0]), bpf_bytecode};
int err = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf_program, sizeof(bpf_program));

char buffer[MAX_PACKET_SIZE];
int n = recv(sock, buffer, MAX_PACKET_SIZE, NO_OPTIONS); // buffer will sometimes contain a packet that doesn't match the bpf

The bug here is deceptively simple: Packets are filtered when they are received by the kernel not when they are read by user-mode using recv. Therefore packets which don’t match the BPF can be received after the socket is created and before setsockopt is called. Those packets will remain in the socket’s buffer even after the BPF is applied and will later be transferred to the application via recv.

Here are two wrong ways to fix this:

  1. After applying the bpf use recv in a loop to discard packets from the socket until it is empty. This typically works but it breaks down if the traffic rate is greater than the rate at which you can discard packets. (e.g. if you’re sniffing on a very high traffic machine and your app typically loses packets when the socket’s buffer fills but it doesn’t matter for your use-case.) In this case your attempt to empty the socket will turn into an infinite loop.
  2. Add user-mode checks that duplicate the BPF’s logic and check all packets after receiving them. Duplicate logic leads to bugs. Enough said.1

Now lets look at the textbook solution for fixing this bug - the same solution that libpcap uses:

struct sock_filter zero_bytecode = BPF_STMT(BPF_RET | BPF_K, 0);
struct sock_fprog zero_program = { 1, &zero_bytecode};

if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &zero_program, sizeof(zero_program)) < 0) {
	printf("error attaching zero bpf: %d\n", errno);
	return 1;
}

char drain[1];
while (1) {
	int bytes = recv(sock, drain, sizeof(drain), MSG_DONTWAIT);
	if (bytes == -1) {
		// we assume the error here means there is nothing left to read from the socket which is exactly what we want
		break;
	}
}

// bpf_program is the actual bpf program we want to apply - just like in the previous example
int err = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf_program, sizeof(bpf_program));
char buffer[MAX_PACKET_SIZE];
int n = recv(sock, buffer, MAX_PACKET_SIZE, NO_OPTIONS); // buffer will now always match the bpf

What we’re doing here is taking advantage of the fact that swapping one BPF out for another is an atomic operation: if you swap one BPF for another then at every moment either the first is in place or the second but never neither. We therefore start by applying the so-called “zero-BPF” which is a BPF that matches no packets. Then we empty out any packets that arrived before the “zero-BPF” filter was applied. At this point the socket is definitely empty and it can’t fill up with junk because the zero-BPF is in place. Then we replace the zero-BPF with the real BPF we want. Because the swap is atomic we know that any packet in the socket after that must match the real BPF.

Simple.


  1. This seems like a theoretical reason but I’ve seen it happen again and again in production code. Lets say that you have a bug such that your usermode checks are too loose and allow mismatching packets, but the BPF is appropriately restrictive. You could test your app with lots of mismatching packets and you would never even notice that the usermode checks are too loose because the BPF would filter out all the mismatching packets anyway. That is, you would never notice until you deploy to a high-traffic machine in production.. [return]