Sep 8, 2020

Using the BPF ring buffer

Update March 30th, 2021: This article is still relevant if you are looking for a practical example on how to use the BPF Ring Buffer. If you want a deep explaination on how it works I suggest to visit the blog of the main author of this feature Andrii here. Enjoy the learning! :)

Introduction

The 5.8 release of the Linux Kernel came out with lots of interesting elements. Yes, as always.

A couple of weeks ago, while still processing all the news in there, I came accross a patch proposing a new bpf map type called BPF_MAP_TYPE_RINGBUF. By using this new map type we finally have an MPSC (multi producer single consumer) data structure optimized for data buffering and streaming.

Some exciting things about it:

This type of map is not tied to the same CPU when dealing with the output as it is with BPF_MAP_TYPE_PERF_EVENT_ARRAY. This is very important for me and I’m already experimenting with this in the Falco BPF driver.
It’s very flexible in letting the user to decide what kind of memory allocation model they want to use by reserving beforehand or not.
It is observable using bpf_ringbuf_query by replying to various queries about its state. This could be useful to feed a prometheus exporter to monitor the health of the buffer.
Producers do not block each other, even on different CPUs
Spinlock is used internally to do the locking on reservations that are also ordered, while commits are completely lock free. This is very cool, because locking comes for free, no need to use bpf_spin_lock around or having to manage it.

The patch author did a very good job at explaining all the reasons why the change was needed, so I will not go that way with this post. Instead, I want to write about to actually make use of this new feature.

Motivation for this post

Finding good resources on new BPF features is very hard. The subsystem maintainers team is doing a ginormous work at it and documenting every single bit is very difficult.

Moreover, this new feature is just another map interface so essentially can be used as the others do. However, I felt like others could benefit from my researching about this new features so i did put together this writeup while I was experimenting on it.

Note on helpers

For every functionality it exposes, the BPF subsystem exposes an helper.

The helper is used to let you interact with that specific part of the subsystem that does the feature you are invoking.

The purpose of the Linux Kernel is not to give you the helper definitions or a library so your system will normally not ship with an header that you can import to get your hands into the functions definitions for the helper.

The idea is that you will write the definitions yourself when you want to use a specific helper, e.g:

static void *(*bpf_ringbuf_reserve)(void *ringbuf, __u64 size, __u64 flags) =
  (void *)BPF_FUNC_ringbuf_reserve;

The patch adds 5 new BPF helpers

void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags);
void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags);
void bpf_ringbuf_submit(void *data, u64 flags);
void bpf_ringbuf_discard(void *data, u64 flags);
u64 bpf_ringbuf_query(void *ringbuf, u64 flags);

You can look at a complete list of all the BPF helpers at bpf-helpers(7).

With these premises, and to keep things simple I decided to show two different usage examples of the new features using libbpf and BCC.

It would be impractical for me to show you how to use the functionalities in a raw way by defining ourselves all the needed helpers definitions for the BPF functionalities we use.

A very good explaination about BPF helpers can be found at ebpf.io.

Using libbpf

Fortunately, the kernel provides a complete API that does all the work of exporting the helpers for us.

If you look around for libbpf, it has two homes:

The original copy, resides in the linux kernel under tools/lib/bpf.
The out-of-tree mirror at github.com/libbpf/libbpf.

To follow the example here, first go to the libbpf repository and follow the instructions to install it. The ring buffer support was added in v0.0.9. Also, make sure to have a >= 5.8 Kernel.

Here is how the BPF program:

The program itself is very simple, we attach to the tracepoint that gets hit every time an execve syscall is done.

The interesting part here for BPF_MAP_TYPE_RINGBUF is the initialization of the map with bpf_map_def. This type of map does not want the .key and .value sections and for the .max_entries value the patch says it wants a power of two. That is not entirely right, the value also needs to be page aligned with the current page shift size. In the current asm_generic/page.h here it’s defined as 1 << 12 so any value multiple of 4096 will be ok.

Once the map is initialized, look at what we do in our tracepoint, there are two ringbuf specific calls:

bpf_ringbuf_reserve does the memory reservation for the buffer, this is the only time locking is done
bpf_ringbuf_submit does the actual write to the map, this is lock free

#include <linux/types.h>

#include <bpf/bpf_helpers.h>
#include <linux/bpf.h>

struct event {
  __u32 pid;
  char filename[16];
};

struct bpf_map_def SEC("maps") buffer = {
    .type = BPF_MAP_TYPE_RINGBUF,
    .max_entries = 4096 * 64,
};

struct trace_entry {
  short unsigned int type;
  unsigned char flags;
  unsigned char preempt_count;
  int pid;
};

struct trace_event_raw_sys_enter {
  struct trace_entry ent;
  long int id;
  long unsigned int args[6];
  char __data[0];
};


SEC("tracepoint/syscalls/sys_enter_execve")
int sys_enter_execve(struct trace_event_raw_sys_enter *ctx) {
  __u32 pid = bpf_get_current_pid_tgid() >> 32;
  struct event *event = bpf_ringbuf_reserve(&buffer, sizeof(struct event), 0);
  if (!event) {
    return 1;
  }
  event->pid = pid;
  bpf_probe_read_user_str(event->filename, sizeof(event->filename),
                          (const char *)ctx->args[0]);

  bpf_ringbuf_submit(event, 0);

  return 0;
}

char _license[] SEC("license") = "GPL";

Now save this source in a file called program.c if you want to try it later.

Loading the program would be impossible without a loader.

Besides all the boilerplate it does to load the program and the tracepoint, there are some interesting things for the ringbuf usecase here too:

The buf_process_sample callback gets called every time a new element is read from the ring buffer
The ringbuffer is read using ring_buffer_consume

#include <bpf/libbpf.h>
#include <stdio.h>
#include <unistd.h>

struct event {
  __u32 pid;
  char filename[16];
};

static int buf_process_sample(void *ctx, void *data, size_t len) {
  struct event *evt = (struct event *)data;
  printf("%d %s\n", evt->pid, evt->filename);

  return 0;
}

int main(int argc, char *argv[]) {
  const char *file = "program.o";
  struct bpf_object *obj;
  int prog_fd = -1;
  int buffer_map_fd = -1;
  struct  bpf_program *prog;

  bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);

  buffer_map_fd = bpf_object__find_map_fd_by_name(obj, "buffer");

  struct ring_buffer *ring_buffer;
 
  ring_buffer = ring_buffer__new(buffer_map_fd, buf_process_sample, NULL, NULL);

  if(!ring_buffer) {
    fprintf(stderr, "failed to create ring buffer\n");
    return 1;
  }

  prog = bpf_object__find_program_by_title(obj, "tracepoint/syscalls/sys_enter_execve");
  if (!prog) {
    fprintf(stderr, "failed to find tracepoint\n");
    return 1;
  }

  bpf_program__attach_tracepoint(prog, "syscalls", "sys_enter_execve");

  while(1) {
    ring_buffer__consume(ring_buffer);
    sleep(1);
  }

  return 0;
}

Now save this source in a file called loader.c if you want to try it later.

It required quite some code to just showcase the ringbuf related functions. Sorry for the big wall of code!

Now we can proceed, compile and run it.

In the folder where you saved program.c and loader.c:

Compile the program:

clang -O2 -target bpf -g -c program.c # -g is to generate btf code

Compile the loader

gcc -g -lbpf loader.c

You can now run it via:

sudo ./a.out

It wil produce something similar to this:

393811 /bin/zsh
393812 /usr/bin/env
393812 /usr/local/bin/
393812 /usr/local/sbin
393812 /usr/bin/zsh
393816 /usr/bin/ls
393818 /usr/bin/git
393819 /usr/bin/awk
393824 /usr/bin/git
393825 /usr/bin/git
393826 /usr/bin/git

If you followed my suggestion and left the -g flag to the clang command while compiling the program, congrats, you just produced a BPF CO-RE (Compile Once, Run Everywhere) program.

Yes, you can move it to another machine with Kernel 5.8 and it will work. Next step is to compile the loader statically to move it together with the program. This is left to the reader :)

Using BCC

This paragraph is about doing the same thing we did with libbpf but with BCC.

BCC added the support for the BPF ring buffer almost immediately by adding the helper definitions and by implementing the Python API support.

To make this work you will need to be on a kernel >= 5.8 and have at least BCC 0.16.0. If you need to learn how to install BCC they have a very good resource here.

Here’s the python code, comments below:

#!/usr/bin/python3

import sys
import time

from bcc import BPF

src = r"""
BPF_RINGBUF_OUTPUT(buffer, 1 << 4);

struct event {
    u32 pid;
    char filename[16];
};

TRACEPOINT_PROBE(syscalls, sys_enter_execve) {
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    struct event *event = buffer.ringbuf_reserve(sizeof(struct event));
    if (!event) {
        return 1;
    }
    event->pid = pid;
    bpf_probe_read_user_str(event->filename, sizeof(event->filename), args->filename);

    buffer.ringbuf_submit(event, 0);

    return 0;
}
"""

b = BPF(text=src)

def callback(ctx, data, size):
    event = b['buffer'].event(data)
    print("%-8s %-16s" % (event.pid, event.filename.decode('utf-8')))


my_rb = b['buffer']
my_rb.open_ring_buffer(callback)

print("%-8s %-16s" % ("PID", "FILENAME"))

try:
    while 1:
        b.ring_buffer_poll()
        time.sleep(0.5)
except KeyboardInterrupt:
    sys.exit()

As you can see, we are making use of the BCC helper BPF_RINGBUF_OUTPUT to create a ring buffer named events, then on that one we call ringbuf_submit and ringbug_poll to do our read and write operations.

If you want to try, copy the program to a program.py file. You will need to execute it with root permissions:

sudo python program.py

The output should be something like:

PID      FILENAME
43674    /bin/zsh
43675    /usr/bin/env
43675    /usr/local/bin/
43675    /usr/local/sbin
43675    /usr/bin/zsh
43678    /usr/bin/dircol
43679    /usr/bin/ls
43681    /usr/bin/git
43682    /usr/bin/awk
43687    /usr/bin/git
43688    /usr/bin/git
43689    /usr/bin/git
43701    /usr/bin/sh
43701    /usr/bin/git

Conclusions

Once again, as with every release, the BPF subsystem is becoming more and more feature complete. This specific feature is addressing a very felt use case for those (like me) who move a lot of data around using maps.

Thanks to the maintainers and the many contributors for their hard work!