news
LWN on Linux Kernel and the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit
-
LWN ☛ Custom out-of-memory killers in BPF
The out-of-memory (OOM) killer has long been a scary and controversial part of the Linux kernel. It is summoned from some dark place when the system as a whole (or, more recently, any given control group) is running so low on memory that further allocations are not possible; its job is to kill off processes until a sufficient amount of memory has been freed. Roman Gushchin has found a way to make the OOM killer even scarier: adding the ability to load custom OOM killers in BPF.
The kernel, in its default configuration, will overcommit the memory available on the system; it will allow processes to allocate more memory than can be provided (that is, more than the sum of physical memory and swap space). Applications routinely allocate more memory than they use; limiting allocations to the available memory would, as a result, cause some of that memory to be unused. Overcommitting memory avoids that waste, and it almost always works out in the end.
-
LWN ☛ Injecting speculation barriers into BPF programs
The disclosure of the Spectre class of hardware vulnerabilities created a lot of pain for kernel developers (and many others). That pain was especially acutely felt in the BPF community. While an attacker might have to painfully search the kernel code base for exploitable code, an attacker using BPF can simply write and load their own speculation gadgets, which is a much more efficient way of operating. The BPF community reacted by, among other things, disallowing the loading of programs that may include speculation gadgets. Luis Gerhorst would like to change that situation with this patch series that takes a more direct approach to the problem.
While the potential to enable speculative-execution attacks may be a concern for any BPF program, the problem is especially severe for unprivileged programs — those that can be loaded by ordinary users. Most program types require privilege but there are a couple of packet-filter program types that do not (though the unprivileged_bpf_disabled sysctl knob can disable those types too). Among the many defenses added to the BPF subsystem is this patch by Daniel Borkmann, which was merged for the 5.13 release in 2021. It causes the verifier to treat possible speculative paths (for Spectre variant 1 in particular) as real alternatives when simulating the execution of a program, even though the verifier can demonstrate that such paths will not be taken in non-speculative execution. If the program does something untoward on one of those speculative paths, it will be rejected by the verifier.
-
LWN ☛ Flexible data placement
At the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF) Kanchan Joshi and Keith Busch led a combined storage and filesystem session on data placement, which concerns how the data on a storage device is actually written. In a discussion that hearkened back to previous summits, the idea is to give hints to enterprise-class SSDs to help them make better choices on where the data should go; hinting was most recently discussed at the summit in 2023. If SSDs can group data with similar lifetimes together, it can lead to longer life for the devices, but there is a need to work out the details.
Joshi began by noting that the logical placement of data provided by the host system is not the same as the physical placement of it on the device. There is a question of where the placement decision is made; if there is a data creator and multiple layers between it and the device (e.g. filesystem, device mapper), it is the piece that is closest to the device that ultimately decides where the data goes, he said. Currently, data is generally written sequentially because there is a single append point in a single open erase block on the device.
-
LWN ☛ Improving FUSE writeback performance
In a combined filesystem and memory-management session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Joanne Koong led a discussion on improving the writeback performance for the Filesystem in Userspace (FUSE) layer. Writeback is how data that is written to the filesystem is actually flushed to the disk; it is the process of writing dirty pages from the page cache to storage. The current FUSE implementation allocates unmovable memory, then copies the dirty data to it before initiating writeback, which is slow; Koong wanted to change that behavior. Since the session, she has posted a patch set that has been applied by FUSE maintainer Miklos Szeredi.
Koong started the session with a description of the current FUSE writeback operation. A temporary page is allocated in the unmovable memory zone for each dirty page and the data is copied to the temporary page. After that, writeback is initiated on the temporary pages and the original pages can immediately have their writeback state cleared. That extra allocation and copying work is expensive, but is needed so that the pages do not move while the writeback operation is underway.
-
LWN ☛ Filtering fanotify events with BPF
Linux systems can have large filesystems; trying to keep up with the stream of fanotify filesystem-monitoring notifications for them can be a struggle. Fanotify is one of a few ways to monitor accesses to filesystems provided by the kernel. Song Liu led a discussion on how to improve in-kernel filtering of fanotify events to a joint session of the filesystem and BPF tracks at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit. He wants to combine the best parts of a few different approaches to efficiently filter filesystem events.
There are two ways to monitor and restrict filesystem actions on Linux, Liu said: fanotify and Linux security modules (LSMs). They both have benefits and drawbacks. The main problem with using LSM hooks to respond to filesystem events is that LSM hooks are global — the LSM must respond to accesses for all files, even if it's only interested in a subset of files. The main problem with fanotify is that notifications are handled in user space, incurring a lot of context switches. The best of both worlds would be to have efficient mask-based filtering for relevant files (like fanotify) and fast in-kernel handling for the more complicated cases (like LSMs).
-
LWN ☛ Hash table memory usage and a BPF interpreter bug
Anton Protopopov led a short discussion at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit about amount of memory used by hash tables in BPF programs. He thinks that the current memory layout is inefficient, and wants to split the structure that holds table entries into two variants for different kinds of maps. When that proposal proved uncontroversial, he also took the chance to talk about a bug in BPF's call instruction.