LWN on Linux Kernel Space
-
One more pidfdfs surprise
The "pidfdfs" virtual filesystem was added to the 6.9 kernel release as a way to export better information about running processes to user space. It replaced a previous implementation in a way that was, on its surface, fully compatible while adding a number of new capabilities. This transition, which was intended to be entirely invisible to existing applications, already ran into trouble in March, when a misunderstanding with SELinux caused systems with pidfdfs to fail to boot properly. That problem was quickly fixed, but it turns out that there was one more surprise in store, showing just how hard ABI compatibility can be at times.
A pidfd is a file descriptor that identifies a running process. Within the kernel, it must have all of the data structures that normally go along with file descriptors so that kernel subsystems know what to do with it. The kernel has, since the 2.6.22 release in 2007, had a small helper mechanism providing anonymous inodes to back up file descriptors on virtual filesystems that do not have a real file behind them. When the pidfd abstraction was added to the 5.1 kernel, it was naturally implemented using anonymous inodes, and all worked as intended.
-
New APIs for filesystems
A discussion of extensions to the statx() system call comes up frequently at the Linux Storage, Filesystem, Memory Management, and BPF Summit; this year's edition was no exception. Kent Overstreet led the first filesystem-only session at the summit on querying information about filesystems that have subvolumes and snapshots. While it was billed as a discussion on statx() additions, it ranged more widely over new APIs needed for modern filesystems.
-
Handling the NFS change attribute
The saga of the i_version field for inodes, which tracks the occurrence of changes to the data or metadata of a file, continued in a discussion at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. In a session led by Jeff Layton, who has been doing a lot the work on changing the semantics and functioning of i_version over the years, he updated attendees on the status of the effort since a session at last year's summit. His summary was that things are "pretty much where we started last year", but the discussion this time pointed to some possible ways forward.
-
Removing GFP_NOFS
The GFP_NOFS flag is meant for kernel memory allocations that should not cause a call into the filesystems to reclaim memory because there are already locks held that can potentially cause a deadlock. The "scoped allocation" API is a better choice for filesystems to indicate that they are holding a lock, so GFP_NOFS has long been on the chopping block, though progress has been slow. In a filesystem-track session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, Matthew Wilcox wanted to discuss how to move kernel filesystems away from the flag with the eventual goal of removing it completely.
He began the session by saying that there are several changes that people would like with regard to the GFP flags, but that the scoped-allocation API (i.e. memalloc_nofs_save() and memalloc_nofs_restore() as mentioned in the LSFMM+BPF topic discussion) for GFP_NOFS went in long ago, while the conversion to it is far from complete. He also wanted to talk a bit about Rust, he said. There is a desire to bring in Rust code from outside the kernel for, say, a hash table, but that requires the ability to allocate memory, which means making GFP flags available. "Why the hell would we want to add GFP flags to every Rust thing that we bring into the kernel? That's crazy."
-
Measuring and improving buffered I/O
There are two types of file I/O on Linux, buffered I/O, which goes through the page cache, and direct I/O, which goes directly to the storage device. The performance of buffered I/O was reported to be a lot worse than direct I/O, especially for one specific test, in Luis Chamberlain's topic proposal for a session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. The proposal resulted in a lengthy mailing-list discussion, which also came up in Paul McKenney's RCU session the next day; Chamberlain led a combined storage and filesystem session to discuss those results with an eye toward improving buffered I/O performance.
-
Standardizing the BPF ISA
While BPF may be most famous for its use in the Linux kernel, there is actually a growing effort to standardize BPF for use on other systems. These include eBPF for Windows, but also uBPF, rBPF, hBPF, bpftime, and others. Some hardware manufacturers are even considering integrating BPF directly into networking hardware. Dave Thaler led two sessions about all of the problems that cross-platform use inevitably brings and the current status of the standardization work at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit.
Thaler opened the first session (on the first day of the summit) by discussing the many platforms that are now capable of running BPF. With multiple compilers and runtimes, there are inevitable compatibility problems. He defined the goal of the ongoing IETF BPF standardization work as trying to ensure that any compiler can be used with any compliant runtime. He then went into a bit more detail about what "compliant" means in this specific context, which required first explaining a bit of background about the structure of the standardization documents.
-
An instruction-level BPF memory model
There are few topics as arcane as memory models, so it was a pleasant surprise when the double-length session on the BPF memory model at the Linux Storage, Filesystem, Memory Management, and BPF Summit turned out to be understandable. Paul McKenney led the session, although he was clear that the work he was presenting was also due to Puranjay Mohan, who unfortunately could not attend the summit. BPF does not actually have a formalized memory model yet; instead it has relied on a history of talks like this one and a general informal understanding. Unfortunately, ignoring memory models does not make them go away, and this has already caused at least one BPF-related bug on weakly-ordered architectures. Figuring out what a formal memory model for BPF should define was the focus of McKenney's talk.
-
Comparing BPF performance between implementations
Alan Jowett returned for a second remote presentation at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit to compare the performance of different BPF runtimes. He showed the results of the MIT-licensed BPF microbenchmark suite he has been working on. The benchmark suite does not yet provide a good direct comparison between all platforms, so the results should be taken with a grain of salt. They do seem to indicate that there is some significant variation between implementations, especially for different types of BPF maps.