LWN Articles on Linux (Outside Paywall Now)
-
Concurrent page-fault handling with per-VMA locks [LWN.net]
The kernel is, in many ways, a marvel of scalability, but there is a longstanding pain point in the memory-management subsystem that has resisted all attempts at elimination: the mmap_lock. This lock was inevitably a topic at the 2022 Linux Storage, Filesystem, Memory-Management and BPF Summit (LSFMM), where the idea of using per-VMA locks was raised. Suren Baghdasaryan has posted an implementation of that idea — but with an interesting twist on how those locks are implemented.
The mmap_lock (formerly called mmap_sem) is a reader/writer lock that controls access to a process's address space; before making changes there (mapping in a new range, for example), the kernel must acquire that lock. Page-fault handling must also acquire mmap_lock (in reader mode) to ensure that the address space doesn't change in surprising ways while a fault is being resolved. A process can have a large address space and many threads running (and incurring page faults) concurrently, turning mmap_lock into a significant bottleneck. Even if the lock itself is not contended, the constant cache-line bouncing hurts performance.
Many attempts at solving the mmap_lock scalability problem have taken the form of speculative page-fault handling, where the work to resolve a fault is done without taking mmap_lock in the hope that the address space doesn't change in the meantime. Should concurrent access occur, the speculative page-fault code drops the work it has done and retries after taking mmap_lock. Various implementations have been shown over the years and they have demonstrated performance benefits, but the solutions are complex and none have managed to convince enough developers to be merged into the mainline kernel.
An alternative approach that has often been considered is range locking. Rather than locking the entire address space to make a change to a small part of it, range locking ensures exclusive access to the address range of interest while allowing accesses to other parts of the address space to proceed concurrently. Range locking turns out to be tricky as well, though, and no implementation has gotten close to being considered for merging.
-
What's in a (type) name? [LWN.net]
The kernel's manual pages are in a bit of an interesting position. They are managed as a separate project, distinct from the kernel's documentation, and have the task of documenting both the kernel's system-call interface and the wrappers for that interface provided by the C library. Sometimes the two objectives come into conflict, as can be seen in a discussion that has been playing out over the course of the last year on whether to use C standard type names to describe kernel-defined structures. The C
header file defines a number of types for developers who need to specify exactly how they need an integer variable to be represented. For example, int16_t is a 16-bit, signed type, while uint64_t is a 64-bit, unsigned type. This level of control is needed when defining data structures that are implemented by hardware, are exchanged through communications protocols — or are passed between user and kernel space. The kernel, though, does not use these types to define its system-call interface. Instead, the kernel has its own types defined internally. Rather than use uint64_t, for example, the kernel's API definitions use __u64. That has been the situation for a long time — since before the standard C types existed — and is simply part of how the kernel project does things.
-
A framework for code tagging [LWN.net]
Kernel code can, at times, be quite inward looking; it often refers to itself. To enable this introspection, the kernel has evolved several mechanisms for identifying specific locations in the code and carrying out actions related to those locations. The code-tagging framework patch set, posted by Suren Baghdasaryan and Kent Overstreet, is an attempt to replace various ad hoc implementations with a single framework, and to add some new applications as well.
There are a number of reasons for the kernel to need to identify specific locations within the code. For example, kernel code is not normally allowed to incur page faults, but the functions that access user-space memory will often do just that. To do the right thing in that situation, the kernel build process makes a note of the location of every user-space access operation; when a page fault happens, that list is checked and, if the fault happened in an expected location, it is handled normally. The kernel's dynamic debugging mechanism is another example; each debugging print statement is tracked and can be enabled independently.
The usual trick for implementing this kind of mechanism is to create a special ELF section in the kernel binary; that section is then populated with structures recording the points of interest within the kernel. At run time, the kernel can locate that section, where it will find an array of structures with the needed information. At its core, the tagging framework is a set of functions and macros that make the creation of and access to this special section easier.