LWN Articles About Linux Kernel
-
Memory-management documentation and development process
As the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit neared its conclusion, two sessions were held in the memory-management track on process-oriented topics. Mike Rapoport ran a session on memory-management documentation (or the lack thereof), while Andrew Morton talked about the state of the subsystem's development process in general. Both sessions were relatively brief and did not foreshadow substantial changes to come.
-
An introduction to EROFS
Gao Xiang gave an overview of the Extended Read-Only File System (EROFS) in a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. EROFS was added to Linux 5.4 in 2019 and has been increasingly used in places beyond its roots as a filesystem for Android and embedded devices. Container images based on EROFS are being used in many places these days, for example.
Unfortunately, this session was quite difficult for me to follow, so the report below is fragmentary and incomplete. There is a YouTube video of the session, but it suffers from nearly inaudible audio, though perhaps that will be addressed before long. The slides from the session are also available.
EROFS is a block-based, read-only filesystem with a "very simple" format, Xiang began. The earlier read-only filesystems had many limitations, such as not supporting compression, which is part of why EROFS was developed. EROFS stores its data in a block-aligned fashion, which is page-cache friendly; that alignment also allows direct I/O and DAX filesystem access.
-
A decision on composefs
At the end of our February article about the debate around the composefs read-only, integrity-protected filesystem, it was predicted that the topic would come up at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. That happened on the second day of the summit when Alexander Larsson led a session on composefs. While the mailing-list discussion was somewhat contentious, the session was less so, since overlayfs can be made to fit the needs of the composefs use cases. It turns out that an entirely new filesystem is not really needed.
Larsson began by looking at the use case that spurred the creation of composefs. At Red Hat, image-based Linux systems are created using OSTree/libostree; they are not the typical physical block-device images, however, as they are more like "virtual images". There is a content-addressed store (CAS) that contains all of the file content for all of the images. In order to build a directory hierarchy for the virtual image, a branch gets checked out from the OSTree repository, which contains the metadata and directory information for the image; OSTree then builds the directory structure using hard links to the CAS entities.
-
Supporting large block sizes
At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Luis Chamberlain led a plenary session on kernel support for block sizes larger than 4KB. There are assumptions in the current kernel that the block size used by a block-layer device is less than or equal to the system's page size—both are usually 4KB today. But there have been efforts over the years to remove that restriction; that work may be heading toward fruition, in part because of the folio efforts of late, though there are still lots of areas that need attention.
Originally, storage devices used 512-byte blocks, but over time that has grown to 4KB and beyond, Chamberlain said. Supporting block sizes greater than the page size has been desired for years; the first related patches were posted 16 years ago and the topic comes up at every LSFMM, he said. There is a wiki page about the project as well.
-
Special file descriptors in BPF
Developers learning the Unix (or POSIX in general) system-call set will quickly encounter file descriptors, which are used to represent open files and more. Developers also tend to learn early on that the first three file descriptors are special, with file descriptor zero being the standard input stream, one being standard output, and two being standard error. The kernel, though, does not normally attach any specific meaning to a given descriptor number, so it was somewhat surprising when a recent BPF patch series attempted to attach a special meaning to zero when used as a file descriptor.
BPF objects (maps and such) normally go away when they are closed, usually when the creating process exits. They may be "pinned", though, which gives them a name in the BPF filesystem (usually under /sys/fs/bpf) and allows them to outlive the creating process. The existing API for the pinning of BPF objects is path-based, meaning that the caller provides a string containing the full path name to be created for an object.