news
LWN Articles on Kernel
-
LWN ☛ The hierarchical constant bandwidth server scheduler
The POSIX realtime model, which is implemented in the Linux kernel, can ensure that a realtime process obtains the CPU time it needs to get its job done. It can be less effective, though, when there are multiple realtime processes competing for the available CPU resources. The hierarchical constant bandwidth server patch series, posted by Yuri Andriaccio with work by Luca Abeni, Alessio Balsini, and Andrea Parri, is a modification to the Linux scheduler intended to make it possible to configure systems with multiple realtime tasks in a deterministic and correct manner.
The core concept behind POSIX realtime is priority — the highest-priority task always runs. If there are multiple processes at the same priority, the result depends on whether they are configured as SCHED_FIFO tasks (in which case the running task gets the CPU until it voluntarily gives it up) or as SCHED_RR (causing the equal-priority tasks to share the CPU in time slices). This model allows a single realtime task to monopolize a CPU indefinitely, perhaps at the expense of other realtime tasks that also need to run.
In an attempt to improve support for systems with multiple realtime tasks, the realtime group scheduling feature was added to the 2.6.25 kernel in 2008 by Peter Zijlstra. It allows a system administrator to put realtime tasks into control groups, then to limit the amount of CPU time available to each group. This feature works and is used, but it has never been seen as an optimal solution. It is easy to misconfigure (the documentation warns that ""fiddling with these settings can result in an unstable system""), complicates the scheduler in a number of ways, and lacks a solid theoretical underpinning.
-
LWN ☛ A parallel path for GPU restore in CRIU
The fundamental concept of checkpoint/restore is elegant: capture a process's state and resurrect it later, perhaps elsewhere. Checkpointing meticulously records a process's memory, open files, CPU state, and more into a snapshot. Restoration then reconstructs the process from this state. This established technique faces new challenges with GPU-accelerated applications, where low-latency restoration is crucial for fault tolerance, live migration, and fast startups. Recently, the restore process for AMD GPUs has been redesigned to eliminate substantial bottlenecks.
-
LWN ☛ Parallelizing filesystem writeback
Writeback for filesystems is the process of flushing the "dirty" (written) data in the page cache to storage. At the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Anuj Gupta led a combined storage and filesystem session on some work that has been done to parallelize the writeback process. Some of the performance problems that have been seen with the existing single-threaded writeback came up in a session at last year's summit, where the idea of doing writeback in parallel was discussed.
Gupta began by noting that Kundan Kumar, who posted the topic proposal, was supposed to be leading the session, but was unable to attend. Kumar and Gupta have both been working on a prototype for parallelizing writeback; the session was meant to gather feedback on it.
Currently, writeback for buffered I/O is single-threaded, though applications are issuing multithreaded writes, which can lead to contention. The backing storage device is represented in the kernel by a BDI (struct backing_dev_info), and each BDI has a single writeback thread that processes the struct bdi_writeback embedded in it. Each bdi_writeback has a single list of inodes that need to be written back and a single delayed_work instance, which makes the process single-threaded.
-
LWN ☛ Supporting NFS v4.2 WRITE_SAME
At the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Anna Schumaker led a discussion about implementing the NFS v4.2 WRITE_SAME command in both the NFS client and server. WRITE_SAME is meant to write large amounts of identical data (e.g. zeroes) to the server without actually needing to transfer all of it over the wire. In her topic proposal, Schumaker wondered whether other filesystems needed the functionality, so that it should be implemented at the virtual filesystem (VFS) layer, or whether it should simply be handled as an NFS-specific ioctl().
The NFS WRITE_SAME operation was partly inspired by the SCSI WRITE SAME command, she began; it is ""intended for databases to be able to initialize a bulk of records all at once"". It offloads much of the work to the server side. So far, Schumaker has been implementing WRITE_SAME with an ioctl() using a structure that looks similar to the application data block structure defined in the NFS v4.2 RFC for use by WRITE_SAME.
-
LWN ☛ Getting Lustre upstream
The Lustre filesystem has a long history, some of which intersects with Linux. It was added to the staging tree in 2013, but was bounced out of staging in 2018, due to a lack of progress and a development model that was incompatible with the kernel's. Lustre may be working its way back into the kernel, though. In a filesystem-track session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Timothy Day and James Simmons led a discussion on how to get Lustre into the mainline.
Day began with an overview of Lustre, which is a ""high-performance parallel filesystem"". It is typically used by systems with lots of GPUs that need to be constantly fed with data (e.g. AI workloads) and for checkpointing high-performance-computing (HPC) workloads. A file is split up into multiple chunks that are stored on different servers. Both the client and server implementations run in the kernel, similar to NFS. For the past ten or more years, the wire and disk formats have been ""pretty stable"" with ""very little change""; Lustre has good interoperability between different versions, unlike in the distant past where both server and client needed to be on the same version.