Mind the Patch Gap: Exploiting an io_uring Vulnerability in Ubuntu

By Oriol Castejón


This post discusses a use-after-free vulnerability, CVE-2024-0582, in io_uring in the Linux kernel. Despite the vulnerability being patched in the stable kernel in December 2023, it wasn’t ported to Ubuntu kernels for over two months, making it an easy 0day vector in Ubuntu during that time.

In early January 2024, a Project Zero issue for a recently fixed io_uring use-after-free (UAF) vulnerability (CVE-2024-0582) was made public. It was apparent that the vulnerability allowed an attacker to obtain read and write access to a number of previously freed pages. This seemed to be a very powerful primitive: usually a UAF gets you access to a freed kernel object, not a whole page – or even better, multiple pages. As the Project Zero issue also described, it was clear that this vulnerability should be easily exploitable: if an attacker has total access to free pages, once these pages are returned to a slab cache to be reused, they will be able to modify any contents of any object allocated within these pages. In the more common situation, the attacker can modify only a certain type of object, and possibly only at certain offsets or with certain values.

Moreover, this fact also suggests that a data-only exploit should be possible. In general terms, such an exploit does not rely on modifying the code execution flow, by building for instance a ROP chain or using similar techniques. Instead, it focuses on modifying certain data that ultimately grants the attacker root privileges, such as making read-only files writable by the attacker. This approach makes exploitation more reliable, stable, and allows bypassing some exploit mitigations such as Control-Flow Integrity (CFI), as the instructions executed by the kernel are not altered in any way.

Finally, according to the Project Zero issue, this vulnerability was present in the Linux kernel from versions starting at 6.4 and prior to 6.7. At that moment, Ubuntu 23.10 was running a vulnerable verison of 6.5 (and somewhat later so was Ubuntu 22.04 LTS), so it was a good opportunity to exploit the patch gap, understand how easy it would be for an attacker to do that, and how long they might possess an 0day exploit based on an Nday.

More precisely:

This post describes the data-only exploit strategy that we implemented, allowing a non-privileged user (and without the need of unprivileged user namespaces) to achieve root privileges on affected systems. First, a general overview of the io_uring interface is given, as well as some more specific details of the interface relevant to this vulnerability. Next, an analysis of the vulnerability is provided. Finally, a strategy for a data-only exploit is presented.


The io_uring interface is an asynchronous I/O API for Linux created by Jens Axboe and introduced in the Linux kernel version 5.1. Its goal is to improve performance of applications with a high number of I/O operations. It provides interfaces similar to functions like read()  and write(), for example, but requests are satisfied in an asynchronous manner to avoid the context switching overhead caused by blocking system calls.

The io_uring interface has been a bountiful target for a lot of vulnerability research; it was disabled in ChromeOS, production Google servers, and restricted in Android. As such, there are many blog posts that explain it with a lot of detail. Some relevant references are the following:

In the next subsections we give an overview of the io_uring interface. We pay special attention to the Provided Buffer Ring functionality, which is relevant to the vulnerability discussed in this post. The reader can also check “What is io_uring?”, as well as the above references for alternative overviews of this subsystem.

The io_uring Interface

The basis of io_uring is a set of two ring buffers used for communication between user and kernel space. These are:

  • The submission queue (SQ), which contains submission queue entries (SQEs) describing a request for an I/O operation, such as reading or writing to a file, etc.
  • The completion queue (CQ), which contains completion queue entries (CQEs) that correspond to SQEs that have been processed and completed.

This model allows executing a number of I/O requests to be performed asynchronously using a single system call, while in a synchronous manner each request would have typically corresponded to a single system call. This reduces the overhead caused by blocking system calls, thus improving performance. Moreover, the use of shared buffers also reduces the overhead as no data between user and kernelspace has to be transferred.

The io_uring API consists of three system calls:

  • io_uring_setup()
  • io_uring_register()
  • io_uring_enter()

The io_uring_setup() System Call

The io_uring_setup() system call sets up a context for an io_uring instance, that is, a submission and a completion queue with the indicated number of entries each one. Its prototype is the following:

					int io_uring_setup(u32 entries, struct io_uring_params *p);

Its arguments are:

  • entries: It determines how many elements the SQ and CQ must have at the minimum.
  • params: It can be used by the application to pass options to the kernel, and by the kernel to pass information to the application about the ring buffers.

On success, the return value of this system call is a file descriptor that can be later used to perform operation on the io_uring instance.

The io_uring_register() System Call

The io_uring_register() system call allows registering resources, such as user buffers, files, etc., for use in an io_uring instance. Registering such resources makes the kernel map them, avoiding future copies to and from userspace, thus improving performance. Its prototype is the following:

					int io_uring_register(unsigned int fd, unsigned int opcode, void *arg, unsigned int nr_args);

Its arguments are:

  • fd: The file io_uring file descriptor returned by the io_uring_setup() system call.
  • opcode: The specific operation to be executed. It can have certain values such as IORING_REGISTER_BUFFERS, to register user buffers, or IORING_UNREGISTER_BUFFERS, to release the previously registered buffers.
  • arg: Arguments passed to the operation being executed. Their type depends on the specific opcode being passed.
  • nr_args: Number of arguments in arg being passed.

On success, the return value of this system call is either zero or a positive value, depending on the opcode used.

Provided Buffer Rings

An application might need to have different types of registered buffers for different I/O requests. Since kernel version 5.7, to facilitate managing these different sets of buffers, io_uring allows the application to register a pool of buffers that are identified by a group ID. This is done using the IORING_REGISTER_PBUF_RING opcode in the io_uring_register() system call.

More precisely, the application starts by allocating a set of buffers that it wants to register. Then, it makes the io_uring_register() system call with opcode IORING_REGISTER_PBUF_RING, specifying a group ID with which these buffers should be associated, a start address of the buffers, the length of each buffer, the number of buffers, and a starting buffer ID. This can be done for multiple sets of buffers, each one having a different group ID.

Finally, when submitting a request, the application can use the IOSQE_BUFFER_SELECT flag and provide the desired group ID to indicate that a provided buffer ring from the corresponding set should be used. When the operation has been completed, the buffer ID of the buffer used for the operation is passed to the application via the corresponding CQE.

Provided buffer rings can be unregistered via the io_uring_register() system call using the IORING_UNREGISTER_PBUF_RING opcode.

User-mapped Provided Buffer Rings

In addition to the buffers allocated by the application, since kernel version 6.4, io_uring allows a user to delegate the allocation of provided buffer rings to the kernel. This is done using the IOU_PBUF_RING_MMAP flag passed as an argument to io_uring_register(). In this case, the application does not need to previously allocate these buffers, and therefore the start address of the buffers does not have to be passed to the system call. Then, after io_uring_register() returns, the application can mmap() the buffers into userspace with the offset set as:


where bgid is the corresponding group ID. These offsets, as well as others used to mmap() the io_uring data, are defined in include/uapi/linux/io_uring.h:

 * Magic offsets for the application to mmap the data it needs
#define IORING_OFF_CQ_RING			0x8000000ULL
#define IORING_OFF_SQES				0x10000000ULL
#define IORING_OFF_PBUF_RING		0x80000000ULL
#define IORING_OFF_MMAP_MASK		0xf8000000ULL

The function that handles such an mmap() call is io_uring_mmap():

					// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/io_uring.c#L3439

static __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
	size_t sz = vma->vm_end - vma->vm_start;
	unsigned long pfn;
	void *ptr;

	ptr = io_uring_validate_mmap_request(file, vma->vm_pgoff, sz);
	if (IS_ERR(ptr))
		return PTR_ERR(ptr);

	pfn = virt_to_phys(ptr) >> PAGE_SHIFT;
	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);

Note that remap_pfn_range() ultimately creates a mapping with the VM_PFNMAP flag set, which means that the MM subsystem will treat the base pages as raw page frame number mappings wihout an associated page structure. In particular, the core kernel will not keep reference counts of these pages, and keeping track of it is the responsability of the calling code (in this case, the io_uring subsystem).

The io_uring_enter() System Call

The io_uring_enter() system call is used to initiate and complete I/O using the SQ and CQ that have been previously set up via the io_uring_setup() system call. Its prototype is the following:

					int io_uring_enter(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags, sigset_t *sig);

Its arguments are:

  • fd: The io_uring file descriptor returned by the io_uring_setup() system call.
  • to_submit: Specifies the number of I/Os to submit from the SQ.
  • flags: A bitmask value that allows specifying certain options, such as IORING_ENTER_GETEVENTS, IORING_ENTER_SQ_WAKEUP, IORING_ENTER_SQ_WAIT, etc.
  • sig: A pointer to a signal mask. If it is not NULL, the system call replaces the current signal mask by the one pointed to by sig, and when events become available in the CQ restores the original signal mask.


The vulnerability can be triggered when an application registers a provided buffer ring with the IOU_PBUF_RING_MMAP flag. In this case, the kernel allocates the memory for the provided buffer ring, instead of it being done by the application. To access the buffers, the application has to mmap() them to get a virtual mapping. If the application later unregisters the provided buffer ring using the IORING_UNREGISTER_PBUF_RING opcode, the kernel frees this memory and returns it to the page allocator. However, it does not have any mechanism to check whether the memory has been previously unmapped in userspace. If this has not been done, the application has a valid memory mapping to freed pages that can be reallocated by the kernel for other purposes. From this point, reading or writing to these pages will trigger a use-after-free.

The following code blocks show the affected parts of functions relevant to this vulnerability. Code snippets are demarcated by reference markers denoted by [N]. Lines not relevant to this vulnerability are replaced by a [Truncated] marker. The code corresponds to the Linux kernel version 6.5.3, which corresponds to the version used in the Ubuntu kernel 6.5.0-15-generic.

Registering User-mapped Provided Buffer Rings

The handler of the IORING_REGISTER_PBUF_RING opcode for the io_uring_register() system call is the io_register_pbuf_ring() function, shown in the next listing.

					// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L537

int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
	struct io_uring_buf_reg reg;
	struct io_buffer_list *bl, *free_bl = NULL;
	int ret;


	if (copy_from_user(&reg, arg, sizeof(reg)))
		return -EFAULT;


	if (!is_power_of_2(reg.ring_entries))
		return -EINVAL;


	/* cannot disambiguate full vs empty due to head/tail size */
	if (reg.ring_entries >= 65536)
		return -EINVAL;

	if (unlikely(reg.bgid io_bl)) {
		int ret = io_init_bl_list(ctx);
		if (ret)
			return ret;

	bl = io_buffer_get_list(ctx, reg.bgid);
	if (bl) {
		/* if mapped buffer ring OR classic exists, don't allow */
		if (bl->is_mapped || !list_empty(&bl->buf_list))
			return -EEXIST;
	} else {


		free_bl = bl = kzalloc(sizeof(*bl), GFP_KERNEL);
		if (!bl)
			return -ENOMEM;


	if (!(reg.flags & IOU_PBUF_RING_MMAP))
		ret = io_pin_pbuf_ring(&reg, bl);
		ret = io_alloc_pbuf_ring(&reg, bl);


	return ret;

The function starts by copying the provided arguments into an io_uring_buf_reg structure reg [1]. Then, it checks that the desired number of entries is a power of two and is strictly less than 65536 [2]. Note that this implies that the maximum number of allowed entries is 32768.

Next, it checks whether a provided buffer list with the specified group ID reg.bgid exists and, in case it does not, an io_buffer_list structure is allocated and its address is stored in the variable bl [3]. Finally, if the provided arguments have the flag IOU_PBUF_RING_MMAP set, the io_alloc_pbuf_ring() function is called [4], passing in the address of the structure reg, which contains the arguments passed to the system call, and the pointer to the allocated buffer list structure bl.

					// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L519

static int io_alloc_pbuf_ring(struct io_uring_buf_reg *reg,
			      struct io_buffer_list *bl)
	size_t ring_size;
	void *ptr;


	ring_size = reg->ring_entries * sizeof(struct io_uring_buf_ring);


	ptr = (void *) __get_free_pages(gfp, get_order(ring_size));
	if (!ptr)
		return -ENOMEM;


	bl->buf_ring = ptr;
	bl->is_mapped = 1;
	bl->is_mmap = 1;
	return 0;

The io_alloc_pbuf_ring() function takes the number of ring entries specified in reg->ring_entries and computes the resulting size ring_size by multiplying it by the size of the io_uring_buf_ring structure [5], which is 16 bytes. Then, it requests a number of pages from the page allocator that can fit this size via a call to __get_free_pages() [6]. Note that for the maximum number of allowed ring entries, 32768, ring_size is 524288 and thus the maximum number of 4096-byte pages that can be retrieved is 128. The address of the first page is then stored in the io_buffer_list structure, more precisely in bl->buf_ring [7]. Also, bl->is_mapped and bl->is_mmap are set to 1.

Unregistering Provided Buffer Rings

The handler of the IORING_UNREGISTER_PBUF_RING opcode for the io_uring_register() system call is the io_unregister_pbuf_ring() function, shown in the next listing.

					// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L601

int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
	struct io_uring_buf_reg reg;
	struct io_buffer_list *bl;


    if (copy_from_user(&reg, arg, sizeof(reg)))
		return -EFAULT;
	if (reg.resv[0] || reg.resv[1] || reg.resv[2])
		return -EINVAL;
	if (reg.flags)
		return -EINVAL;


	bl = io_buffer_get_list(ctx, reg.bgid);
	if (!bl)
		return -ENOENT;
	if (!bl->is_mapped)
		return -EINVAL;


	__io_remove_buffers(ctx, bl, -1U);
	if (bl->bgid >= BGID_ARRAY) {
		xa_erase(&ctx->io_bl_xa, bl->bgid);
	return 0;

Again, the function starts by copying the provided arguments into a io_uring_buf_reg structure reg [8]. Then, it retrieves the provided buffer list corresponding to the group ID specified in reg.bgid and stores its address in the variable bl [9]. Finally, it passes bl to the function __io_remove_buffers() [10].

					// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L209

static int __io_remove_buffers(struct io_ring_ctx *ctx,
			       struct io_buffer_list *bl, unsigned nbufs)
	unsigned i = 0;

	/* shouldn't happen */
	if (!nbufs)
		return 0;

	if (bl->is_mapped) {
		i = bl->buf_ring->tail - bl->head;
		if (bl->is_mmap) {
			struct page *page;


			page = virt_to_head_page(bl->buf_ring);

			if (put_page_testzero(page))
			bl->buf_ring = NULL;
			bl->is_mmap = 0;
		} else if (bl->buf_nr_pages) {


In case the buffer list structure has the is_mapped and is_mmap flags set, which is the case when the buffer ring was registered with the IOU_PBUF_RING_MMAP flag [7], the function reaches [11]. Then, the page structure of the head page corresponding to the virtual address of the buffer ring bl->buf_ring is obtained. Finally, all the pages forming the compound page with head page are freed at [12], thus returning them to the page allocator.

Note that if the provided buffer ring is set up with IOU_PBUF_RING_MMAP, that is, it has been allocated by the kernel and not the application, the userspace application is expected to have previously mmap()ed this memory. Moreover, recall that since the memory mapping was created with the VM_PFNMAP flag, the reference count of the page structure was not modified during this operation. In other words, in the code above there is no way for the kernel to know whether the application has unmapped the memory before freeing it via the call to free_compound_page(). If this has not happened, a use-after-free can be triggered by the application by just reading or writing to this memory.


The exploitation mechanism presented in this post relies on how memory allocation works on Linux, so the reader is expected to have some familiarity with it. As a refresher, we highlight the following facts:

  • The page allocator is in charge of managing memory pages, which are usually 4096 bytes. It keeps lists of free pages of order n, that is, memory chunks of page size multiplied by 2^n. These pages are served in a first-in-first-out basis.
  • The slab allocator sits on top of the buddy allocator and keeps caches of commonly used objects (dedicated caches) or fixed-size objects (generic caches), called slab caches, available for allocation in the kernel. There are several implementations of slab allocators, but for the purpose of this post only the SLUB allocator, the default in modern versions of the kernel, is relevant.
  • Slab caches are formed by multiple slabs, which are sets of one or more contiguous pages of memory. When a slab cache runs out of free slabs, which can happen if a large number of objects of the same type or size are allocated and not freed during a period of time, the operating system allocates a new slab by requesting free pages to the page allocator.

One of such cache slabs is the filp, which contains file structures. A filestructure, shown in the next listing, represents an open file.

					// Source: https://elixir.bootlin.com/linux/v6.5.3/source/include/linux/fs.h#L961

struct file {
	union {
		struct llist_node	f_llist;
		struct rcu_head 	f_rcuhead;
		unsigned int 		f_iocb_flags;

	 * Protects f_ep, f_flags.
	 * Must not be taken from IRQ context.
	spinlock_t		f_lock;
	fmode_t			f_mode;
	atomic_long_t		f_count;
	struct mutex		f_pos_lock;
	loff_t			f_pos;
	unsigned int		f_flags;
	struct fown_struct	f_owner;
	const struct cred	*f_cred;
	struct file_ra_state	f_ra;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
	const struct file_operations	*f_op;

	u64			f_version;
	void			*f_security;
	/* needed for tty driver, and maybe others */
	void			*private_data;

	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct hlist_head	*f_ep;
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;
	errseq_t		f_wb_err;
	errseq_t		f_sb_err; /* for syncfs */
} __randomize_layout
  __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

The most relevant fields for this exploit are the following:

  • f_mode: Determines whether the file is readable or writable.
  • f_pos: Determines the current reading or writing position.
  • f_op: The operations associated with the file. It determines the functions to be executed when certain system calls such as read(), write(), etc., are issued on the file. For files in ext4 filesystems, this is equal to the ext4_file_operations variable.

Strategy for a Data-Only Exploit

The exploit primitive provides an attacker with read and write access to a certain number of free pages that have been returned to the page allocator. By opening a file a large number of times, the attacker can force the exhaustion of all the slabs in the filp cache, so that free pages are requested to the page allocator to create a new slab in this cache. In this case, further allocations of file structures will happen in the pages on which the attacker has read and write access, thus being able to modify them. In particular, for example, by modifying the f_mode field, the attacker can make a file that has been opened with read-only permissions to be writable.

This strategy was implemented to successfully exploit the following versions of Ubuntu:

  • Ubuntu 22.04 Jammy Jellyfish LTS with kernel 6.5.0-15-generic.
  • Ubuntu 22.04 Jammy Jellyfish LTS with kernel 6.5.0-17-generic.
  • Ubuntu 23.10 Mantic Minotaur with kernel 6.5.0-15-generic.
  • Ubuntu 23.10 Mantic Minotaur with kernel 6.5.0-17-generic.

The next subsections give more details on how this strategy can be carried out.

Triggering the Vulnerability

The strategy begins by triggering the vulnerability to obtain read and write access to freed pages. This can be done by executing the following steps:

  • Making an io_uring_setup() system call to set up the io_uring instance.
  • Making an io_uring_register() system call with opcode IORING_REGISTER_PBUF_RING and the IOU_PBUF_RING_MMAP flag, so that the kernel itself allocates the memory for the provided buffer ring.
Registering a provided buffer ring
  • mmap()ing the memory of the provided buffer ring with read and write permissions, using the io_uring file descriptor and the offset IORING_OFF_PBUF_RING.
MMap the buffer ring
  • Unregistering the provided buffer ring by making an io_uring_register()system call with opcode IORING_UNREGISTER_PBUF_RING
Unregistering the buffer ring

At this point, the pages corresponding to the provided buffer ring have been returned to the page allocator, while the attacker still has a valid reference to them.

Spraying File Structures

The next step is spawning a large number of child processes, each one opening the file /etc/passwd many times with read-only permissions. This forces the allocation of corresponding file structures in the kernel.

Spraying file structures

By opening a large number of files, the attacker can force the exhaustion of the slabs in the filp cache. After that, new slabs will be allocated by requesting free pages from the page allocator. At some point, the pages that previously corresponded to the provided buffer ring, and to which the attacker still has read and write access, will be returned by the page allocator.

Requesting free pages from the page allocator

Hence, all of the file structures created after this point will be allocated in the attacker-controlled memory region, giving them the possibility to modify the structures.

Allocating file structures within a controlled page

Note that these child processes have to wait until indicated to proceed in the last stage of the exploit, so that the files are kept open and their corresponding structures are not freed.

Locating a File Structure in Memory

Although the attacker may have access to some slabs belonging to the filp cache, they don’t know where they are within the memory region. To identify these slabs, however, the attacker can search for the ext4_file_operations address at the offset of the file.f_op field within the file structure. When one is found, it can be safely assumed that it corresponds to the file structure of one instance of the previously opened /etc/passwd file.

Note that even when Kernel Address Space Layout Randomization (KASLR) is enabled, to identify the ext4_file_operations address in memory it is only necessary to know the offset of this symbol with respect to the _text symbol, so there is no need for a KASLR bypass. Indeed, given a value val of an unsigned integer found in memory at the corresponding offset, one can safely assume that it is the address of ext4_file_operations if:

  • (val >> 32 & 0xffffffff) == 0xffffffff, i.e. the 32 most significant bits are all 1.
  • (val & 0xfffff) == (ext4_fops_offset & 0xfffff), i.e. the 20 least significant bits of val and ext4_fops_offset, the offset of ext4_file_operations with respect to _text, are the same.

Changing File Permissions and Adding a Backdoor Account

Once a file structure corresponding to the /etc/passwd file is located in the memory region accessible by the attacker, it can be modified at will. In particular, setting the FMODE_WRITE and FMODE_CAN_WRITE flags in the file.f_mode field of the found structure will make the /etc/passwd file writable when using the corresponding file descriptor.

Moreover, setting the file.f_pos field of the found file structure to the current size of the /etc/passwd/ file, the attacker can ensure that any data written to it is appended at the end of the file.

To finish, the attacker can signal all the child processes spawned in the second stage to try to write to the opened /etc/passwd file. While most of all of such attempts will fail, as the file was opened with read-only permissions, the one corresponding to the modified file structure, which has write permissions enabled due to the modification of the file->f_mode field, will succeed.


To sum up, in this post we described a use-after-free vulnerability that was recently disclosed in the io_uring subsystem of the Linux kernel, and a data-only exploit strategy was presented. This strategy proved to be realitvely simple to implement. During our tests it proved to be very reliable and, when it failed, it did not affect the stability of the system. This strategy allowed us to exploit up-to-date versions of Ubuntu during the patch gap window of about two months.

About Exodus Intelligence

Our world class team of vulnerability researchers discover hundreds of exclusive Zero-Day vulnerabilities, providing our clients with proprietary knowledge before the adversaries find them. We also conduct N-Day research, where we select critical N-Day vulnerabilities and complete research to prove whether these vulnerabilities are truly exploitable in the wild.

For more information on our products and how we can help your vulnerability efforts, visit www.exodusintel.com or contact info@exodusintel.com for further discussion.