the cost of a system call

A system call looks like a function call. It is not. A function call stays within a single address space, a single privilege level, and a single execution context; a system call crosses all three. Understanding what it actually does — at the hardware level — helps explain why modern kernels have spent two decades trying to avoid them on hot paths.

what happens on the way into the kernel

On x86-64, the syscall instruction performs a tightly defined sequence of operations. It saves the user-mode instruction pointer, saves the flags register, loads a new instruction pointer from a model-specific register, switches the privilege level from ring 3 to ring 0, and jumps into the kernel's syscall entry point. The entry path then saves the state it needs, switches to a kernel stack, resolves the syscall number, and dispatches to the handler.

On the way out, the reverse happens: the handler returns, the kernel restores the saved state, switches back to ring 3, and resumes user execution. A null syscall round trip on a modern system is often on the order of hundreds of nanoseconds. For comparison, a function call is under 2 nanoseconds. A syscall can be up to two orders of magnitude more expensive than the thing it looks like.

where the cost actually lives

The instruction itself is cheap. The cost lives in the consequences.

Privilege transition overhead. Entering the kernel disrupts speculative and out-of-order execution in ways a normal function call does not. This alone is dozens of nanoseconds that a normal call never pays.

Register save/restore. The kernel must preserve enough architectural state to enter, run, and return safely. Depending on the path, this may include general-purpose registers and, on systems with wide vector extensions like AVX-512, several hundred bytes of additional state per transition.

Cache and TLB effects. The kernel's code and data live in different pages from the user's. Each crossing can disturb the user-side working set and reduce locality. After the syscall returns, the user code may run cold for a while — not because of the syscall's own cost, but because the translation and cache state it depended on has been disturbed.

Post-Meltdown overhead. Since the 2018 Meltdown vulnerability, most kernels implement Kernel Page Table Isolation, which means every syscall now additionally switches the page table root register (CR3) twice, incurring a full TLB flush on CPUs without PCID support. On older hardware, this roughly doubled the cost of many common syscalls overnight.

the architectural response

The cost of a syscall can't be fixed without fixing what the syscall is, so the response has been to avoid syscalls rather than optimize them. Three patterns recur.

Batching. If an application is going to make a thousand small syscalls, the kernel can offer interfaces that amortize the boundary crossing. io_uring is a good example: user and kernel communicate through shared ring buffers, so work can be submitted and completed in batches. Some configurations reduce how often the submitting thread needs to enter the kernel at all.

Mapping read-only kernel state into userspace. Some "syscalls" don't need to enter the kernel because they don't need anything only the kernel can provide. Asking for the current time is a good example. The vDSO — a small shared library the kernel maps into every user process — contains user-mode implementations of gettimeofday, clock_gettime, and a few others. They read a kernel-maintained data structure that the kernel updates atomically, and they return without ever touching ring 0.

Shared memory as a bypass. When a user process and a kernel subsystem need to exchange a lot of data, setting up a shared memory region lets them do it without a syscall per byte. High-performance networking stacks, storage engines, and virtualization drivers often use variations of this approach. The syscall happens once to set up the mapping; after that, it's just memory.

what the pattern is

These approaches are variations on the same idea: the boundary is expensive, so move the work to one side or the other and cross the boundary as rarely as possible. The syscall as an abstraction is fine. The syscall as a frequent operation is the problem.

When you are writing code that talks to the kernel thousands of times a second — a database, a networking stack, an audio engine, a game engine — the first optimization to consider is not "make each syscall faster." It is "make fewer syscalls." The first form is bounded by the hardware. The second is bounded by how well you understand the boundary.