Mapping PagedAttention to Virtual Memory

LLMs generate tokens one at a time. To avoid recomputing attention over the full history at each step, serving systems cache intermediate results in the KV cache, one entry per token, per layer. The KV cache is large and grows with sequence length.

Before vLLM, the state-of-art serving systems at that time, like Orca, pre-allocated contiguous GPU memory for the maximum possible sequence length. Because the attention kernel assumes each layer’s K/V tensors for a sequence sit in a single contiguous range. Most requests are shorter than the max. The result is the majority of KV cache memory wasted to fragmentation.

The vLLM’s PagedAttention fixes this by applying a technique that OS designers solved decades ago. I find it interesting that how the problem is proved to exist in the first place.

Discovering the problem

The starting point is a rough calculation. Model parameters already consume the majority of GPU memory. What’s left goes to KV cache, and it fills fast. A single request at max sequence length can take over a gigabyte. Only a handful of concurrent requests fit before memory runs out. KV cache is the scarce resource, and every byte wasted directly limits batch size and throughput.

With the bottleneck identified, the authors designed an experiment to isolate exactly where the waste came from. They reimplemented Orca in three variants: one that reserves max sequence length per request (what real systems do), one with a smarter power-of-two heuristic, and one that magically knows the exact output length in advance representing the best pre-allocation could ever do.

By profiling all three under real workloads, they broke KV cache usage into categories: actual token storage, space reserved for future tokens, internal fragmentation (allocated for max length but never used), and external fragmentation (gaps from the buddy allocator). The result is that at most 40% of KV cache memory stored actual tokens.

Even the variant which cheats by knowing the future still wastes significant memory to the buddy allocator’s external fragmentation. This means better length prediction alone cannot fix the problem. The waste is structural, baked into the pre-allocation model itself.

The mapping to Virtual Memory

OS concept vLLM equivalent
Virtual address space Logical KV cache (contiguous in the model’s view)
Physical page frames KV blocks: fixed-size GPU memory chunks
Page table Block table: maps logical block index to physical block
Demand paging Blocks allocated only when needed
Copy-on-write Shared prefixes share physical blocks until divergence

The analogy between PagedAttention and virtual memory holds in a few places.

Copy-on-Write. In an OS, two processes can share the same physical pages, marked read-only. When one process writes, a page fault triggers a private copy of just that page. In vLLM, multiple requests sharing a system prompt point to the same KV cache blocks. Only when a request generates its own tokens — diverging from the shared prefix — does vLLM copy the block.

Paging out. When GPU memory is full, vLLM picks a victim request and moves its KV blocks to CPU DRAM — the same structure as an OS paging to disk. On re-access, the blocks get fetched back. The tier hierarchy is the same: scarce fast memory, abundant slow memory, eviction policy in between.

Scheduling. The OS scheduler decides which processes get CPU time, preempting when needed. vLLM’s scheduler decides which requests get GPU time and memory, preempting when GPU memory is full.

Here is where it gets interesting. When an OS needs a page that was evicted, it has exactly one option: read it back from disk. The page contents are opaque bytes — there’s no way to reconstruct them.

However, vLLM has two options:

  1. Swap the KV blocks back from CPU DRAM over PCIe
  2. Recompute them from the original input tokens on the GPU

This second option doesn’t exist in OS virtual memory. It exists in vLLM because KV cache is deterministically reproducible. Given the same input tokens and model weights, the corresponding KV cache entry is always the same. So the system gets to choose: is it cheaper to transfer or to recalculate? The answer depends on the size. Short sequences are cheap to recompute. Long sequences have large KV caches where the PCIe transfer is cheaper than redoing all that attention math.

This two recovery paths pattern shows up elsewhere. Consider a distributed replicated storage system where a node needs data that isn’t in its local RAM. It has two options:

  1. Fetch from another node over the network: the data was replicated there
  2. Read from local disk: reconstruct from persistent storage

Same shape of decision: two ways to recover, with different cost profiles depending on data size and available bandwidth. In our example, the choice is network latency vs disk I/O. In vLLM, it’s PCIe bandwidth vs GPU compute.

The analogy isn’t exact. In distributed storage, two recovery paths exist because of redundancy where the data was copied to multiple nodes. In vLLM, they exist because the data can be recomputed from inputs. Different reasons for recoverability, but the same decision structure: when the fast tier is full, which recovery path is cheaper for this particular piece of data?

Memory efficiency and throughput

A GPU has thousands of cores. One request doesn’t saturate them. Running multiple requests in a batch uses the hardware fully. But each request needs KV cache in GPU memory. Memory waste limits batch size. Batch size limits GPU utilization.

By cutting waste to a great extent, vLLM fits more requests into the same GPU memory, runs bigger batches, and gets higher throughput. It applies memory management ideas that operating systems have used for decades, adapted for hardware where the trade-offs are different.

Reference: Woosuk Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention,” arXiv (2023), arXiv:2309.06180.