vLLM: LLM Inference That Doesn’t Waste Your GPU
Preface
vLLM is a library for running LLMs on GPUs. It is designed to be fast and efficient, and is a great choice for running LLMs on GPUs. Before we even start talking about vLLM and why it's so fast, we need to understand a few things about LLM inference. First and foremost, we need to understand the KV cache.
KV cache, in one picture
What it is: during prefill, a decoder-only transformer computes attention over the whole prompt and stores each layer’s K/V tensors for every processed token. During decode, each new token reuses those cached K/Vs so it doesn’t recompute the whole prompt again. That turns the per-step cost from “full-prompt again” to “look up K/Vs + one step,” which is why decode is much cheaper than prefill. Why memory blows up: K/V memory scales with context length × layers × kv-heads × head_dim × dtype bytes
Why KV cache isn’t enough (by itself)
Even with KV cache, you still have:
- Padding waste in prefill when lengths vary.
- Stragglers: one long request stalls a whole static batch.
- Fragmented KV memory if you try to reuse/evict caches across many concurrent requests.
You need engine-level tricks for memory & scheduling — that’s where vLLM comes in.
vLLM's tricks
PagedAttention: virtual memory for KV
- What: Store KV cache in fixed-size pages/blocks and add an indirection table, just like OS virtual memory.
- Why it helps: near-zero KV waste, cheap compaction/eviction, and fast sharing of prefixes across requests — no giant memcopies.
- Effect: lets the scheduler keep the GPU full even as requests come/go with different lengths.
Continuous batching
- What: New requests join mid-flight, finished ones leave; the scheduler treats prompt and output tokens uniformly and mixes prefill with decode in the same step.
- Why it helps: higher throughput and lower tail latency; stragglers don’t freeze the batch.
Prefix caching
- What: Detect shared prefixes (e.g., system prompt / RAG boilerplate), reuse KV pages instead of recomputing.
- Why it helps: slashes time-to-first-token for repetitive structures in chats, agents, and RAG. (vLLM does this automatically at the block level.)
Some benchmarks
First round: A10G, 16GB RAM, 4vCPUs
- vLLM
- Pure HF (torch)
we start by a relatively easy setup:
from vllm import LLM, SamplingParams
llm = LLM(MODEL_ID, dtype="bfloat16", enable_prefix_caching=True)
outs = llm.generate(prompts, SamplingParams(max_tokens=128, temperature=0.0))
print(outs)
Takeaways
- vLLM changes the way we do inference and bring back os abstraction to LLM inference.
What's next
We've seen how to accelerate LLM inference using vLLM. In the next post, we'll see how to leverage Ray for distributed inference. And then we'll see how to use vLLM for distributed inference.