vLLM: LLM Inference That Doesn’t Waste Your GPU
Preface
vLLM is a library for running LLMs on GPUs. It is designed to be fast and efficient. Before we even start talking about vLLM and why it's so fast, we need to understand a few things about LLM inference. First and foremost, we need to understand the KV cache.
KV cache, in one picture
What it is: during prefill, a decoder-only transformer computes attention over the whole prompt and stores each layer’s K/V tensors for every processed token. During decode, each new token reuses those cached K/Vs so it doesn’t recompute the whole prompt again. That turns the per-step cost from “full-prompt again” to “look up K/Vs + one step,” which is why decode is much cheaper than prefill. Why memory blows up: K/V memory scales with context length × layers × kv-heads × head_dim × dtype bytes
Why KV cache isn’t enough
Even with KV cache, you still have:
- Padding waste in prefill when lengths vary.
 - Stragglers: one long request stalls a whole static batch.
 - Fragmented KV memory if you try to reuse/evict caches across many concurrent requests.
 
You need engine-level tricks for memory & scheduling — that’s where vLLM comes in.
vLLM's tricks
Continuous batching
- What: New requests join mid-flight, finished ones leave; the scheduler treats prompt and output tokens uniformly and mixes prefill with decode in the same step.
 - Why it helps: higher throughput and lower tail latency; stragglers don’t freeze the batch.
 
PagedAttention: virtual memory for KV
- What: Store KV cache in fixed-size pages/blocks and add an indirection table, just like OS virtual memory.
 - Why it helps: near-zero KV waste, cheap compaction/eviction, and fast sharing of prefixes across requests — no giant memcopies.
 - Effect: lets the scheduler keep the GPU full even as requests come/go with different lengths.
 
Prefix caching
- What: Detect shared prefixes (e.g., system prompt / RAG boilerplate), reuse KV pages instead of recomputing.
 - Why it helps: slashes time-to-first-token for repetitive structures in chats, agents, and RAG. (vLLM does this automatically at the block level.)
 
⚠️ Security note: --enable-prefix-caching can accidentally leak across requests if you're not careful. Keep it off unless you know what you are doing.
Serving 101
vLLM enables online serving for LLMs. Naïve batching is a no-go, just think of it - a naive loop that feeds one massive prompt at a time will:
- Starve the scheduler (no concurrency).
 - Inflate memory peaks (one huge prefill).
 - Tank tail latency (everyone waits for the whale).
 - and honestly ruin the user experience.
 
❗️ Rule: Always run with concurrency. Let shorter requests finish while a long one keeps going. Continuous batching enables online serving. PagedAttention effectively enables it through paged attention.
Parameters that Matter
- --max-model-len - the length of the served (request prompt + generation) Set by p95/p99 effective length, not worst-case. Too high wastes KV capacity and kills concurrency.
 - --gpu-memory-utilization - the percentage of the GPU memory to use. 0.9 is a good default (e.g. CUDA context, NCCL comm buffers if multi-GPU).
 - --max-num-seqs - the maximum number of sequences to serve.
 - --enable-prefix-caching - whether to enable prefix caching.
 - --max-num-batched-tokens - the maximum number of tokens to batch. Tuning: raise until near-OOM, then back off ~10–15%.
 - --dtype - the precision to use. float16 or bfloat16. fp32 is infeasible beyond tiny models.
 
Serving Sample
📋 Configuration:
- Qwen/Qwen2.5-7B-Instruct, H100, FP16
 - Weights FP16: 7B × 2 bytes ≈ 14 GB.
 - H100 has 80 GB VRAM, leaving ~65 GB for KV + buffers.
 - Plenty of headroom for 8k contexts and high concurrency.
 
💻 Code:
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --dtype float16 \
  --max-model-len 8000 \
  --max-num-seqs 160 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.9 \
  --port 8000
Tensor Parallelism
This is a whole topic that deserves a post of its own. But for now, just know that you can use it to scale your model across multiple GPUs. You'll need it in cases the model is too large to fit on a single GPU. Qwen/Qwen2.5-72B-Instruct is a good example. The model just doesn't fit on a single GPU, but it does fit on 2×80GB H100 with fp16 (weights + KV with reasonable concurrency).
Tensor Parallelism Sample
📋 Configuration:
- Qwen/Qwen2.5-72B-Instruct, H100, FP16
 - Weights FP16: 72B × 2 bytes ≈ 144 GB.
 - On 2× H100 80 GB (TP=2, same host), each GPU holds ~72 GB of weights. That leaves ~8 GB per GPU for KV + buffers. Tight, but possible.
 - With this little headroom, you must run conservative capacity: lower max_model_len and smaller batches.
 
💻 Code:
vllm serve Qwen/Qwen2.5-72B-Instruct \
  --dtype float16 \
  --tensor-parallel-size 2 \
  --max-model-len 8000 \
  --max-num-seqs 48 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.9 \
  --port 8000
⚠️ If you want breathing room (longer contexts, more concurrency), use TP=4 on 4× H100s or apply weight-only quantization (INT8/INT4).
Takeaways
- vLLM = continuous batching + PagedAttention → no wasted GPU.
 - KV cache is the limiting factor, not just FLOPs.
 - Don’t inflate --max-model-len “just in case.” Set it to what your users actually need.
 - Avoid multi-node TP; keep sharding inside one box. Scale out by replication.
 
What's next
There are several inference libraries out there — TensorRT-LLM, FasterTransformer, DeepSpeed-Inference — but vLLM has effectively become the de facto standard for self-hosted LLMs in the open-source ecosystem. Its continuous batching and PagedAttention make it hard to ignore, and it’s the serving stack I’m choosing to build on. From here, I’m moving beyond serving into end-to-end work:
- Training my own model,
 - Running a tight RL loop to improve it,
 - And finally serving it with vLLM to close the full cycle.
 
That’s the point where the loop is complete — not just running other people’s models, but training, evaluating, and deploying my own under real constraints.