Benchmarking LLM Inference: The Metrics That Actually Matter

September 25, 2025

Preface

People love to brag about “low latency” or “optimized inference,” but unless you’re clear about what you’re measuring, those numbers are basically meaningless.

LLM inference isn’t like a normal REST API. If you’re streaming, tokens arrive as they’re generated — which means a user can start reading long before the model is done. If you’re non-streaming, you only see the full output at the end, so latency is measured differently.

And in practice, inference is almost never run on raw PyTorch or Hugging Face. Engines like vLLM, TensorRT-LLM, or FasterTransformer are used to maximize throughput and minimize latency by batching requests, optimizing scheduling, and managing memory. That makes understanding benchmarks even more important, because the serving stack matters as much as the model.

Here are the metrics that actually matter for LLM inference, how to measure them, and what good looks like on a 7B model running FP16 on an A100.

Percentiles (P50, P90, P99)

Before diving into specific latency metrics, it’s important to understand how to read them.
All latency numbers should be reported with percentiles (P50, P90, P99), not just averages. Averages hide tail latency — and tail latency is what users actually feel.

❓ What it is:

Latency distributions instead of averages.

🚀 Why it matters:

Averages lie. Tail latency defines the worst-case UX.
Your users will feel P99 way more than P50.

🔬 How to measure:

Run thousands of requests.
Report percentiles on per-request metrics (TTFT, TPOT, total latency) and per-token gaps (ITL).

Time to First Token (TTFT)

❓ What it is: The time from request submission to the first output token. TTFT is dominated by the prefill phase (the model has to process the entire prompt before generation starts). It also includes inference engine overhead (e.g. vLLM batching/scheduling), and — if measured externally — network latency.

🚀 Why it matters:

In streaming, TTFT defines perceived responsiveness. Users don’t care if the full output takes 3 seconds, as long as the first token shows up fast.
In non-streaming, TTFT collapses into total latency, since the client only sees the final output.
TTFT scales with prompt length and model size: longer prompts and bigger models = slower TTFT.

🔬 How to measure:

It's important to call out what labs usually report vs what you might want to track:
- Most lab benchmarks: TTFT = pure model prefill + inference engine overhead. No networking.
- For applications: TTFT = what the user feels. That includes network calls, API routing, serialization, etc.

📋 Example:

7B, FP16 on A100
P50: ~150-250 ms
P90: ~400 ms
P99: ~600+ ms

Time per Output Token (TPOT)

❓ What it is:

The average time between tokens once generation begins.

🚀 Why it matters:

Defines the “speed of thought” for the model.
Low TPOT (~10-20 ms/token) feels conversational; high TPOT (100+ ms/token) feels like subtitles lagging behind.

🔬 How to measure:

Take the decode phase of a request (from first to last token). Divide total time by number of tokens generated.

\text{TPOT}_{\text{request}} = \frac{t_\text{last token} - t_\text{first token}}{\text{num tokens} - 1}

Report P50, P90, P99 on all $\text{TPOT}_{\text{request}}$ .

📋 Example:

7B, FP16 on A100
P50: ~10-20 ms/token
P90: ~25-30 ms/token
P99: ~40+ ms/token

Inter-Token Latency (ITL)

❓ What it is:

The distribution of token gaps, not just the average.

🚀 Why it matters:

TPOT gives you an average speed of thought; ITL shows whether that speed is steady or jittery.
A user notices a 200 ms pause in the middle of a response way more than a steady 20 ms.

🔬 How to measure:

For each request, record the gap between every consecutive pair of tokens.

\text{ITL} = t_i - t_{i-1}

Report P50, P90, P99 on the entire stream.

📋 Example:

7B, FP16 on A100
Steady ~10-20 ms gaps.
Occasional spikes at 50-100 ms under load.

Total Latency

❓ What it is:

Time from request submission to last token generated.

🚀 Why it matters:

For non-streaming, this is the only metric that matters.
For streaming, less important than TTFT + ITL, but still worth reporting.

🔬 How to measure:

Fix prompt and output length.
Measure total wall-clock time.

📋 Example:

7B, 200-token output, FP16 on A100
P50: ~2.5-3 s
P90: ~3.5 s
P99: ~5+ s

This image summarizes the latency metrics:

Latency Metrics

Throughput vs Goodput

❓ What it is:

Throughput: tokens/sec (TPS) or requests/sec (RPS).
Goodput: throughput of successful requests only.

🚀 Why it matters:

Raw throughput is easy to inflate. If your system serves 100 RPS but 30% of those fail or timeout, your real capacity is much lower.
Goodput forces you to report what’s actually usable in production.

🔬 How to measure:

Fix prompt length and output length.
Ramp up concurrency until TPOT degrades or requests fail.
Track TPS/RPS and % successful requests.
Report both throughput and goodput.
Most modern inference engines (vLLM, TensorRT-LLM) use paged attention to scale goodput by handling larger batch concurrency.

📋 Example:

7B, FP16 on A100
Throughput: ~25-35k tokens/sec with batching.
Goodput: ~70-80% of that at high concurrency.

Memory Footprint

❓ What it is:

GPU VRAM and CPU memory usage during inference.

🚀 Why it matters:

Dictates hardware choice.
Constrains concurrency.
Precision (FP32 vs FP16 vs INT8) dramatically changes footprint.

🔬 How to measure:

GPU: nvidia-smi, torch.cuda.max_memory_allocated().
CPU: htop, ps.
Record idle (model loaded) vs peak (max concurrency).

📋 Example:

Need to account for model weights, KV cache, batching overhead, activations, etc.
For dense models like a 7B in FP16, VRAM usage is predictable (~14-16 GB).
Quantization lowers this (INT8 → ~8-10 GB).
Memory is lighter with MoE models.

Cost Efficiency

❓ What it is:

Cost per 1M tokens generated.

🚀 Why it matters:

This is what infra and ops teams care about.
Lets you compare across hardware and hosting providers.

🔬 How to measure:

Combine throughput with hourly GPU pricing.

📋 Example:

7B model, FP16 on A100
A100: $3/hr.
Sustained 30k tokens/sec = ~108M tokens/hr.
Cost: $3/108M tokens =$ 0.028 / 1M tokens.

👉 The quantitative way to make this call is by benchmarking cost efficiency based on goodput and comparing your $/token to the published costs of frontier labs (OpenAI, Anthropic, etc.). 💡 In other words: cost per 1M tokens is the universal currency of inference. It lets you compare running your own infra vs. calling OpenAI/Anthropic APIs.

Takeaways

A proper LLM inference benchmark isn’t one number — it’s a profile:

TTFT → responsiveness (streaming only)
TPOT → token speed
ITL → jitter between tokens
Total Latency → end-to-end time
Throughput + Goodput → raw vs usable capacity
Memory Footprint → hardware requirements
Cost Efficiency → $ per 1M tokens
P50/P90/P99 → applied across all latency metrics (TTFT, TPOT, ITL, Total Latency)
Bottom Line → benchmarking this way tells you not just how your model runs, but whether you should host it yourself or stick with a lab-hosted frontier model.

Run your tests with an optimized inference engine like vLLM, report results with percentiles, and you’ll know not just how your model runs, but how it feels to a user, how it scales under load, and how much it will cost you to serve in production.

What's next

As we wrap up benchmarking, in the next posts we'll go straight back to agents!