Benchmarking LLM Inference: The Metrics That Actually Matter
People love to brag about “low latency” or “optimized inference,” but unless you’re clear about what you’re measuring, those numbers are basically mea...
From RAG to Agentic Retrieval
Large language models are great at reasoning, but terrible at remembering everything.
From DDP to ZeRO-2 and ZeRO-3
If you’ve ever trained or fine-tuned a large language model in PyTorch, you’ve probably started with Distributed Data Parallel (DDP).
Distributed Data Parallel (DDP) for Training Models
Training large models on a single GPU can be painfully slow. PyTorch's Distributed Data Parallel (DDP) is the standard way to scale training across mu...
Training LLMs 101
Large Language Models (LLMs) don’t start out as friendly assistants. They begin as vast, raw systems trained on enormous datasets—powerful but unpolis...
Ray for LLM Inference
Ray is a distributed execution engine. Its job is to take a messy cluster of machines and make it feel like one giant computer.
vLLM: LLM Inference That Doesn't Waste Your GPU
vLLM is a library for running LLMs on GPUs. It is designed to be fast and efficient.
Three Practical Ways to Detect Sensitive Data
Agents don’t just think — they move data between systems.
Evals: How to Evaluate Agents
Evaluating agents is messy. Traditional software is deterministic — same input, same output. Agents don’t work that way. They reason in loops, call to...
Why Multi-Agent Systems Matter
MAS are emerging as a serious pattern for tackling the limits of single agents.
From Zero to Agent: ReAct, Reflection, and Planning
We've covered a lot of topics in the past few posts, but one thing that is missing is the concept of agents.
How Agents Remember: On Memory and the Art of Context Engineering
When we talk about memory in LLM agents, we’re not talking about neurons or synapses — we’re talking about tokens, context windows, and clever hacks t...
Structured Outputs in Practice: Instructor vs PydanticAI vs BAML
In part one, I wrote about why structured outputs matter and why just asking an LLM to “return JSON” doesn’t cut it.
Structured Output
When you build with LLMs, you quickly run into a recurring issue:
Engineering Books
Listing some technical books that I higly recommend (and actually read).
