Training LLMs 101

September 17, 2025

Preface

Large Language Models (LLMs) don’t start out as friendly assistants. They begin as vast, raw systems trained on enormous datasets—powerful but unpolished. In AI circles, this pre-trained model is often compared to the Shoggoth, a Lovecraftian creature: immense, alien, and only partially understood. The work of fine-tuning, alignment, and optimization is essentially putting a human-friendly mask on that creature so it can serve useful purposes.

Pre-Training: Building the Foundation

Definition: Pre-training exposes the model to huge datasets—web text, books, code—so it can learn patterns of language.
Objective: This is a massive transformer model training loop that is tasked with predicting the next token in a sequence.
Scale: Billions of tokens, thousands of GPUs, training that can run for months.
Precision: Early models used FP32 (32-bit floats).

This stage is done mostly in frontier labs.

Post-Training: Specialization and Alignment

Once pre-trained, the model needs refinement to follow instructions and align with human preferences.

Supervised Fine-Tuning (SFT)

Train on curated input → output pairs.
Teaches the model to follow instructions reliably.
Example: "User: What is 2+2?" → "Assistant: 4" (instead of a rambling essay).

Reinforcement Learning from Human Feedback (RLHF)

Even with SFT, responses can be clumsy. RLHF provides refinement:

Collect human preference data by asking people to rank model outputs.
Train a reward model to predict which responses humans prefer.
Use reinforcement learning (often PPO) to optimize the model toward outputs that score highly with the reward model.

This step makes the model more helpful, harmless, honest and structured.

LoRA and QLoRA: Efficient Fine-Tuning

Training an entire LLM from scratch is impractical for most. Instead, parameter-efficient fine-tuning (PEFT) methods make adaptation accessible.

LoRA (Low-Rank Adaptation): Instead of updating all model weights, LoRA inserts small, trainable low-rank matrices into key layers. This drastically reduces compute while preserving effectiveness. This fine-tuning technique was introduced in the paper LoRA: Low-Rank Adaptation of Large Language Models by Hu et al.
QLoRA: Extends LoRA by quantizing the model down to 4-bit precision before fine-tuning. This allows very large models (tens of billions of parameters) to be fine-tuned on a single GPU without massive infrastructure. This fine-tuning technique was introduced in the paper QLoRA: Efficient Finetuning of Quantized LLMs by Dettmers et al.

These methods are like short, focused and most of the time would be enough to get the job done.

Precision, Quantization, and Scaling

You'll see these numbers a lot, it just means what the size for each parameter in the model is. As you might have guessed, fp32 is 4 bytes, fp16 is 2 bytes and so on.

fp32: high precision, but expensive.
fp16/bf16: faster, less memory.
int8/int4: shrink weights to smaller integers for inference. used for on device inference.

Pros: smaller models, fits on more hardware, faster inference. (why faster inference? less data to process, faster compute throughput and enough space for larger batch size). Cons: slight accuracy drop (but clever quantization minimizes this).

Quantization is usually applied after training, especially for inference. The exception is QLoRA, which uses quantized weights during fine-tuning.

The Pipeline

Let’s stick with one analogy: a chef's journey.

Pre-training: The chef is mostly self-taught, binge-watching cooking shows, YouTube, and reading every cookbook they can find. They learn a huge amount about ingredients and techniques, but their dishes are hit-or-miss.
SFT: They finally go to culinary school. Now they get structured lessons and real recipes, so their food starts looking consistent.
RLHF: They open a restaurant and serve actual customers. Customers give feedback—"too salty," "amazing pancakes"—and the chef adapts based on real taste tests.
LoRA/QLoRA: Instead of going back to school, they take weekend workshops—like a pastry class or pancake masterclass—adding skills without redoing all of school.
Quantization: To run the diner from a food truck, the chef downsizes their tools and cookbooks so everything fits.

Code Snippet: Fine-Tuning Qwen with LoRA

Here’s a simple Hugging Face + PEFT example for fine-tuning Qwen-1.5 1.8B with LoRA, designed to fit on an NVIDIA A10G.

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model


# Load model & tokenizer
model_name = "Qwen/Qwen1.5-1.8B"
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)


# Load dataset
dataset = load_dataset("*")

def format_example(example):
    prompt = example["instruction"] + "\n" + example.get("input", "")
    return {
        "input_ids": tokenizer(prompt, truncation=True, padding="max_length", max_length=512).input_ids,
        "labels": tokenizer(example["output"], truncation=True, padding="max_length", max_length=512).input_ids,
    }

train_dataset = dataset["train"].map(format_example)

# LoRA config
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05)
model = get_peft_model(model, lora_config)


# Training setup
args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    warmup_steps=100,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    output_dir="./qwen-lora-finetuned",
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
)

trainer.train()

Takeaways

Pre-training builds raw capability.
SFT teaches structured instruction-following.
RLHF aligns responses with human values.
LoRA/QLoRA enable efficient, affordable adaptation.
Quantization makes models practical for deployment on limited hardware.

Next Steps

Okay, this was fun, but I want to actually work on production workloads! So next we'll see how to scale it up using Distributed Data Parallel (DDP) for scaling fine-tuning across multiple GPUs.