⚠️ Experimental Project — This is a personal research and learning experiment. The model was trained under non-ideal conditions with suboptimal dataset choices, and the results reflect that. It is not a production-ready language model. Read the Honest Assessment section before using the weights.
- Overview
- Model Architecture
- Training Infrastructure
- Dataset Strategy (and the mistakes)
- Training Timeline
- Training Metrics
- What the Model Actually Learned
- Honest Assessment: What Went Wrong and Why
- Project Structure
- Setup & Usage
- Reproducing Training
- Log Analysis Tools
- Lessons Learned
This project trains a GPT-2-scale (~124M parameter) autoregressive language model entirely from scratch — no pretrained weights, no transfer learning. The model uses the GPT-2 tokenizer (tiktoken, gpt2 encoding, vocab size 50,257) and is trained on Google Colab using mixed-precision training on a single GPU.
The training run covered roughly 1,680+ logged steps (continuing from a checkpoint at step 600, so the actual total is higher), and suffered from a significant midcourse dataset strategy change that impacted learning continuity.
Bottom line: The model learned to produce fluent-looking English text. It does a passable job of story continuation and free-form rambling. It fails at instruction following, factual recall, reasoning, and coding — largely due to dataset choices, not architecture.
A clean GPT-2-style decoder-only transformer, implemented from scratch in PyTorch.
| Component | Value |
|---|---|
| Architecture | Decoder-only Transformer (GPT-2 style) |
| Parameters | ~124M |
| Vocabulary | 50,257 (GPT-2 / tiktoken gpt2) |
Embedding dimension (d_model) |
768 |
| Transformer layers | 12 |
| Attention heads | 12 |
Head dimension (d_k) |
64 |
Feed-forward dimension (d_ff) |
3,072 (4× d_model) |
| Max sequence length | 1,024 tokens |
| Dropout | 0.1 |
| Attention type | Causal multi-head (manual tril mask) |
| Positional encoding | Learned absolute positional embeddings |
| Normalization | Pre-norm LayerNorm (applied before attention and FFN) |
| Activation | GELU |
| Weight tying | Token embedding and output projection share weights |
| Weight initialization | Normal(0, 0.02) — same as GPT-2 paper |
- Pre-norm layout: LayerNorm is applied before the attention and FFN sublayers (not after, as in the original "Attention Is All You Need" paper). This is the GPT-2 convention and improves training stability.
- Weight tying: The input embedding matrix and the output linear projection share the same weights. This reduces the parameter count by ~38.6M and is standard practice for language models.
- No Flash Attention / SDPA: The attention is computed manually (QKV projection → softmax → weighted sum), without using
torch.nn.functional.scaled_dot_product_attention. This is fine for research but slower than optimized kernels. - No RoPE / ALiBi: Positional information comes from learned absolute embeddings capped at 1,024. The model cannot generalize to sequences longer than this.
Training ran on Google Colab Pro with a single NVIDIA A100 GPU (the log consistently shows ~14.8 GB GPU memory usage).
| Setting | Value |
|---|---|
| Platform | Google Colab (T4 GPU) |
| GPU memory used | ~14.8 GB |
| Mixed precision | Yes — torch.cuda.amp (FP16) |
| Optimizer | bitsandbytes AdamW 8-bit (bnb.optim.AdamW8bit) |
| Learning rate | 1e-4 (peak) |
| LR schedule | Linear warmup (1,750 steps) → cosine decay to max step |
| Warmup steps | 1,750 |
| Max planned steps | 25,000 |
| Batch size | 6 sequences per forward pass |
| Gradient accumulation | 21 steps |
| Effective batch size | 6 × 21 × 1,024 = 129,024 tokens per update |
| Gradient clipping | Max norm 1.0 |
| Weight decay | 0.1 |
| Adam betas | (0.9, 0.95) |
| Checkpoint frequency | Every 100 steps |
| Checkpoint retention | Last 3 only |
| Sample generation | Every 100 steps |
| Data storage | Google Drive (tokenized .npy files) |
| Local cache | Colab /content/ SSD (~30 GB chunks) |
| Throughput | ~8,360–8,400 tokens/second (very stable) |
The 8-bit AdamW optimizer (bitsandbytes) was chosen to reduce optimizer memory overhead, allowing a larger effective batch with the 14.8 GB GPU budget.
This is where things went wrong, and it's worth being honest about.
The first ~15,000 training steps used only C4 (Common Crawl, cleaned English) — a large web-crawled corpus of general English text. C4 is a reasonable pre-training corpus for a model this size, but training on it alone for that long created a model biased toward:
- Web-style prose (news articles, blog posts, product descriptions)
- Completing sentences in a journalistic/descriptive tone
- No instruction following whatsoever (it's raw completion data)
The C4 phase was not logged in the available log files (training resumed from step 600 in the captured log), but based on the loss values at step 610 (~6.27) and the model's behavior in samples, a significant amount of pre-training had already happened before this log begins.
Partway through training, the dataset was switched to a mixed corpus without the C4 data:
| Dataset | Mix Ratio | Purpose |
|---|---|---|
Cosmopedia (web_samples_v1) |
50% | Educational/synthetic web content |
| Stanford Alpaca | 30% | Instruction-following pairs |
| Python (CodeAlpaca-20k / The Stack) | 20% | Code reasoning |
The mixing was handled by HybridMixedDatasetLoader with a pre-generated random schedule (seeded at 42) and a DatasetStateManager that persisted file-level bookmarks to Google Drive to survive Colab restarts.
Why this was a problem:
-
C4 only training. A model as small as 124M trained on web text was bounded to do this, web text is utterly random, it gave the model english understanding and grammer, but failed to give overall semantic understanding we expect from it.
-
Abrupt distribution shift. The model had spent ~15,000 steps learning C4's statistical patterns. Suddenly switching to a completely different data distribution (educational text + instructions + code) caused the model to partially "unlearn" C4 patterns without fully learning the new ones. This is visible in the loss curves — the loss at step 610 is 6.27 and only reaches under 2.0 at the best point.
-
Alpaca formatting mismatch. The Alpaca data was formatted with
<|user|>and<|assistant|>special tokens, but the GPT-2 tokenizer doesn't have these as special tokens — they get split into subword pieces. The model never learned to treat them as control signals, making instruction following structurally impossible. -
Cosmopedia at 50% dominance. Cosmopedia's educational/synthetic content has a different style from both C4 and Alpaca. Rather than helping, it created a three-way tug-of-war that prevented clean convergence on any style.
Steps 1–~15,000 → C4 only (not in this log)
Large web text corpus, pure completion pretraining
Step 10 → First logged step. Loss: 10.0179
Steps >15,000 → Mixed dataset (Cosmopedia 50% / Alpaca 30% / Python 20%)
Loss slowly improving with high variance
Step 1000 → First sample generation checkpoint
The training was designed to run to 25,000 steps. It never got there.
All metrics are extracted from the training log via llm_postprocess.py, you ca see here
Loss is noisy and high-variance throughout — characteristic of a model in distribution conflict between its prior training (C4) and the new mixed dataset.
The training curves PNG (llm_training_curves.png) shows loss, learning rate schedule, tokens/sec throughput, and GPU memory across all logged steps.
Sample generations were recorded every 100 steps using two fixed prompts. Here's an honest look at progression:
Step 700 (early):
"...also not only be certain old looking to be affected by experts that I's most of my image. I' I usually a lot of my error in that I am a child're making me to go. It'm"
Very incoherent. Token-level fluency exists but grammatical structure is broken. The model is essentially sampling from a confused prior.
Step 1,000 (mid-run):
"...able to sell a company. The tool is a little bit of strong, and the source of the company has worked on social media and can be a successful Google on a smartphone..."
Noticeably better. Complete sentences, real nouns, some semantic coherence. Still drifts into unrelated territory (company → Google → smartphone).
Step 1,600 (late):
"...a Man of the Star Wars Cup – which is a good news. Totally, but he will be able to push for another 2018 Championship, therefore it is little shame to have been a decisive choice for the winning season..."
Grammatically fluent but topically nonsensical — it shifted from AI to sports. The model is generating plausible-looking English without any real concept of the prompt's topic.
Step 700:
"...which was not dropped it really long-preometric hurdles. The pre-3-shaped standard of the printer details of the screen/X_AIDS-commercial Spring..."
Completely broken.
Step 1,000:
"...consistently in a few days. Nitthan, who set up a minimum. Are you feel teachers there. You need to make a business, just "moving". Our goal is also a lot of where we start the"
Structure is better (short sentences, discourse markers), still hallucinating and drifting.
Step 1,600:
"...I believe that it's not only thought to be the same thing about me and I'd do a good job. I've both been working on a three-year weekend away. I must go on the other side so"
First-person narrative rambling. This is actually the model's strongest mode — it can produce something that reads like a personal blog post or inner monologue. Not coherent, but fluent.
| Task | Performance |
|---|---|
| Story rambling / continuation | |
| Free-form prose generation | |
| Factual statements | ❌ Mostly wrong or confabulated |
| Instruction following | ❌ Does not follow instructions at all |
| Question answering | ❌ Treats questions as prompts to continue generating prose |
| Code generation | ❌ Never reliably produces valid code |
| Logical reasoning | ❌ No structured reasoning observed |
Switching from C4 to a mixed corpus halfway through training without any curriculum strategy (like gradually blending in new data) created a distribution conflict. The model was forced to simultaneously "remember" C4 patterns and "learn" new ones, doing neither well.
What to do instead: Either commit to one corpus from step 0, or use a curriculum where the new data is gradually introduced — e.g., 90% C4 + 10% new at the switch point, then linearly increase the new data's share over thousands of steps.
Using <|user|> and <|assistant|> as delimiter tokens when the GPT-2 tokenizer has no concept of them means the model sees <, |, user, |, > as separate tokens. There is no signal to learn that this boundary means "switch speaker." The GPT-2 special token for end-of-text (<|endoftext|>) is a real special token (ID 50256) — but the chat delimiters are not.
What to do instead: Use a tokenizer that natively supports chat special tokens (e.g., <|im_start|> and <|im_end|> in the ChatML format used by GPT-4/Qwen, or the Llama-style [INST] tokens), or add custom special tokens to the tokenizer and resize the embedding matrix.
~1,000 effective steps (logged portion after the switch) with 50/30/20 mixing gives each domain very little exposure. The Chinchilla scaling laws suggest a 124M parameter model needs roughly 2.5 billion tokens for optimal training. This run covered approximately:
1,688 steps × 129,024 tokens/step ≈ 217.8 million tokens (logged portion only)
Even including the ~15,000 C4 steps:
15,000 steps × ~130,000 tokens/step ≈ 1.95 billion tokens (estimated)
The total is in the right ballpark for Chinchilla-optimal, but the dataset quality and distribution problems dilute the effective learning signal significantly.
Training loss was tracked but no held-out validation set was used. Without a validation loss, it's impossible to know whether the model is overfitting to the training data or actually generalizing. The loss spike at step 760 might be overfitting beginning, or just a bad batch — there's no way to tell.
The training was interrupted multiple times by Colab disconnections. While the checkpoint system worked correctly (the DatasetStateManager saves progress to Drive), each restart loses the optimizer's momentum state (since it's tied to the current session), which can temporarily destabilize training.
.
├── llm_model_from_scratch.ipynb # Main training notebook (Colab)
│ ├── Cell 1 — Package installation (bitsandbytes)
│ ├── Cell 2 — Library imports, Drive mount
│ ├── Cell 3 — Config class (all hyperparameters)
│ ├── Cell 4 — SmartResumeTokenDataset (single-file dataset)
│ ├── Cell 5B — HybridMixedDatasetLoader (multi-dataset with state tracking)
│ ├── Cell 6 — Model architecture (MultiHeadAttention, FeedForward, GPTModel)
│ ├── Cell 7 — Training utilities (logger, LR schedule, checkpoint save/load, sampler)
│ ├── Cell 8 — train_complete() — main training loop
│ ├── Cell 9 — Entry point (kicks off training)
│ ├── Cell 9b — C4 state removal utility (for dataset switching)
│ ├── Cell 10 — Interactive text generation (streaming token-by-token)
│ ├── Cell 11 — Training visualization (matplotlib curves)
│ ├── Cell 12 — C4 downloader/tokenizer (one-time, ~2–6 hours)
│ ├── Cell 13 — Cosmopedia + Alpaca + Python downloader (one-time, ~3–5 hours)
│ └── Cell 14 — Python-only downloader (CodeAlpaca / The Stack)
│
├── llm_postprocess.py # Log parser, JSON exporter, plotter
├── Parses training_*.log files
├── Outputs: llm_training_metrics.json, llm_training_samples.json
└── Outputs: llm_training_curves.png (4-panel plot)
cd backend
pip install -r requirements.txtPython 3.10+ recommended. GPU with at least 16 GB VRAM for the same config. The notebook is designed for Google Colab with Google Drive mounted.
import torch
import tiktoken
from pathlib import Path
# Define the same Config class as in the notebook
# (copy Cell 3 from the notebook)
model = GPTModel(Config).cuda()
checkpoint = torch.load("checkpoint_step_1688.pt", map_location="cuda", weights_only=False)
# Handle checkpoints from torch.compile() if needed
state_dict = checkpoint["model_state_dict"]
if any(k.startswith("_orig_mod.") for k in state_dict):
state_dict = {k.replace("_orig_mod.", ""): v for k, v in state_dict.items()}
model.load_state_dict(state_dict)
model.eval()
print(f"Loaded from step {checkpoint['step']}, loss {checkpoint['loss']:.4f}")tokenizer = tiktoken.get_encoding("gpt2")
def generate(model, prompt, max_tokens=150, temperature=0.8):
tokens = tokenizer.encode(prompt)
tokens = torch.tensor(tokens, dtype=torch.long, device="cuda").unsqueeze(0)
with torch.no_grad():
for _ in range(max_tokens):
if tokens.size(1) >= 1024:
tokens = tokens[:, -1024:]
logits = model(tokens)
logits = logits[:, -1, :] / temperature
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
tokens = torch.cat([tokens, next_token], dim=1)
if next_token.item() == tokenizer.eot_token:
break
return tokenizer.decode(tokens[0].cpu().numpy())
# Best use case: story/prose continuation
print(generate(model, "Once upon a time, in a forest far from the city,"))
print(generate(model, "The old scientist walked into the lab and noticed"))Recommended prompts: Open-ended narrative starters work best. Avoid questions, instructions, or anything requiring factual accuracy.
Temperature guide:
0.7— More focused, less creative. Tends to loop or repeat phrases.0.8— Sweet spot for this model.1.0— More varied but more incoherent.1.2+— Mostly nonsense at this model's capability level.
Run Cell 12 (C4 preprocessor) for the initial C4 corpus:
- Target: ~50–100 GB of C4 English text
- Output:
tokens_0000.npy,tokens_0001.npy, ... (~10M tokens each, uint16) - Expected time: 2–6 hours on Colab
Run Cell 13 for the mixed corpus:
- Cosmopedia: ~20 GB (
cosmopedia_tokens_*.npy) - Alpaca: ~100 MB (
alpaca_tokens_*.npy) — Stanford Alpaca 52K - Python: ~5 GB (
python_tokens_*.npy) — CodeAlpaca or The Stack (Python)
All token files are stored on Google Drive at:
/content/drive/MyDrive/llm_training/data/
├── tokens_0000.npy ... tokens_XXXX.npy (C4)
├── cosmopedia/cosmopedia_tokens_*.npy
├── alpaca/alpaca_tokens_*.npy
└── python/python_tokens_*.npy
Edit Config in Cell 3 to change hyperparameters. Key fields:
class Config:
DRIVE_DATA_DIR = "/content/drive/MyDrive/llm_training/data"
DRIVE_CHECKPOINT_DIR = "/content/drive/MyDrive/llm_training/checkpoints"
D_MODEL = 768 # Model width
N_LAYERS = 12 # Depth
N_HEADS = 12 # Attention heads
MAX_SEQ_LEN = 1024 # Context window
BATCH_SIZE = 6 # Per-GPU batch
GRADIENT_ACCUM_STEPS = 21 # Effective batch = 6×21×1024 = 129,024 tokens
LEARNING_RATE = 5e-5
WARMUP_STEPS = 1750
MAX_STEPS = 25000Run cells in order: 2 → 3 → 5B → 6 → 7 → 8 → 9.
Cell 9 is the entry point:
mixed_dataloader = setup_mixed_dataloader(current_step=0)
trained_model = train_complete()The training loop automatically:
- Detects and resumes from the latest checkpoint
- Persists dataset file position to Drive (survives Colab restarts)
- Generates text samples every 100 steps
- Saves checkpoints every 100 steps, retaining only the last 3
To remove C4 from the dataloader state without resetting everything, run Cell 9b:
remove_c4_from_state()This removes the "c4" key from dataset_state.json on Drive, so the next restart will not try to resume C4 file position. Then update the probs dict in setup_mixed_dataloader() to set c4: 0.0.
Parses one or more training log files and produces structured outputs.
# Basic usage (current directory)
python llm_postprocess.py
# Custom paths
python llm_postprocess.py --log_dir ./logs --out_dir ./results
# Override glob pattern
python llm_postprocess.py --pattern "training_*.log"Outputs:
| File | Contents |
|---|---|
llm_training_metrics.json |
One JSON object per step: timestamp, step, loss, lr, tokens_per_s, gpu_gb, elapsed_s, checkpoint_saved, checkpoint_removed, interrupted |
llm_training_samples.json |
One block per sample event: step, timestamp, list of {prompt, response, timestamp} pairs |
llm_training_curves.png |
4-panel matplotlib figure: loss (raw + smoothed + checkpoint markers), LR schedule, throughput, GPU memory |
Multi-file support: If you have multiple log files from different sessions (common with Colab), place them all in the same directory. The script merges them and resolves step-level duplicates by keeping the entry with the most recent timestamp.
Distilled from this experiment for anyone attempting something similar:
-
Plan your dataset before you start. Switching datasets mid-training without a curriculum is the single most damaging thing you can do. Decide upfront: pretrain on clean web text or go straight to the curated mix, not both in sequence without a transition plan.
-
Chinchilla scaling matters. For a 124M parameter model, you need ~2.5B tokens of high-quality data for optimal compute efficiency. Splitting those tokens poorly across incompatible domains doesn't count as 2.5B useful tokens.
-
Tokenizer and data format must be aligned. If your instruction data uses special tokens, your tokenizer needs to natively support them. Adding ad-hoc
<|user|>strings to a base GPT-2 tokenizer does nothing. -
Validation loss is non-optional. Without a held-out split, you're flying blind. Even a small 1% split would have shown whether the loss improvements were real generalization or just fitting to training data.
-
8-bit AdamW works well.
bitsandbytes.AdamW8bitsaved significant memory with no visible degradation in training stability. The throughput stayed rock-steady at ~8,374 tokens/sec throughout. -
The checkpoint/resume system was solid.
DatasetStateManager(Drive-backed JSON with file-level bookmarks) survived multiple Colab disconnections without losing significant progress. This was the most reliable part of the whole project. -
Story rambling ≠ understanding. The model learns to mimic the style of text far before it learns any semantic content. Fluent-looking output at step 1,600 is the model pattern-matching surface statistics, not reasoning.
- Architecture based on the GPT-2 paper: Language Models are Unsupervised Multitask Learners (Radford et al., 2019)
- Tokenizer:
tiktokenby OpenAI (GPT-2 BPE encoding) - Datasets: AllenAI C4, HuggingFaceTB Cosmopedia, Stanford Alpaca, BigCode The Stack / CodeAlpaca
- Optimizer:
bitsandbytes8-bit AdamW by Tim Dettmers
This README was written after the training run completed. All metrics, sample outputs, and observations are based on actual training logs and generated samples from the run.