This repository contains a high-performance C implementation of the GPT-2 transformer model. The implementation focuses on efficiency through multi-threading and AVX instruction utilization while maintaining the core architecture of the original model.
- Complete implementation of the GPT-2 transformer architecture
- Multi-threaded attention mechanism for improved performance
- AVX instruction set utilization for accelerated matrix operations
- Memory-efficient design for handling large models
- Support for the 124M parameter GPT-2 model (12 layers, 12 heads, 768 embedding dimensions)
- Embedding size: 768
- Number of transformer blocks: 12
- Number of attention heads: 12
- Head dimension: 64 (768/12)
- Vocabulary size: 50,257
- Maximum position embeddings: 1,024
- Maximum threads for parallelization: 8 (configurable)
- GCC or Clang compiler with C11 support
- POSIX threads library (pthread)
- AVX instruction set support (Intel processors since Sandy Bridge or AMD processors since Bulldozer)
- Math library (
-lm)
To compile the program:
gcc -O3 -mavx -pthread -lm -o gpt2 gpt2.cFor maximum performance:
gcc -O3 -march=native -mavx2 -ffast-math -pthread -lm -o gpt2 gpt2.c./gpt2The default implementation initializes random weights and processes a sample input. For practical use, you'll need to:
- Load pre-trained weights from a file
- Implement tokenization for input text
- Add temperature-based sampling for text generation
- Linear Layer Implementation: Matrix multiplication with bias addition
- Attention Mechanism:
- Multi-headed scaled dot-product attention
- Parallelized implementation using pthreads
- AVX-accelerated dot products
- Layer Normalization: For stabilizing network activations
- Activation Functions: GELU (Gaussian Error Linear Unit)
- Memory Management: Comprehensive cleanup functions to prevent leaks
To load actual GPT-2 weights, implement a function to read weights from a file:
GPT2Weights load_weights_from_file(const char* filename) {
GPT2Weights weights;
FILE* file = fopen(filename, "rb");
// Read weights from file
// ...
fclose(file);
return weights;
}Implement sampling from logits for text generation:
int sample_token(float* logits, float temperature) {
// Apply temperature
// ...
// Sample from distribution
// ...
return sampled_token_id;
}Add functions to convert between text and tokens:
int* tokenize(const char* text, int* length) {
// Implement tokenization
// ...
return tokens;
}
char* detokenize(int* tokens, int length) {
// Convert tokens back to text
// ...
return text;
}The code includes several optimizations:
- Multi-threading: The attention mechanism is parallelized across multiple threads
- SIMD Instructions: AVX instructions are used for fast vector operations
- Memory Efficiency: Careful memory management to minimize allocations
For the 124M parameter model:
- Word token embeddings: ~154MB (50,257 * 768 * 4 bytes)
- Position embeddings: ~3MB (1,024 * 768 * 4 bytes)
- Transformer blocks: ~285MB (12 blocks * parameters per block)
- Total: ~450MB for model parameters