An implementation of a GPT style Large Language Model built from scratch using PyTorch. This repository contains the code for fundamental concepts of building, training, and fine-tuning language models, using Sebastian Raschka's 'Build a LLM from Scratch' book.
Classification Task (Spam Detection)
Instruction Tuning (Alpaca Dataset)
Data Preprocessing & Tokenization
- SimpleTokenizerV1 & V2: Custom tokenizers that convert text to numerical tokens
- GPTDatasetV1: PyTorch Dataset class for sliding window data loading
- create_dataloader_v1: Creates efficient data loaders with configurable batch size and stride
Attention Mechanisms
- Self-Attention variants (v1 & v2): Basic self-attention implementations
- CasualAttention: Implements causal masking for autoregressive generation
- MultiHeadAttention: Efficient multi-head attention with parallel processing
GPT Architecture
- GPTModel: Full implementation of GPT with configurable layers
- TransformerBlock: Combines attention and feed-forward layers
- LayerNorm: Layer normalization for training stability
- GELU: Gaussian Error Linear Unit activation function
- FeedForward: Position-wise feed-forward network
Training Pipeline
- train_model_simple: Complete training loop with evaluation
- calc_loss_batch/loader: Loss calculation utilities
- generate: Text generation with temperature and top-k sampling
- plot_losses: Visualization of training progress
Classification Fine-tuning Fine-tune GPT for text classification tasks:
- SMS spam detection implementation using the SMS Spam Collection dataset
- Dataset balancing and preprocessing for classification
- Converts generative model to classification model
Instruction Fine-tuning Fine-tune GPT to follow instructions:
- Implements Alpaca-style prompt formatting
- Loads instruction-following datasets (instruction-input-output format)
- Fine-tunes model for instruction-following capabilities
src/data.py
src/attention.py
src/model.py
src/train.py
src/finetune_classification.py
src/finetune_instructions.pydata/instruction-data.json
- Install dependencies:
pip install -r requirements.txt- Run the training scripts:
# Pretrain the model
python src/train.py
# Fine-tune for classification
python src/finetune_classification.py
# Fine-tune for instructions
python src/finetune_instructions.py
