Accept this assignment: GitHub Classroom Link
Due: January 26, 2026 at 11:59 PM EST
Click the link above to create your private repository for this assignment. Complete your work in Google Colab, then push your notebook to the repository before the deadline.
Spam detection is one of the most successful and widely-deployed applications of machine learning. In this assignment, you will build a comprehensive spam classification system that not only achieves high performance but also provides deep insights into what makes spam detection work—and what makes it fail.
Unlike a simple "build a classifier" task, this assignment requires you to:
- Implement multiple classification approaches (traditional ML and neural methods)
- Conduct rigorous comparative analysis across methods
- Perform extensive error analysis to understand failure modes
- Test adversarial robustness by trying to fool your own classifier
- Consider real-world deployment constraints (speed, memory, class imbalance)
This assignment mirrors real-world ML engineering: you'll make architecture decisions, justify trade-offs, and demonstrate that you understand not just how to build models, but why they work.
Timeline: This assignment is designed to be completed in 1 week (7 days) while remaining comprehensive in scope. By using GenAI tools to accelerate implementation, you can focus your time on the deeper analytical work—error analysis, robustness testing, and deriving insights—that separates excellent work from good work.
By completing this assignment, you will:
- Understand the full pipeline of text classification from feature engineering to deployment
- Compare traditional ML vs. neural approaches on the same task
- Learn to properly evaluate classifiers using multiple metrics
- Develop skills in systematic error analysis and debugging ML models
- Think adversarially about model robustness
- Make informed decisions about model selection based on performance/efficiency trade-offs
A sample dataset is provided (training.zip), consisting of two folders:
- spam/: Contains spam emails in plain text format.
- ham/: Contains ham emails in plain text format.
You can use this dataset to develop, train, and test your classifiers. When evaluating your solution, a new dataset—structured in the same way (with "spam" and "ham" folders)—will be used. Your models should generalize well to this unseen data.
Important Notes:
- You may split the provided data however you like (train/val/test)
- You may augment the dataset with external spam datasets (must cite sources)
- Be aware of potential class imbalance and handle it appropriately
- Consider using stratified splitting to maintain class distributions
You must implement and train at least three different classifiers:
Implement two of the following with proper feature engineering:
- Naive Bayes with TF-IDF features
- Support Vector Machine (SVM) with TF-IDF features
- Logistic Regression with engineered features
- Random Forest with custom feature extraction
For traditional methods, you must:
- Document your feature engineering choices (TF-IDF parameters, n-grams, custom features)
- Explain preprocessing decisions (lowercasing, stemming, stop words, etc.)
- Justify hyperparameter selections
Example features to consider:
- TF-IDF vectors (unigrams, bigrams, trigrams)
- Custom features: email length, number of URLs, exclamation marks, ALL CAPS ratio
- Domain-specific features: sender patterns, header information
- Character-level features for obfuscated spam
Implement at least one neural approach:
- Fine-tuned BERT (bert-base-uncased or distilbert-base-uncased)
- Fine-tuned DistilBERT (recommended for faster training)
- RoBERTa or other transformer variant
- Custom neural architecture (LSTM/GRU with embeddings)
For neural methods, you must:
- Use appropriate pre-trained models and fine-tune on spam data
- Document training procedures (learning rate, epochs, batch size)
- Handle sequence length appropriately (truncation/padding)
- Monitor for overfitting (training vs. validation curves)
Create an ensemble that combines predictions from your best models:
- Voting ensemble (majority vote or weighted average)
- Stacking (meta-classifier on top of base models)
- Boosting (if using multiple traditional classifiers)
Document your ensemble strategy and show whether it improves over individual models.
For each classifier, you must report:
- Accuracy: Overall correctness
- Precision: Of emails classified as spam, how many are actually spam?
- Recall: Of actual spam emails, how many did you catch?
- F1 Score: Harmonic mean of precision and recall
- AUC-ROC: Area under the receiver operating characteristic curve
- Confusion Matrix: Visualize true positives, false positives, true negatives, false negatives
Create a comparison table showing all metrics for all classifiers.
For each model, measure and report:
- Training time: How long to train on the full dataset?
- Inference time: Average time to classify a single email
- Model size: Memory footprint of saved model
- Throughput: Emails classified per second
This helps understand deployment trade-offs.
- Use cross-validation (at least 5-fold) to get confidence intervals
- Perform statistical tests to determine if one model significantly outperforms another
- Report means and standard deviations across folds
This is where you demonstrate deep understanding:
- Identify failure cases: Find at least 20 misclassified emails (10 false positives, 10 false negatives)
- Categorize errors: Group them into patterns:
- False Positives: Legitimate emails classified as spam (e.g., promotional emails, newsletters)
- False Negatives: Spam that slipped through (e.g., sophisticated phishing, image-based spam)
- Analyze patterns: What do misclassified emails have in common?
- Vocabulary overlap between spam and ham
- Short emails with little context
- Emails with unusual formatting
- Multilingual content
- Do different models make different mistakes?
- Are neural models better at certain types of spam vs. traditional models?
- Where does the ensemble help most?
- For traditional models: What features are most predictive? (Use feature coefficients or SHAP values)
- For neural models: Use attention visualization or probing classifiers
- What words/patterns are most strongly associated with spam vs. ham?
Test the robustness of your classifiers:
Manually craft at least 5 emails that:
- Should be classified as spam but try to evade detection
- Use techniques spammers actually use: character substitution (V1agra), adding legitimate text, etc.
Test these on all your classifiers. Which models are most robust?
Test your classifier against:
- Typos and misspellings: Add random character swaps
- Case variations: Change capitalization randomly
- Synonym replacement: Replace words with synonyms
- Content injection: Add benign text to spam emails
How much does performance degrade? Which models are most robust?
Discuss the following:
- How did you handle class imbalance in training?
- What happens if spam/ham ratio changes in production?
- Should you use sampling techniques (SMOTE, undersampling)?
Given different constraints, which model would you choose?
- Mobile email app: Needs fast inference, small model size
- Email server: Can use larger models, needs high throughput
- Maximum accuracy: No constraints, best possible performance
Justify your recommendations with evidence from your experiments.
Submit a single Jupyter notebook that includes:
- All classifier implementations with clear documentation
- Training and evaluation code
- Utility functions for metrics, visualization, etc.
- Must run in a clean Google Colab instance without errors
Your notebook must include well-written markdown sections:
- Overview of your approach
- High-level architecture decisions
For each classifier:
- Architecture description
- Feature engineering choices
- Hyperparameter selection process
- Training procedure
- Comparison table of all metrics
- Visualizations (confusion matrices, ROC curves, training curves)
- Statistical significance tests
- Categorization of failure cases
- Specific examples with explanations
- Insights about what makes spam detection difficult
- Description of adversarial examples you created
- Results of robustness tests
- Analysis of model vulnerabilities
- What did you learn about spam classification?
- Which approach worked best and why?
- Trade-offs between different models
- Recommendations for deployment scenarios
- Limitations of your approach
- What was challenging about this assignment?
- What would you do differently with more time/resources?
- What surprised you about the results?
- Clean, readable code with meaningful variable names
- Comments explaining non-obvious logic
- Modular functions (not one giant cell)
- Reproducible results (set random seeds)
Your assignment will be graded according to the following breakdown:
-
Traditional ML Models (15 points)
- Two different traditional ML classifiers properly implemented (8 pts)
- Proper feature engineering documented and justified (4 pts)
- Appropriate hyperparameter tuning demonstrated (3 pts)
-
Neural/Transformer Model (15 points)
- Correct implementation of fine-tuned transformer or neural model (8 pts)
- Proper training procedure with validation monitoring (4 pts)
- Handling of sequence length and preprocessing (3 pts)
-
Ensemble Method (10 points)
- Ensemble combines multiple models appropriately (5 pts)
- Improvement over individual models demonstrated (5 pts)
-
Quantitative Metrics (15 points)
- All required metrics computed correctly (5 pts)
- Clear comparison table across all models (5 pts)
- Proper visualizations (confusion matrices, ROC curves) (5 pts)
-
Computational Efficiency (5 points)
- Training time, inference time, and model size measured (3 pts)
- Trade-offs discussed appropriately (2 pts)
-
Statistical Rigor (5 points)
- Cross-validation performed correctly (3 pts)
- Statistical significance testing (2 pts)
-
Error Analysis (20 points)
- Systematic categorization of at least 20 failure cases (8 pts)
- Insightful patterns identified in errors (6 pts)
- Comparative analysis across models (3 pts)
- Feature importance analysis (3 pts)
- Creation of meaningful adversarial examples (5 pts)
- Robustness testing across perturbations (3 pts)
- Thoughtful analysis of vulnerabilities (2 pts)
- Discussion of class imbalance handling (2 pts)
- Deployment scenario recommendations with justification (3 pts)
- Code quality: clean, readable, well-commented (5 pts)
- Markdown documentation: clear, thorough, well-organized (5 pts)
- Visualizations: informative and professional (3 pts)
- Reflection and insights: thoughtful and substantive (2 pts)
- Exceptionally thorough analysis that goes beyond requirements (+5 pts)
- Novel techniques or insights not covered in class (+5 pts)
- Additional robustness tests (e.g., multilingual spam, zero-day attacks) (+3 pts)
- Deployment-ready code with API endpoint or containerization (+5 pts)
- Interactive demo or visualization tool (+3 pts)
Note: Maximum score is capped at 110/100.
While your grade is based on the rubric above, your model's performance will also be tested on a held-out dataset. This serves as a sanity check—if your models perform poorly (e.g., below 0.85 AUC), you may lose points even if other components are complete.
Expected Performance Benchmarks:
- Minimum acceptable: AUC > 0.85 on held-out data
- Good performance: AUC > 0.92 on held-out data
- Excellent performance: AUC > 0.96 on held-out data
The following code can be used to evaluate your classifiers during development:
import os
import zipfile
import shutil
from pathlib import Path
from sklearn.metrics import roc_auc_score
def evaluate_classifier(zip_path: str, classify_email_fn) -> float:
"""
Evaluate a classifier's performance on a dataset contained in a zip archive.
Parameters:
zip_path (str): Path to the zip archive containing "spam" and "ham" folders.
classify_email_fn (function): A function handle to classify_email(email_text: str) -> int.
Returns:
float: The AUC (Area Under the Curve) score of the classifier.
"""
# Step 1: Set up paths and directories
dataset_dir = Path(zip_path).with_suffix('') # Create a directory name based on the zip name (without .zip)
temp_extracted = False # Track if we extracted the zip (for cleanup)
# Step 2: Check if the dataset is already extracted
if not dataset_dir.exists():
print(f"Extracting {zip_path}...")
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(dataset_dir)
temp_extracted = True # Mark that we extracted files
# Step 3: Prepare to collect the data
emails = []
labels = []
# Traverse the spam folder
spam_folder = dataset_dir / "spam"
for file_path in spam_folder.iterdir():
if file_path.is_file():
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
email_text = file.read()
emails.append(email_text)
labels.append(1) # Spam is labeled as 1
# Traverse the ham folder
ham_folder = dataset_dir / "ham"
for file_path in ham_folder.iterdir():
if file_path.is_file():
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
email_text = file.read()
emails.append(email_text)
labels.append(0) # Ham is labeled as 0
# Step 4: Classify all emails
predictions = [classify_email_fn(email) for email in emails]
# Step 5: Calculate AUC score
auc_score = roc_auc_score(labels, predictions)
# Step 6: Clean up if necessary
if temp_extracted:
print(f"Cleaning up extracted files from {dataset_dir}...")
shutil.rmtree(dataset_dir)
return auc_scoreYou can call this function in your notebook to evaluate individual models during development:
auc_score = evaluate_classifier('training.zip', classify_email)
print(f"Model AUC Score: {auc_score:.4f}")While this assignment is comprehensive in scope, it's designed to be completable in 7 days. Here's a suggested timeline (students can use GenAI to accelerate implementation):
- Day 1: Setup & Data Exploration - Extract data, perform EDA, create train/val/test splits
- Day 2: Traditional ML Models - Implement two traditional ML classifiers with feature engineering (use GenAI to accelerate TF-IDF/feature pipeline code)
- Day 3: Neural Model - Fine-tune transformer (DistilBERT recommended for speed), monitor training
- Day 4: Evaluation & Metrics - Compute all metrics, generate comparison tables and visualizations
- Day 5: Error Analysis - Identify and categorize 20+ failure cases, analyze patterns and feature importance
- Day 6: Adversarial Testing & Real-World Considerations - Create adversarial examples, test robustness, discuss class imbalance and deployment scenarios
- Day 7: Documentation & Polish - Write markdown sections, verify code runs cleanly, final review
Key to Success: Use GenAI coding assistants to accelerate boilerplate code and feature engineering, but invest your time in understanding results, analyzing errors, and writing insightful analysis.
- Save different model versions as you experiment.
- Track what worked and what didn't in your notebook.
- Use meaningful names for models and experiments.
Since this assignment must be completed in 7 days, using AI coding assistants is essential to accelerate implementation while you focus on the analytical components:
DO use GenAI for:
- Boilerplate code: Data loading, train/test splits, metric computation
- Feature engineering pipelines: TF-IDF setup, feature extraction utilities
- Model scaffolding: Training loops, validation monitoring, hyperparameter grids
- Visualization code: Confusion matrices, ROC curves, comparison tables
- Debugging: Finding issues in data processing or model output shape mismatches
DON'T use GenAI as a shortcut for:
- Understanding results: Always interpret what your models are doing
- Error analysis: Manually examine misclassified examples and identify patterns
- Design decisions: Think critically about which models/features to try and why
- Documentation: Write your own explanations of methodology and findings
Workflow: Generate code scaffolds with GenAI, then spend your time on data exploration, result interpretation, error categorization, and insightful analysis. The best submissions show deep understanding of the why, not just the how.
The goal is to learn ML concepts deeply while shipping a complete, well-analyzed project in 7 days.
For traditional ML models:
- Don't just use default TF-IDF—experiment with n-gram ranges, max features, min/max document frequency.
- Create custom features based on spam characteristics (URLs, special characters, etc.).
- Consider domain-specific patterns (e.g., "Click here", "Free money", etc.).
- Always use a validation set separate from your training data.
- Plot training vs. validation curves for neural models.
- Use cross-validation for traditional models.
- If training accuracy >> validation accuracy, you're overfitting.
If your models aren't performing well:
- Check data quality: Are there mislabeled examples?
- Verify preprocessing: Are you handling special characters, URLs correctly?
- Inspect predictions: Look at specific examples where the model fails.
- Try simpler models first: Debug Naive Bayes before attempting BERT.
- Check class balance: Are you predicting only the majority class?
- Use the same train/validation/test split for all models.
- Report all metrics (not just the best one).
- Don't cherry-pick results.
- Future you (and the grader) will thank you for clear documentation.
- Explain why you made each decision, not just what you did.
- Include negative results—what didn't work and why?
- A Plan for Spam - Paul Graham's foundational essay on Bayesian spam filtering
- A Bayesian Approach to Filtering Junk E-Mail - Sahami et al., seminal academic paper (AAAI 1998)
- Machine Learning for E-mail Spam Filtering: Review, Techniques and Trends - Comprehensive survey
- Hugging Face Transformers Documentation - Essential for BERT/DistilBERT
- Fine-tuning BERT for Text Classification
- DistilBERT Paper - Lighter, faster BERT variant
- Precision and Recall - Understanding the trade-off
- ROC Curves and AUC
- Confusion Matrix Guide
- TextFooler: Is BERT Really Robust? - Strong baseline for NLP adversarial attacks (AAAI 2020)
- TextAttack: Framework for Adversarial Attacks
- scikit-learn: Traditional ML, metrics, preprocessing
- transformers: BERT, DistilBERT, RoBERTa
- pandas: Data manipulation
- matplotlib/seaborn: Visualization
- nltk/spaCy: NLP preprocessing
- imbalanced-learn: Handling class imbalance
If you use external datasets, you must cite them clearly in your notebook.
This assignment is submitted via GitHub Classroom. Follow these steps:
-
Accept the assignment: Click the assignment link to create your private fork of the repository.
- Assignment repository: github.com/ContextLab/spam-classifier-llm-course
-
Clone your repository:
git clone https://github.com/ContextLab/spam-classifier-llm-course-YOUR_USERNAME.git
-
Complete your work:
- Work in Google Colab, Jupyter, or your preferred environment
- Save your notebook to the repository
-
Commit and push your changes (if working locally):
git add . git commit -m "Complete SPAM classifier assignment" git push
-
Verify submission: Check that your latest commit appears in your GitHub repository before the deadline
Deadline: January 26, 2026 at 11:59 PM EST
Submit one Jupyter notebook (.ipynb file) in your GitHub Classroom repository.
Your notebook must:
- Run from top to bottom without errors in a clean Google Colab environment
- Include all necessary code for training, evaluation, and analysis
- Download any required data/models within the notebook (don't assume files are present)
- Set random seeds for reproducibility (e.g.,
np.random.seed(42)) - Have a reasonable runtime: Full execution should complete in under 60 minutes on Colab (use DistilBERT instead of BERT-base to stay within this constraint)
Structure your notebook with clear sections:
1. Introduction and Setup
- Import libraries
- Load data
- Exploratory data analysis
2. Data Preprocessing
- Train/val/test split
- Text cleaning functions
- Feature engineering utilities
3. Model Implementations
- Traditional ML models (separate subsections for each)
- Neural model (BERT/DistilBERT)
- Ensemble method
4. Evaluation
- Metrics computation
- Comparison tables
- Visualizations
- Statistical tests
5. Error Analysis
- Failure case examination
- Pattern identification
- Feature importance
6. Adversarial Testing
- Adversarial examples
- Robustness tests
7. Discussion and Conclusions
- Model comparison
- Real-world considerations
- Reflection
8. References
- Papers cited
- Datasets used
- Resources consulted
- Use descriptive markdown headers for each section
- Include explanatory text before code cells
- Add inline comments for complex code
- Create clear visualizations with titles and labels
- Use tables for metric comparisons
Name your file: LastName_FirstName_Assignment2.ipynb
Example: Smith_Jane_Assignment2.ipynb
Before submitting, verify:
- Notebook runs completely in a fresh Colab instance
- All required components are implemented (3+ models, evaluation, error analysis, adversarial testing)
- All metrics are reported for all models
- At least 20 error cases are analyzed
- Markdown documentation is thorough and well-written
- Code is clean and readable
- Visualizations are clear and professional
- Random seeds are set for reproducibility
- Citations are included for external resources
- File is named correctly
One week from assignment release (7 calendar days)
Late submissions will be penalized according to the course late policy.
While you are encouraged to use AI coding assistants and discuss concepts with peers:
- Your submission must be your own work
- Understand every line of code you submit
- Do not copy code from other students or online sources without attribution
- Cite all external resources (papers, datasets, significant code snippets)
You may be asked to explain your implementation decisions in office hours or during grading. Make sure you can justify your choices.
If you're stuck:
- Review the tips and resources in this document
- Ask specific questions in office hours or on the course forum
- Debug systematically: Isolate the problem, test components individually
- Start simple: Get a basic version working before adding complexity
Remember: The goal is to learn about text classification, evaluation, and error analysis. Don't get lost in trying to achieve the highest possible score—focus on understanding the concepts deeply.
This assignment is designed to be challenging but achievable within 1 week. You're expected to:
- Think critically about model selection and evaluation
- Go beyond "does it work?" to "why does it work?"
- Consider real-world deployment constraints
- Demonstrate both technical skills and analytical thinking
- Use GenAI tools strategically to manage time constraints without sacrificing rigor
The best submissions will show:
- Deep understanding of classification fundamentals
- Thoughtful comparison across methods
- Insightful error analysis
- Professional code and documentation
- Strategic use of GenAI to accelerate implementation while maintaining analytical depth
Remember: The 7-day timeline is realistic because:
- GenAI can generate boilerplate code (training loops, metrics, visualizations)
- DistilBERT trains faster than BERT-base
- You can run many experiments in parallel on Colab's GPUs
- The most valuable insights come from analysis, not implementation time
Good luck, and enjoy building your spam classifier in a week!