This project presents an end-to-end pipeline for detecting fraudulent credit card transactions using machine learning. It combines extensive feature engineering, data visualization, model selection, and optimization techniques to build an effective fraud detection system. The final model—an XGBoost classifier—achieved an F1 score of 0.90 on Kaggle, showcasing strong performance on an imbalanced dataset.
Key highlights include:
- Custom temporal, behavioral, and category-based features.
- Exploratory data analysis to uncover correlations.
- Use of SHAP values and feature importances for explainability.
- Model tuning using performance curves and recursive feature elimination.
Ensure you have the following installed:
- Python 3.8+
- pip
- Jupyter Notebook or JupyterLab
-
Clone the repository:
git clone https://github.com/cros-nash/CreditCardFraud.git cd CreditCardFraud -
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate -
Install the dependencies:
pip install -r requirements.txt
-
Launch the Jupyter Notebook environment:
jupyter notebook
-
Download the data files from the following URL:
- https://drive.google.com/drive/folders/1qhCGDZV32bMrMT-lu8gCL1lu23MNFV0R?usp=share_link
- Rename the folder from
CreditCardDatatodataand place it in the same directory asCreditCardFraud.
-
Open and run
CreditCardFraud.ipynb. The notebook is structured as follows:- Load and clean data
- Perform exploratory data analysis
- Generate custom features (e.g., transaction time, spending ratios, odds ratios)
- Train initial DecisionTree model
- Switch to and optimize XGBoost model
- Use SHAP values and
feature_importances_to guide final feature selection - Evaluate performance using F1 score, PR curve, and optimal thresholding
- 📊 EDA Visualizations: Fraud distribution by gender, age, time of day, and geography.
- ⚙️ Custom Feature Engineering: Temporal indicators, spending profiles, category volatility.
- 🧠 Model Selection: Transition from DecisionTree to XGBoost for robustness.
- 🔍 Interpretability: SHAP analysis and odds ratio calculations.
- 🔄 Recursive Feature Elimination: Remove redundant features to avoid overfitting.
- 📈 Threshold Tuning: Optimize decision boundary using F1/precision-recall trade-offs.
- ProjectPaper.pdf: Describes the research, methodology, and technical decisions in detail.
- CreditCardFraud.ipynb: Fully executable notebook with data, models, and results.
- Initial release
- Complete pipeline implemented
- Achieved 0.90 F1 score on Kaggle
A well-crafted fraud detection pipeline demonstrates both technical rigor and responsible feature design. This project highlights the importance of thoughtful preprocessing, explainable AI techniques, and real-world applicability. We hope this serves as a strong foundation for others looking to build high-performance fraud detection models.