Machine Learning 2023

Credit Card Fraud Detection & Prediction

Building an ensemble ML pipeline to detect fraudulent transactions in highly imbalanced financial data — achieving optimal precision-recall balance through threshold optimization.

The Challenge

Credit card fraud is rare (typically less than 1% of all transactions) but enormously costly. Traditional rule-based systems miss sophisticated fraud patterns. The challenge was to build a model that catches fraud without drowning legitimate transactions in false positives.

The dataset exhibited severe class imbalance — a naive classifier achieving 99.8% accuracy by simply predicting "not fraud" for everything. Standard metrics are useless here; the real challenge is in the margins.

<1% Fraud Rate

284K+ Transactions

94% Precision

87% Recall

Approach & Methodology

I tackled this in three phases: data understanding, rebalancing strategy, and model ensemble.

Exploratory Analysis: Identified temporal patterns — fraud peaks at specific hours. Found that fraudulent transactions have distinct amount distributions.
SMOTE Oversampling: Generated synthetic fraud samples to address class imbalance without losing information from undersampling.
Isolation Forest: Deployed as an unsupervised anomaly detector for initial flagging of suspicious patterns.
Ensemble Model: Combined Logistic Regression (interpretability) + Random Forest (non-linear patterns) + Gradient Boosting (residual learning) with soft voting.
Threshold Optimization: Instead of default 0.5 cutoff, swept thresholds to find the optimal precision-recall tradeoff for the business context.

Key Insights

The most valuable finding was that fraud detection isn't just a modeling problem — it's a decision problem. The "best" model depends on the cost ratio between false positives (blocking legitimate customers) and false negatives (missing fraud).

By presenting results across multiple thresholds rather than a single "accuracy" number, stakeholders can make informed business decisions about their risk tolerance.

Technology Stack

Python scikit-learn SMOTE (imbalanced-learn) Isolation Forest Logistic Regression Random Forest Gradient Boosting matplotlib seaborn