Portfolio Case Study

Telco Customer Churn Prediction

Predict which customers are likely to leave, understand why, and act before they churn. Clean pipeline: data import → EDA → preprocessing → Logistic Regression & Random Forest → evaluation → insights.

Accuracy (LR / RF) ~80%

ROC-AUC ~0.84

Focus Interpretability

View Source on GitHub Skip to Results

Quick Facts

Type: Supervised binary classification
Target: Churn (Yes/No)
Key features: Contract type, tenure, monthly charges, total charges, internet service, payment method
Why these models? Logistic Regression for clarity; Random Forest for non-linear patterns

📌 Description

I predicted why customers churn and who is likely to churn for a telecom company. The project runs end-to-end: import data, clean it, engineer features, train and compare models, and explain key drivers.

1. Load and audit dataset

2. EDA for trends and issues

3. Clean & encode (binary + OHE)

4. Train LR & RF

5. Evaluate (Accuracy, ROC-AUC, F1, CM)

6. Explain drivers (feature importance)

Understanding the Problem

Main problem: The company is losing customers and needs to identify who is at risk and why.

Why it matters: Churn hits revenue and raises acquisition costs. Early signals allow retention actions like discounts, plan recommendations, or better support.

Business Questions I Considered

What exactly is “churn” for the business, and when is it recorded?
How will predictions change retention strategy and revenue?
Which actions will be triggered for high-risk customers?
How will we measure success after deploying the model?

Constraints & Assumptions

Use only fields available before churn happens (avoid leakage)
Balanced train/validation via stratified split
Keep a clear baseline first, then add complexity

Data Understanding & Preprocessing

Checked for missing/null/NaN
Cleaned TotalCharges (non-numeric entries)
Binary columns (Yes/No) → 1/0
Nominal categoricals → One-Hot Encoding
Dropped IDs and obvious non-signals (customerID; optionally gender)
Stratified train/test split to keep churn ratio stable

Visual Checks

Churn rate distribution (overall baseline)
Churn by contract type, internet service, payment method, tenure
Boxplots for numeric features (monthly/total charges)

These gave business-level insight even before modeling.

Logistic Regression

Simple, explainable baseline
~80% accuracy on the test set
ROC-AUC around 0.84
Reported: precision, recall, F1, confusion matrix

Random Forest Classifier

Captures non-linear patterns
Also ~80% accuracy; ROC-AUC ~0.84
Feature importance to highlight drivers of churn

Feature Importance — What Drives Churn

Tenure: Shorter tenure → higher churn risk
Total Charges: Lower totals often mean newer accounts → higher risk
Contract Type: Month-to-month sees more churn than yearly contracts
Internet Service: Fiber optic plan users churned more than DSL in this dataset

Evaluation Metrics

Accuracy
Precision / Recall / F1
ROC-AUC
Confusion Matrix

Choose thresholds based on business cost of false positives vs false negatives.

Results & Next Steps

End-to-end cycle: business framing → wrangling → modeling → evaluation → interpretation
Strong baseline with LR/RF at ~80% accuracy and ~0.84 ROC-AUC

Planned Improvements

XGBoost with hyperparameter tuning
SHAP for customer-level explanations
Deploy a dashboard for retention teams

Completed: August 2025 • Author: Haseeb Sagheer • Stack: Python, pandas, scikit-learn, seaborn, matplotlib

Notebook (HTML Preview)

If your host blocks iframes, use the button above to view the full notebook.

More Projects

Fake News Detector (NLP)

Web app + API that classifies news as real or fake. Deployed on VPS with HTTPS.

CV Generator (Flask)

Users enter details and get a downloadable PDF CV. Production on VPS.

Cloud Retail Insights (Azure) In Progress

End-to-end retail analytics & forecasting with Azure services and dashboards.

You can view my completed and in progress projects too.