Portfolio Case Study

Telco Customer Churn Prediction

Predict which customers are likely to leave, understand why, and act before they churn. Clean pipeline: data import → EDA → preprocessing → Logistic Regression & Random Forest → evaluation → insights.

Accuracy (LR / RF) ~80%
ROC-AUC ~0.84
Focus Interpretability

Quick Facts

  • Type: Supervised binary classification
  • Target: Churn (Yes/No)
  • Key features: Contract type, tenure, monthly charges, total charges, internet service, payment method
  • Why these models? Logistic Regression for clarity; Random Forest for non-linear patterns

📌 Description

I predicted why customers churn and who is likely to churn for a telecom company. The project runs end-to-end: import data, clean it, engineer features, train and compare models, and explain key drivers.

1. Load and audit dataset
2. EDA for trends and issues
3. Clean & encode (binary + OHE)
4. Train LR & RF
5. Evaluate (Accuracy, ROC-AUC, F1, CM)
6. Explain drivers (feature importance)

Data Understanding & Preprocessing

  • Checked for missing/null/NaN
  • Cleaned TotalCharges (non-numeric entries)
  • Binary columns (Yes/No) → 1/0
  • Nominal categoricals → One-Hot Encoding
  • Dropped IDs and obvious non-signals (customerID; optionally gender)
  • Stratified train/test split to keep churn ratio stable

Visual Checks

  • Churn rate distribution (overall baseline)
  • Churn by contract type, internet service, payment method, tenure
  • Boxplots for numeric features (monthly/total charges)

These gave business-level insight even before modeling.

Logistic Regression

  • Simple, explainable baseline
  • ~80% accuracy on the test set
  • ROC-AUC around 0.84
  • Reported: precision, recall, F1, confusion matrix

Random Forest Classifier

  • Captures non-linear patterns
  • Also ~80% accuracy; ROC-AUC ~0.84
  • Feature importance to highlight drivers of churn

Feature Importance — What Drives Churn

  • Tenure: Shorter tenure → higher churn risk
  • Total Charges: Lower totals often mean newer accounts → higher risk
  • Contract Type: Month-to-month sees more churn than yearly contracts
  • Internet Service: Fiber optic plan users churned more than DSL in this dataset

Evaluation Metrics

  • Accuracy
  • Precision / Recall / F1
  • ROC-AUC
  • Confusion Matrix

Choose thresholds based on business cost of false positives vs false negatives.

Results & Next Steps

  • End-to-end cycle: business framing → wrangling → modeling → evaluation → interpretation
  • Strong baseline with LR/RF at ~80% accuracy and ~0.84 ROC-AUC

Planned Improvements

  • XGBoost with hyperparameter tuning
  • SHAP for customer-level explanations
  • Deploy a dashboard for retention teams

Completed: August 2025 • Author: Haseeb Sagheer • Stack: Python, pandas, scikit-learn, seaborn, matplotlib

Notebook (HTML Preview)

customer_churn.html
Open in new tab
If your host blocks iframes, use the button above to view the full notebook.