Portfolio Case Study

Titanic Survival Prediction

Predict whether a passenger survived the Titanic disaster using a simple pipeline: data loading → EDA → preprocessing → Logistic Regression → evaluation → Kaggle submission.

Validation Accuracy ~81%

Kaggle Score 0.76

Model LogReg

View Source on GitHub Skip to Results

Quick Facts

Type: Supervised binary classification
Target: Survived (0 = no, 1 = yes)
Main features: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked
Why Logistic Regression? Simple, solid baseline and interpretable

📌 Project Overview

This project predicts passenger survival on the Titanic using the classic Kaggle dataset. The flow is straightforward: load the data, explore it, clean it, train a logistic regression model, and evaluate.

1. Load train.csv and test.csv

2. Explore with summary stats and missing values

3. Clean & encode features

4. Train Logistic Regression

5. Validate and tune basics

6. Predict on test and submit to Kaggle

Understanding the Problem

Goal: Build a reliable baseline that predicts whether a passenger survived, using features available at embarkation time (no leakage).

Business framing: In a real setting, similar classification helps allocate limited resources to higher-risk passengers or improve safety protocols by identifying drivers of survival.

Success metric: Validation accuracy (~81% here) and a consistent public leaderboard score (0.76) after submitting predictions to Kaggle.

Key assumptions: Cabin is dropped due to high missingness; Sex is binary-encoded; missing Age and Embarked are imputed; features are kept simple to highlight a clean baseline before engineering.

Constraints: Keep the model lightweight and explainable; ensure identical preprocessing on train and test to avoid leakage.

Loading & Exploring Data

Basic checks used:

data.info(), data.describe(), data.head()
data.isnull().sum() to inspect missing values

Missing Values (typical)

Column	Missing
Age	~20%
Cabin	High (dropped)
Embarked	Few (imputed)

Note: This mirrors the common profile of the Titanic dataset.

Data Cleaning & Preprocessing

Dropped Cabin due to too many missing values
Encoded Sex as binary (male→1, female→0)
Imputed missing values for Age and Embarked
Assembled a reusable cleaning function for consistency

Result: A tidy numeric dataset ready for model training.

Workflow Highlights

Clear train/validation split
Simple baseline with Logistic Regression
Clean feature set to avoid leakage
Consistent preprocessing for train and test

What I’d Improve Next

Feature engineering (titles, family size, ticket groups)
Regularization tuning
Model comparison (tree-based ensembles)

Model Training

Estimator: sklearn.linear_model.LogisticRegression

Baseline with sensible defaults
Focus on clean inputs and straightforward validation
Quick interpretability of feature effects

Validation Snapshot

Metric	Value
Accuracy	~81%
Confusion Matrix	Balanced baseline
Notes	Good foundation; more features can improve

Tip: Feature engineering and model ensembles can push this further.

Testing with Unseen Data

Predictions were generated for test.csv and submitted to Kaggle.

Submission	Public Score	Status
Titanic_LogReg_Baseline.csv	0.76	On Leaderboard

✅ Conclusion & Learnings

Solid end-to-end workflow from data to predictions
Clean preprocessing drives stable results
Ready to extend with new features and stacks

Next: Reuse this structure on other datasets and iterate on features.

Notebook (HTML Preview)

If the preview doesn’t load on your host, the direct link above will open the notebook in a new tab.

More Projects

Fake News Detector (NLP)

Web app + API that classifies news as real or fake. Deployed on VPS with HTTPS.

CV Generator (Flask)

Users enter details and get a downloadable PDF CV. Production on VPS.

Cloud Retail Insights (Azure) In Progress

End-to-end retail analytics & forecasting with Azure services and dashboards.

You can view my completed and in progress projects too.