Portfolio Case Study

Titanic Survival Prediction

Predict whether a passenger survived the Titanic disaster using a simple pipeline: data loading → EDA → preprocessing → Logistic Regression → evaluation → Kaggle submission.

Validation Accuracy ~81%
Kaggle Score 0.76
Model LogReg

Quick Facts

  • Type: Supervised binary classification
  • Target: Survived (0 = no, 1 = yes)
  • Main features: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked
  • Why Logistic Regression? Simple, solid baseline and interpretable

📌 Project Overview

This project predicts passenger survival on the Titanic using the classic Kaggle dataset. The flow is straightforward: load the data, explore it, clean it, train a logistic regression model, and evaluate.

1. Load train.csv and test.csv
2. Explore with summary stats and missing values
3. Clean & encode features
4. Train Logistic Regression
5. Validate and tune basics
6. Predict on test and submit to Kaggle

Loading & Exploring Data

Basic checks used:

  • data.info(), data.describe(), data.head()
  • data.isnull().sum() to inspect missing values

Missing Values (typical)

ColumnMissing
Age~20%
CabinHigh (dropped)
EmbarkedFew (imputed)

Note: This mirrors the common profile of the Titanic dataset.

Data Cleaning & Preprocessing

  • Dropped Cabin due to too many missing values
  • Encoded Sex as binary (male→1, female→0)
  • Imputed missing values for Age and Embarked
  • Assembled a reusable cleaning function for consistency

Result: A tidy numeric dataset ready for model training.

Workflow Highlights

  • Clear train/validation split
  • Simple baseline with Logistic Regression
  • Clean feature set to avoid leakage
  • Consistent preprocessing for train and test

What I’d Improve Next

  • Feature engineering (titles, family size, ticket groups)
  • Regularization tuning
  • Model comparison (tree-based ensembles)

Model Training

Estimator: sklearn.linear_model.LogisticRegression

  • Baseline with sensible defaults
  • Focus on clean inputs and straightforward validation
  • Quick interpretability of feature effects

Validation Snapshot

MetricValue
Accuracy~81%
Confusion MatrixBalanced baseline
NotesGood foundation; more features can improve

Tip: Feature engineering and model ensembles can push this further.

Testing with Unseen Data

Predictions were generated for test.csv and submitted to Kaggle.

SubmissionPublic ScoreStatus
Titanic_LogReg_Baseline.csv0.76On Leaderboard

✅ Conclusion & Learnings

  • Solid end-to-end workflow from data to predictions
  • Clean preprocessing drives stable results
  • Ready to extend with new features and stacks

Next: Reuse this structure on other datasets and iterate on features.

Notebook (HTML Preview)

titanic_eda.html
Open in new tab
If the preview doesn’t load on your host, the direct link above will open the notebook in a new tab.