Main features: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked
Why Logistic Regression? Simple, solid baseline and interpretable
📌 Project Overview
This project predicts passenger survival on the Titanic using the classic Kaggle dataset.
The flow is straightforward: load the data, explore it, clean it, train a logistic regression model, and evaluate.
1. Load train.csv and test.csv
2. Explore with summary stats and missing values
3. Clean & encode features
4. Train Logistic Regression
5. Validate and tune basics
6. Predict on test and submit to Kaggle
Loading & Exploring Data
Basic checks used:
data.info(), data.describe(), data.head()
data.isnull().sum() to inspect missing values
Missing Values (typical)
Column
Missing
Age
~20%
Cabin
High (dropped)
Embarked
Few (imputed)
Note: This mirrors the common profile of the Titanic dataset.
Data Cleaning & Preprocessing
Dropped Cabin due to too many missing values
Encoded Sex as binary (male→1, female→0)
Imputed missing values for Age and Embarked
Assembled a reusable cleaning function for consistency
Result: A tidy numeric dataset ready for model training.
Workflow Highlights
Clear train/validation split
Simple baseline with Logistic Regression
Clean feature set to avoid leakage
Consistent preprocessing for train and test
What I’d Improve Next
Feature engineering (titles, family size, ticket groups)
Regularization tuning
Model comparison (tree-based ensembles)
Model Training
Estimator:sklearn.linear_model.LogisticRegression
Baseline with sensible defaults
Focus on clean inputs and straightforward validation
Quick interpretability of feature effects
Validation Snapshot
Metric
Value
Accuracy
~81%
Confusion Matrix
Balanced baseline
Notes
Good foundation; more features can improve
Tip: Feature engineering and model ensembles can push this further.
Testing with Unseen Data
Predictions were generated for test.csv and submitted to Kaggle.
Submission
Public Score
Status
Titanic_LogReg_Baseline.csv
0.76
On Leaderboard
✅ Conclusion & Learnings
Solid end-to-end workflow from data to predictions
Clean preprocessing drives stable results
Ready to extend with new features and stacks
Next: Reuse this structure on other datasets and iterate on features.