The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. That's why the name DieTanic. This is a very unforgetable disaster that no one in the world can forget.
It took about $7.5 million to build the Titanic and it sunk under the ocean due to collision. The Titanic Dataset is a very good dataset for beginners to start a journey in data science and participate in competitions in Kaggle.
The objective of this notebook is to give an idea how is the workflow of any predictive modeling problem. How do we analyze features, how do we add new features, how do we make the existing features useful for the model and some Machine Learning Concepts. I have tried to keep the notebook as basic as possible so that even newbies can understand every phase of it.
We will predict whether a passenger in the Titanic would survive or not based on the passenger's data (name, age, price of ticket, etc).
- Exploratory Data Analysis
- Feature Engineering
- Define and train the ML models
- Evaluate the performance of our trained models on the validation set
- Select the best model and predict the targets for the test set
The Titanic dataset is from the kaggle's "Titanic: Machine Learning from Disaster" competition.
Random Forest, XGBoost and Logistic Regression.
The best validation score among the models is given by the Random Forest model, i.e. 83%. On submission of the same to the kaggle competition, the model achieved a public score of 77.033.