Titanic Kaggle Dataset

April 5, 2017
Titanic dataset analysed through multicass decision forest algorithm working on training and testing dataset.
While working on titanic dataset I came across a great learning experience like how to deal with non-categorical features of dataset or the missing values in a particular feature and the most important to play around with different parameters of a particular algorithm or try hands at different algorithm to best fit the data. First of all I started with training dataset of titanic kaggle dataset which contained 891 rows and 12 columns demonstrating 12 different features line Passenger ID, Sex, Parch, Pclass…..etc. I selected the features to work upon and dropped some of the features like PassengerId, Name, and Tickets etc which was of little concern. Then while edit metadata categorical type casting has been done of some features like Pclass, Sex, Embarked, and Cabin. Then the numeric missing data is replaced with median and categorical missing data is replaced with mode. Next the data is split into 70-30 ratio (training- testing) and the model is trained using Multicast decision forest algorithm. Then the scoring and evaluation of the model is done. After that the edit metadata and clean missing features is repeated for the test dataset. Then we finally train our model for 100% of dataset. Then the cross validation is done to check for range of accuracy of model on about 10% of dataset.After scoring the final model we select the columns we want in the output file which in our case are “PassengerId” and “score-labels”. For the final submission we edit the column “score labels” by renaming it to “Survived”. Finally convert it into .csv format so that we can download it for final submission at Kaggle.