Kaggle - Titanic : Machine Learning from Disaster - Training experiment
This experiment is meant to train models in order to predict accuratly who survived the Titanic disaster.
<h1>Kaggle : Machine Learning from disaster</h1>
<h2>Objective</h2>
<p>The objective of the competition is to make accurate predicition about who survived the titanic disaster.</p>
<h2>Data available</h2>
<p>A set of labelled data is available to train models and a set of unlabelled data is available to make and submit predictions.</p>
<p> The training data contains about 900 observations and has 12 columns. Among these columns, there are 10 features and 1 label column. The last column is a data column for tuple identification.</p>
<h2> Reshaping data</h2>
<p> Some data like Survived and Pclass needed to be reinterpreted as categorical data. Others, like Age, Fare and Embarked had missing data. The missing data were replaced by the mean or the mode depending on the data is categorical or numeric. There was not any outlier.</p>
<p> Some data were removed because it is hard to see how they would have an impact on survival like Name, PassengerId, Ticket. The Cabin feature was also removed because it contained too much missing value and there were no way to rebuilt missing information.</p>
<p> Finally, numerical features like Age and Fare were grouped into bins because their distribution was totally skewed. </p>
<h2> Feature engineering </h2>
<p> There were two columns detailing the composition of family onboard. They were used to build two derived features : One that indicates if the passenger has some family aboard and another that indicates the number of family members aboard. </p>
<h2> Feature selection </h2>
<p> The features were selected after performing a test (ANOVA, Chi-Square) that aims to verify the link between the label and the feature. </p>
<p>For example, tests proved that having some family aboard the titanic has an impact on survival but the size of family does not. A chi-sqaure test showed that the embarked feature has a link with the Survived label.</p>
<h2> Model training </h2>
<p>Several models based on various classification method (Neural network, Support Vector Machine, Naive Bayes, Logistic Regression, Boosted Decision Tree and Random Forest) were trained and tuned using cross-validation and parameter sweep.It seemed like the boosted decision tree had the best performance among the other models.</p>
<h2> Making prediction </h2>
<p> A section in the experiment is dedicated to produce a .csv file which contains prédictions that can be directly submitted to Kaggle. </p>
<h2> Rating </h2>
<p> Predictions were 77,033% accurate for the public leaderboaard (made with half of the prédictions. The other half will be used to make the final classement at the end of the competition). </p>
<p> Public leaderboard result : 6834 / 11239 </p>
<h2> Conclusion </h2>
<p> It has been a good first machine learning experience with Azure Machine Learning Studio. The purpose was to use the knowledge earned with Microsoft MOOC on Datascience (EDX). </p>