Titanic Kaggle Competition

June 7, 2016
This is a template experiment on building and submitting the predictions results to the Titanic kaggle competition.
#Data This version of the Titanic dataset can be retrieved from the [Kaggle](https://www.kaggle.com/c/titanic-gettingStarted/data) website, specifically their “train” data (59.76 kb). The train Titanic data ships with 891 rows, each one pertaining to an occupant of the RMS Titanic on the night of its sinking. [Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/) The dataset also has 12 columns that each record an attribute about each occupant’s circumstances and demographic. For this particular experiment we will build a classification model that can predict whether or not someone would survive the Titanic disaster given the same circumstances and demographic. #Data cleansing First, some preprocessing. It is highly recommended that you read the [detailed tutorial](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/) to understand the rationale behind each step: * Drop the columns that do not add immediate value for data mining or hold too many missing categorical values to be a reliable predictor attribute. The following columns were dropped using the **select columns in dataset** module: * PassengerID, Name, Ticket, Cabin * Identify categorical attributes and cast them into categorical features using the **edit metadata** module. The following attributes were cast into categorical values: * Survived, Pclass, Sex, Embarked * Scrub the missing values from the following columns using the **clean missing data** module: * All missing values associated with numeric columns were replaced with the median value of the entire column * All missing values associated with categorical columns were replaced with the mode value of the entire column #Algorithm Selection *We chose to go with a **two-class decision forest** as the learning algorithm. #Experimentation * Randomly split and partition the data into 70% training and 30% scoring using the **split** module. * Train the model using the **train model** module. * Use the **score model** module to get predictions from our model on the 30% test set from the **split data** module. * Evaluation metrics are given in the **evaluate model** module. #Cross Validation * Perform a 10-fold cross validation to evaluate the mean and variance performance of the classifier #Submission * Once the experimentation and evaluation is done, retrain the model on 100% of the data. * Feed Kaggle's test set into the experiment as a parallel workflow. * Follow the same cleaning methods except: ** Keep PassengerID ** Remove all references to Survived, since it does not exist in this dataset * Get predictions from the **score model** module from the 100% trained model * Select only the PassengerId and ScoredLabel from the **score model** module using the **select columns in dataset** module. * Rename the "Scored Label" column into "Survived" * Conver the dataset to CSV using the **convert to csv** module. * Run the model, right-click on the **convert to csv** module to download the csv to submit to Kaggle. # Related 1. [Detailed Tutorial: Building and deploying a classification model in Azure Machine Learning Studio](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/) 2. [Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/) 3. [Tutorial: Creating a random forest regression model in R and using it for scoring](https://gallery.azureml.net/Details/b729c21014a34955b20fa94dc13390e5) 4. [Tutorial: Obtaining feature importance using variable importance plots](https://gallery.azureml.net/Details/964dfc4151e24511aa5f78159cab0485)