Titanic Kaggle Competition
This is a template experiment on building and submitting the predictions results to the Titanic kaggle competition.
#Data
This version of the Titanic dataset can be retrieved from the [Kaggle](https://www.kaggle.com/c/titanic-gettingStarted/data) website, specifically their “train” data (59.76 kb). The train Titanic data ships with 891 rows, each one pertaining to an occupant of the RMS Titanic on the night of its sinking.
[Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/)
The dataset also has 12 columns that each record an attribute about each occupant’s circumstances and demographic. For this particular experiment we will build a classification model that can predict whether or not someone would survive the Titanic disaster given the same circumstances and demographic.
#Data cleansing
First, some preprocessing. It is highly recommended that you read the [detailed tutorial](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/) to understand the rationale behind each step:
* Drop the columns that do not add immediate value for data mining or hold too many missing categorical values to be a reliable predictor attribute. The following columns were dropped using the **select columns in dataset** module:
* PassengerID, Name, Ticket, Cabin
* Identify categorical attributes and cast them into categorical features using the **edit metadata** module. The following attributes were cast into categorical values:
* Survived, Pclass, Sex, Embarked
* Scrub the missing values from the following columns using the **clean missing data** module:
* All missing values associated with numeric columns were replaced with the median value of the entire column
* All missing values associated with categorical columns were replaced with the mode value of the entire column
#Algorithm Selection
*We chose to go with a **two-class decision forest** as the learning algorithm.
#Experimentation
* Randomly split and partition the data into 70% training and 30% scoring using the **split** module.
* Train the model using the **train model** module.
* Use the **score model** module to get predictions from our model on the 30% test set from the **split data** module.
* Evaluation metrics are given in the **evaluate model** module.
#Cross Validation
* Perform a 10-fold cross validation to evaluate the mean and variance performance of the classifier
#Submission
* Once the experimentation and evaluation is done, retrain the model on 100% of the data.
* Feed Kaggle's test set into the experiment as a parallel workflow.
* Follow the same cleaning methods except:
** Keep PassengerID
** Remove all references to Survived, since it does not exist in this dataset
* Get predictions from the **score model** module from the 100% trained model
* Select only the PassengerId and ScoredLabel from the **score model** module using the **select columns in dataset** module.
* Rename the "Scored Label" column into "Survived"
* Conver the dataset to CSV using the **convert to csv** module.
* Run the model, right-click on the **convert to csv** module to download the csv to submit to Kaggle.
# Related
1. [Detailed Tutorial: Building and deploying a classification model in Azure Machine Learning Studio](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/)
2. [Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/)
3. [Tutorial: Creating a random forest regression model in R and using it for scoring](https://gallery.azureml.net/Details/b729c21014a34955b20fa94dc13390e5)
4. [Tutorial: Obtaining feature importance using variable importance plots](https://gallery.azureml.net/Details/964dfc4151e24511aa5f78159cab0485)