Building an Ensemble Classifier in Azure ML

June 7, 2016
Predict, using machine learning, whether or not a passenger would survive the Titanic disaster given their demographic or circumstance.
#Data This version of the Titanic dataset can be retrieved from the [Kaggle](https://www.kaggle.com/c/titanic-gettingStarted/data) website, specifically their “train” data (59.76 kb). The train Titanic data ships with 891 rows, each one pertaining to an occupant of the RMS Titanic on the night of its sinking. [Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/) The dataset also has 12 columns that each record an attribute about each occupant’s circumstances and demographic. For this particular experiment we will build a classification model that can predict whether or not someone would survive the Titanic disaster given the same circumstances and demographic. #Model First, some preprocessing. It is highly recommended that you read the [detailed tutorial](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/) to understand the rationale behind each step: * Drop the columns that do not add immediate value for data mining or hold too many missing categorical values to be a reliable predictor attribute. The following columns were dropped using the **select columns in dataset** module: * PassengerID, Name, Ticket, Cabin * Identify categorical attributes and cast them into categorical features using the **edit metadata** module. The following attributes were cast into categorical values: * Survived, Pclass, Sex, Embarked * Scrub the missing values from the following columns using the **clean missing data** module: * All missing values associated with numeric columns were replaced with the median value of the entire column * All missing values associated with categorical columns were replaced with the mode value of the entire column * Randomly split and partition the data into 70% training and 30% scoring using the **split** module. #Algorithm Selection We chose to go with a **two-class decision forest** as the learning algorithm. Then we train the model using the **train model** module. We use the **score model** module to get predictions from our model on the 30% test set from the **split data** module. Evaluation metrics are given in the **evaluate model** module. # Related 1. [Detailed Tutorial: Building and deploying a classification model in Azure Machine Learning Studio](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/) 2. [Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/) 3. [Tutorial: Creating a random forest regression model in R and using it for scoring](https://gallery.azureml.net/Details/b729c21014a34955b20fa94dc13390e5) 4. [Tutorial: Obtaining feature importance using variable importance plots](https://gallery.azureml.net/Details/964dfc4151e24511aa5f78159cab0485)