Tutorial: Building a classification model in Azure ML

February 14, 2015
This experiment serves as a tutorial on building a classification model using Azure ML. We will be using the Titanic passenger data set and build a model for predicting the survival of a given passenger.
#Data This version of the Titanic dataset can be retrieved from the [Kaggle](https://www.kaggle.com/c/titanic-gettingStarted/data) website, specifically their “train” data (59.76 kb). The train Titanic data ships with 891 rows, each one pertaining to an occupant of the RMS Titanic on the night of its sinking. [Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/) The dataset also has 12 columns that each record an attribute about each occupant’s circumstances and demographic. For this particular experiment we will build a classification model that can predict whether or not someone would survive the Titanic disaster given the same circumstances and demographic. #Model First, some preprocessing. It is highly recommended that you read the [detailed tutorial](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/) to understand the rationale behind each step: * Drop the columns that do not add immediate value for data mining or hold too many missing categorical values to be a reliable predictor attribute. The following columns were dropped using the **project columns** module: * PassengerID, Name, Ticket, Cabin * Identify categorical attributes and cast them into categorical features using the **metadata editor** module. The following attributes were cast into categorical values: * Survived, Pclass, Sex, Embarked * Scrub the missing values from the following columns using the **missing scrubber** module: * Age: missing values were replaced with the median value of 28. * Embarked: dropped 2 rows that contained missing values of embarked. * Tell Azure ML what it is trying to predict by casting the response class into a label using the **metadata editor** module. * Randomly split and partition the data into 70% training and 30% scoring using the **split** module. #Algorithm Selection We chose to go with a **two-class boosted decision tree** and a **two-class decision forest**. We used separate **train model** modules and **score model** modules for both algorithms to train separately on the same dataset. The algorithms were trained with their default settings. Both model's performance were evaluated and compared together using a single **evaluate model** module. # Results Both models performed rather fairly (~0.81 RoC AuC each). The boosted decision tree got an overall slightly higher RoC AuC, however it was lower in accuracy, precision and recall when compared to the two class decision forest. Both models as they stand are perfectly fine for deployment, and that is where this experiment will conclude. Users can take this experiment and tweak the parameters of either algorithms to achieve higher performance. # Related 1. [Detailed Tutorial: Building and deploying a classification model in Azure Machine Learning Studio](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/) 2. [Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/) 3. [Tutorial: Creating a random forest regression model in R and using it for scoring](https://gallery.azureml.net/Details/b729c21014a34955b20fa94dc13390e5) 4. [Tutorial: Obtaining feature importance using variable importance plots](https://gallery.azureml.net/Details/964dfc4151e24511aa5f78159cab0485)