Building a Decision Tree Classifier Model

November 15, 2016
Ever wonder how a model gets to its conclusions? A decision tree is often the most transparent algorithm in terms of internal mechanics.
#Data This version of the Titanic dataset can be retrieved from the [Kaggle]( website, specifically their “train” data (59.76 kb). The train Titanic data ships with 891 rows, each one pertaining to an occupant of the RMS Titanic on the night of its sinking. [Demo: Interact with the user interface of a model deployed as service]( The dataset also has 12 columns that each record an attribute about each occupant’s circumstances and demographic. For this particular experiment we will build a classification model that can predict whether or not someone would survive the Titanic disaster given the same circumstances and demographic. #Model First, some preprocessing. It is highly recommended that you read the [detailed tutorial]( to understand the rationale behind each step: * Drop the columns that do not add immediate value for data mining or hold too many missing categorical values to be a reliable predictor attribute. The following columns were dropped using the **select columns in dataset** module: * PassengerID, Name, Ticket, Cabin * Identify categorical attributes and cast them into categorical features using the **edit metadata** module. The following attributes were cast into categorical values: * Survived, Pclass, Sex, Embarked * Scrub the missing values from the following columns using the **clean missing data** module: * All missing values associated with numeric columns were replaced with the median value of the entire column * All missing values associated with categorical columns were replaced with the mode value of the entire column * Randomly split and partition the data into 70% training and 30% scoring using the **split** module. #Algorithm Selection In this gallery experiment we show that how to build a single decision tree in Azure ML, much like that of the rpart package in R programming. We will take the **two-class decision forest** as the learning algorithm and set the number of trees to one. Then we train the model using the **train model** module. We use the **score model** module to get predictions from our model on the 30% test set from the **split data** module. Evaluation metrics are given in the **evaluate model** module. # Related 1. [Detailed Tutorial: Building and deploying a classification model in Azure Machine Learning Studio]( 2. [Demo: Interact with the user interface of a model deployed as service]( 3. [Tutorial: Creating a random forest regression model in R and using it for scoring]( 4. [Tutorial: Obtaining feature importance using variable importance plots](