Cross Validating a Classification Model

November 15, 2016
One of the safeguards against over-fitting is to build multiple models over different partitions of the same data, a technique called cross validation.
## Title #Evaluating and Parameter Tuning a Decision Tree Model ## Tag: data cleansing, preprocessing, decision tree, evaluation, parameter tuning #Summary Predict, using machine learning, whether or not a passenger would survive the Titanic disaster given their demographic or circumstance. #Description #Data This version of the Titanic dataset can be retrieved from the [Kaggle](https://www.kaggle.com/c/titanic-gettingStarted/data) website, specifically their “train” data (59.76 kb). The train Titanic data ships with 891 rows, each one pertaining to an occupant of the RMS Titanic on the night of its sinking. [Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/) The dataset also has 12 columns that each record an attribute about each occupant’s circumstances and demographic. For this particular experiment we will build a classification model that can predict whether or not someone would survive the Titanic disaster given the same circumstances and demographic. #Model First, some preprocessing. It is highly recommended that you read the [detailed tutorial](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/) to understand the rationale behind each step: * Drop the columns that do not add immediate value for data mining or hold too many missing categorical values to be a reliable predictor attribute. The following columns were dropped using the **select columns in dataset** module: * PassengerID, Name, Ticket, Cabin * Identify categorical attributes and cast them into categorical features using the **edit metadata** module. The following attributes were cast into categorical values: * Survived, Pclass, Sex, Embarked * Scrub the missing values from the following columns using the **clean missing data** module: * All missing values associated with numeric columns were replaced with the median value of the entire column * All missing values associated with categorical columns were replaced with the mode value of the entire column * Randomly split and partition the data into 70% training and 30% scoring using the **split** module. #Algorithm Selection In this gallery experiment we show that how to build a single decision tree in Azure ML, much like that of the rpart package in R programming. We will take the **two-class decision forest** as the learning algorithm and set the number of trees to one. We use a 10-fold cross validation methodology to evaluate the mean accuracy the model will be expected to have, and its stability in the form of a standard deviation measure. #Parameter Tuning * We begin by running the model on default parameters to get a baseline. The model starts off with 76% mean accuracy and a standard deviation of 4.6% * Min-sample-per-leaf node was set to 1 by default, which would naturally make the tree over-fit and learn from the all the data points, including outliers. We increase it to about ~1% of the data points to stop the tree from prematurely classifying these outliers. Mean accuracy saw an improvement however standard deviation shot up, showing a bias and variance trade-off. * A decision tree depth of 32 is too large for a data set with only 7 predictors. We want to create a situation where almost all features have been given a chance to participate in becoming a decision node, but not too much so that we start splitting on arbitrary numeric cut off in numeric columns. Maximum tree depth was reduced to 6, and accuracy saw an improvement and so did standard deviation. * Number of random splits per node matters a lot more in the context of a decision forest vs a decision tree. This controls how similar the trees will look toward one another. Reducing this will have marginal impact on the performance of the model, however will dramatically increase model build times. This number needs to be not so large that a true greedy approach is applied when learning, but not so small that good features are always excluded. # Related 1. [Detailed Tutorial: Building and deploying a classification model in Azure Machine Learning Studio](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/) 2. [Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/) 3. [Tutorial: Creating a random forest regression model in R and using it for scoring](https://gallery.azureml.net/Details/b729c21014a34955b20fa94dc13390e5) 4. [Tutorial: Obtaining feature importance using variable importance plots](https://gallery.azureml.net/Details/964dfc4151e24511aa5f78159cab0485)