Sample 3: Cross Validation for Binary Classification: Adult Dataset

By for January 6, 2015
This experiment demonstrates the use of cross validation in binary classification.
# Cross Validation: Binary Classification This experiment demonstrates how to use the **Cross Validate Model** module. We used the [Adult dataset]( and trained multiple classification models using cross validation to predict whether an individual’s income is greater or less than $50,000. ## Data The dataset contains 14 features and one label column. There are multiple types of features, including numerical and categorical. The following diagram shows an excerpt from the dataset: ![][image_dataset] ## Creating the Experiment The following diagram shows the overall workflow of the experiment: ![][image_experiment] ###Sampling the Data After adding the dataset to the experiment, we used the **Partition and Sample** module to get a 20% sample from the dataset, using the following settings. ![][image_sample] ###Missing Data Handling We used the **Clean Missing Data** module to replace missing values with zeros, using the settings below. ![][image_missing] ###Cross Validation After cleaning up missing entries in the dataset, we divided the experiment into four branches. In each branch, a different algorithm is used as the input to the **Cross Validate Model** module. Cross validation is a technique for reducing bias in a model that can occur as a result of using a single training set. Instead of splitting the dataset into two parts, a training and test set, cross validation partitions the dataset into multiple subsets and trains multiple models, while testing on the remaining data. The module then creates a report showing the error measurements for each subset of data. This information reveals the sensitivity of the model to the training set, and it provides you with a better indication of the model’s ability to generalize to new data. By default, **Cross Validate Model** divides the input dataset into ten subsets, called _folds_. If you want to specify a different number of folds, you can use the **Partition and Sample** module and select the **Assign to Folds** option. Use the option **Specify number of folds to split evenly into** to set the number of folds. The following diagram shows where you can find these settings: ![][image_partition] **Cross Validate Model** takes two inputs: a machine learning model and a dataset. For this binary classification problem, we used the following four binary classification methods: **Two-Class Averaged Perceptron**, **Two-Class Boosted Decision Tree**, **Two-Class Logistic Regression**, and **Two-Class Support Vector Machine**. To configure **Cross Validate Model**, you must also specify the label column, or classification target. In this case, select the column `income` and leave the default value for the **Random seed** option as 0, to randomize the distribution of instances into the folds. ![][image_parameters] ###Results There are two outputs from **Cross Validate Model**. The left-hand output contains the scored results on the training data; the right-hand output contains accuracy metrics for each model. For example, the following diagram shows the first output from the instance of **Cross Validate Model** that trained using **Two-Class Averaged Perceptron**. The **Scored Labels** column shows the predicted class label; the **Scored Probabilities** column shows the probabilities for the predicted class. ![][image_output1] The following figure shows the second output, with performance data for each fold in terms of accuracy, precision, recall, F-Score, AUC, average log loss, and training log loss. ![][image_output2] Based on the cross validation results, you can tune the model parameters or decide which model to use in the scoring experiment. <!-- Images --> [image_dataset]: [image_sample]: [image_partition]: [image_missing]: [image_parameters]: [image_experiment]: [image_output1]: [image_output2]: