Model Selection for Binary Classification

August 18, 2016
Concurrent Model and Hyperparameters Selection for Binary Classification
# Intro In Azure ML Studio you sometime want to compare the performance of a large number of algorithms at once. This experiment lets you benchmark all the available binary classifiers against each other, while optimising the hyperparameters. # Model Selection A common problem in Machine Learning is **model selection**, which means determining which model performs the best with your data. You need to select the best model across different class of algorithms, like *Decision Trees* or *Neural Networks*, and different sets of hyperparameter like the *Learning Rate* or the *Number of Iterations*. Indeed picking the best algorithm is not enough, each algorithm make use of different parameters that need to be adjusted to produce the best performing model. This experiment makes use of the `Tune Model Hyperparameters` module to test out different combinations of parameters for a given algorithm. It performs a grid search on the set of hyperparameters. # Performance Stating that a model is the best performing model for a given classification task can be misleading, because the performance of a model can be evaluated with several metrics, for example *Precision*, *Accuracy*, *Recall*, or the *F-Score*. Depending on your use case, you might want to maximise one or the other. This experiments gives you the best model and parameters for each of the available metrics. # Instructions ## Using your own data - Replace the Adult Census dataset with an `Import Data` module or with one of your dataset to access your own data. ## Picking the label column - The `Edit Metadata` module lets you pick which column should be considered the label. It is necessary to specify it because that is how the `Tune Model Hyperparameters` module knows which column is the label. ## Running the experiment - Press Run to Run the experiement ## Visualizing the results - Visualize the output of the last `Execute Python Script` module to find out which models performed the best - The *modelName* column specifies you which classifier was used - The *modelParameters* column specifies the parameters used for that model in a JSON format - The *max Metric* column specifies which model has the best metric. ## Using a Random Sweep - The experiment is currently using a grid search over a fixed number of parameters using the **Entire Grid** parameter on the `Tune Model Hyperparameters` module, I recommend testing out the **Random Sweep** with ~100 sweep per model for a richer variety of sweeps. ## Creating the best model - Use the information contained in the dataset to pick which classifier to use and which parameters to set ### Notes If you are using a free workspace, only one module is running at once and you won't benefit from the inherent parallelization