Learning with Counts: Binary Classification

By for September 2, 2014
In this experiment, we demonstrate how to utilize the learning with counts modules to generate features from columns of categorical values for binary classification model.
#Learning with Counts: Binary Classification For data columns of categorical values, especially those with large number of unique values, it is often not feasible or efficient to directly use these columns as input features for classification models. In this experiment, we demonstrate how to utilize the **Learning with Count** modules (**Build Counting Transform**, **Modify Count Table Parameters**) and **Apply Transformation** module to generate more efficient features from columns of categorical values. These derived features can be conveniently consumed by a binary classification model. - For each unique value of a selected column, the **Build Counting Transform** module counts the number of examples belonging to each class. The module then outputs a transform that can be used to featurize the categorical values with default parameters. - The **Modify Count Table Parameters** module can be used to change the parameters in featurization of categorical values. - The **Apply Transformation** module applies the transform to a dataset with same schema as the input of the **Build Counting Transform** module and replace the original categorical values with features (such as logodds, counts of both classes, and garbage bin threshold). For more information about using counts in machine learning, see the online [help](https://msdn.microsoft.com/library/azure/81c457af-f5c0-4b2d-922c-fdef2274413c). ##Data We used the Flight Delays sample dataset in the experiment, and processed the data using the procedure described in the [Flight Delay Prediction](http://go.microsoft.com/fwlink/?LinkId=525725) experiment. However, we'll also generate some new count-based features using these two modules. ##Feature Engineering The following diagram shows how the two modules are used to generate count-based features. We perform a three-way data split using the [**Split**](https://msdn.microsoft.com/library/azure/70530644-c97a-4ab6-85f7-88bf30a8be5f) module, and then build the counting transform (count table and features) using the training set. The counting transform are then applied to the train, validation and test sets. ![][countGraph] ###Build counting transform Using the **Build counting transform** module, we create the counting transform as follows: 1. First, we select the columns on which we want to build count tables using the column selector. Here we select three columns: _OriginAirportId_, _DestAirportId_, and _DayofMonth_. 2. We select _ArrDel15_ as the label column. 3. We specify the number of classes parameter as _2_. 4. We specify the input data as "Dataset" for the Module type parameter. You can also specify it as from blob or HDInsight hadoop cluster. 5. For the rest of the parameters, we use default values. ![][buildTableParam] This module outputs a transform that can be used to featurize selected data columns. ###Modify Count Table Parameters We further modify the parameters of the transform using **Modify Count Table Parameters** module, e.g, we can select logodds only, counts only, or both logodds and counts. For this experiment, we select the **LogOddsOnly** option for output features, and selected the option **Ignore back off column**. We use default values for other parameters. ![][featurizerParam] To examine the count table and parameters, the **Export Count Table** module can be used, taking the counting transform as an input. There are two outputs: the left-hand output contains the count parameters: ![][countParam] The right-hand output contains the generated count table. An excerpt is shown here. ![][countTable] ###Apply Transformation The **Apply Transformation** module takes as input the output of the **Build Counting Transform** module, plus the raw data, and generates the count-based features. For each column in the raw data, the raw data is replaced with count based features, as shown in the following diagram. ![][countFeature] If we select both counts and logodds in **Modify Count Table Parameters** module, and include the back-off column, we would have the following output for each of the selected columns. For simplicity, here we show the results only for the _DestAirportID_ column. ![][countFeature_all] For our binary classification problem, we need the logodds of just one class. Because we split the original data three ways for this experiment, to generate the count-based features, we create the counts table using the training dataset. We then apply the count tables to the training, validation, and test datasets, which replaces the original columns with these featurized logodds columns. ##Model Performance After the new features have been generated and used to replace the raw data columns, we trained a model using the [**Sweep Parameters**](https://msdn.microsoft.com/library/azure/038d91b6-c2f2-42a1-9215-1f2c20ed1b40) module to find the best model parameters. The following chart shows the performance of the model built on the featurized data. The AUC is 0.703 for the model using count-based features, which is slightly better than the AUC (0.697) when using the original data columns as features. Computational efficiency is also greatly enhanced. When using the new count-based feature, the entire experiment ran in around 21 minutes, compared to 34 minutes for the original experiment using a categorical column. When you use categorical columns for analysis, most algorithms transform the categorical column into a series of binary variable columns, each column corresponding to a unique value of the category column, so this can be computationally expensive. ![][countPerf] In the series of **Learning with Counts** experiments, we demonstrated the use of the **Build Counting Transform**, **Modify Count Table Parameters** and **Apply Transformation** modules to generate new count-based features for binary classification, multiclass classification, and regression problems. Although in this experiment we demonstrated the use of count-based featuring on datasets already in Azure ML Studio, you can also use the aforementioned modules to work with data from Windows Azure Blobs. The **Build Counting Transform** module can also read data using Hive queries against an HDInsight cluster, and compute the count tables using the underlying HDFS. This provides the advantage that you can compute features for large datasets (larger than 10GB) that cannot easily be consumed by Azure ML Studio. <!-- Images --> [countGraph]:https://az712634.vo.msecnd.net/samplesimg/v2/16/countGraph.PNG [buildTableParam]:https://az712634.vo.msecnd.net/samplesimg/v2/16/buildTableParam.PNG [countParam]:https://az712634.vo.msecnd.net/samplesimg/v1/16/countParam.PNG [countTable]:https://az712634.vo.msecnd.net/samplesimg/v1/16/countTable.PNG [featurizerParam]:https://az712634.vo.msecnd.net/samplesimg/v2/16/featurizerParam.PNG [countFeature]:https://az712634.vo.msecnd.net/samplesimg/v1/16/countFeature.PNG [countFeature_all]:https://az712634.vo.msecnd.net/samplesimg/v1/16/countFeature_all.PNG [countPerf]:https://az712634.vo.msecnd.net/samplesimg/v1/16/countPerf.PNG