Learning with Counts: Binary classification with NYC taxi data

By for July 16, 2015
This sample demonstrates how to use the learning with counts modules for performing binary classification on the publicly available NYC taxi dataset. We use a two-class logistic regression learner to model this problem.
#Learning with Counts: Binary Classification with NYC taxi data Learning with counts is a useful technique for efficient encoding of high-dimensional (also called "high cardinality") categorical variables. In this experiment, we demonstrate how to utilize the **Learning with Count** modules (**Build counting transform**, **Modify count table parameters**) and **Apply Transformation** module to generate compact representations of high dimensional categorical variables. These derived features are then used in a binary classification model to predict whether a passenger will tip or not. - For each unique value of a selected column, the **Build Counting Transform** module counts the number of examples belonging to each class. The module then outputs a transform that can be used to featurize the categorical values with default parameters. - The **Modify Count Table Parameters** module can be used to change the parameters in featurization of categorical values. - The **Apply Transformation** module applies the transform to a dataset with the same schema as the input of the **Build Counting Transform** module and replace the original categorical values with features (such as log odds, counts of both classes, and the use of a backoff). For more information about using counts in machine learning, see the online [help](https://msdn.microsoft.com/library/azure/81c457af-f5c0-4b2d-922c-fdef2274413c). ##Data We used the New York city taxi dataset in the experiment, freely available [here](http://www.andresmh.com/nyctaxitrips/). The dataset consists of two sets of data : the trip data and the fare data. A few lines of the trip data are shown below : medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude 89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171 0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-06 00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066 0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-05 18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002 DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388 DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868 We see that the trip data consists of driver details (medallion, hack\_license, vendor\_id) and trip details such as pickup and dropoff times, the number of passengers, trip time and distance, and the GPS coordinates of the pickup and dropoff. The fare data, on the other hand, contains fare details of the trip and we show a few lines below : medallion, hack_license, vendor_id, pickup_datetime, payment_type, fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount 89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-01 15:11:48,CSH,6.5,0,0.5,0,0,7 0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-06 00:18:35,CSH,6,0.5,0.5,0,0,7 0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-05 18:49:41,CSH,5.5,1,0.5,0,0,7 DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:54:15,CSH,5,0.5,0.5,0,0,6 We see that in addition to containing some common fields like the driver details, this dataset contains details on fare amount, the tolls and surcharge taxes, and the tip amount. ## Binary classification problem The binary classification problem we pose here takes the form : Given the driver and trip details, will a passenger tip or not? We denote class 0 as "no tip" and class 1 as "tip". After joining the trip and fare datasets on the medallion, hack\_license, and vendor\_id and attaching an additional column called "tipped" which is 0 for no tip and 1 for a tip, we obtain a dataset of the form shown below : medallion,hack_license,vendor_id,rate_code,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount,tip_bin_value,tipped 413F4FE8B13419006400C2A8517D7A44,01A7DEBB426ABA1C9CFD9DC4711EF497,CMT,1,2013-12-07 21:59:14,2013-12-07 22:08:18,1,543.0,1.7,-73.9702,40.757236,-73.952904,40.769035,CRD,9.0,0.5,0.5,2.5,0.0,12.5,1,1 413F4FE8B13419006400C2A8517D7A44,334CBA3C4F54A6A9E02BB506F74C674B,CMT,1,2013-06-29 01:31:18,2013-06-29 01:37:13,1,354.0,0.8,-73.993774,40.745815,-74.003891,40.742088,CSH,5.5,0.5,0.5,0.0,0.0,6.5,0,0 413F4FE8B13419006400C2A8517D7A44,FD1176E5658567D01B51C43525BA5672,CMT,1,2013-01-20 03:26:40,2013-01-20 03:52:49,3,1568.0,14.5,-74.008926,40.726002,-73.828033,40.68576,CRD,42.0,0.5,0.5,10.75,0.0,53.75,3,1 41410577B81EBF63D371BD07D3092DF9,2311C55F3F626C2956E79624CD0DA084,VTS,5,2013-11-17 01:54:00,2013-11-17 02:10:00,2,960.0,9.14,-73.989258,40.757542,-74.074799,40.764229,CRD,55.0,0.0,0.5,13.05,10.25,78.8,3,1 Note that we wish to predict the last column, "tipped". We will also refer to it as the label column in what follows. ## Label distribution Of interest in classification problems is the label distribution. We show the label distributions on our train data below : ![][labelDistributionBinary] We note that our label distribution between the no tip - tip classes is almost 50-50, i.e., we have a class balanced dataset for this problem. ## Experiment We now show the experiment in full, and then describe its various components. ![][fullExperimentBinary] ### Accessing the train and test datasets We use the Reader module to access the NYC taxi datasets via publicly available blobs. By choosing the "PublicOrSAS" option, we can access data stored in public blob storage. The train dataset is the one used for training our models; we evaluate model performance on the test dataset. ##Feature Engineering As mentioned in the introduction, in this experiment, we showcase how to produce a compact representation of high-dimensional categorical features by using the learning with counts approach. In our data, some of the high-dimensional categorical features are "medallion", "hack\_license", and the GPS coordinates. Below, we list the number of unique values for a few of these categorical variables: medallion : more than 13000 unique values hack_license : more than 39000 unique values pickup_longitude : more than 42000 unique values As we see, the number of unique values (and hence the dimensionality) of these categorical variables is very large. We expect count features to help us by producing a compact representation of these high-dimensional data. To use the count features in our modeling, the first step is to use the Build Count Transform as shown below to generate the counting transform on our chosen categorical variables. **Important Note:** In this sample experiment, we compute our count features on the train dataset and then use that to compute count features on the test dataset. In practice, it is even better to have an entirely separate dataset for just performing the counts on. ###Build Count Transform We use the [**Build Count Transform**](https://msdn.microsoft.com/library/azure/166586ff-5bba-46a9-b469-20179f179b6c) module and it looks like this : ![][buildCountTransformBinary] 1. We first select the number of classes; in our case, this is 2 since we are performing binary classification. 2. Next, we choose the number of bits of the hashing function for constructing a dictionary transform. We choose this to be 23. 3. Next, we choose a random seed - this allows for reproducibility of the transform if so needed. 4. In "Module type", we choose "Dataset". Note that the module also takes data from a blob or Mapreduce (where the data is stored in HDFS). 5. In "Label column index or name", we select "tipped". 6. In "Select columns to count", we select "medallion", "hack\_license", "vendor\_id", "pickup\_longitude", "pickup\_latitude", "dropoff\_longitude", and "dropoff\_latitude". 7. In "Label column", we specify the index of the column that is chosen as the label. In our case, we choose the index of the column "tip\_bin\_value". 8. Finally, we specify the type of count table to be constructed : either a dictionary or a count-min sketch. We choose a dictionary here and will not delve into the differences in these approaches here. This module outputs a transform that can be used to featurize selected data columns. ### Modify Count Table Parameters To control the output of the count transform, we can use the **Modify Count Table Parameters** module. For this experiment, we select the **LogOddsOnly** option for output features, and selected the option **Ignore back off column**. We use default values for other parameters. This is shown below. ![][modifyCountTableParamsBinary] Note that to generate the count features, we will use the **Apply Transformation** module shortly. ### Apply transformation To apply the counting transform to the test dataset, we simply use the **Apply Transformation** module and connect one of its ports to the modified count transform and the other to the train dataset. This is shown below (a similar procedure is repeated for the test dataset as well). ![][applyTransformationBinary] An excerpt of the result of generating count features is shown below: ![][countFeaturesBinary] ### Project Columns At this stage, we are ready to filter out columns that re possible target leaks, and also columns that we think are not essential to the modeling process. For this binary classification problem, we filter out the following target leaks : "tip\_bin\_value", "total\_amount", "tip\_amount", and "randnum". To do this, we use the **Project Columns"** module shown below: ![][projectColumnsBinary] After doing all the above steps for both train and test datasets, we are ready to build our multiclass classification model for this dataset. ### Choice of learner For the binary classification problem, we choose a two-class logistic regression learner. To use this learner, we select **Two-class Logistic Regression** using the Search toolbar and drag and drop the module on to our experiment canvas. Then, we select the **Train Model** module and do the same. The inputs to the **Train Model** module are the train dataset and the learner, which is shown below : ![][trainModelBinary] For simplicity, we choose the default values of the learner parameters here. ### Scoring the model on test data After the training is complete, we can measure the performance of our model on test data by using the **Score Model** module thus. ![][scoreModelBinary] ##Model Performance We can now evaluate model performance using the **Evaluate Model** module shown below. ![][evaluateModelBinary] Since this is a binary classification problem, the ROC curve and AUC are good metrics to measure model performance. Below, we show these metrics on our dataset. ![][modelPerfBinary] We see that we get an AUC of 0.980 which is quite good. In addition to this, our precision, accuracy, and recall are good as well. We note that this performance can be improved further by two additional simple steps in the modeling process : i) Use the **Clean Missing Data** module to sanitize missing values in columns, ii) Use a **Sweep Parameters** module to run parameter sweeps so as to pick up the best logistic learner parameter values as opposed to the default settings we choose here. In this experiment, we demonstrated the use of the learning with counts technique by using **Build Counting Transform**, **Modify Count Table Parameters** and **Apply Transformation** modules to generate new count-based features for binary classification on the NYC taxi dataset. In another experiment, we will show how to model this dataset as a multi-class classification problem. ## Summary We use the learning with counts approach to succinctly represent high-dimensional categorical variables in our modeling. This typically results in smaller models, faster run-times, and sometimes also in a better model performance. We also note that although for reasons of simplicity, we have shown how to perform binary classification on the NYC taxi dataset using a sample of the data, the technique of learning with counts is very scalable and has been demonstrated internally on very large datasets. <!-- Images --> [labelDistributionBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/labelDistributionBinary.PNG [buildCountTransformBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/buildCountTransformBinary.PNG [fullExperimentBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/fullExperimentBinary.PNG [modifyCountTableParamsBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/modifyCountTableParamsBinary.PNG [applyTransformationBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/applyTransformationBinary.PNG [countFeaturesBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/countFeaturesBinary.PNG [projectColumnsBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/projectColumnsBinary.PNG [trainModelBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/trainModelBinary.PNG [scoreModelBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/scoreModelBinary.PNG [evaluateModelBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/evaluateModelBinary.PNG [modelPerfBinary]:https://az712634.vo.msecnd.net/samplesimg/v1/41/modelPerfBinary.PNG