Online Fraud Detection: working with unbalanced class data

December 18, 2015
Apply down/over sampling techniques to deal with unbalanced class problem in fraud detection use case.
# Introduction This is a modification of [Step 4 of Online Fraud Detection]( You may also go to [Cortana Analytics Gallery]( and search by "online fraud detection" to review the full five steps of the original experiment. The label of fraud transaction data usually is extremely unbalanced. In another word, most of the transactions are good transactions. The fraud rate typically is under 1%. If we train a certain model on the original data, it will be very difficult to find a proper boundary, resulting in a poor performance on testing data set. Here we introduce both down sampling and over sampling techniques to deal with the unbalanced class problem. # Details of The Experiment Four independent sub-experiments are included: train model with original data; train model with randomly down sampling only; train the model with over sampling only; train the model with both randomly down sampling and over sampling. The following graph shows the experiments. ![enter image description here][1] The description of each sub-experiment is the following: 1. No down sampling or over sampling. This is the same experiment as step 4 of original online fraud detection template. We train the model directly on the data with extremely unbalanced label (i.e., fraud rate 0.57%). 2. Down sampling only. We split the training data into to two groups: Label=0 (non-fraud) and Label=1 (fraud). The observations in group with Label=1 are all kept while observations in group with Label=0 are randomly selected with sampling rate 0.05. Finally combine the down-sampled group (Label=0) with the non down-sampled group (Label=1). The fraud rate is increased to 10.36%. 3. Over sampling only. We apply [SMOTE]( to over sample the fraud transactions (Label=1). A simple example of how SMOTE simulate an observation for the minority class: 1) pick one observation in the minority class; 2) find the nearest neighbor; 3) randomly select one point along the line between the two points picked above. The over sampling rate is 2000% and the number of nearest neighbor been used is 1. The fraud rate is 10.82% after SMOTE. 4. Down sampling and over sampling. Combine down sampling in experiment 2 and over sampling in experiment 3 together. We down sample the good (Label=0) with down sampling rate 0.1, in the mean while, we over sample the bad (Label=1) using SMOTE with over sampling rate 200%. The fraud rate is 14.77% after combine the two together. # Results The four data sets are trained on same model (gradient boosted tree) and tested on same testing data set. The AUC is the following: 1. No down sampling or over sampling: 0.710 2. Down sampling only: 0.744 3. Over sampling only: 0.741 4. Down sampling and over sampling: 0.761 The following graph shows the ROC curves . The blue one is for experiment 1: no down sampling or over sampling; the red one is for experiment 4: combination of down sampling and over sampling. We can see the red curve dominates the blue one. ![enter image description here][2] Also, you can see the graph of ROC curves on account level below. Blue one for no sampling techniques and red one for combination of down sampling and over sampling. ![enter image description here][3] # Summary Model performance has been improved significantly by down/over sampling on original data, especially when combining the two together. Down/over sampling are typical techniques dealing with unbalanced data. The 0-1 ratio sometimes is pre-determined based on prior knowledge, or it may be finely tuned for each specific case. Created by a Microsoft Employee [1]: [2]: [3]: