Online Fraud Detection: Step 3 of 5, feature engineering
By AzureML Team for Microsoft March 18, 2015
This experiment demonstrates the steps in building a online transaction fraud detection solution.
#Online Transaction Fraud Detection Template Fraud detection is one of the earliest industrial applications of data mining and machine learning. As part of the Azure Machine Learning offering, Microsoft provides a template that helps data scientists easily build and deploy an online transaction fraud detection solution. The template includes a collection of pre-configured machine learning modules, as well as custom R scripts in the **Execute R Script** module, to enable an end-to-end solution. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. Therefore, model performance is measured by using account-level metrics, which will be discussed in detail later. This template uses the example of online purchase transactions to demonstrate the fraud detection modeling process. This process is put into 5 separate experiments, with each containing one of the following steps. ![workflow] Here are the links to each step (experiment) of the template: **[Online Fraud Detection: Step 1 of 5: Generate tagged data](http://gallery.azureml.net/Details/8e9fe4e03b8b4c65b9ca947c72b8e463)** **[Online Fraud Detection: Step 2 of 5: Data Preprocessing](http://gallery.azureml.net/Details/433909cccd4c4a8c9a49bc3a6a04fb61)** **[Online Fraud Detection: Step 3 of 5: Feature engineering](http://gallery.azureml.net/Details/33619e6f184b40ba8f8fdb4e9cb52802)** **[Online Fraud Detection: Step 4 of 5: Train and Evaluation Model](http://gallery.azureml.net/Details/4f1114c317c34d81b5b18a31712cd576)** **[Online Fraud Detection: Step 5 of 5: Publish as web service](http://gallery.azureml.net/Details/66e6ca668a5041058f388d35730c05af)** <!--- - [Step 1: Data Aggregation] - [Step 2: Data Preprocessing] - [Step 3: Feature engineering] - [Step 4: Train and Evaluation Model] - [Step 5: Publish as web service] [Step 1: Data Aggregation]:#step1 [Step 2: Data Preprocessing]:#step2 [Step 3: Feature engineering]:#step3 [Step 4: Train and Evaluation Model]:#step4 [Step 5: Publish as web service]:#step5 --> ##<a id="step1"></a> Step 1: Data Aggregation The following graphic shows the experiment for this step: ![step1] The fraud template takes two datasets as inputs: the transactional data, and the transaction-level fraud data. The transactional datasets uses a recommended data schema for transaction data, which consists of three groups of data fields: 1. **Transactional Data**. Data fields related to the current transaction. 2. **Account Data**. Data fields related to the transacting user account. 3. **Aggregated Data**. Data aggregated over the account's historical activity. This can be precomputed, or computed in the feature engineering step. The following table shows a subset of the data schema, the columns in the untagged dataset. You can review these data columns in the online transaction data and fraud data provided as samples in Azure ML Studio. Note that the date and time format specified in this schema needs to be strictly followed in order to use the fraud detection template. However, you can add additional data fields if available. ![dataSchema1] **1.0** The Reader reads the sample transactional data and fraud data from a public blob. You can replace with your own data with the same data schema. **1.1** The **Metadata Editor** module is used to make sure that the field, `accountId`, is non-numeric. Azure Machine Learning Studio always treats data fields with all numbers as numeric data types, which is not always the desired behavior. In this template, we changed the account number to a string to enable correct comparison of the account numbers in the sorting step. **1.2** The field, `transactionTime`, has the format “HHMMSS”, but sometimes input data will leave out the leading zeros. Therefore, use an R script to format time data using the “HHMMSS” format, which is required for subsequent sorting. **1.3** Both the transaction data and the fraud data is sorted by account and then by datetime. Having sorted data makes it easier to do account-level fraud tagging, account-level feature aggregation, and account-level performance measurement. **1.4** Use the **Remove Duplicates** module to remove duplicates records from both files. **1.5** In this step, labels are added to the transactional data by referencing the transaction-level fraud data, a process we call tagging. One of the inputs to the **Execute R Script** module is an R utility file, named `fraudTemplateUtil.zip`, that defines a custom R function, `tagTranDataWithFraud()`. When called, the function generates a new **Label** column, using the tags described below. tagTranDataWithFraud<-function(df, frd_df, mode) The following table lists the parameters that you can use when calling this function. ![func1] When labeling in transaction-level mode, the following tags are used: - 0 - Fraud (fraudulent transactions) - 1 - NonFraud (nonfraud transactions) When labeling in account-level mode, the following tags are used: - 0 - NonFraud (transactions from a nonfraud account) - 1 - Fraud (transactions within the fraud window) - 2 - Pre-fraud and Post-fraud (transactions outside the fraud window) _Pre-fraud_ transactions represent transactions of the fraudulent account prior to the fraudulent transactions. _Post-fraud_ transactions are transactions from the fraudulent account that happened after the fraud window. In a business scenario, the fraudulent accounts are usually blocked once a fraud transaction is found. Thus pre-frauds and post-frauds are typically excluded from training and evaluation. **1.6** Run the experiment and save the tagged data as a dataset. Name the dataset `Online Fraud- Tagged Data` so that you can use it in the next step. ##<a id="step2"></a>Step 2: Data Processing This step performs data processing. ![step2] **2.0** The reader reads the tagged transactions in a public blob. You can replace the reader with the previously saved dataset (`Online Fraud- Tagged Data`) for your experiment. **2.1** First, missing values are handled using the **Clean Missing Values** module in the train set. The same missing value transform can be applied to the test set. **2.2** The tagged data is partitioned into training and test sets using a custom R function `splitDataByKey()`. As a result of the R script, a `trainFlag` Boolean column is added to the dataset to indicate whether a row belongs to the train or test set. Then the data is further split into train and test sets using the **Split** module. For fraud detection, in order to measure account-level performance, it is necessary that the transactions for any single account be placed in either the training set or the test set, but not in both. # Split and sample the data by key: all transactions with the same key will stay in the same sample population. # A `trainFlag` column is added: `trainFlag` = 1 for the training set, `trainFlag` = 0 for the test set splitDataByKey<-function(dataset, key, NFrate, Frate= NFrate) The following table lists the parameters used for the `splitDataByKey()` function: ![func2] The user can adjust the percentage of data allocated to training and test sets, which in this template have been set to 70% and 30% respectively. **2.3** In this step, check for non-scoring conditions, such as invalid values in `transactionAmount`, `transactionDate`, or `transactionTime`. We demonstrated using **Split** and **Execute R Scripts** modules respectively **2.4** Run the experiment and save the training and test data back to the Azure ML Studio, naming the datasets `Online Fraud- Train Data` and `Online Fraud- Test Data`. Click the right-hand output of the **Clean Missing Data** module and save the transformation used to correct the missing values with the name `Online Fraud- MV Transform`. You can then re-use this transform in step 5 of the template (the scoring experiment). ##<a id="step3"></a> Step 3: Feature Engineering In this step, training and testing features are generated based on the training and test datasets saved in the previous step. In the following graphic, the modules in the left half of the experiment canvas use the training dataset, while the modules in the right half of the experiment use the testing dataset. The features that are created can be grouped into four categories, which appear as four parallel workflows in the experiment graph. - Count based features - Binary features - Aggregate features - Selected raw features ![step3] **3.0** The reader reads the train data and test data in a public blob. You can replace the readers with the previously saved datasets (`Online Fraud- Train Data` and `Online Fraud- Test Data`) for your experiment. **3.1 Count based features**. Using the **Project Columns** module, we selected data fields from the training dataset that have a large number of unique categorical values. Then, we use the **Build Count Tranform** module to generate a transform that can be applied to dataset replacing the original categorical value with count-based features. Essentially, it does the following: - Builds a count table that summarizes the number of nonFraud and fraud samples for each unique value - Generate count based features (logodds, raw counts for each class and a garbage bin indicator) - Outputs a transform that can be applied to a new dataset The **Modify Count Table Parameters** module is used to change the parameters in featurization of categorical values (e.g., smoothing, selection of output features) The **Apply Tranformation** module is used to replace the original values in the train and test dataset with new count-based features, such as log odds. The postal code is a good field to try with this type of count-based featurization. Next, the count transfrom that was derived from the training dataset were applied to featurize the test dataset. Typically, after you have used a data column as a basis to create count-based features, you will exclude that column from later use because the count-based features retain the information and you no longer need the high-dimensionality fields. **3.2 Binary features**. Azure Machine Learning can create a number of binary flags representing states in categorical input data columns. Additionally, we can create additional binary variables by comparing value pairs: for example, we might compare the shipping address to the user address and generate a flag that indicates whether the addresses match. The following two groups of features are provided as part of the original input dataset. **3.3 Aggregated account-level features**. These features summarize the historical activity of the account, such as the total amount of purchases during the last week, last month, etc. These features can be pre-computed and included in the input to the model directly as shown in the sample fraud data. For example, you could use Azure Stream Analytics to compute these values before uploading the data to Azure Machine Learning Studio. **3.4 Selected raw features**. We excluded the raw data columns that cannot be directly used as model inputs, such as `accountID`, `transactionID`, `transactionDate`, and `transactionTime`. However, these columns are useful for other purposes, like account-level sampling and performance evaluation, etc. **3.5** In the end, all four types of features are combined by using the **Add Columns** module. The pre-fraud and post-fraud transactions are excluded from the training and testing feature datasets by using the **Split** module. (Note that in **3.1** we could have selected the columns in the **Build Count Table** module itself without using the **Project Columns** module, and we do not have to single out the aggregate features from the input data. We did so to better illustrate the four categories of features.) **3.6** Run the experiment, save the train and test features as datasets (named `Online Fraud- Train Features` and `Online Fraud- Test Features`).to be used for training and evaluation in the next step. Save the generated count table and its metadata as datasets (named `Online Fraud- Count Table` and `Online Fraud- Count Table Metadata`) to be used in the scoring step (step 5). ##<a id="step4"></a> Step 4: Model Training and Evaluation Of the features constructed in Step 3, the training features are used for training, and the test features are used for evaluation. ![step4] **4.0** Read the train and test features from public blob. You can replace these readers with previously saved datasets (`Online Fraud- Train Features` and `Online Fraud- Test Features`) in this experiment. **4.1** Training a model. Train a model based on the **Two-Class Boosted Decision Tree** algorithm, passing the training dataset as an input to the **Train Model** module. **4.2** Generating scores. Use the **Score Model** module with the trained boosted decision tree model to score the test features. **4.3** Evaluating the model's performance. Evaluate the accuracy of the model, at the transaction and account level, by using the **Evaluate Model** module together with a custom R script in the **Execute R Script** module. The metric used for assessing accuracy (performance) depends on how the original cases are processed. If each case is processed on a transaction by transaction basis, you can use a standard performance metric, such as transaction-based ROC curve or AUC. You can calculate and visualize both metrics using the **Evaluate Model** module. However, for fraud detection, typically account-level metrics are used, based on the assumption that once a transaction is discovered to be fraudulent (for example, via customer contact), an action will be taken to block all subsequent transactions. A major difference between account-level metrics and transaction-level metrics is that, typically an account confirmed as a false positive (that is, fraudulent activity was predicted where it did not exist) will not be contacted again during a short period of time, to avoid inconveniencing the customer. The industry standard fraud detection metrics are ADR vs AFPR and VDR vs AFPR for performance, and transaction level performance, as defined here: - ADR – Fraud Account Detection Rate. The percentage of detected fraud accounts in all fraud accounts. - VDR - Value Detection Rate. The percentage of monetary savings, assuming the current fraud transaction triggered a blocking action on subsequent transactions, over all fraud losses. - AFPR - Account False Positive Ratio. The ratio of detected false positive accounts over detected fraud accounts. The following R function generates the above account-level fraud statistics and additional transaction level statistics, and the output of the function can be plotted in the right-hand output of the **Execute R script** module, to show the performance for the model. scr2stat <-function(dataset, contactPeriod, sampleRateNF,sampleRateFrd) The following table lists the parameters for the `splitDataByKey()` function: ![func3] In addition to the parameters in the table, the `splitDataByKey` function requires the following data columns to run successfully: `accountID`, `transactionDate`, `transactionTime`, `transactionAmount`, `Label`, `Scored Probabilities`. The function returns all performance statistics, which are then plotted using the R `plot` function. The left-hand output of the **Execute R Script** contains the performance statistics, and the right-hand output shows the plots of the above performance curves. Here is an example of the account-level performance curves, which is generated by sample dataset with approximately 1.5 million transactions. When a data size is small and fraud rate is very low, the account level curves do not not look smooth, as you may see from the curves generated by the sample data used in this template (around 205,000 transactions). ![perf1] ![perf2] **4.4** Save the model as a trained model (name it as `Online Fraud- Trained Model`) by clicking the output of **Train Model**, so that the model can be used in the scoring experiment. ##<a id="step5"></a> Step 5: Publish as a Web Service The trained model is now ready for deployment as a web service. To deploy a model, we need to connect all the data processing, feature engineering, scoring modules, saved transforms (e.g., MV tranform, count transform), and trained model to form one scoring experiment, as shown in the following graphic. ![step5] We read the test transaction data, count table, count table meta data from public blob, you can replace these with the `Online Fraud- Test Data`, `Online Fraud- Count Table` and `Online Fraud- Count Table Metadata` datasets saved in previous steps. When you run the experiment, you will be prompted to select a web service **Input** point and an **Output** point. After doing so, click the **Publish web service** button. Your first fraud detection scoring web service is up and running! Click Web Service icon, you will see the published web service like below: ![webService] Click the Online Fraud web serice, you will see the following: ![webAPI] ###Consuming the Web Service The web service can be consumed in two modes: - RRS (request-response service) - BES (batch execution service). Click the **API help page**, you will be led to a help page that shows sample code (C#/python/R) for calling the web services is provided in the help page for the web service (click the **API help page** link). The following code sample illustrates how you can call the RRS web service by using Python. ![apiCode] ##Summary Microsoft Azure Machine Learning provides a cloud-based platform for data scientists to easily build and deploy machine learning applications. This fraud detection template used an on-line purchase scenario to illustrate the process of fraud detection, but it can be adapted to other fraud detection scenarios. This template, along with other templates published by Microsoft, helps users to rapidly prototype and deploy machine learning solutions. <!-- Images --> [workflow]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/workflow.PNG [step1]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/step1.PNG [dataSchema1]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/dataSchema1.PNG [step2]:https://az712634.vo.msecnd.net/samplesimg/v2/T3/step2.PNG [step3]:https://az712634.vo.msecnd.net/samplesimg/v3/T3/step3.PNG [step4]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/step4.PNG [step5]:https://az712634.vo.msecnd.net/samplesimg/v2/T3/step5.PNG [webService]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/webService.PNG [webAPI]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/webAPI.PNG [apiCode]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/apiCode.PNG [func1]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/func1.PNG [func2]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/func2.PNG [func3]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/func3.PNG [perf1]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/perf1.PNG [perf2]:https://az712634.vo.msecnd.net/samplesimg/v1/T3/perf2.PNG