Binary Classification: Network intrusion detection

By for September 2, 2014
Develop a model that uses various network features to detect which network activities are part of an intrusion/attack.
#Binary Classification: Network Intrusion Detection In this experiment we use various network features to detect which network activities are part of an intrusion/attack. ## Dataset We used a modified [dataset]( from KDD cup 1999. The dataset includes both training and testing sets. Each row of the dataset contains features about network activity and a label about type of activity. All activities except one (with value 'normal') indicate network intrusion. The training set has approximately 126K examples. It has 41 feature columns, a label column and an auxiliary 'diff-level' column that is an estimation of the difficulty of correctly classifying a given example (see [1] for a detailed description of this column). Feature columns are mostly numeric with a few string/categorical features. The test set has approximately 22.5K test examples (with same 43 columns as in the training set). We upload training and test sets into [Azure blob storage]( using the following Powershell commands: Add-AzureAccount $key = Get-AzureStorageKey -StorageAccountName <your storage account name> $ctxt = New-AzureStorageContext -StorageAccountName $key.StorageAccountName -StorageAccountKey $key.Primary Set-AzureStorageBlobContent –Container <container name in your storage account> -File "network_intrusion_detection.csv" –Context $ctxt Set-AzureStorageBlobContent –Container <container name in your storage account> -File "network_intrusion_detection_test.csv" –Context $ctxt For the purpose of this sample experiment we uploaded the files to 'datasets' public container of 'azuremlsampleexperiments' storage account. ## Data preprocessing We import training set into Studio by using **Reader** module with the following parameters: ![Reader][reader] Note that to read from the public blob storage we choose authentication type 'PublicOrSAS'. Importing of the test set is done in a similar way. The original label column, called 'class' has many values and of string type. Each string value corresponds to a different attack. Some attacks do not have many examples in the training set and there are new attacks in the test set. to simplify this sample experiment we build a model that does not distinguish between different types of attacks. For this purpose we replace 'class' column with the binary column that has 1 if an activity is normal and 0 if it is an attack. The Studio provides built-in modules to ease this preprocessing step. The binarization of 'class' column is achieved by using **Metadata Editor** to change the type of 'class' column to categorical, getting binary column with **Indicator Values** module and selecting 'class-normal' column with **Project Columns** module. This sequence of steps is shown below: ![Reader][label_processing] We do this transformation for both training and test sets. ## Comparison of classifiers We compare 2 machine learning algorithms: **Two-Class Logistic Regression** and **Two-Class Boosted Decision Tree**. Also we compare two different training sets, the first one with the original 41 features and the second one with 15 most important features that are found by **Filter Based Feature Selection** module. The parameters of this module are shown below: ![Reader][feature_selection] For every combination of learning algorithm and training set we train a model and generate predictions using the following sequence of steps, illustrated below: ![Reader][training] 1. Split the training set into 5 folds. This is done using **Partition and Sample** module with 'Partition or Sample mode' option set to 'Assign to Folds'. 2. Do 5-fold cross-validation over the training set. This and the next 2 steps are done by **Sweep Parameters** module. We connect partitioned training set to 'training dataset' input of **Sweep Parameters**. Since we use **Sweep Parameters** in cross-validation mode, we leave module's right output unconnected. 3. Finding the best hyperparameters of the learning algorithm on a given training set. We would like to use AUC metric for evaluating performance of our model and hence we set the option 'Metric for measuring performance for classification' of **Sweep Parameters** to 'AUC'. 4. Train the learning algorithm with the training set using the best values of hyperparameters from the previous step. 5. Score the test set using **Score Model** module. 6. Compute AUC over the test set. We use **Evaluate Model** to compute various metrics and **Project Columns** to extract AUC values. Having computed AUC's for all 4 combinations of learning algorithm and training set, we use **Execute R** to generate a table that summarizes all results. This module has the following R code: dataset1 <- maml.mapInputPort(1) dataset2 <- maml.mapInputPort(2) data.set <- data.frame(c("Logistic Regression, all features", "Boosted Decision Tree, all features", "Logistic Regression, 15 features", "Boosted Decision Tree, 15 features"), rbind(dataset1,dataset2)) names(data.set) <- c("Algorithm, features", "AUC") maml.mapOutputPort("data.set") ##Results The final output of the experiment is the left output of the last **Execute R Script** module: ![Reader][results] We conclude that Boosted Decision Tree, trained with all available features, achieves the best AUC. ##References [1] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009. <!-- Images --> [reader]: [experiment]: [label_processing]: [feature_selection]: [training]: [results]: