Binary Classification: Network intrusion detection
Develop a model that uses various network features to detect which network activities are part of an intrusion/attack.
#Binary Classification: Network Intrusion Detection
In this experiment we use various network features to detect which network activities are part of an intrusion/attack.
## Dataset
We used a modified [dataset](http://nsl.cs.unb.ca/NSL-KDD/) from KDD cup 1999. The dataset includes both training and testing sets. Each row of the dataset contains features about network activity and a label about type of activity. All activities except one (with value 'normal') indicate network intrusion. The training set has approximately 126K examples. It has 41 feature columns, a label column and an auxiliary 'diff-level' column that is an estimation of the difficulty of correctly classifying a given example (see [1] for a detailed description of this column). Feature columns are mostly numeric with a few string/categorical features. The test set has approximately 22.5K test examples (with same 43 columns as in the training set).
We upload training and test sets into [Azure blob storage](http://azure.microsoft.com/en-us/services/storage/) using the following Powershell commands:
Add-AzureAccount
$key = Get-AzureStorageKey -StorageAccountName <your storage account name>
$ctxt = New-AzureStorageContext -StorageAccountName $key.StorageAccountName -StorageAccountKey $key.Primary
Set-AzureStorageBlobContent –Container <container name in your storage account> -File "network_intrusion_detection.csv" –Context $ctxt
Set-AzureStorageBlobContent –Container <container name in your storage account> -File "network_intrusion_detection_test.csv" –Context $ctxt
For the purpose of this sample experiment we uploaded the files to 'datasets' public container of 'azuremlsampleexperiments' storage account.
## Data preprocessing
We import training set into Studio by using **Reader** module with the following parameters:
![Reader][reader]
Note that to read from the public blob storage we choose authentication type 'PublicOrSAS'. Importing of the test set is done in a similar way.
The original label column, called 'class' has many values and of string type. Each string value corresponds to a different attack.
Some attacks do not have many examples in the training set and there are new attacks in the test set. to simplify this sample experiment we build a model that does not distinguish between different types of attacks. For this purpose we replace 'class' column with the binary column that has 1 if an activity is normal and 0 if it is an attack. The Studio provides built-in modules to ease this preprocessing step.
The binarization of 'class' column is achieved by using **Metadata Editor** to change the type of 'class' column to categorical, getting binary column with **Indicator Values** module and selecting 'class-normal' column with **Project Columns** module. This sequence of steps is shown below:
![Reader][label_processing]
We do this transformation for both training and test sets.
## Comparison of classifiers
We compare 2 machine learning algorithms: **Two-Class Logistic Regression** and **Two-Class Boosted Decision Tree**. Also we compare two different training sets, the first one with the original 41 features and the second one with 15 most important features that are found by **Filter Based Feature Selection** module. The parameters of this module are shown below:
![Reader][feature_selection]
For every combination of learning algorithm and training set we train a model and generate predictions using the following sequence of steps, illustrated below:
![Reader][training]
1. Split the training set into 5 folds. This is done using **Partition and Sample** module with 'Partition or Sample mode' option set to 'Assign to Folds'.
2. Do 5-fold cross-validation over the training set. This and the next 2 steps are done by **Sweep Parameters** module. We connect partitioned training set to 'training dataset' input of **Sweep Parameters**. Since we use **Sweep Parameters** in cross-validation mode, we leave module's right output unconnected.
3. Finding the best hyperparameters of the learning algorithm on a given training set. We would like to use AUC metric for evaluating performance of our model and hence we set the option 'Metric for measuring performance for classification' of **Sweep Parameters** to 'AUC'.
4. Train the learning algorithm with the training set using the best values of hyperparameters from the previous step.
5. Score the test set using **Score Model** module.
6. Compute AUC over the test set. We use **Evaluate Model** to compute various metrics and **Project Columns** to extract AUC values.
Having computed AUC's for all 4 combinations of learning algorithm and training set, we use **Execute R** to generate a table that summarizes all results. This module has the following R code:
dataset1 <- maml.mapInputPort(1)
dataset2 <- maml.mapInputPort(2)
data.set <- data.frame(c("Logistic Regression, all features", "Boosted Decision Tree, all features",
"Logistic Regression, 15 features", "Boosted Decision Tree, 15 features"),
rbind(dataset1,dataset2))
names(data.set) <- c("Algorithm, features", "AUC")
maml.mapOutputPort("data.set")
##Results
The final output of the experiment is the left output of the last **Execute R Script** module:
![Reader][results]
We conclude that Boosted Decision Tree, trained with all available features, achieves the best AUC.
##References
[1] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009.
<!-- Images -->
[reader]:http://az712634.vo.msecnd.net/samplesimg/v1/3/reader.PNG
[experiment]:http://az712634.vo.msecnd.net/samplesimg/v1/3/experiment.PNG
[label_processing]:http://az712634.vo.msecnd.net/samplesimg/v1/3/label_processing.PNG
[feature_selection]:http://az712634.vo.msecnd.net/samplesimg/v1/3/feature_selection.PNG
[training]:http://az712634.vo.msecnd.net/samplesimg/v1/3/training.PNG
[results]:http://az712634.vo.msecnd.net/samplesimg/v1/3/results.PNG