Step 2. Train a VW Model

September 15, 2016

Report Abuse
This example show how to train a Vowpal Wabbit model using training dataset stored in Azure blob storage, then evaluate the performance using validation set.
This is the step 2 in the [Vowpal Wabbit Samples Collections](https://gallery.cortanaintelligence.com/Collection/Vowpal-Wabbit-Samples-2). To see how the training and validation datasets are generated, please see [step 1](https://gallery.cortanaintelligence.com/Experiment/Convert-Dataset-to-VW-Format-2). In the previous experiment, we converted the Adult Income Census data, which contains 32k+ rows, into the VW file format, then created the training set and validation set. You can also download these two files from [here](http://az754797.vo.msecnd.net/docs/vw/income.zip). **NOTE**: even though in this example the dataset is relatively small with only 32k+ rows, exact the same approach can be followed if the training set is well beyond the typical 10 GB limit that most Azure ML module can allow. VW is an online learner, and hence can read data from the blob storage and train in a streaming fashion, without having to load the entire dataset into memory. In this experiment, we use the training and validation set produced from the previous experiment for training and evaluating the model. ![train](http://az754797.vo.msecnd.net/docs/vw/vw_train_small.png) The fact that Train VW module can only read the training data from an Azure blob storage may appear to be rather limiting. But this is actually a very efficient way to stream data into VW. The storage account key in the Train VW module and Import Data module are removed, and will appear empty if you open this experiment in Studio. However, you can also download these files, and upload them into your own Azure blob storage, and then configure these modules accordingly. In the VW arguments property of the Train VW module, we supply *logistic* which tells VW to use logistic regression. --loss_function logistic You can also use *squared*, *hinge* or *quantile*, see the VW documentation for more details. The other notable here is that an Execute R Script is used to attach attributes to the result columns so that they can be picked up by the evaluate module and visualized properly. Here is the R code inside the Execute R Script module: dataset <- maml.mapInputPort(1) # class: data.frame # set the threshold threshold = 0.5 # set negative class dataset$MyScoredLabels[dataset$Results < threshold] <- -1 # set positive class dataset$MyScoredLabels[dataset$Results >= threshold] <- 1 # Result is the probability when "--link logistic" is on. Rename it to MyScoredProbabilities names(dataset)[names(dataset) == "Results"] <- "MyScoredProbabilities" # set metadata for the scored probability and labels columns dataset <- set.binary.classification.scores(dataset, "MyScoredProbabilities", "MyScoredLabels") # set metadata for the label column dataset <- set.true.label(dataset, "Labels") maml.mapOutputPort("dataset"); And here is the evaluation results: ![evaluate results](https://azuremluxcdnprod001.blob.core.windows.net/homepage/images/eval.PNG) After you are happy with the performance of the model, you are ready to move on to [step 3 operationalize the trained VW model](https://gallery.cortanaintelligence.com/Experiment/Part-III-Operationalize-a-VW-model-1).