Data Scientists’ Guide to Azure Machine Learning Studio

October 2, 2015
This experiment, along with detailed description, is published to Cortana Intelligence Gallery as a guide for data scientists new to Azure Machine Learning Studio. Created by a Microsoft Employee
## **1 Introduction** As a data scientist, I have experience with statistical modeling and machine learning using R. When I first got exposed to Azure Machine Learning Studio I spent a couple of weeks trying to understand the big picture about it. The way I did it was that I tried to read all the [online documentation][doc link] and work with the examples as described there. While doing this I encountered many questions and asked around about them. In this process I felt the need of a tutorial for someone with background like mine. In this tutorial, I try to cover the things I found most relevant from my own experience, some of which are explained by the Azure Machine Learning Studio documentation and others are not. The purpose is to help you grasp the core elements of using Azure Machine Learning Studio in about 3-4 hours: managing workspace, fitting models, evaluating models, setting up web service, consuming web service, and running R scripts. Once you understand the big picture as described here, you can learn more about specific topics by reading the online documentation. For illustration purpose, a linear regression model will be developed using the Housing dataset that comes with the R package MASS. While having some prior R knowledge is helpful, it is not a must for understanding the material here. If all you want to learn is the basic steps for developing models with Azure Machine Learning Studio and NOT any of the topics like understanding workspace and setting up web services, you might find the [Tutorial on creating your first experiment][tut link] helpful. As a caveat, Azure Machine Learning Studio is constantly being improved upon, which means that some of the information (e.g., module names) in the guide might be different by the time you are reading this. However, I would expect that you should be able to follow this Guide by using some good judgment. If you want the figure links in this document to open in a new browser tab, you can use ctrl-click. ## **2 Set Up the Workspace** Go to [Azure Machine Learning][AzureML link] website and sign in with your [Microsoft account][ms account]. If you don't have a subscription to Azure Machine Learning, you can click the [Azure subscription][Azure sub] link on the page and try Azure for free. Azure Machine Learning is one of the products in Azure so your subscription to Azure will allow you access to Azure Machine Learning. After signing in, click on “Studio” to enter Azure Machine Learning Studio and you are automatically signed into a WORKSPACE. Think of a WORKSPACE as a project folder. In [Figure 1][pic 1] I'm in a workspace named "my-free-workspace", as indicated by the name at the **top right** corner. [![Figure 1][pic 1]][pic 1] Figure 1 Clicking on the workspace name will show you whether you have other workspaces. [Figure 2][pic 2] shows that I have two other workspaces. [![Figure 2][pic 2]][pic 2] Figure 2 I'll use "my-free-workspace" in this example. By clicking on SETTINGS in the left pane, you can manage the workspace by changing its name ([Figure 3][pic 3]), checking your authorization tokens, and adding or removing users. Change the workspace name to anything you like and leave the other fields unchanged. [![Figure 3][pic 3]][pic 3] Figure 3 In addition to the SETTINGS tab, there are five other tabs in your workspace’s left pane and you can think of them as subfolders: Experiments, Web Services, Notebooks, Datasets, and Trained Models. All of them should be empty now since you have not started working on the project. ## **3 Create an Experiment** ### **3.1 Start an Experiment** Now let's start the project by creating a new EXPERIMENT. At the bottom of your workspace page there is a tool bar and the first icon on it is "+New". Click on it and a new window pops up. From here you can add new DATASET, MODULE, EXPERIMENT, or NOTEBOOK. By clicking on "EXPERIMENT" in the left pane of this pop-up window you'll see a window similar to [Figure 4][pic 4]. [![Figure 4][pic 4]][pic 4] Figure 4 There are some samples available for use. Here let's click on the first one "Blank Experiment." Now a new EXPERIMENT has been generated with a title "Experiment created on [today’s date]" ([Figure 5][pic 5]). [![Figure 5][pic 5]][pic 5] Figure 5 You can change the title to something like "My 1st Experiment" ([Figure 6][pic 6]). Within an EXPERIMENT, there are some tabs in the left pane, e.g., "Saved Datasets", "Trained Models", etc. We’ll be using many items from this pane during the demo. In the middle is a canvas and to the right you can enter some Summary and Description information about the EXPERIMENT. EXPERIMENTS are created by editing contents on the canvas. [![Figure 6][pic 6]][pic 6] Figure 6 ### **3.2 Key Elements of an Experiment** What characterises the development of an EXPERIMENT in Azure Machine Learning Studio is the use of DATASETS and MODULES. These MODULES can perform different functions on datasets, e.g., reading data, converting data formats, initializing a model, fitting a model, scoring a new dataset, and evaluating a fitted model. MODULES can also be used to execute R or Python scripts. The typical process of developing a model using Modules in Azure Machine Learning Studio is shown in [Figure 7][pic 7]. [![Figure 7][pic 7]][pic 7] Figure 7 ### **3.3 Load Data** In this example we'll use a dataset from R. Click on "R Language Modules" in the left pane and you'll see two modules: "Create R Model" and "Execute R Script". Click and hold "Execute R Script" and move it into the canvas ([Figure 8][pic 8]). [![Figure 8][pic 8]][pic 8] Figure 8 Click on "Execute R Script" on the canvas and the right pane shows options related to this module: R Script and Random Seed. Click within the R Script window in the right pane, delete the default lines of code and replace them with the following lines of code ([Figure 9][pic 9]): [![Figure 9][pic 9]][pic 9] Figure 9 library(MASS) maml.mapOutputPort("Boston") Now let's run the EXPERIMENT by clicking on the "RUN" button on the tool bar at the bottom of the workspace. This automatically saves the project. After the program finishes running (as indicated by the checkmark in the top right corner of the canvas), let's check the dataset contents by right-clicking on the left output node of the "Execute R Script" module ([Figure 10][pic 10]) and selecting Visualize. [![Figure 10][pic 10]][pic 10] Figure 10 Now a new window similar to [Figure 11][pic 11] shows up. [![Figure 11][pic 11]][pic 11] Figure 11 So we see that there are 506 rows and 14 columns in the dataset. Azure ML Studio currently allows preview for a maximum of 100 rows and 100 columns. By clicking on any column you'll be able to see the summary statistics and a histogram for the variable. The last variable “medv” represents the median value of homes in $1000s. Other variables have information about crime rate, index of accessibility to radial highways, etc. Now close this window and we’ll use the dataset to fit a linear model. ### **3.4 Initialize Model** Here we are going to fit a linear regression model. Before fitting the model we need to initialize it (as explained in [Figure 7][pic 7]). The module for initializing a linear regression can be found from the left pane of the EXPERIMENT: Machine Learning -> Initialize Model -> Regression -> Linear Regression. You can see that different types of models can be initialized, which are classified as Anomaly Detection, Classification, Clustering, and Regression. To learn more about what models should be used you can refer to the [cheat sheet][cheatsheet link]. For this EXPERIMENT drag this “Linear Regression” module onto the canvas. Now click on this module in the canvas and you'll see some parameters for it in the right pane ([Figure 12][pic 12]): Solution method, L2 regularization weight, Random number seed, etc. [![Figure 12][pic 12]][pic 12] Figure 12 Follow [Figure 13][pic 13] and set Solution method as Ordinary Least Squares, L2 regularization weight as 0. Check "Include intercept term" and "Allow unknown categorical levels". Leave "Random number seed" empty since it won't affect the solution method we selected. [![Figure 13][pic 13]][pic 13] Figure 13 ### **3.5 Train Model** Next we add the "Train Model" module on to the canvas ([Figure 14][pic 14]), which can be found from the left pane of the EXPERIMENT: Machine Learning -> Train -> Train Model. [![Figure 14][pic 14]][pic 14] Figure 14 The different "Train" modules are used to train different types of models. For example, "Train Clustering Model" is for training clustering models. Now link the output node of the "Linear Regression" module to the left input node of the "Train Model" module and link the left output node of the "Execute R Script" to the right input node of the "Train Model" module ([Figure 15][pic 15]). [![Figure 15][pic 15]][pic 15] Figure 15 Now the "Train Model" module knows which model to fit and what data to use. One more thing we need to do is to specify the response/dependent and predictor/independent variables. To specify the response/dependent variable, click on the "Train Model" module in the canvas and then "Launch column selector" in the right pane. In the window that pops up ([Figure 16][pic 16]), select "column names" from the 2nd dropdown menu and "medv" for the third box. [![Figure 16][pic 16]][pic 16] Figure 16 Click on the checkmark in the bottom right corner of the popup window ([Figure 17][pic 17]) to confirm the selection and return to the canvas. [![Figure 17][pic 17]][pic 17] Figure 17 All the columns in the "Train Model" module's right input node are now treated as predictor/independent variables except for "medv." In practice you’ll need to carry out variable selection to select the right predictor variables. Since we are just trying to demonstrate the process of using the Studio, however, we won’t do this here. Click on the "RUN" button on the toolbar and wait for it to finish ([Figure 18][pic 18]). [![Figure 18][pic 18]][pic 18] Figure 18 Right click on the output node of the "Train Model" module ([Figure 19][pic 19]) and select "Visualize." [![Figure 19][pic 19]][pic 19] Figure 19 A new window pops up with information about the fitted model ([Figure 20][pic 20]). [![Figure 20][pic 20]][pic 20] Figure 20 So the fitted regression can be written as: *medv = 36.4595 - 17.7666 nox + 3.8099 rm + 2.6867 chas - 1.4756 dis - 0.9527 ptratio - 0.5248 lstat + 0.3060 rad - 0.1080 crim + 0.0464 zn + 0.0206 indus - 0.0123 tax + 0.0093 black + 0.0007 age* ### **3.6 Score Data and Evaluate Model** In order to check the performance of the fitted model, we need to first score a dataset and then evaluate the model. The "Score Model" module can be found at Machine Learning -> Score -> Score Model. Drag it into the canvas ([Figure 21][pic 21]) [![Figure 21][pic 21]][pic 21] Figure 21 and connect the nodes as shown in [Figure 22][pic 22]. [![Figure 22][pic 22]][pic 22] Figure 22 This tells Azure Machine Learning Studio we are using the fitted model to make predictions for the training dataset. Click on "RUN" on the toolbar and wait until it's finished. Now right-click the output node and select Visualize ([Figure 23][pic 23]). [![Figure 23][pic 23]][pic 23] Figure 23 In the pop-up window ([Figure 24][pic 24]), the last column gives the predictions. [![Figure 24][pic 24]][pic 24] Figure 24 Please note that in practice you can use the “Score Model” module to score a new dataset and evaluate the model’s performance on it. As an example, you can split the original dataset into 2 smaller datasets, using one for training and one for scoring. To do this in the Studio you can use the “split” module. In this example, we are keeping it simple and evaluating the model’s performance on the training dataset only. Next add the "Evaluate Model" module which can be found at Machine Learning -> Evaluate -> Evaluate Model. Drag the module onto the canvas ([Figure 25][pic 25]) [![Figure 25][pic 25]][pic 25] Figure 25 and connect it with the "Score Model" module ([Figure 26][pic 26]). [![Figure 26][pic 26]][pic 26] Figure 26 The "Evaluate Model" module allows two inputs to evaluate and compare two models but we'll just use one in this example. Run the EXPERIMENT, wait until it finishes, and then right-click the output node of the "Evaluate Model" module ([Figure 27][pic 27]). [![Figure 27][pic 27]][pic 27] Figure 27 Now click on "Visualize" to bring up the "Evaluation results" window ([Figure 28][pic 28]). We can see that the model has an R-squared value of 0.7406. The other evaluation metrics calculated for linear regression include mean absolute error, root mean squared error, relative absolute error, and relative squared error. [![Figure 28][pic 28]][pic 28] Figure 28 ## **4 Web Service** ### **4.1 Set up a Web Service** Now that we've developed and evaluated the performance of the model, we can set up a web service so that others can use it to make predictions. Return to the canvas by closing the pop-up window. Click "SET UP WEB SERVICE" on the tool bar and two options are presented: "Predictive Web Service (Recommended)" and "Deploy Web Service" ([Figure 29][pic 29]). [![Figure 29][pic 29]][pic 29] Figure 29 Click on "Predictive Web Service (Recommended)" and a new EXPERIMENT will be created ([Figure 30][pic 30]). [![Figure 30][pic 30]][pic 30] Figure 30 You can read the tips that pops up if you want to. After going through all the tips, in the left pane go to Web Service -> Input. Drag this Input module into the Canvas and connect it to the "Score Model" module ([Figure 31][pic 31]). [![Figure 31][pic 31]][pic 31] Figure 31 This way we’re telling the web service to use for the input the same column names and data types as those from the “Execute R Script” module’s output. In this Predictive experiment, you can switch between experiment view and web service view by clicking the last button following the zoom in/out buttons. [Figure 31][pic 31] is showing the web service view and the active links are highlighted. Now click RUN, wait for it to finish, and then click "DEPLOY WEB SERVICE" ([Figure 32][pic 32]). [![Figure 32][pic 32]][pic 32] Figure 32 A new web service is created and information about it is shown in the window that follows ([Figure 33][pic 33]). One piece of information on this page that's very important is the API key. [![Figure 33][pic 33]][pic 33] Figure 33 ### **4.2 Consume a Web Service** There are different ways to consume this web service and we'll talk about three of them. Click "Test" under the Test column in [Figure 33][pic 33]. A window similar to [Figure 34][pic 34] shows up. Here you can enter values for the different variables and click the check mark in the lower right corner to get predictions. I entered the values for the first sample in the training dataset. [![Figure 34][pic 34]][pic 34] Figure 34 Return to the Web Service information page ([Figure 33][pic 33]) and click on "REQUEST/RESPONSE" under "API HELP PAGE". On the page that follows, scroll down to the Sample Code section ([Figure 35][pic 35]). Here C#, Python, and R code is provided for consuming the web service. Copy the R code to a local R console and make two changes: 1) copy the API Key for your API and replace the default value of "abc123" with it ([Figure 33][pic 33] has the API key for my example and it won’t work for your example); 2) enter some meaningful values for "values" in the "req" list. I entered the values for the first two samples in the training dataset. Notice that all values should be within quotation marks. Now run the R script and you should get predictions for the inputs. [![Figure 35][pic 35]][pic 35] Figure 35 The 3rd way of consuming the web service is to use an Excel file. Go back to the Web Service information page ([Figure 33][pic 33]) and click "Download Excel Workbook" under the APPS column to download an Excel file. After opening the Excel file, you'll notice that there is a Web Service in the Azure Machine Learning window ([Figure 36][pic 36]). [![Figure 36][pic 36]][pic 36] Figure 36 After clicking on the web service you'll see options for the web service ([Figure 37][pic 37]). [![Figure 37][pic 37]][pic 37] Figure 37 Follow the steps in [Figure 38][pic 38]: 1) make sure the the pointer is in the top left cell (Cell A1), 2) click on "Use sample data" to generate some samples, 3) select all the sample data by clicking on the data selection button, 4) enter value "A10" in the "Output: output1" box, 5) click on "Predict." [![Figure 38][pic 38]][pic 38] Figure 38 Now you should see prediction results starting from cell A10 ([Figure 39][pic 39]). You can change the values for the input predictors and click on "Predict" to update the predictions. When the checkbox for "Auto-predict" is selected, changes in the input section will lead to updated predictions automatically. [![Figure 39][pic 39]][pic 39] Figure 39 Another way of consuming the web service which we will not illustrate here is to use BATCH EXECUTION, which also shows up in [Figure 33][pic 33]. This approach allows you to save you input data and predictions on your Azure storage account, a topic we did not cover in this Guide. Once you set up your storage account, you’ll be able to use this service following similar steps as you did in the 2nd approach above. ## **5 How Does Azure Machine Learning Studio's Algorithm Compare with R?** Now that we've fitted a model in Azure Machine Learning Studio using its own algorithm, some might want to check the results against those from R. For example, are the coefficients the same? What about the model performance metrics such as R-squared? To make the comparison, we'll fit the linear model with the same data, this time using R. The good news for R users is that you can do this within the Azure Machine Learning Studio. As we did before for importing data, what we need to do is to add a new "Execute R Script" module to the canvas ([Figure 40][pic 40]). [![Figure 40][pic 40]][pic 40] Figure 40 Next we replace the default R script for the newly added module with the following code and connect the two "Execute R Script" modules ([Figure 41][pic 41]). [![Figure 41][pic 41]][pic 41] Figure 41 Boston <- maml.mapInputPort(1) # class: data.frame lm1 <- lm(medv~., data = Boston) summary(lm1) pred <- predict(lm1) mae <- mean(abs(pred-Boston$medv)) rmse <- sqrt(mean((pred-Boston$medv)^2)) rae <- mean(abs(pred-Boston$medv))/mean(abs(Boston$medv-mean(Boston$medv))) rse <- mean((pred-Boston$medv)^2)/mean((Boston$medv-mean(Boston$medv))^2) print(paste("Mean Absolute Error: ", as.character(round(mae,digit=6)), sep="")) print(paste("Root Mean Squared Error: ", as.character(round(rmse,digit=6)), sep="")) print(paste("Relative Absolute Error: ", as.character(round(rae,digit=6)), sep="")) print(paste("Relative Squared Error: ", as.character(round(rse,digit=6)), sep="")) After running the experiment, click on the "Execute R Script" module in the canvas again and then the "View output log" link in the middle of the right pane ([Figure 42][pic 42]). [![Figure 42][pic 42]][pic 42] Figure 42 In the page that pops up, scroll down until you see the results from running the R scripts ([Figure 43][pic 43]). You'll notice that the results are exactly the same as those we saw in [Figure 20][pic 20] (for coefficients) and [Figure 28][pic 28] (for performance metrics). [![Figure 43][pic 43]][pic 43] Figure 43 ## **6 Conclusion** In this tutorial, we've covered some of the most important features of Azure Machine Learning Studio: setting up the workspace, fitting models, setting up web service, and consuming the web. We also compared the model fitting results from Azure Machine Learning Studio with those from R. The Azure Machine Learning Studio [documentation website][doc link] provides detailed information about all the topics we've covered as well as many others we did not describe. The [data science process][ds process] helps you quickly identify the documents for specific topics. Reading this tutorial should have allowed you to understand Azure Machine Learning Studio from an overall perspective. ## **7 Additional Good-to-Knows** For fitting linear model, Azure Machine Learning Studio allows you to add an L2 regularization weight. In our example, we assigned value of 0 for this weight. If we use values higher than zero, we would be fitting ridge regression. Azure Machine Learning Studio provides two solution techniques for the linear regression: Ordinary Least Squares and Online Gradient Descent. Online Gradient Descent might perform better when the dataset is very large. In addition to linear regression, Azure Machine Learning Studio provides many other machine learning techniques that are classified into four groups: Anomaly Detection, Classification, Clustering, and Regression. To see the list of these models, from the left pane inside an Experiment, you can click on Machine Learning -> Initialized Model. For feature selection, you can use the "Filter Based Feature Selection" module to score relationship between each feature and the response variable. For some machine learning techniques, certain parameters need to be optimized (e.g., L2 regularization weight in linear regression). One module that does this is the "Sweep Parameters" module. Azure Machine Learning Studio also provides many modules for managing data. For example, after we specified the response column in the "Train Model" module, all the other columns of the input dataset are used as predictor variables. If we only need a subset of these other columns as predictors, we can add the "Project Columns" modules between "Execute R Script" and "Train Model" modules and specify the columns we need (including the response variable) in the "Project Columns" module. Other modules that can be useful include the “Split”, "Reader" and "Writer" modules. For instance, the “Reader” module allows you to read data from different data sources – Web URL or HTTP, Azure SQL Database, Azure Table, etc – and makes it possible to work with large datasets on the cloud. If you know the name of the module and want to find it quickly, you can search for it using the search tool within an experiment (e.g., the **top left** corner in [Figure 42][pic 42]). In addition to the online documentation, another good source of information is the [Azure Machine Learning Gallery][AzureGallery Link], which have sample experiments published by Microsoft as well as other users. Those created by Microsoft showed up when we were creating an experiment in [Figure 4][pic 4]. [doc link]: [tut link]: [AzureML link]: [ms account]: [Azure sub]: [cheatsheet link]: [ds process]: [AzureGAllery Link]: [pic 1]: [pic 2]: [pic 3]: [pic 4]: [pic 5]: [pic 6]: [pic 7]: [pic 8]: [pic 9]: [pic 10]: [pic 11]: [pic 12]: [pic 13]: [pic 14]: [pic 15]: [pic 16]: [pic 17]: [pic 18]: [pic 19]: [pic 20]: [pic 21]: [pic 22]: [pic 23]: [pic 24]: [pic 25]: [pic 26]: [pic 27]: [pic 28]: [pic 29]: [pic 30]: [pic 31]: [pic 32]: [pic 33]: [pic 34]: [pic 35]: [pic 36]: [pic 37]: [pic 38]: [pic 39]: [pic 40]: [pic 41]: [pic 42]: [pic 43]: