Sample 6: Train, Test, Evaluate for Regression: Auto Imports Dataset

By for November 2, 2014
This experiment demonstrates how to build a regression model to predict the automobile's price. The process includes training, testing and evaluating the model on the Auto Imports dataset.
# Regression: Car price prediction In this experiment, we demonstrate how to build a regression model, using car price prediction as an example. ## Data ## In this experiment, we used the `Automobile price data (Raw)` which is sourced from the UCI Machine Learning repository. This dataset contains 26 columns which includes information about automobiles by make, model, price, vehicle features like the number of cylinders, MPG, as well as an insurance risk score. Here the goal is to predict the price of the car. ## Creating the Experiment ## We build the experiment in four steps. - [Step 1: Get data] - [Step 2: Data pre-processing] - [Step 3: Train the model] - [Step 4: Test, evaluate, and compare the model] [Step 1: Get data]:#step-1-get-data [Step 2: Data pre-processing]:#step-2-pre-process-data [Step 3: Train the model]:#step-3-train-model [Step 4: Test, Evaluate, and Compare the model]:#step-4-evaluate-model ![steps flowchart](http://az712634.vo.msecnd.net/samplesimg/v1/S6/steps.PNG) ### <a name="step-1-get-data"></a>Step 1: Get data ### There are different ways of loading data into Azure ML Studio for use in experiments, depending on the location of the input data. You can upload data from the local file system, or use the **Reader** module to retrieve the data from cloud storage locations like Azure SQL databases, Hadoop via Hive queries, Web URL, etc. In this case, the input dataset is already available in Azure ML Studio as a saved dataset. We add the data to our experiment by dragging the dataset from the left panel onto the experiment canvas on the right panel. ![text](http://az712634.vo.msecnd.net/samplesimg/v1/S6/loadData_option1.PNG) ![text](http://az712634.vo.msecnd.net/samplesimg/v1/S6/loadData_option2.PNG) ### <a name="step-2-pre-process-data"></a>Step 2: Data pre-processing ### The major data preparation tasks include data cleaning, integration, transformation, reduction, and discretization or quantization. In Azure ML studio, you can find modules to perform these operations and other data pre-processing tasks in the **Data Transformation** group in the left panel. - Data cleaning tasks are handled by modules such as **Clean Missing Data** and **Remove Duplicate Rows** - Multiple datasets can be combined using **Join** if the datasets share a common key. - Other modules such as **Normalize Data**, **Partition and Sample**, and **Quantize Data** prepare data for machine learning by transforming, reducing and binning data. In Step 2 of this experiment, we use **Clean Missing Data** to fill in empty cells with zero (0) values. Then we use **Project Columns** to exclude four columns (such as `num-of-doors`) that are less relevant for the analysis. This helps create a clean set of training data. ![text](http://az712634.vo.msecnd.net/samplesimg/v1/S6/missingData.PNG) ![Project Columns Module](http://az712634.vo.msecnd.net/samplesimg/v1/S6/projectColumn.PNG) ### <a name="step-3-train-model"></a>Step 3: Train the model ### Machine learning problems vary in nature. Common machine learning tasks include classification, clustering, regression, recommender system, each of which might require a different algorithm. Choosing an algorithm often depends on the requirements of the actual use case. After picking an algorithm, the parameters of the algorithm must be tuned in order to train a more accurate model. All models must then be evaluated based on metrics such as accuracy, intelligibility and efficiency. In this experiment, the goal is to predict automobile prices. Since the label column (`price`) contains real numbers, a regression model is a good choice. Considering that the number of features is relatively small (less than 100) and these features are not sparse, the decision boundary is likely to be nonlinear. To compare the performance of different algorithms, we chose two nonlinear algorithms, **Poisson Regression** and **Decision Forest Regression**, to build models. Both algorithms have parameters that you can modify, but we used the default values for this experiment. To determine the optimum parameters for your model, the **Sweep Parameters** module is recommended. Using the **Split** module, we randomly divide the input data such that the training and testing datasets contains 60% and 40% of the original data respectively. ![split](http://az712634.vo.msecnd.net/samplesimg/v1/S6/split.PNG) ### <a name="step-4-evaluate-model"></a>Step 4: Test, evaluate, and compare the model ### We used two different sets of randomly chosen data to train and then test the model, as described in Step 4. By splitting the dataset and using different datasets to train and test the model, the result of model evaluation is more objective. After the model is trained, use the **Score Model** and **Evaluate Model** modules to generate predicted results and to evaluate the models. **Score Model** generates predictions for the test dataset using the trained model. The scores are then passed to **Evaluate Model** to generate evaluation metrics. In this experiment, we used two instances of **Evaluate Model**, to compare two pairs of models. - First, two algorithms are compared on training dataset. - Second, two algorithms are compared on testing dataset. From these results, we observe that: - The model built using **Poisson Regression** has lower accuracy than the model built on **Decision Forest Regression**. - Both algorithms have higher accuracy on the training dataset than on the unseen testing dataset. ![result comparison](http://az712634.vo.msecnd.net/samplesimg/v1/S6/modelComparison.PNG)