Cortana Conf CA Milk R

September 2, 2015
This experiment contains a forecasting regression model using Azure ML and R for the 2015 Cortana Analytics Conference Data Science Tutorial.
# Cortana Analytics Conference Data Science Tutorial Example: ## Forecasting with Azure ML and R Dr. Steve Elston, Quantia Analytics, LLC ## Overview This experiment illustrates the basics of building and evaluating a regression forecasting machine learning model, using Azure Machine Learning and R. This experiment is part of a tutorial presented at the 2015 Cortana Analytics Conference. You can download the code for this experiment from a Github repository: https://github.com/Quantia-Analytics/Contana-Data-Science-Example-R. Download the code in a .zip file by clicking on Download Zip in the lower right of the repository page. If you have Git installed you can clone the repository. ## Data The goal of this experiment is to forecast the monthly milk production for the State of California. The data set contains a time series of dairy production data for several products, along with milk fat pricing, for 128 months. To account for the nonlinear trend, two columns containing the square and cube of the month count are computed in an Execute R Script module. These new features are used in a polynomial regression of the time series trend. Code in another Execute R Script module generates graphics for the visualization and exploration of the data set. Specifically, one can see that the time series has a strong trend. Further, these data exhibit a significant seasonal (monthly) variation. A Metadata editor module converts the string column containing the names of the months to a categorical feature. A Project Columns module selects only the columns used in the model. The third Execute R Script module contains final bit of data munging code. This module divides the data into training and test sets. The last 12 months of milk production data are held back to test the forecasting power of our model. Note, the Split module would not work in this case, since it randomly samples the data. ## Model and results A linear regression model for both the trend and seasonal variation is used. Note, an intercept is not computed. We are modeling seasonal variation as a monthly category (factor). Including an intercept term would make the problem over determined. Both the Evaluate Model module and custom code in an Execute R Script module measure model performance. An evaluation Execute R Script module outputs RMS error figures for the entire time series and for the 12 months used to test forecasting accuracy. As one would expect, the RMS error for the forecasted months is slightly greater than for the overall series. The evaluation Execute R Script module also produces several useful charts for evaluating the model. Overall, the fit to both trend and seasonal change are good. The model residuals (errors) are close to normally distributed with only a few outliers. Further, there is no particular trend and structure to these residuals with time.