Bike Rental Monte Carlo

March 11, 2015

Report Abuse
Run a simple Monte Carlo simulation on the Bike Rental model to get distribution on expected number of rentals.
This experiment is an illustration of how to use to operationlization capabilities of AzureML to do more than simple execution of a model on a single observation. Here we use a Boosted Decision Tree Regression Tree model trained to predict bike rentals, using the sample dataset provided with AzureML. The basic task is to predict bike rentals in an hour, based on the following input variables: 1. Month of the year (mnth: 1-12); 2. Hour of the day (hr: 1-24); 3. Holiday indicator (holiday: 1/0); 4. Weekday indicator (weekday: 1/0); 5. Working Day indicator (workingday: 1/0); 6. Weather (weathersit: 1=Sunny, 2=Cloudy, 3=Light Rain, 4=Heavy Rain); 7. Temperature (temp: 0-1); 8. Humidity (hum: 0-1); 9: Windspeed (windspeed: 0-1). (Note that weather data has been normalized). Once the model has been trained, it can be used to make a prediction for number of bikes rented, give a specific set of conditions. However, in order to generate a prediction, the user would have to provide the forecasted weather. Since weather forecasts usually come with some uncertainty, it would be useful to do not a single evaluation of the model, but run a Monte Carlo simulation by sampling possible weather conditions given uncertainty in the predictions. We illustrate a very simple version of this here, making the simplifying assumption that all the inputs are independent. We assume that this model is going to be used behind some user interface. To make it more comprehensible, we want to enter the weather parameters in normal units (Celcius, percentage and m/s for temperature, relative humidity and windspeed respectively). To quantify uncertainly, we'll let the user specify a standard deviation for each of these parameters. For the weather, we allow the user to specify the likelihood for each of the conditions. A single observation of this input dataset was uploaded and is used as input for the experiment. It comprises the 9 inputs listed above, with 8 additional columns: 10. Temperature Standard Deviation (TempSTD); 11. Humidity Standard Deviation (HumSTD); 12. Windspeed Standard Deviation (WindSTD); 13. Probaility for Sunny (ProbSunny); 14. Probability for Cloudy (ProbCloudy); 15. Probability for light rain or snow (ProbLightRainSnow); 16. Probaility for heavy rain/snow (ProbHeavyRain); 17. Number of samples to generate for the simulation (n). This single observation dataset represents a typical call to what will be the published API. The first step is a R script to normalize the weather inputs and generate the random samples. We use a simple Gaussian for the temprature, humidity and windspeed, and sample the variable "weathersit" according to the probabilities provided by the user. We then score the generated dataset and strip off all the columns we don't need using a Project Columns module. A second Execute R module is used to calculate the distributions for the predicted number of rentals, as well as for the input variables (as a check). For a simple way to see the distribution of the predictions, visualize the output of the last R module. Select the rentals.x column, then, using the "compare to" control, select rentals.y. It is interesting to see that for this particular scenario, the distribution of number of rentals is roughly bi-modal, with the most probable outcome around 325 rentals, but having significant likelihood for between 150 and 200. Depending on staffing costs and other profit considerations, an operator might choose to prepare for one of the two possibilities. In order to use the model, the first step is to publish it as a service. After a sucessful execution of the experiment, click on the "Publish Web Service" button at the bottom of the canvas. The default location for the Web Service Output module is incorect - move it to the left-hand (dataset) output port on the Execute R module. Once published, one would call the service with .../scoremultirow instead of the usual .../score for the REST endpoint. This will return an array containing 100 rows as in the the output of the experiment. [I am a Microsoft employee]