Regression: Demand estimation

By for September 2, 2014
This experiment demonstrates demand estimation using regression with UCI bike rental data.
# Regression: Demand Estimation # This experiment demonstrates the **feature engineering** process for building a **regression** model using bike rental demand prediction as an example. We demonstrate that effective feature engineering will lead to a more accurate model. ## Data ## The Bike Rental UCI dataset is used as the input raw data for this experiment. This dataset is based on real data from the Capital Bikeshare company, which operates a bike rental network in Washington DC in the United States. The dataset contains 17,379 rows and 17 columns, each row representing the number of bike rentals within a specific hour of a day in the years 2011 or 2012. Weather conditions (such as temperature, humidity, and wind speed) were included in this raw feature set, and the dates were categorized as holiday vs. weekday etc. The field to predict is "cnt", which contain a count value ranging from 1 to 977, representing the number of bike rentals within a specific hour. ![complete experiment image]( ##Feature Engineering Because our goal was to construct effective features in the training data, we built four models using the same algorithm, but with four different training datasets. The input data was split in such a way that the training data contained records for the year 2011 and and the testing data contained records for 2012. The four training datasets that we constructed were all based on the same raw input data, but we added different additional features to each training set. * Set A = weather + holiday + weekday + weekend features for the predicted day * Set B = number of bikes that were rented in each of the previous 12 hours * Set C = number of bikes that were rented in each of the previous 12 days at the same hour * Set D = number of bikes that were rented in each of the previous 12 weeks at the same hour and the same day Each of these feature sets captures different aspects of the problem: - Feature set B captures very recent demand for the bikes. - Feature set C captures the demand for bikes at a particular hour. - Feature set D captures demand for bikes at a particular hour and particular day of the week. The four training datasets were built by combining the feature set as follows: - Training set 1: feature set A only - Training set 2: feature sets A+B - Training set 3: feature sets A+B+C - Training set 4: feature sets A+B+C+D ## Model ## We used a regression model because the label column (number of rentals) contains continuous real numbers. Given that the number of features is relatively small (less than 100) and these features are not sparse, the decision boundary is very likely to be nonlinear. Based on these observations, we decided to use the **Boosted Decision Tree Regression** algorithm for the experiment. ##Running the Experiment ## Overall, the experiment had five major steps: - [Step 1: Get data] - [Step 2: Data pre-processing] - [Step 3: Feature engineering] - [Step 4: Train the model] - [Step 5: Test, evaluate, and compare the model] [Step 1: Get data]:#step-1-get-data [Step 2: Data pre-processing]:#step-2-pre-process-data [Step 3: Feature engineering]:#step-3-define-features [Step 4: Train the model]:#step-4-train-model [Step 5: Test, Evaluate, and Compare the model]:#step-5-evaluate-model <!-- This detracts from the flow of the doc (JS) ![text]( ![text]( ![text]( --> ### <a name="step-1-get-data"></a>Step 1: Get data ### There are different methods to load data into the Azure ML experiment depending on where the input data is stored. The data can be loaded from the local file system, or from the **Reader** module, which can access data from many other persistent storage locations such as Azure SQL database, Hive, Web URL, etc. However, the UCI Bike Rental dataset is already available in Azure ML Studio as a saved dataset, so you can simply drag it from the list of datasets into the experiment canvas. ![loading data: from local file system]( ![loading data: from reader module]( ### <a name="step-2-pre-process-data"></a>Step 2: Data pre-processing ### Data pre-processing is an important step in most real-world analytical applications. The major tasks include data cleaning, data integration, data transformation, data reduction, and data discretization and quantization. In Azure ML Studio, you can find tools to help with many of these data processing tasks in the *Data Transformation* group. For example: - If you need to combine multiple datasets, you can use the **Join**, **Add Rows**, or **Add Columns** modules. - To clean and transform data, you can usese modules: **Clean Missing Data**, **Normalize Data**, **Partition and Sample**, or **Quantize Data** . In this experiment, we used **Metadata Editor** and **Project Columns** to convert the two numeric columns "weathersit" and "season" into categorical variables and to remove four less relevant columns ("instant", "dteday", "casual", "registered"). ![Metadata Editor Module]( ![Project Columns Module]( ### <a name="step-3-define-features"></a>Step 3: Feature engineering### Normally, when preparing training data you pay attention to two requirements: - First, find the right data, integrate all relevant features, and reduce the data size if necessary. - Second, identify the features that characterize the patterns in the data and if they don't exist, construct them. It can be tempting to includes many raw data fields in the feature set, but more often, you need to construct additional features from the raw data to provide better predictive power. This is called **feature engineering**. In this experiment, we created four copies of the dataset resulting from **Project Columns** and used the **Execute R Script** module to construct different sets of derived features and to append the new features to each dataset. The following figure shows the R script included in one of these branches. ![Feature Engineering - example of using "Execute R Script" module]( ### <a name="step-4-train-model"></a>Step 4: Train the model ### Next, we needed to choose an algorithm to use in analyzing the data. There are many kinds of machine learning problems (classification, clustering, regression, recommendation, etc.) with different algorithms suited to each task, depending on their accuracy, intelligibility, and efficiency. For this experiment, because the goal was to predict a number (the demand for the bikes, represented as the number of bike rentals) we chose a regression model. Moreover, because the number of features is relatively small (less than 100) and these features are not sparse, the decision boundary is very likely to be nonlinear. Based on these factors, we used the **Boosted Decision Tree Regression** module, a commonly used nonlinear algorithm, to build the models. We were not interested in comparing different algorithms, so we used the same module for training of all four models. You can change many parameters in the **Boosted Decision Tree Regression** module, but for this experiment, we did not change any parameters and just used the default values. If you want to find the best parameters for a model, we recommend that you use the **Sweep Parameters** module. ![The example of the parameter settings of an algorithm]( We used the **Split** module to divide the input data in such a way that the training data was based on data from the year 2011 and the testing data was based on data from the year 2012. (In the dataset, see the column "yr" column in which 0 means 2011 and 1 means 2012.) ![module "Split"]( ### <a name="step-5-evaluate-model"></a>Step 5: Test, evaluate, and compare the model ### After the model was trained, we used the **Score Model** and **Evaluate Model** modules. - **Score Model** scores a trained classification or regression model against a test dataset. That is, the module generates predictions using the trained model. - **Evaluate Model** takes the scored dataset and uses it to generate some evaluation metrics. You can then visualize the evaluation metrics. ![result comparison]( To understand the performance of four models, see the comparison results in the following table. - The best results were from the combination of features A+B+C and A+B+C+D. - Feature set D does not provide additional improvement over A+B+C. ![result comparison](