Generate Lag Features

By for October 21, 2016
Language

Report Abuse
This module generates the features based on the data in previous time periods. e.g. add features for bike sold in every hour for last 12 hrs
This experiment uses Bike rental dataset to demonstrate how creating additional lag features can boost model performance. In this experiment, number of bike rentals is output being predicted. Now creating new features that provides number of bike rentals sold an hour ago, 2 hours ago, ... followed by additional features such as # of bike rentals sold a day ago, two days ago etc. significantly improves accuracy of the model. As part of this experiment, we are creating two new custom modules: - **Generate Lag Features** that takes a column and creates additional features for prior periods. For example, in this experiment it is used to generate features for number of bikes sold in previous hours - **Generate Nested Lag Features** that takes a column and nested level to create additional features for prior periods. For example, in this experiment it is used to generate features for number of bikes sold in previous hours, previous days and previous weeks ### Generate Lag Features ### This module generates additional features by taking the given column, finding the values in previous periods, and adding the new columns with values in previous periods. It names the additional features as "<Column Name> lag -<period>" This module expects inputs as dataset, set of columns within the dataset, and the number of lag features to be generated. This module also supports generating lag features for multiple columns using the same module. Just select multiple columns in column selector. ![](http://neerajkh.blob.core.windows.net/images/LagConfigCapture.PNG) The output of this module for this specific experiment is shown below. As you can see, module has added the new columns such as "cnt lag-1", "cnt lag-2", etc. These columns such as "cnt lag-1" starts with second hour and has value of bikes sold in 1st hour, "cnt lag-2" start with third hour and has value of second hour and so on. ![](http://neerajkh.blob.core.windows.net/images/LagOutputCapture.PNG) ### Generate Nested Lag Features ### This module generates additional features by taking the the given set of columns, finding the values in previous periods, and adding the new columns with values in previous periods. It names the additional features as "<Column Name><level> lag -<period>" This module expects inputs as dataset, set of columns within the dataset, the number of lag features to be generated, and space separated periods within each level. In this experiment, we are trying to generate lag features for every hour, every day and every week. Hence, the rows per level is 1, 24, and 7 respectively. ![](http://neerajkh.blob.core.windows.net/images/NestedLagCapture.PNG) The output of this module for this specific experiment is shown below. As you can see,this module new lag features for daily bike rentals every hour and daily bike rentals every day. The lag features for bike rentals every hour is represented by cntlag*1*-1, cntlag*1*-2, etc. While the lag features for daily bike rentals is represented by cntlag*2*-1, cntlag*2*-2, etc. ![](http://neerajkh.blob.core.windows.net/images/Lag1Capture.PNG) ![](http://neerajkh.blob.core.windows.net/images/Lag2Capture.PNG) ### Experiment Graph ### ![](http://neerajkh.blob.core.windows.net/images/GenerateLagCapture.PNG) ### Source code ### The source code for these modules are located in github as shown below: **Generate Lag Features** - [code](https://gist.github.com/nk773/a2ed7cd0ce8020647f5e7711f749b3b5) **Generate Nested Lag Features** - [code](https://gist.github.com/nk773/4d70452c698b4e113c84235dbf78af4a)