Create useful features for trains

October 13, 2015
Sometimes feature engineering is the most important part of the problem. Here's an example using trains.
Feature engineering is the practice of rearranging your data to make it more useful. Careful feature engineering usually does more to improve your results than using a fancy machine learning algorithm. Most data needs at least a little engineering. Some of it needs a lot. There are lots of ways to do feature engineering. Some are sophisticated, like hierarchical unsupervised neural networks (deep learning), but some are simple, like combining two variables to make a third. To show how important this can be, I created a data set about trains. Walking through the Azure ML modules one at a time tells the story of why feature engineering matters. ![trains experiment][1] 1. I created a year of fake data set for the Red Line MBTA subway train in Boston running from Central Square and Kendall Square. It consists of three columns: arrival time at Kendall (in hours since midnight) departure time from Central (in hours since midnight) and maximum speed between stations (in kilometers per hour). [Here is a copy of the python script.][2] And here is the head of the file: ![trains data head][3] 2. A boosted decision tree regression was initialized to learn the top speed based on the arrival and departure times. Regression, rather than classification, is called for here because we are looking for a number to be predicted rather than a category. I didn't fiddle with its default hyperparameters. 3. The model trained on the data. 4. The model tried to predict top speed based on the original arrival and departure data. Typically in machine learning we would separate out our data into a testing and a training set and keep them apart and handle them separately. This lets us see when we are [overfitting][4] our model. Since this illustration is focused on feature engineering, rather than model performance, we can safely ignore this convention and keep our Azure ML graph simpler. 5. The model's fit was measured. One very useful measure of how well a model fits is the coefficient of determination, oven shown as R-squared. It measures how much of the variation in the data is described by the model. If the model perfectly describes all the variation in the data, R-squared is 1. If it can't describe any of it, then R-squared is 0. For this model on this data, R-squared is **0.016**. It's nearly worthless. 6. I manually saved the data set as one of "My Datasets." Then these modules read it back in and save it out to a comma-separated value file. I used [another python script][5] to make scatterplots of the arrival and departure times against the maximum speed. These show that our regression algorithm is working just fine. There is just no trend or slope to the data. ![scatterplot of original data][6] 7. A logical next approach is to consider the interaction of departure and arrival times. It makes sense, after all, that speed is related not just to one or the other, but to an interaction of the two. A very common way to model interaction is be multiplying two features together. This module adds a new feature to the data set, the product of our original two features. This is a classic example of feature engineering. ![head of data file with multiplied feature][7] 8. Just like last time. 9. Just like last time. 10. Just like last time. 11. This time the R-squared value came in at **0.015**. Even worse than before. What happened? 12. The scatterplot of the multiplied feature against the top speed (right panel, below) shows again that there is no trend or slope to model. It's not our model's fault, it's our features'. ![scatter plot of multiplied feature][8] 13. This time we stop and think carefully about what the data means. The departure and arrival times are closely related to the speed that the train travels. In fact the average speed is given by the distance between the stops, divided by how long it took to travel it: the arrival time *minus* the departure time. We are trying to predict the peak speed, which is different but still related. Let's try it again, this time using the travel time, the difference between arrival and departure times, as our engineered feature. ![head of data file with difference feature][9] 14. Just like last time. 15. Just like last time. 16. Just like last time. 17. This time the model fits with tremendously improved R-squared of **0.885**, about 30 times better. 18. A look at the scatterplot confirms that the travel time shows a strong relationship with the peak speed. ![scatterplot of the travel time and peak speed][10] Even though this was a simple example, it illustrates that the way to engineer features may not always be obvious, even when the need for it is. Complex data sets require careful thinking about how to combine features. There is literally no end to the list of possible methods. And there is no solution that will work in every case. The secret that will guide you through your feature engineering is to **know your domain**. Applying what you know about the problem you are trying to solve will give you insights into how to engineer features. And if you don't know the domain already, there is no substitute for diving in, reading a few books, talking to a few experts, and getting your hands dirty. This example is one of four that I talk about in a presentation called The Other Stuff: Getting from Machine Learning to Data Science. The rest of the examples from The Other Stuff are in [this collection][11], including [data visualization][12], [handling missing values][13], and [operationalization][14]. [Here are the slides][15] that walk through the examples. If you found this helpful please take a look at [my other submissions][16] to the Cortana Analytics Gallery and [follow me on Twitter][17]. links to this page. I am a Senior Data Scientist at Microsoft. [1]: [2]: [3]: [4]: [5]: [6]: [7]: [8]: [9]: [10]: [11]: [12]: [13]: [14]: [15]: [16]: [17]: