Methods for handling missing values
This experiment illustrates a variety methods for handling missing data on a sample data set.
Real world data is usually missing values, which trip up a lot of machine learning algorithms. There are lots of tricks for dealing with these, but you have to be careful. The way in which you fill them can change the result dramatically. Being explicit and thoughtful about how you handle missing values will get you the very best results.
I've illustrated a large handful of approaches to missing values here in a fake data set. The data shows a group of employees, some of their personal data, and some data regarding an upcoming office party. In every case, knowing what the data means is the most important part of handling it well.
![Employee data with missing values][1]
To get the most out of this, open the experiment in Studio. That way you can visualize the data set at each step by clicking on the output dot of each module and selecting "Visualize."
![Missing values experiment graph from Azure ML, top half][2]
1. Replace missing values with the mean. For this age data, we assume that missing values are distributed similarly to the values that are present. The formal name for this assumption is Missing Completely at Random (MCAR). In this case, substituting values that represent the existing distribution, such as the mean, is a reasonable approach.
2. Replace missing values with the median. This is another justifiable way to handle missing-at-random data, although note that it gives a different answer. For categorical data, it's also common to use the mode, the most commonly occurring value.
3. Replace missing values with an interpolated estimate. Inspection of the data shows that the values in this column, years seniority, is ordered, greatest to least. This structure can be exploited by interpolating the missing value. This approach is very effective when it is appropriate, usually with time-series data.
4. Replace missing values with a constant. Sometimes missing values are Missing Not at Random (MNAR), that is, that they are missing for some reason that is reflected by what their value should be. For income, we assume that values are missing because those with very high incomes preferred not to state them. In this case, we can make a reasonable guess for what "high" means and fill in the blanks. It will still be inaccurate, but more accurate than it was.
5. Replace missing values using imputation. Imputation is a way of using features to model each other. That way, when one is missing, the others can be used to fill in the blank in a reasonable way. This example shows Multivariate Imputation by Chained Equations (MICE) in action on age, years seniority, and income. This approach is particularly powerful when features are somewhat related. In this example, someone with many years seniority is likely to have a higher age. This means that the two features share information, and that when one is absent the other can be used to estimate it.
6. Replace missing values with a missing rank. Our knowledge of how parking spaces are numbered let us make a guess here. All the space numbers from 1-11 are accounted for, so the missing one might be 12. Reasoning along these lines can be applied when data is given in the form of a rank--each item has a positive integer value, and there are no duplicates and no gaps.
7. Replace missing values with a dummy value and create an indicator variable for "missing." When a missing value really means that the feature is not applicable, then that fact can be highlighted. Filling in a dummy value that is clearly different from actual values, such as a negative rank, is one way to do this. Another is to create a new true/false feature tracking whether the original feature is missing. These two approaches work together very nicely and are both illustrated here.
8. Replace missing values with 0. A missing numerical value can mean zero. In the case of an RSVP, invitees who are not planning to attend sometimes neglect to respond, but guests planning to attend are more likely to. In this case, filling in missing blanks with a zero is reasonable.
9. Replace missing values with 0 and create an indicator variable for "missing." Replacing missing values requires making assumptions. Whenever your confidence in those assumptions is low, it is safer to also create a true/false feature indicating that the value was missing. This allows many algorithms to learn to weight those differently. In the case of RSVPs, creating a missing value column is a way of noting that there is less confidence connected to it than to a zero in which an invitee sent their regrets.
![Missing values experiment graph from Azure ML, bottom half][3]
10. Replace missing values with a string. In categorical data, missing values can be replaced with a string communicating that they are missing.
11. Add an indicator variable showing which strings are considered "missing." Unfortunately, there are lots of ways to communicate the concept of "missing" in a string. These can be unified into a single indicator feature by explicitly listing them. Doing this is a way to make sure that all the missing values are handled the same way.
12. Delete columns that are missing too many values to be useful. If a feature is missing too many values, or if there is not enough information available to make reasonable assumptions about how to replace the missing values, you can delete the column entirely. If the feature does not provide useful information, including it can slow down the model's runtime. For many algorithms, having many noisy or uninformative columns can actually degrade their performance. In those cases, deleting these columns during data preparation is the best policy.
13. Delete rows that are missing critical values. As with columns, rows that are missing important features can be deleted. This is particularly useful when you have the luxury of hand-picking high-quality data, such as when training a model.
The rest of the examples from The Other Stuff are in [this collection][4], including [data visualization][5], [feature engineering][6], and [operationalization][7]. [Here are the slides][8] that walk through the examples.
If you found this helpful please take a look at [my other submissions][9] to the Cortana Analytics Gallery and [follow me on Twitter][10].
http://bit.ly/1jreneN links to this page.
I am a Senior Data Scientist at Microsoft.
[1]: https://github.com/brohrer-ms/public-hosting/raw/master/missing_values/missing_values_data.png
[2]: https://raw.githubusercontent.com/brohrer-ms/public-hosting/master/missing_values/missing_values_graph_top.png
[3]: https://raw.githubusercontent.com/brohrer-ms/public-hosting/master/missing_values/missing_values_graph_bottom.png
[4]: https://gallery.cortanaanalytics.com/Collection/The-other-stuff-Getting-from-machine-learning-to-data-science-1
[5]: https://gallery.cortanaanalytics.com/Experiment/Data-exploration-through-visualization-1
[6]: https://gallery.cortanaanalytics.com/Experiment/Create-useful-features-for-trains-1
[7]: https://gallery.cortanaanalytics.com/Experiment/Make-an-API-for-trains-1
[8]: https://github.com/brohrer-ms/public-hosting/raw/master/the_other_stuff_TDSC.pdf
[9]: https://gallery.cortanaanalytics.com/Home/Author?authorId=A45B9C46BE3D2A79C7BF21A0E91E205066C4F21409C28B5EB871E3018F4C298A
[10]: https://twitter.com/intent/user?screen_name=_brohrer_