Data exploration through visualization

October 13, 2015

Report Abuse
Visualization can reveal data structure.
This Azure ML experiment demonstrates how turning data into a picture can show its structure. It is one of several examples I use in a talk on the Other Stuff--what it takes to turn Machine Learning into Data Science. ![The Azure ML experiment for creating the smiley data set.][1] 1. This python script creates the "smiley distribution" data set. (Read the comments in the script for fine-grained details on how I do this.) After running it, I manually saved the data set in "My Datasets" as "Smiley distribution." When you download this experiment, that data set will be copied as well. By clicking on this Execute Python Script module's left hand output node, you can choose to Visualize it. This offers several options for displaying the data. When you create a scatter plot of x vs y values, the structure of the data becomes obvious. ![Smiley data scatterplot][2] 2. These modules load the saved Smiley distribution dataset and convert it to a comma-separated value file. I then saved this to my local hard drive as smiley.csv. This was a necessary step for importing the data to [PowerBI][3]. PowerBI has some more powerful visualization options and allows us to color-code the classes. This reveals still more of the structure in the data. ![Smiley data scatterplot in color][4] The rest of the examples from The Other Stuff are in [this collection][5], including [handling missing values][6], [feature engineering][7], and [operationalization][8]. [Here are the slides][9] that walk through the examples. If you found this helpful please take a look at [my other submissions][16] to the Cortana Analytics Gallery and [follow me on Twitter][17]. http://bit.ly/1WgAN3Q links to this page. I am a Senior Data Scientist at Microsoft. [1]: https://raw.githubusercontent.com/brohrer-ms/public-hosting/master/smiley/smiley_graph_small.png [2]: https://raw.githubusercontent.com/brohrer-ms/public-hosting/master/smiley/smiley_dots_small.png [3]: https://powerbi.microsoft.com/en-us/ [4]: https://raw.githubusercontent.com/brohrer-ms/public-hosting/master/smiley/smiley_color_small.png [5]: https://gallery.cortanaanalytics.com/Collection/The-other-stuff-Getting-from-machine-learning-to-data-science-1 [6]: https://gallery.cortanaintelligence.com/Experiment/Methods-for-handling-missing-values-1 [7]: https://gallery.cortanaanalytics.com/Experiment/Create-useful-features-for-trains-1 [8]: https://gallery.cortanaanalytics.com/Experiment/Make-an-API-for-trains-1 [9]: https://github.com/brohrer-ms/public-hosting/raw/master/the_other_stuff_TDSC.pdf [16]: https://gallery.cortanaanalytics.com/Home/Author?authorId=A45B9C46BE3D2A79C7BF21A0E91E205066C4F21409C28B5EB871E3018F4C298A [17]: https://twitter.com/intent/user?screen_name=_brohrer_