Clustering (K-Means) basic

June 27, 2017
This experiment clusters similar companies into same group given their Wikipedia articles and can be used to assign cluster to new company.
#Clustering: Find similar companies This experiment demonstrates how to use the K-Means clustering algorithm to perform segmentation on companies from the Standard & Poor (S&P) 500 index, based on the text of Wikipedia articles about each company. ![enter image description here][1] #Data The articles from Wikipedia were pre-processed outside Azure ML Studio to extract and partially clean text content related to each company. The processing included: - Removing wiki formatting - Removing non-alphanumeric characters - Converting all text to lowercase - Adding company categories, where known For some companies, articles could not be found; therefore the number of records is less than 500. #Model First, the contents of each Wiki article were passed to the Feature Hashing module, which tokenizes the text string and then transforms the data into a series of numbers, based on the hash value of each token. Even with this transformation, the dimensionality of the data is too high and sparse to be used by the K-Means clustering algorithm directly. Therefore, Principal Component Analysis (PCA) was applied using a custom R script in the Execute R Script module to reduce the dimensionality to 10 variables. You can review the result of PCA by double-clicking the right-hand output of the Execute R Script R module. From trial and error, we learned that the first variable in the PCA transformed data had the highest variance and appears to have had a detrimental effect on clustering. Therefore, we removed it from the feature set using Project Columns. Once the data was prepared, we created K-Means Clustering module and trained models on the text data.Finally, we used Metadata Editor to change the cluster labels into categorical values. ![enter image description here][99] Thanks to Microsoft - Brandon Rohrer. https://gallery.cortanaintelligence.com/Experiment/Clustering-Find-similar-companies-23 <br><br> ---------- > This ML experiment is for [Microsoft Azure Machine Learning Course][101].<br> For the complete experiment list [Click here][102].<br> Laploy | laploy@gmail.com | 084 007 5544 | [www.laploy.com][103]<br> ![enter image description here][104] ---------- [101]: https://notebooks.azure.com/laploy/libraries/loyml/html/00001%20Sessions%20summary.ipynb [102]: https://gallery.cortanaintelligence.com/Home/Author?authorId=81E333F747E3429B55A3445E6714C36F60B397C13B4D0B07F34DEF1421F64D73 [103]: http://laploy.com [104]: https://raw.githubusercontent.com/laploy/mli/master//loy-small.jpg [1]: https://raw.githubusercontent.com/laploy/mli/master//12520-000.PNG [99]: https://raw.githubusercontent.com/laploy/mli/master//12520-099.PNG [11]: https://raw.githubusercontent.com/laploy/mli/master//loy-small.jpg