Clustering (K-Means) basic

June 27, 2017
This experiment clusters similar companies into same group given their Wikipedia articles and can be used to assign cluster to new company.
#Clustering: Find similar companies This experiment demonstrates how to use the K-Means clustering algorithm to perform segmentation on companies from the Standard & Poor (S&P) 500 index, based on the text of Wikipedia articles about each company. ![enter image description here][1] #Data The articles from Wikipedia were pre-processed outside Azure ML Studio to extract and partially clean text content related to each company. The processing included: - Removing wiki formatting - Removing non-alphanumeric characters - Converting all text to lowercase - Adding company categories, where known For some companies, articles could not be found; therefore the number of records is less than 500. #Model First, the contents of each Wiki article were passed to the Feature Hashing module, which tokenizes the text string and then transforms the data into a series of numbers, based on the hash value of each token. Even with this transformation, the dimensionality of the data is too high and sparse to be used by the K-Means clustering algorithm directly. Therefore, Principal Component Analysis (PCA) was applied using a custom R script in the Execute R Script module to reduce the dimensionality to 10 variables. You can review the result of PCA by double-clicking the right-hand output of the Execute R Script R module. From trial and error, we learned that the first variable in the PCA transformed data had the highest variance and appears to have had a detrimental effect on clustering. Therefore, we removed it from the feature set using Project Columns. Once the data was prepared, we created K-Means Clustering module and trained models on the text data.Finally, we used Metadata Editor to change the cluster labels into categorical values. ![enter image description here][99] Thanks to Microsoft - Brandon Rohrer. <br><br> ---------- > This ML experiment is for [Microsoft Azure Machine Learning Course][101].<br> For the complete experiment list [Click here][102].<br> Laploy | | 084 007 5544 | [][103]<br> ![enter image description here][104] ---------- [101]: [102]: [103]: [104]: [1]: [99]: [11]: