Clustering: Find similar companies

By for September 2, 2014
This experiment clusters similar companies into same group given their Wikipedia articles and can be used to assign cluster to new company.
This experiment demonstrates how to use the K-Means clustering algorithm to perform segmentation on companies from the Standard & Poor (S&P) 500 index, based on the text of Wikipedia articles about each company. ## Data The articles from Wikipedia were pre-processed outside Azure ML Studio to extract and partially clean text content related to each company. The processing included: * Removing wiki formatting * Removing non-alphanumeric characters * Converting all text to lowercase * Adding company categories, where known For some companies, articles could not be found; therefore the number of records is less than 500. ## Model First, the contents of each Wiki article were passed to the **Feature Hashing** module, which tokenizes the text string and then transforms the data into a series of numbers, based on the hash value of each token. Even with this transformation, the dimensionality of the data is too high and sparse to be used by the K-Means clustering algorithm directly. Therefore, Principal Component Analysis (PCA) was applied using a custom R script in the **Execute R Script** module to reduce the dimensionality to 10 variables. You can review the result of PCA by double-clicking the right-hand output of the **Execute R Script** R module. From trial and error, we learned that the first variable in the PCA transformed data had the highest variance and appears to have had a detrimental effect on clustering. Therefore, we removed it from the feature set using **Project Columns**. Once the data was prepared, we created several different instances of the **K-Means Clustering** module and trained models on the text data. By trial and error, we found that the best results were obtained with 3 clusters, but models using 4 and 5 clusters were also tried. Finally, we used **Metadata Editor** to change the cluster labels into categorical values, and saved the results in CSV format for downloading, using **Convert to CSV** module. ![][image1] # Results To view the results from the sample experiment: 1. Right-click the output from **Metadata Editor** and select **Visualize**. 2. Plot the Category column (a known feature from the Wikipedia data) against the Assignments columns. The three clusters that we obtained correspond roughly to three plausible categories. Note that the clusters are not clearly delineated. ![results][image2] <!-- Images --> [image1]: [image2]: [1]: [2]: