Multiclass Classification: News categorization

By for September 2, 2014
This sample demonstrates how to use multiclass classifiers and feature hashing in Azure ML Studio to classify news into different categories.
##Multiclass Classification: News Categorization## This sample demonstrates how to use **multiclass classifiers** and **feature hashing** in Azure ML Studio to classify news into categories. ##Data## We used the 2004 Reuters news dataset. The training set has about 23,000 examples, and the test set has 781,000 examples. The original dataset has 103 categories that are organized into four hierarchies: - Corporate-Industrial (CCAT) - Government and Social (GCAT) - Economics and Economic Indicators (ECAT) - Securities and Commodities Trading and Market (MCAT) For this experiment, we used the names of the hierarchies as the label, or attribute to predict. Thus we were solving a multiclass classification problem with four classes. The original news articles might belong to one or more hierarchies. For those articles, a separate example was created for each combination of label and article, so that the articles had the same features but different label. For instance, if an article belonged to CCAT and GCAT, two examples would exist in the label data set, one for CCAT, and the other one for GCAT. The training and test articles, as well as labels, were available as files stored in Azure public blob storage. ## Data Preparation ## Because the original Reuters data did not have column headings, after reading the data from storage we replaced the dummy column headings with meaningful column names, using **Metadata Editor**. For the label data, we used only the rows already tagged with hierarchy names (CCAT,ECAT,GCAT,MCAT). Then we joined the label data to tag the unlabeled train and test data by using the **Join** module. We also removed duplicate rows using the **Remove Duplicate Rows** module. ##Feature Engineering## We used the **Feature Hashing** module to convert the plain text of the articles to integers and used the integer values as input features to the model. ![][image1] ##Model## We compared two nonlinear multiclass classifiers: - **Multiclass Decision Forest** - **One-vs-All** classifier using the **Two-Class Decision Forest** module for the base classifier All algorithms used their default parameters. ![][image2] ##Results## The accuracy of the **One-vs-All** classifier was 71.7%, compared to accuracy of 69.6% for the native multiclass classifier (**Multiclass Decision Forest**). All accuracy values were computed and compared using custom script in the **Execute R Script** module. The following graphic showed the confusion matrices for the **One-vs-All** classifier on the left, and the **Multiclass Decision Forest** model on the right. ![][image3] From these results, you can see that the "Economics and Economic Indicators (ECAT)" category had the worst prediction accuracy. <!-- Images --> [image1]:http://az712634.vo.msecnd.net/samplesimg/v1/1/dataGraph.PNG [image2]:http://az712634.vo.msecnd.net/samplesimg/v1/1/modelGraph.PNG [image3]:http://az712634.vo.msecnd.net/samplesimg/v1/1/perf.PNG