Multiclass Text Classification Benchmark - end to end solution with almost zero code.

September 17, 2018
The goal is to create a text multiclass classification experiment for a benchmark dataset 20newsgroups.
The goal is to create a text multiclass classification experiment for a benchmark dataset 20newsgroups https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups The AML experiment created, using almost zero code, produces the similar accuracy (82-83 %) as other methods. To make the dependences clear between training and scoring parts I put it all together. The experiment covers end to end solution where left part is used for training, parameters tuning and validation. The right branch is showing how to implement scoring experiment based on the trained model. In real deployment you would need to separate the right part and used saved training model and saved dictionary created by the training part. The data are provided as training and test sets, already split by the data provider. The data was converted to the csv files from the original format. In order to avoid model overfitting, the test dataset is split into two parts test and validation sets where the test set is used for hyperparameters tuning and validation set for the final model assessment. Note: The full run of the experiment can take upto x hours. In order to shorten the time you can remove the Tune model hyperparameters module and use Train Module instead. You can check different approches using SVM and DNN including transfer learning. https://github.com/MartinMachac/TextClassification/blob/master/README.md Reference: https://gallery.azure.ai/Experiment/Text-Classification-Step-1-of-5-data-preparation-3 https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups