Logistic Regression for Text Classification (Sentiment Analysis)

July 7, 2016
Template experiment for performing document classification using logistic regression. Includes cross-validation and training output.
# Summary This is a template experiment for performing document classification using logistic regression. It includes cross-validation and model output summary steps. The included data represents a variation on the common task of sentiment analysis, however this experiment structure is well-suited to multiclass text classification needs more generally as well. # Description **Data** This dataset contains approximately 10,000 tweets that have been labeled using the CrowdFlower platform as conveying either Happiness or Sadness. That makes this dataset a unique perspective on the popular topic of sentiment analysis. While sentiment analysis typically focuses on expressions of positive or negative opinion, this data is alternatively more grounded in emotional states. The data contains 3 columns, two of which (label and features) are explicitly expected by the experiment as it is set up: - id_nfpu: This is a unique identifier for each piece of data. This is useful if you are only passing part of your data to the classifier, and want to be able to stitch predictions back together with other metadata later on. - label: This is the label assigned to a row of training data. Here, labels are either "happiness" or "sadness", representing the two emotional states of posts being classified. - features: This column contains the text to which the label applies. It will get transformed into features used by the model during training and prediction. **Pipeline** 1. This data is subjected to three standard transformation/cleaning steps: - Converting the "label" column to be a categorical variable - Removing any rows for which the label column is missing a value - Stripping out non-alphanumeric characters and converting text to lower case 2. Feature extraction is done using AML's native Feature Hashing module, here set to fairly conservative parameters of unigram features and 12-bit hashing. 3. A logistic regression classifier is used. While the example data included with this experiment only contains two labels to predict, the model is created as one-vs-all multiclass. 4. In addition to training up a model, cross-validation is included (defaults to 5-fold). Summary statistics for cross-validation can be viewed directly via the output port of the Evaluate Model node, and predictions from the cross-validation run (averaged across folds) are also exported to CSV for inspection of model predictions.