Email Classification for Automated Support Ticket Generation: Step 1 of 2, Train and Evaluate Models

By for October 20, 2017
Email classification experiment to assign an email to one or more class(es) of predefined set of classes or work queues.
The goal of this experiment is to classify an email into one or more predefined classes or categories and to create a support ticket or assign it to correct support team. This experiment has two steps [Step 1 of 2: Train model with data and Save trained models][1] [Step 2 of 2: Create an experiment using the trained models and publish it as web service][2] Next optional step is to [Integrate Predictive Web service in CRM workflow ][3] In this experiment, input dataset has two raw text columns: Email Subject and Email Body, and three label columns: Case Type, Case Subject, and Queue name. Once we train the model it should predict case type of the email, case subject, and queue name to which email belongs. This experiment has following steps 1. Import Data 2. Data Preprocessing 3. Text Preprocessing 4. Feature Engineering 5. Train and Evaluate Model 6. Save Trained Models **Experiment Work Flow:** ![Work flow][4] **Import Data:** Import data module of this experiments loads data from Azure Blob Storage, users can replace it with their own datasets like a file on a local system or online stores like Azure SQL Storage etc. **Data Preprocessing:** Select required columns from the dataset and *clean missing values* and provide it as input to next module. In this experiment email subject and body are combined as single text column Email_text. ![Import data and clean data][5] **Text Preprocessing:** Use **Preprocess Text** module to specify required text preprocessing step such as remove stop words, remove duplicate char, replace numbers, convert to lower case, stem words etc.. ![Preprocess Text Options][6] **Feature Engineering:** In this step, we use **feature hashing** to convert variable-length text to equal length numeric feature vector .The objective of using feature hashing is to reduce dimensionality. ![Feature hashing][7] Classification time and complexity of model depend on a number of input features. The number of features resulting from the previous step is high. so to find more compact features without harming overall model accuracy. we use **Filter based feature selection** to get top 5000 features using the **Chi-squared** score function. ![Filter based Feature selection][8] **Train and Evaluate Model:** In this experiment, we are training 3 models independently to predict 3 different output values for given email subject and body. ![Training and tuning][9] So the output of Filter Based Feature selection contains 5000 features extracted from email and one of three label columns (case type, case subject, queue name ) for training. Email features along with it's labeled column are given as input to respective training models. We use Split Data module to split data into Test and Training, as rule of thumb it can be divided into 70- 30 ratio for training and testing respectively. Train data has to be provided as one of the inputs to the **Tune Hyperparameters** to get the optimal values for underlying algorithm parameters. In this experiment sweep parameter is set as **random sweep**, you can choose other options like an Entire grid. Choose any classification algorithm and connect it to the left input port of Tune Hyperparameters and provide part of testing data as input to Tune Hyperparameters. Specify the number of times the experiment should run the random sweep and also select one of the performance evaluation metrics to measure model performance.Use a performance metric among *AUC curve, accuracy, precision, recall* and *F- score* that best suits your needs. Score Model module is used to make predictions on the test dataset. Evaluate Model module evaluates these predictions against the true labels and visualizes results.Once the desired an output is achieved click on the right output port of each Tune Hyperparameters and save Trained models using **Save as Trained Model** option ![Save trained model][10] These trained models will be used in step 2 of this experiment. [1]: [2]: [3]: [4]: [5]: [6]: [7]: [8]: [9]: [10]: