This sample model perform product category classification. This model classifies the product description into different product ids.
This model classifies the product description into different product ids. *Data* This sample data set size of 210 examples and 2 categories contains a number of products and customer search terms from Home Depot's website. The challenge is to predict a relevant product ids by the product description. Data Fields 1. product_uid - an id for the products 2. product_description - the text description of the product. For this experiment, I used the names of the product_uid as a label, or attribute to predict. In this data set there are total 10 different product categories are present for each of this single category there are different descriptions are available in the data that means same product id is repeating itself multiple times for different description. *Data Preparation* As you know a description text may contains a punctuation's, stopwords, html contents, etc to get rid of this I have performed few operations on these texts as below. 1. Removed punctuation's 2. Removed stopwords like and ,an, is, are, etc 3. Removed numerical numbers from text 4. Removed High frequent words 5. Removed white spaces 6. Done stemming As you know we have to build a model such that it should have to classify the product name for given description and we have lots of descriptions available for a same product id. The main task here is to give all required set of descriptions to model such that it should classify the product id correctly. So, to do this what I have done is I have created a document term matrix. where we get a weighted score to those word which are present in that product description for each product ID. To do this all steps I have written a script in R and executed it in Azure. *Model* Now, you know that there are more than two product ids are present so, we will need to run a multi-class classification algorithm to get the output. For this problem I have used a two- class decision forest using one vs all multiclass method. I will explain this method later. I have tried two to three different models and stabilized on this model because this was giving me 100% accuracy on train and test dataset where multiclass decision forest has given me 100% accuracy on train data and 92% accuracy on test data this mean multiclass decision forest had overffitted the data. See below a screenshot of steps performed *Result:* The accuracy of the One-vs-All classifier was 100%. To check the accuracy is right I have cross validated the model. The following graphic showed the confusion matrices for the One-vs-All classifier ![Confusion matrix] : https://drive.google.com/drive/starred?ths=true From these result we can say that our model is now ready for product category classification.