HARMAN ANALYTICS: F3 Media Churn Analysis NN and SVM

October 23, 2015
Churn model using NN and SVM for ‘F3: Fall Footbal Fanatica’ app with demographic breakdown for customers who do and do not churn.
The online mobile app ‘F3: Fall Footbal Fanatica’ earns money by pushing adds to the subscribers. To these ends in support of real time add auctions, it is important to understand who the subscribers are and who does and does not stay with the app. The ‘HARMAN ANALYTICS: F3 Media Churn Analysis NN & SVM’ experiment illustrates a neural net and SVM within Azure ML that predicts churn. Specifically: the experiment: 1. Has a common imputation process to fill in missing data by interpolation. 2. Splitting dataset into a training (5%) and reserve test (95%) subsets. 3. Training both a neural net and SVM with the training set. 4. Evaluation: a) Confirming model predictivity using the training and reserve test set (excluded from the training) b) Use of R- for simple graphics to illustrate the age and gender demographics of the true-positive, false-negative, true-negative and false-positive subsets. ![Scheme][1] STEP 1: COMMON IMPUTATION PROCESS --------------------------------- Real world data always requires clean-up: shaping data into a coherent model with properly cast field types, removing txt string where there should be numbers, eliminating values outside of possible ranges and filling in holes where data is absent (Imputation). The segment of the experiment immediately after reading the csv is a ‘Common Imputation Process’ since that the most critical action which occurs in said segment. ![Imputation Segment][2] The imputation occurs specifically in the two modules marked ‘Clean Missing Data’. Other operations include: ‘Project Column’ module: original data has >200 field, several redundant and all but 17 not used by modelling. These are eliminated for a workable dataset here. The three ‘Metadata’ modules cast fields as proper types (booleans rather than text, floats rather than ints…). The two ‘Clean Missing Data’ use means and modes to replace nulls with a reasonable replacement value. ‘Clip Value’ eliminates outliers. ‘Normalize Data’ replaces true ranges with ranges between 0-1 for select columns. STEP 2: SPLITTING DATA INTO TEST AND TRAINING SETS -------------------------------------------------- The ‘Split’ module randomly places 95% of the dataset into a reserve test set and the other 5% into a training set. The neural net and SVM will be trained over the training set. The ability of the models to predict will be observed by predicting churn for the test set which was not used in training and hence mimics a real life situation. STEP 3: REMOVING INDEX, ID AND DEMOGRAPHIC FIELDS BEFORE TRAINING ----------------------------------------------------------------- The NN and SVM do not use several fields. Obviously, training over a randomly generated customer ID is not defendable. The demographic data could have been used but was eliminated in stepwise regression in a prior exercise. However, the demographic data is needed for analysis later so it is not eliminated from the test set. NN TRAINING: The neural net is trained and model later used for prediction. The NN has 10 hidden nodes and is developed over 1000 steps. Model is used to score both the training set and test sets. Scored sProbabilities are scaled (by call to ‘Normalize Data’ module) to range from 0 to 1.00. The ‘Evaluate’ modules are passed the normalized data and report the AUC, TN, TP, FP and FN metrics from a 0.50 threshold. SVM TRAINING: The SVM is trained and model later used for prediction. The kernel is assumed to be Gaussian but MSFT documentation is unclear on this. 10 steps are employed and lambda is 0.001.   STEP 4: EVALUATION ------------------ For both NN and SVM for both training and test. Scored Probabilities are scaled (by call to ‘Normalize Data’ module) to range from 0 to 1.00. The ‘Evaluate’ modules are passed the normalized data and report the AUC, TN, TP, FP and FN metrics from a 0.50 threshold. The interested can check each of the four ‘Evaluate’ module (NN training set, NN test set, SVM training set and SVM test set). To summarize, the NN has 0.829 AUC, it selects a subset that is 16.54% of the whole that has 66% of the churn occurrences. The SVM has 0.808 AUC, it select a subset that is 9.44% of the whole that has 42% of the churn occurrences. ![Results Summary][3] [1]: http://dgsprous0808.weebly.com/uploads/6/3/9/0/63905587/1583778_orig.jpg [2]: http://dgsprous0808.weebly.com/uploads/6/3/9/0/63905587/4042163_orig.jpg [3]: http://dgsprous0808.weebly.com/uploads/6/3/9/0/63905587/1445618789.png