Women’s Health Risk Assessment – 1st Prize Predictive Experiment

October 6, 2016

Report Abuse
This is the winning solution for the Women’s Health Risk Assessment data science competition on Microsoft’s Cortana Intelligence platform. In this page you can find the published Azure ML Studio experiment of the most successful submission to the competition, a detailed description of the methods used, and links to code and references. The competition To help achieve the goal of improving women's reproductive health outcomes in underdeveloped regions, this competition calls for optimized machine learning solutions so that a patient can be accurately categorized into different health risk segments and subgroups. More specifically, the objective of this competition is to build machine learning models to assign a young woman subject (15-30 years old) in one of the 9 underdeveloped regions into a risk segment, and a subgroup within the segment. After the accurate assignments of the risk segment and subgroup in each region, healthcare practitioners can deliver services to prevent the subjects from sexual and reproductive health risks (like HIV infections). The types of services are personalized, based on the risk segment and subgroup assignments; such customized programs have a better chance to help reduce the reproductive health risk of patients. A summary and more detailed description of the competition are available here: https://gallery.cortanaintelligence.com/Competition/Women-s-Health-Risk-Assessment-1 The scripts used in R Studio and in ML Studio to produce and use the successful model uploaded in Azure ML Studio can be found here: https://github.com/IonKl/Womens-Health-Risk-Assessment-Competition The local R script that generated the model is in https://github.com/IonKl/Womens-Health-Risk-Assessment-Competition/blob/master/WHRA_Onprem_XGBoost.R
**Introduction** In the following sections I will describe in detail the steps I took to make the most successful submission to the competition through Azure ML Studio, that got me to the first place. To begin, I chose to build my solution locally in R Studio and managed to submit it through Azure ML Studio following the competition's "Tutorial using R". Links to the code are provided below - I built my solution based on the on-prem R script provided by Cortana Intelligence. The reason I chose to initially build my solution locally was mainly the fact that I would have the freedom to train and test any machine learning algorithm I wanted, although Azure ML Studio offers a very useful variety of the essential machine learning algorithms. More specifically, I was interested in using an algorithm named XGBoost, or Extreme Gradient Boosting, and particularly its decision tree classification algorithm. The nature of the dataset immediately made me suspect that a decision-tree based algorithm would perform well on classifying the subjects into the correct health risk categories, since the majority of the dataset’s features are binary, and most decision trees are also binary. Here is an example of the visualisation of a small part of the large and complicated decision tree produced by one of my XGBoost models, to help you understand how ideal this structure is for the specific dataset: ![enter image description here][1] **Dataset** The data for this competition was collected from around 9000 young (15 to 30 years old) women subjects when they visited clinics in 9 underdeveloped regions, with around 1000 subjects in each region. Each subject was asked by clinical practitioners some questions and her answers were recorded, together with her demographic information. The sexual and reproductive health risks were then evaluated by clinical practitioners and are assigned to different risk segments and subgroups. As mentioned, the majority of the dataset consists of simple yes/no answers to a series of questions about the patient, while the rest consists of categorical data. It’s target variables are the columns geo, segment, and subgroup, which are combined into one label for convenience and better classification performance. The original training dataset is in CSV format and can be found in the competition’s description, Azure ML’s sample solution, and the R script’s code where it is automatically downloaded. You can find a detailed description of the dataset in the following link: https://az754797.vo.msecnd.net/competition/whra/docs/data-description.docx **Pre-processing & Cleaning** After the data is loaded, each column is converted into the right data type, mostly integers. Then, the geo, segment, and subgroup target columns are converted into a single one, which will be the dataset’s label to be predicted by the model. The combined label derives from the following formula: 100*geo + 10*segment + subgroup. The 3-digit label is then mapped into one of the 38 different classes that were found in the training set - not all possible 3-digit combinations exist in the set! It would be pointless to try to predict different geo-segment-subgroup combinations that do not exist in the training set, as we wouldn’t have enough information to link the subjects to them. This is a small detail but it is of very high importance for the classification task, as classification algorithms will behave differently when classifying into 38 or 100 or 1000 different classes – the less the classes, the easier and more precise the classification, as decision boundaries seem to be “wider” for less classes. Then some very basic data cleaning is performed: the integer cells that are missing are assigned with the mean of the column, and the empty characters are replaced with ‘0’. A vector containing the 38 possible geo-segment-subgroup categories is also created to help mapping the classification results to the 3-digit class codes. What the classification algorithm actually sees as the classes is integers between 1 and 38. **Splitting & cross-validation** The dataset is split into a training set (75%) and a validation set (25%) to train and evaluate the machine learning models on. The dataset is split randomly without any stratification - note that applying stratification might actually increase the overall accuracy, as it would make sure that the training set contains a balanced proportion of all the classes. Also, using cross-validation might have helped produce an even stronger, more robust model, but it is time-consuming to properly train many XGBoost models on different portions of the dataset. **Feature engineering** For feature engineering, I initially tried using Azure ML Studio's feature selection modules, and more particularly the Filter based feature selection module, which identifies the features in a dataset with the greatest predictive power, and the Permutation feature importance module which computes the permutation feature importance score of feature variables given a trained model and a test dataset. I tried excluding several features, training different models on different combinations of them, and realised that feature selection didn’t really help improving the classification accuracy, at least not for the XGBoost algorithm. The way the dataset is structured, which corresponds to the way the initial experiment was created by specialists to classify the subjects into different categories, indicates that every single column is important and carries information that cannot be replaced by any other column or combination of columns. In other words, every column contains information that is useful and only contributes to the classification power of the trained decision trees. Thus, I concluded that I would not use any sort of feature engineering, apart of course from the fact that I excluded all unique identifiers from the training set. **Training – the XGBoost algorithm** As mentioned earlier, the decision tree functionality of the XGBoost (Extreme Gradient Boosting) algorithm was chosen to be used for the classification. The binary nature of the dataset indicates that a decision-tree based algorithm would perform optimally on classifying the subjects into the correct health risk categories. The XGBoost algorithm is ideal for building extremely large and complex decision trees relatively fast. Some of its benefits include regularization, parallel processing, high flexibility, handling missing values well, tree pruning, and built-in cross-validation. You can see a small part of one of the large and complicated decision trees produced by my XGBoost model: ![enter image description here][2] XGBoost also offers a very large variety of parameters, which fall into 3 categories: general parameters, which define the overall functionality of the algorithm, such as the booster type. Then we have booster parameters, like the learning rate, child weight, depth and leaf node limits, and finally we have the learning task parameters which are used to define the optimization objective. I initially manually tuned the parameters of the XGBoost models, following advice from the parameter tuning chapter of this article: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/ I chose a relatively high learning rate, and started with the indicated max_depth, min_child_weight, and gamma parameters. I then proceeded to evaluate the constructed model, in order to get feedback on how the parametrization actually affected the classification accuracy. **Evaluation** The evaluation of the model for the competition is quite straightforward – it is based on the classification accuracy, or the percentage of subjects which were correctly classified under the right category. Note that we consider the right category to be the right combination of all three of the geo, segment, and subgroup labels. In order to evaluate the classification results of the model I used the lab.lev factor that I mentioned earlier, to map the 1-38 classes into the corresponding 3 digit geo+segment+subgroup codes. **Fine tuning the model parameters** Once we evaluate a model, we can change some of its parameters and retrain it, to see how its classification accuracy is actually affected. As I mentioned, the parameters of the algorithm were initially tuned manually, following general rules, my intuition, and trial-and-error methods, but I later used the help of a grid-search parameter tuning algorithm. Doing this with XGBoost is quite tricky, as the number of parameters that can be tuned is large, resulting to a relatively large grid with way too many possible combinations – too many models have to be trained, and this is time and resource consuming. Again, I took advice from the previously mentioned article, and from this very useful XGBoost slide-show: http://www.slideshare.net/ShangxuanZhang/kaggle-winning-solution-xgboost-algorithm-let-us-learn-from-its-author As scientists agree, it is nearly impossible to give a set of universal optimal parameters, or a global algorithm that can achieve it. The key concerns for parameter tuning on XGBoost are understanding the bias-variance tradeoff and its importance, controlling overfitting, and handling the relatively imbalanced dataset. The grid-search algorithm I used for tuning my model’s parameters was based on the code that can be found here: https://www.r-bloggers.com/r-setup-a-grid-search-for-xgboost/ I just made sure to include the values that I had earlier figured out that helped with improving classification performance. A full description of all the parameters of the algorithm can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.md Once an optimal combination of parameters was selected, by comparing the classification accuracy of each of the created models, and making sure that my selected model had the highest possible accuracy compared to the ones that used different parameter settings, I trained another model with the selected parameters but this time using the whole training dataset, instead of the 75% split that I was using before so that I could locally evaluate the performance of my models. The final model is exported and saved as an .rda file, which will be used as the classification model in Azure ML Studio. **Exporting and loading to Azure ML Studio** Once our best model is exported, it can be used in Azure ML Studio to predict the classes of the hidden test dataset. Note that the same pre-processing and cleaning techniques have to be used in the ML Studio experiment (see the R script module of the experiment for details). The data has to be in the same format, and cleaned with the same techniques as the data that was used to train the XGBoost model. Also, note that the lab.lev vector was also used to help mapping the predicted classes back to the original combined label (geo-segment-subgroup), as explained earlier. lab.lev contains all the unique classes contained in the training set (a total of 38, since not all possible geo-segment-subgroup combinations actually exist in the set). The test and labels datasets are loaded into the R module, along with the exported .rda model from R Studio, and the 2 required external libraries, the same pre-processing is being applied, and the test set is evaluated by the exported model. Finally, the results are mapped back to the correct 3-digit codes, and fed to the web service output to be officially evaluated. **Results - Discussion** What is very interesting is that I was able to achieve a pretty satisfying accuracy with little effort on data pre-processing, cleaning, and feature engineering - more focus was put on correctly tuning the parameters of the XGBoost algorithm, since my initial instinct was that the particular algorithm would perform amazingly well given the almost binary structure of the dataset. As mentioned, XGBoost is basically a boosted Decision Tree algorithm, and decision trees are known to be very efficient for the type of multi-class classification problem of the current competition. **How to reproduce the results** Reproducing the experiment is a relatively easy, straightforward process. These are the steps you need to take: • Download the R code from https://github.com/IonKl/Womens-Health-Risk-Assessment-Competition/blob/master/WHRA_Onprem_XGBoost.R • Open the file in R Studio or the R IDE of your choice, and run the whole script. • Make sure that the model is exported and saved in a file called ‘xgb11.rda’ • Also export the lab.lev array into a csv file called ‘lab.lev.csv’ • Zip the exported .rda file into a zip file, along with the zip files of the R libraries called ‘xgboost’ and ‘magrittr’ (an xgboost dependency). Name that file xgbee.zip. You can find the library files online or locally in your filesystem when you load them in R Studio. • Upload the xgbee.zip and the lab.lev.csv files in you Azure ML workspace. In ML Studio, click on New, Dataset, From local file, and upload the 2 files. • Build a predictive experiment in Azure ML to operationalize the model: Open the Starter Experiment of the competition, and only keep the Reader module - delete all other modules. Save this experiment using a different name. Note that you should not build your predictive experiment from scratch via +New > Experiment because it doesn’t carry metadata for this competition. • Add an Execute R Script module, and add the xgbee.zip and lab.lev.csv files to the experiment from the Saved Dataset, My Dataset section in the toolbox. Also, add a Web service input module and a Web service output module to the experiment. Connect them as follows (The csv file must be linked to the input port 2 of the R module, while the zip file must be linked on the 3rd port.): ![enter image description here][3] • Replace the R script in Execute R Script module with the script found in https://github.com/IonKl/Womens-Health-Risk-Assessment-Competition/blob/master/ML.R Note the xgbee.zip file is automatically unzipped and the xgb11.rda is dropped into the src folder of the sandbox R runtime in Azure ML, which is why you can load it directly by load('src/xgb11.rda'). • Run the predictive experiment. If everything is in place, you should be able to get the classification results of the uploaded model on the test data. **My code can be found in:** Local R Script: https://github.com/IonKl/Womens-Health-Risk-Assessment-Competition/blob/master/WHRA_Onprem_XGBoost.R Azure ML Studio R module script: https://github.com/IonKl/Womens-Health-Risk-Assessment-Competition/blob/master/ML.R Lab.lev.csv: https://github.com/IonKl/Womens-Health-Risk-Assessment-Competition/blob/master/lab.lev.csv **From the competition page.** Original script: https://az754797.vo.msecnd.net/competition/whra/code/WHRA_Onprem_Solution.R Tutorial using R for the competition: https://az754797.vo.msecnd.net/competition/whra/docs/WHRATutorial-in-R.docx General tutorial using R + Azure ML: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-r-quickstart/ **Other useful links:** http://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions https://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/ http://www.slideshare.net/ShangxuanZhang/kaggle-winning-solution-xgboost-algorithm-let-us-learn-from-its-author https://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees [1]: https://github.com/IonKl/Womens-Health-Risk-Assessment-Competition/blob/master/123.png?raw=true [2]: https://github.com/IonKl/Womens-Health-Risk-Assessment-Competition/blob/master/321.png?raw=true [3]: https://github.com/IonKl/Womens-Health-Risk-Assessment-Competition/blob/master/333.png?raw=true