Women’s Health Risk Assessment – 2nd Prize Training Experiment

October 13, 2016

Report Abuse
Training a gradient boosting decision tree model to classify women patients in developing countries into different health segments and subgroups. https://github.com/evian123/whra
This is a multiple-class classification problem with 37 possible classes. The whole training experiment: ![https://github.com/evian123/whra/blob/master/train.JPG?raw=true][1] **Strategy** At the very beginning, I was considering to build up a stacked model using model ensemble technique. That means I need several different types of models. I tried several different models including logistic regression with multiple classes, neural network with 1 and 2 hidden layers, random forest, naïve Bayesian classifier and gradient boosting decision tree (xgboost). I found that the best single model was a xgboost model with some parameter tuning. Since the best single xgboost model already put me in the 1st place on the public leaderboard, I didn’t spend more time to train other models. And I decided to just use a single xgboost model in this competition. **Data preparation** The data set in this competition is already stored in one table so there is no need to do any table merging. As I use a tree-based model, there is also no need to do variable imputation which is commonly required in logistic regression or neural network. So I just replaced all missing values by -9999. There is only one variable in character format, i.e., ‘religion’. So I transformed the ‘religion’ to numeric. After the above steps, all variables are in numeric format. **Feature selection** Xgboost can output the feature importance, which would be helpful for feature selection in general. But I didn’t do the feature selection since 1. Less than 50 features is not a lot. 2. Gradient boosting decision tree is not that sensitive to multicollinearity or noise. All the available features are used in the model. **Parameter tuning** The biggest challenge to use xgboost is parameter tuning. The most important parameters include ‘tree depth’, ‘min_child_weight’ and ‘lambda’ and ‘number of trees’. I used 10-fold cross validation to do the parameter tuning manually. While there are also some automated ways to tune parameters. **Prediction** I have two AML experiments, one is for model training and the other is for prediction. Once the modeled is trained, I have to store the model in the AML work space. The xgboost model is stored in a data.frame in R and then the data.frame is saved as a data set. In the predictive experiment, the saved model is loaded from the AML work space. And the testing data is prepared following exactly the same process as the training data. Finally the xgboost model is applied on the prepared testing data to make the prediction. **Other thoughts** - Model ensemble may help. Since the evaluation metric is classification accuracy, there are two simple directions to ensemble: 1. Averaging (with or without weights) the predicted probabilities from different models. Then the model output should be probabilities rather than classes, which means we need to use 'multi:softprob' as the objective instead of 'multi:softmax' in xgboost parameters. 2. Voting based on the predicted classes from different models. - Model stacking may also be worth a try, but it is a little more complicated. - In my original R code, the transformation between characters and integers occurred multiple times. For example, the classes of each sample are converted from concatenated strings (geo+segment+subgroup) to integers. And after the prediction, the predicted integers were transformed back. I found the R package 'CatEncoders' that I wrote before can make this process more simple. The modified R code using 'CatEncoders' can be found: https://github.com/evian123/whra/blob/master/v2.R [1]: https://github.com/evian123/whra/blob/master/train.JPG?raw=true