Women’s Health Risk Assessment –3rd Prize Predictive Experiment

October 14, 2016

Report Abuse
Women’s Health Risk Assessment –3rd Prize Predictive Experiment. Training code using xgboost on: http://github.com/demillan/WomenRisk
The competition calls for optimized machine learning solutions so that a patient can be accurately categorized into different health risk segments and subgroups. Data The dataset was collected via survey in 2015 as part of a Bill & Melinda Gates Foundation funded project exploring the wants, needs, and behaviors of women and girls with regards to their sexual and reproductive health in nine geographies. Solution I started following the R tutorial for this competition. The first thing I did was changing the initial multinomial model (nnet package) for an xgboost model. After a lot of feature engineering I found out that the best solution for the missing values was replacing them by -1. The new variables created with feature engineering didn't provide the expected results, so I discarded them. After this, I started to optimize the xgboost parameters of the model using a training and validation datasets. I also tried SVM (e1071 package) and random forest (randomForest package), but the public leaderboard score was lower than the one achieved with xgboost. If you want to replicate the experiment, follow the next steps: 1. Train the model locally using the scrip that you can find on Github http://github.com/demillan/WomenRisk. You have to use the same R version locally and in the Azure Machine Learning. I used Microsoft R Open 3.2.2. 2. Zip and upload your model as a dataset, and place it in your experiment. Follow the steps described in the competition R tutorial if you have any problem https://az754797.vo.msecnd.net/competition/whra/docs/WHRATutorial-in-R.docx 3. Run your experiment, taking care of changing the R script box if necessary. Maybe you want to change the feature index and the name of the loaded model. 4. Deploy the web service and enjoy your results! Important note: you may score a different value depending of the seed you use when you train your model. If you have questions about the model or the solution, open an Issue on github and i will be happy to provide more insight.