Integrated tool for rapid assessment of multi-type regression machine learning models

April 2, 2018
The tool compares 14 types of regression models and presents results arranged by model accuracy in a single table using five (5) metrics. >>>See Revision 2
**Please proceed to the 2nd Revision of this experiment.** Botchkarev, A. (2018). Revision 2 Integrated tool for rapid assessment of multi-type regression machine learning models. Experiment in Microsoft Azure Machine Learning Studio. Azure AI Gallery. https://gallery.azure.ai/Experiment/Revision-2-Integrated-tool-for-rapid-assessment-of-multi-type-regression-machine-learning-models In the 2nd Revision, all regression models are assessed with a newly developed Enhanced Evaluation Model module. Number of evaluation performance metrics has been increased to 22. Also, noted errors of the earlier version have been fixed. Experiment Highlights --------------------- The purpose of this experiment is to build an Azure MLS tool for rapid assessment of multiple types of regression models. The tool offers environment for comparing 14 types of regression models in a unified experiment and presents assessment results arranged by model accuracy in a single table using five (5) performance metrics. The experiment includes 14 types of regression models: - Azure built-in models: linear regression, Bayesian linear regression, decision forest regression, boosted decision tree regression, neural network regression, Poisson regression. - R Language models built with Azure “Create R Model” using R packages: Gaussian processes for regression, gradient boosted machine, nonlinear least squares regression, projection pursuit regression, random forest regression, robust regression, robust regression with MM-type estimators, support vector regression (support vector machine). The tool presents assessment results arranged by model accuracy in a single table using five (5) performance metrics: mean absolute error (MAE), root mean squared error (RMSE), relative absolute error (RAE), relative squared error (RSE), coefficient of determination (CoD). Details of the experiment are presented in: Botchkarev, A. (2018). Evaluating Hospital Case Cost Prediction Models Using Azure Machine Learning Studio. arXiv:1804.01825 [cs.LG]. [http://arxiv.org/abs/1804.01825][1] Input Data ---------- The experiment used a simulated data set intended to mimic hospital information. The data set had the following columns (features): Age Group (Age Gr), Relative Intensity Weight (RIW), Length of Stay (LOS), Cost. Total number of rows in the data set is 7,000. R Language Models ----------------- Azure MLS built-in algorithms are complemented by models developed using R language modules: Execute R Script and Create R model. The following R packages and functions were used in the experiment. **Regression Type >>> R Package >>> Function** Gaussian Processes for Regression >>> kernlab >>> gausspr Gradient Boosted Machine (GBM) >>> caret >>> train (gbm) Nonlinear Least Squares Regression >>> stats >>> nls Projection Pursuit Regression >>> stats >>> ppr Random Forest Regression >>> randomForest >>> randomForest Robust Regression >>> MASS >>> rlm Robust Regression with MM-Type Estimators >>> robustbase >>> lmrob Support Vector Regression (Support Vector Machine) >>> E1071 >>> svm Concluding Remarks ------------------ Note that the tool has been built and tested using numerical only data with no n.a. (missing) elements. Certain regression models do not except categorical data and conversion to numerical format may be required. Note that the data set used in the experiment is simulated and no warranties are provided as to the validity of the data and how closely it simulates real-world information. Note that, because of the previous point, the tool cannot be used to make actual predictions. Note that the views, opinions and conclusions expressed in this document are those of the author alone and do not necessarily represent the views of the author’s current or former employers. References ---------- caret: Classification and Regression Training. R package. https://cran.r-project.org/web/packages/caret/index.html e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071). R package. https://cran.r-project.org/web/packages/e1071/index.html kernlab: Kernel-Based Machine Learning Lab. R package. https://cran.r-project.org/web/packages/kernlab/index.html MASS: Support Functions and Datasets for Venables and Ripley's MASS. R package. https://cran.r-project.org/web/packages/MASS/index.html MSBVAR: Markov-Switching, Bayesian, Vector Autoregression Models. R package. https://cran.r-project.org/web/packages/MSBVAR/index.html randomForest: Breiman and Cutler's Random Forests for Classification and Regression. R package. https://cran.r-project.org/web/packages/randomForest/index.html robustbase: Basic Robust Statistics. R package. https://cran.r-project.org/web/packages/robustbase/index.html stats: The R Stats Package. R package. https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html [1]: http://arxiv.org/abs/1804.01825