Revision 2 Integrated tool for rapid assessment of multi-type regression machine learning models

May 13, 2018
The tool compares 14 types of regression models using 22 performance metrics.
**This is the 2nd Revision of the Integrated Tool for regression models evaluation.** In this revision, all regression models are assessed with a newly developed Enhanced Evaluation Model module. Number of evaluation performance metrics has been increased to 22. Also, noted errors of the earlier version have been fixed. **Details of the experiment are presented in:** Botchkarev, A. (2018). Evaluating Hospital Case Cost Prediction Models Using Azure Machine Learning Studio. arXiv:1804.01825 [cs.LG]. **Experiment Highlights** The purpose of this experiment is to build an Azure MLS tool for rapid assessment of multiple types of regression models. The tool offers environment for comparing 14 types of regression models in a unified experiment and presents assessment results in a single table using 22 performance metrics. The experiment includes **14 types of regression models:** **Azure built-in models:** linear regression, Bayesian linear regression, decision forest regression, boosted decision tree regression, neural network regression, Poisson regression. **R Language models built with Azure “Create R Model” using R packages:** Gaussian processes for regression, gradient boosted machine, nonlinear least squares regression, projection pursuit regression, random forest regression, robust regression, robust regression with MM-type estimators, support vector regression (support vector machine). The tool presents assessment results in a single table using **22 performance metrics**: **Metric Abbreviation >>> Metric Name** CoD >>> Coefficient of Determination GMRAE >>> Geometric Mean Relative Absolute Error MAE >>> Mean Absolute Error MAPE >>> Mean Absolute Percentage Error MASE >>> Mean Absolute Scaled Error MdAE >>> Median Absolute Error MdAPE >>> Median Absolute Percentage Error MdRAE >>> Median Relative Absolute Error ME >>> Mean Error MPE >>> Mean Percentage Error MRAE >>> Mean Relative Absolute Error MSE >>> Mean Squared Error NRMSE_mm >>> Normalized Root Mean Squared Error (normalized to the difference between maximum and minimum actual data) NRMSE_sd >>> Normalized Root Mean Squared Error (normalized to the standard deviation of the actual data) RAE >>> Relative Absolute Error RMdSPE >>> Root Median Square Percentage Error RMSE >>> Root Mean Squared Error RMSPE >>> Root Mean Square Percentage Error RSE >>> Relative Squared Error sMAPE >>> Symmetric Mean Absolute Percentage Error SMdAPE >>> Symmetric Median Absolute Percentage Error SSE >>> Sum of Squared Error **Input Data** The experiment used a simulated data set intended to mimic hospital information. The data set had the following columns (features): Age Group (Age Gr), Relative Intensity Weight (RIW), Length of Stay (LOS), Cost. Total number of rows in the data set is 7,000. R Language Models Azure MLS built-in algorithms are complemented by models developed using R language modules: Execute R Script and Create R model. The following R packages and functions were used in the experiment. **Regression Type >>> R Package >>> Function** Gaussian Processes for Regression >>> kernlab >>> gausspr Gradient Boosted Machine (GBM) >>> caret >>> train (gbm) Nonlinear Least Squares Regression >>> stats >>> nls Projection Pursuit Regression >>> stats >>> ppr Random Forest Regression >>> randomForest >>> randomForest Robust Regression >>> MASS >>> rlm Robust Regression with MM-Type Estimators >>> robustbase >>> lmrob Support Vector Regression (Support Vector Machine) >>> E1071 >>> svm**strong text** **Concluding Remarks** Note that the tool has been built and tested using numerical only data with no n.a. (missing) elements. Certain regression models do not except categorical data and conversion to numerical format may be required. Note that the data set used in the experiment is simulated and no warranties are provided as to the validity of the data and how closely it simulates real-world information. Note that, because of the previous point, the tool cannot be used to make actual predictions. Note that the module is publicly shared for information and training purposes only. All efforts were taken to make this module error-free. However, we do not guarantee the correctness, reliability and completeness of the material and the module is provided "as is", without warranty of any kind, express or implied. Any user is acting entirely at their own risk. Note that the views, opinions and conclusions expressed in this document are those of the author alone and do not necessarily represent the views of the author’s current or former employers. **References** caret: Classification and Regression Training. R package. https://cran.r-project.org/web/packages/caret/index.html e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071). R package. https://cran.r-project.org/web/packages/e1071/index.html kernlab: Kernel-Based Machine Learning Lab. R package. https://cran.r-project.org/web/packages/kernlab/index.html MASS: Support Functions and Datasets for Venables and Ripley's MASS. R package. https://cran.r-project.org/web/packages/MASS/index.html MSBVAR: Markov-Switching, Bayesian, Vector Autoregression Models. R package. https://cran.r-project.org/web/packages/MSBVAR/index.html randomForest: Breiman and Cutler's Random Forests for Classification and Regression. R package. https://cran.r-project.org/web/packages/randomForest/index.html robustbase: Basic Robust Statistics. R package. https://cran.r-project.org/web/packages/robustbase/index.html stats: The R Stats Package. R package. https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html