Revision 2 Integrated tool for rapid assessment of multi-type regression machine learning models
The tool compares 14 types of regression models using 22 performance metrics.
**This is the 2nd Revision of the Integrated Tool for regression models evaluation.**
In this revision, all regression models are assessed with a newly developed Enhanced Evaluation Model module. Number of evaluation performance metrics has been increased to 22. Also, noted errors of the earlier version have been fixed.
**Details of the experiment are presented in:**
Botchkarev, A. (2018). Evaluating Hospital Case Cost Prediction Models Using Azure Machine Learning Studio. arXiv:1804.01825 [cs.LG].
**Experiment Highlights**
The purpose of this experiment is to build an Azure MLS tool for rapid assessment of multiple types of regression models. The tool offers environment for comparing 14 types of regression models in a unified experiment and presents assessment results in a single table using 22 performance metrics.
The experiment includes **14 types of regression models:**
**Azure built-in models:**
linear regression,
Bayesian linear regression,
decision forest regression,
boosted decision tree regression,
neural network regression,
Poisson regression.
**R Language models built with Azure “Create R Model” using R packages:**
Gaussian processes for regression,
gradient boosted machine,
nonlinear least squares regression,
projection pursuit regression,
random forest regression,
robust regression,
robust regression with MM-type estimators,
support vector regression (support vector machine).
The tool presents assessment results in a single table using **22 performance metrics**:
**Metric Abbreviation >>> Metric Name**
CoD >>> Coefficient of Determination
GMRAE >>> Geometric Mean Relative Absolute Error
MAE >>> Mean Absolute Error
MAPE >>> Mean Absolute Percentage Error
MASE >>> Mean Absolute Scaled Error
MdAE >>> Median Absolute Error
MdAPE >>> Median Absolute Percentage Error
MdRAE >>> Median Relative Absolute Error
ME >>> Mean Error
MPE >>> Mean Percentage Error
MRAE >>> Mean Relative Absolute Error
MSE >>> Mean Squared Error
NRMSE_mm >>> Normalized Root Mean Squared Error (normalized to the difference between maximum and minimum actual data)
NRMSE_sd >>> Normalized Root Mean Squared Error (normalized to the standard deviation of the actual data)
RAE >>> Relative Absolute Error
RMdSPE >>> Root Median Square Percentage Error
RMSE >>> Root Mean Squared Error
RMSPE >>> Root Mean Square Percentage Error
RSE >>> Relative Squared Error
sMAPE >>> Symmetric Mean Absolute Percentage Error
SMdAPE >>> Symmetric Median Absolute Percentage Error
SSE >>> Sum of Squared Error
**Input Data**
The experiment used a simulated data set intended to mimic hospital information. The data set had the following columns (features): Age Group (Age Gr), Relative Intensity Weight (RIW), Length of Stay (LOS), Cost. Total number of rows in the data set is 7,000.
R Language Models
Azure MLS built-in algorithms are complemented by models developed using R language modules: Execute R Script and Create R model. The following R packages and functions were used in the experiment.
**Regression Type >>> R Package >>> Function**
Gaussian Processes for Regression >>> kernlab >>> gausspr
Gradient Boosted Machine (GBM) >>> caret >>> train (gbm)
Nonlinear Least Squares Regression >>> stats >>> nls
Projection Pursuit Regression >>> stats >>> ppr
Random Forest Regression >>> randomForest >>> randomForest
Robust Regression >>> MASS >>> rlm
Robust Regression with MM-Type Estimators >>> robustbase >>> lmrob
Support Vector Regression (Support Vector Machine) >>> E1071 >>> svm**strong text**
**Concluding Remarks**
Note that the tool has been built and tested using numerical only data with no n.a. (missing) elements. Certain regression models do not except categorical data and conversion to numerical format may be required.
Note that the data set used in the experiment is simulated and no warranties are provided as to the validity of the data and how closely it simulates real-world information.
Note that, because of the previous point, the tool cannot be used to make actual predictions.
Note that the module is publicly shared for information and training purposes only. All efforts were taken to make this module error-free. However, we do not guarantee the correctness, reliability and completeness of the material and the module is provided "as is", without warranty of any kind, express or implied. Any user is acting entirely at their own risk.
Note that the views, opinions and conclusions expressed in this document are those of the author alone and do not necessarily represent the views of the author’s current or former employers.
**References**
caret: Classification and Regression Training. R package. https://cran.r-project.org/web/packages/caret/index.html
e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071). R package. https://cran.r-project.org/web/packages/e1071/index.html
kernlab: Kernel-Based Machine Learning Lab. R package. https://cran.r-project.org/web/packages/kernlab/index.html
MASS: Support Functions and Datasets for Venables and Ripley's MASS. R package. https://cran.r-project.org/web/packages/MASS/index.html
MSBVAR: Markov-Switching, Bayesian, Vector Autoregression Models. R package. https://cran.r-project.org/web/packages/MSBVAR/index.html
randomForest: Breiman and Cutler's Random Forests for Classification and Regression. R package. https://cran.r-project.org/web/packages/randomForest/index.html
robustbase: Basic Robust Statistics. R package. https://cran.r-project.org/web/packages/robustbase/index.html
stats: The R Stats Package. R package. https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html