Integrated tool for rapid assessment of multi-type regression machine learning models
The tool compares 14 types of regression models and presents results arranged by model accuracy in a single table using five (5) metrics. >>>See Revision 2
**Please proceed to the 2nd Revision of this experiment.**
Botchkarev, A. (2018). Revision 2 Integrated tool for rapid assessment of multi-type regression machine learning models. Experiment in Microsoft Azure Machine Learning Studio. Azure AI Gallery. https://gallery.azure.ai/Experiment/Revision-2-Integrated-tool-for-rapid-assessment-of-multi-type-regression-machine-learning-models
In the 2nd Revision, all regression models are assessed with a newly developed Enhanced Evaluation Model module. Number of evaluation performance metrics has been increased to 22. Also, noted errors of the earlier version have been fixed.
Experiment Highlights
---------------------
The purpose of this experiment is to build an Azure MLS tool for rapid assessment of multiple types of regression models. The tool offers environment for comparing 14 types of regression models in a unified experiment and presents assessment results arranged by model accuracy in a single table using five (5) performance metrics.
The experiment includes 14 types of regression models:
- Azure built-in models:
linear regression,
Bayesian linear regression,
decision forest regression,
boosted decision tree regression,
neural network regression,
Poisson regression.
- R Language models built with Azure “Create R Model” using R packages:
Gaussian processes for regression,
gradient boosted machine,
nonlinear least squares regression,
projection pursuit regression,
random forest regression,
robust regression,
robust regression with MM-type estimators,
support vector regression (support vector machine).
The tool presents assessment results arranged by model accuracy in a single table using five (5) performance metrics: mean absolute error (MAE), root mean squared error (RMSE), relative absolute error (RAE), relative squared error (RSE), coefficient of determination (CoD).
Details of the experiment are presented in:
Botchkarev, A. (2018). Evaluating Hospital Case Cost Prediction Models Using Azure Machine Learning Studio. arXiv:1804.01825 [cs.LG]. [http://arxiv.org/abs/1804.01825][1]
Input Data
----------
The experiment used a simulated data set intended to mimic hospital information. The data set had the following columns (features): Age Group (Age Gr), Relative Intensity Weight (RIW), Length of Stay (LOS), Cost. Total number of rows in the data set is 7,000.
R Language Models
-----------------
Azure MLS built-in algorithms are complemented by models developed using R language modules: Execute R Script and Create R model. The following R packages and functions were used in the experiment.
**Regression Type >>> R Package >>> Function**
Gaussian Processes for Regression >>> kernlab >>> gausspr
Gradient Boosted Machine (GBM) >>> caret >>> train (gbm)
Nonlinear Least Squares Regression >>> stats >>> nls
Projection Pursuit Regression >>> stats >>> ppr
Random Forest Regression >>> randomForest >>> randomForest
Robust Regression >>> MASS >>> rlm
Robust Regression with MM-Type Estimators >>> robustbase >>> lmrob
Support Vector Regression (Support Vector Machine) >>> E1071 >>> svm
Concluding Remarks
------------------
Note that the tool has been built and tested using numerical only data with no n.a. (missing) elements. Certain regression models do not except categorical data and conversion to numerical format may be required.
Note that the data set used in the experiment is simulated and no warranties are provided as to the validity of the data and how closely it simulates real-world information.
Note that, because of the previous point, the tool cannot be used to make actual predictions.
Note that the views, opinions and conclusions expressed in this document are those of the author alone and do not necessarily represent the views of the author’s current or former employers.
References
----------
caret: Classification and Regression Training. R package. https://cran.r-project.org/web/packages/caret/index.html
e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071). R package. https://cran.r-project.org/web/packages/e1071/index.html
kernlab: Kernel-Based Machine Learning Lab. R package. https://cran.r-project.org/web/packages/kernlab/index.html
MASS: Support Functions and Datasets for Venables and Ripley's MASS. R package. https://cran.r-project.org/web/packages/MASS/index.html
MSBVAR: Markov-Switching, Bayesian, Vector Autoregression Models. R package. https://cran.r-project.org/web/packages/MSBVAR/index.html
randomForest: Breiman and Cutler's Random Forests for Classification and Regression. R package. https://cran.r-project.org/web/packages/randomForest/index.html
robustbase: Basic Robust Statistics. R package. https://cran.r-project.org/web/packages/robustbase/index.html
stats: The R Stats Package. R package. https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html
[1]: http://arxiv.org/abs/1804.01825