Classification using K-Means Clustering vs. Multiclass Logisitic Regression: Iris data
This sample demonstrates how to perform clustering using k-means algorithm on the UCI Iris data set. Also we apply multi-class Logistic regression to perform multi-class classification and compare its performance with k-means clustering.
This sample builds upon [https://gallery.cortanaintelligence.com/Experiment/Clustering-Group-iris-data-2][1] We added the ability to view the performance of the multi-class logistic regression and k-mean clustering model side-by-side using the Evaluate Model module. After assigning the test samples to the learned centroids in K-Means, we changed the **Select Columns in Dataset** module to select out the following columns:
- Label
- Assignments
- F1
- F2
- F3
- F4
To create a Dataset that matches the Scored dataset format, we created an R script, which is run in using the **Execute R Script** module.
# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame
#dataset2 <- maml.mapInputPort(2) # class: data.frame
# Contents of optional Zip port are in ./src/
# source("src/yourfile.R");
# load("src/yourData.rdata");
# Sample operation
#levels(dataset1$Assignments) <- list("Iris-setosa"=0, "Iris-versicolor"=1, "Iris-virginica"=2)
names(dataset1) <- c("F1", "F2", "F3", "F4", "Label", "Scored Labels")
dataset1[["Scored Probabilities for Class \"Iris-setosa\""]] <- 0.05
dataset1[["Scored Probabilities for Class \"Iris-setosa\""]][dataset1[["Scored Labels"]] == "Iris-setosa"] <- 0.9
dataset1[["Scored Probabilities for Class \"Iris-versicolor\""]] <- 0.05
dataset1[["Scored Probabilities for Class \"Iris-versicolor\""]][dataset1[["Scored Labels"]] == "Iris-versicolor"] <- 0.9
dataset1[["Scored Probabilities for Class \"Iris-virginica\""]] <- 0.05
dataset1[["Scored Probabilities for Class \"Iris-virginica\""]][dataset1[["Scored Labels"]] == "Iris-virginica"] <- 0.9
dataset1 <- dataset1[c(1,2,3,4,5,7,8,9,6)]
attr(dataset1$Label, "label.type") <- "True Labels"
attr(dataset1[["Scored Probabilities for Class \"Iris-setosa\""]], "feature.channel") <- "Multiclass Classification Scores"
attr(dataset1[["Scored Probabilities for Class \"Iris-setosa\""]], "score.type") <- "Scored Probabilities for Class \"Iris-setosa\""
attr(dataset1[["Scored Probabilities for Class \"Iris-versicolor\""]], "feature.channel") <- "Multiclass Classification Scores"
attr(dataset1[["Scored Probabilities for Class \"Iris-versicolor\""]], "score.type") <- "Scored Probabilities for Class \"Iris-versicolor\""
attr(dataset1[["Scored Probabilities for Class \"Iris-virginica\""]], "feature.channel") <- "Multiclass Classification Scores"
attr(dataset1[["Scored Probabilities for Class \"Iris-virginica\""]], "score.type") <- "Scored Probabilities for Class \"Iris-virginica\""
numbers <- factor(c(1, 0, 2))
dataset1[["Scored Labels"]] <- factor(dataset1[["Scored Labels"]], levels = numbers[1:3])
levels(dataset1[["Scored Labels"]]) <- list("Iris-setosa"=1, "Iris-versicolor"=0, "Iris-virginica"=2)
attr(dataset1[["Scored Labels"]], "feature.channel") <- "Multiclass Classification Scores"
attr(dataset1[["Scored Labels"]], "score.type") <- "Assigned Labels"
str(dataset1)
# You'll see this output in the R Device port.
# It'll have your stdout, stderr and PNG graphics device(s).
#plot(dataset1);
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort("dataset1");
Lines 26 and 28 assume the cluster label 0 corresponds to Iris-versicolor, the cluster label 1 corresponds to Iris-setosa, and the cluster label 2 corresponds to Iris-virginica. If your cluster ids do not match the following mappings, then these two lines will need to be changed.
To evaluate the performance of the k-means clustering and multi-class logistic regression classifications can, several metrics can be viewed by clicking the output port of the **Evaluate Model** module and selecting **Visualize**.
[1]: https://gallery.cortanaintelligence.com/Experiment/Clustering-Group-iris-data-2