Classification using K-Means Clustering vs. Multiclass Logisitic Regression: Iris data

October 27, 2017
This sample demonstrates how to perform clustering using k-means algorithm on the UCI Iris data set. Also we apply multi-class Logistic regression to perform multi-class classification and compare its performance with k-means clustering.
This sample builds upon [https://gallery.cortanaintelligence.com/Experiment/Clustering-Group-iris-data-2][1] We added the ability to view the performance of the multi-class logistic regression and k-mean clustering model side-by-side using the Evaluate Model module. After assigning the test samples to the learned centroids in K-Means, we changed the **Select Columns in Dataset** module to select out the following columns: - Label - Assignments - F1 - F2 - F3 - F4 To create a Dataset that matches the Scored dataset format, we created an R script, which is run in using the **Execute R Script** module. # Map 1-based optional input ports to variables dataset1 <- maml.mapInputPort(1) # class: data.frame #dataset2 <- maml.mapInputPort(2) # class: data.frame # Contents of optional Zip port are in ./src/ # source("src/yourfile.R"); # load("src/yourData.rdata"); # Sample operation #levels(dataset1$Assignments) <- list("Iris-setosa"=0, "Iris-versicolor"=1, "Iris-virginica"=2) names(dataset1) <- c("F1", "F2", "F3", "F4", "Label", "Scored Labels") dataset1[["Scored Probabilities for Class \"Iris-setosa\""]] <- 0.05 dataset1[["Scored Probabilities for Class \"Iris-setosa\""]][dataset1[["Scored Labels"]] == "Iris-setosa"] <- 0.9 dataset1[["Scored Probabilities for Class \"Iris-versicolor\""]] <- 0.05 dataset1[["Scored Probabilities for Class \"Iris-versicolor\""]][dataset1[["Scored Labels"]] == "Iris-versicolor"] <- 0.9 dataset1[["Scored Probabilities for Class \"Iris-virginica\""]] <- 0.05 dataset1[["Scored Probabilities for Class \"Iris-virginica\""]][dataset1[["Scored Labels"]] == "Iris-virginica"] <- 0.9 dataset1 <- dataset1[c(1,2,3,4,5,7,8,9,6)] attr(dataset1$Label, "label.type") <- "True Labels" attr(dataset1[["Scored Probabilities for Class \"Iris-setosa\""]], "feature.channel") <- "Multiclass Classification Scores" attr(dataset1[["Scored Probabilities for Class \"Iris-setosa\""]], "score.type") <- "Scored Probabilities for Class \"Iris-setosa\"" attr(dataset1[["Scored Probabilities for Class \"Iris-versicolor\""]], "feature.channel") <- "Multiclass Classification Scores" attr(dataset1[["Scored Probabilities for Class \"Iris-versicolor\""]], "score.type") <- "Scored Probabilities for Class \"Iris-versicolor\"" attr(dataset1[["Scored Probabilities for Class \"Iris-virginica\""]], "feature.channel") <- "Multiclass Classification Scores" attr(dataset1[["Scored Probabilities for Class \"Iris-virginica\""]], "score.type") <- "Scored Probabilities for Class \"Iris-virginica\"" numbers <- factor(c(1, 0, 2)) dataset1[["Scored Labels"]] <- factor(dataset1[["Scored Labels"]], levels = numbers[1:3]) levels(dataset1[["Scored Labels"]]) <- list("Iris-setosa"=1, "Iris-versicolor"=0, "Iris-virginica"=2) attr(dataset1[["Scored Labels"]], "feature.channel") <- "Multiclass Classification Scores" attr(dataset1[["Scored Labels"]], "score.type") <- "Assigned Labels" str(dataset1) # You'll see this output in the R Device port. # It'll have your stdout, stderr and PNG graphics device(s). #plot(dataset1); # Select data.frame to be sent to the output Dataset port maml.mapOutputPort("dataset1"); Lines 26 and 28 assume the cluster label 0 corresponds to Iris-versicolor, the cluster label 1 corresponds to Iris-setosa, and the cluster label 2 corresponds to Iris-virginica. If your cluster ids do not match the following mappings, then these two lines will need to be changed. To evaluate the performance of the k-means clustering and multi-class logistic regression classifications can, several metrics can be viewed by clicking the output port of the **Evaluate Model** module and selecting **Visualize**. [1]: https://gallery.cortanaintelligence.com/Experiment/Clustering-Group-iris-data-2