Clustering: Group iris data

By for February 19, 2015
This sample demonstrates how to perform clustering using k-means algorithm on the UCI Iris data set. Also we apply multi-class Logistic regression to perform multi-class classification and compare its performance with k-means clustering.
#Clustering: Group Iris Data This sample demonstrates how to perform clustering using the k-means algorithm on the UCI Iris data set. In this experiment, we perform k-means clustering using all the features in the dataset, and then compare the clustering results with the true class label for all samples. We also use the **Multiclass Logistic Regression** module to perform multiclass classification and compare its performance with that of k-means clustering. ##Data We used the Iris data set, a well-known benchmark dataset for multiclass classification from the [UCI repository](https://archive.ics.uci.edu/ml/datasets/Iris). This dataset has 150 samples with 4 features and 1 label (the last column). All features are numeric except that the label, which is a string. ![][image1] ##Data Processing We use the **Reader** module to read the data directly from the [UCI site](http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data). The **Reader** module also reads the final empty row, so for this experiment we used the **Clean Missing Data** module to remove the empty row from the data set. Next we use the **Metadata Editor** to change the column names. In particular, the column name of the last column, which contains the class label, is changed to `Label`. Then we used the **Split** module to divide the whole data set randomly into a training and a test data set, setting the fraction of rows in the training set to 0.6. After splitting, the training data set has 90 samples and the test data set has 60 samples. ##Model In this experiment, we create a model using the **K-Means Clustering** module, and compare it with the model created by using the **Multiclass Logistic Regression** module. We use the **Train Clustering Model** to train the model, attaching the untrained clustering model from the **K-Means Clustering** module on one input, and an unlabeled dataset as the other input. You can then use the trained clustering model to return labels for the training data, by connecting the output of the **Train Clustering Model** module to the **Assign to Clusters** module. Next we use the **Assign to Clusters** module to predict the clustering labels for the test dataset. The **Assign to Clusters** module takes the trained k-means clustering model as its first input, and the unlabeled test dataset as the second input. Next, we build a model using the **Multiclass Logistic Regression** module, and use the **Train Model** module to train the model on the training data set. We use the **Score Model** module to create predictions on the test dataset, and use those scores as input to the **Evaluate Model** module, to compute the classification performance of the logistic regression model. ##Results When the **Train Clustering Model** module and **Assign to Clusters** module output the clustering assignment results, the original data is also returned, including all features and the label column. To simplify these results, we use **Project Columns** to extract just the two columns, `Label` and `Assignments`, for both the training and test datasets. The column `Label` (which we renamed using **Metadata Editor**) contains the true labels and the column `Assignments` contains the predicted clustering label. To review the clustering results, right-click the output of the two **Project Columns** modules, and select **Visualize**. Select the `Label` column in the left panel first, and then select `Assignments` in the **Compare to** list in the right-hand Visualization panel, to plot a chart similar to a confusion matrix, to compare `Label` and `Assignments`. The following figure shows the plot for the training data set. ![][image2] In this plot, the values for `Label` are on the x-axis, and the predicted clustering labels in `Assignments` are on the y-axis. The cluster label 0 corresponds to Iris-versicolor, the cluster label 1 to Iris-virginica, and cluster label 2 to Iris-setosa. Unfortunately, two of the Iris-versicolor samples are misclassified as Iris-virginica. We can get the similar plot for the test data set (below), and get similar observations. ![][image3] For the **Multiclass Logistic Regression** module, you can view the output by clicking the output port of the **Evaluate Model** module and selecting **Visualize**. You can view several metrics including overall accuracy, average accuracy, micro and macro-averaged precision and recall. A confusion matrix is also provided. ![][image4] To evaluate clustering results we classify each example according to the majority of labels in its cluster. We observe from confusion matrix that our clustering is very accurate since only cluster 1 has examples from two different classes. Moreover multiclass accuracy of our unsupervised clustering is 96.67%, which is very close to the multiclass accuracy 98.33% obtained by supervised multiclass logistic regression. <!-- Images --> [image1]:https://az712634.vo.msecnd.net/samplesimg/v1/39/whole_exp.PNG [image2]:https://az712634.vo.msecnd.net/samplesimg/v1/39/kmeans_train.PNG [image3]:https://az712634.vo.msecnd.net/samplesimg/v1/39/kmeans_test.PNG [image4]:https://az712634.vo.msecnd.net/samplesimg/v1/39/logistic_confusion.PNG