Clustering sweep: diabetes dataset

October 28, 2015
This experiment demonstrates how to use the new Sweep Clustering module to select the best number of centroids and optimize other parameters
## Summary ## This experiment uses a parameter sweep with the K-means clustering algorithm to select the best number of clusters and the best initial centroid, based on a clustering metric you select. In the experiment, the original dataset is divided into two parts, to handle missing data and to compare the effect of variables on clustering. The results of the models are compared using a principal components graph. The experiment also demonstrates how you can use the [Sweep Clustering]( module to fill in values for a label column. ![Experiment overview](../media/completeExperimentGraph.PNG) ## Understanding the Data ## The dataset contains 768 rows, from a larger dataset that studies the incidence of diabetes among different populations. The following clinical values are included, which are often used in diagnosing diabetes. **Triceps skin fold measurements (TSF)** and **body mass index (BMI)** are thought to be correlated with body fat or obesity. However, and many factors influence the bone, fat, and muscle composition of the arm, and measurements can vary greatly by age, gender, obesity, fitness training status, and race. Also, recent research using has shown a lack of correlation between deep fat and subcutaneous fat. **Glucose** can be measured in whole blood, serum, or plasma. The list shows some sample _reference levels_ for plasma glucose, which is how blood samples taken at a doctor's office or lab are typically reported: - Fasting plasma glucose: 70-99 mg/dL - Postprandial plasma glucose at 2 hours: Less than 140 mg/dL - Random plasma glucose: Less than 140 mg/dL Glucose levels that are above the normal reference range indicate a medical problem, possibly a pre-diabetic stage. The **diabetes pedigree score** is assigned using a pedigree function that calculates the _a priori_ risk for a disorder based on hereditary factors. A higher score indicates higher risk. The **Class** variable contains the diagnosis, where 1 = diabetes and 0 = non diabetic. ## Experiment Overview ## The [Sweep Clustering](, module is used, together with the [K-Means Clustering]( module, to create a clustering model and iterate over many combinations of parameters to find the best model. Clustering models are known to be sensitive to initial parameters, so using a parameter sweep for a clustering model has the advantage that you can in one pass try several different initial seeds, and vary the number of target clusters. Additionally, this walkthrough demonstrates a new feature in Azure Machine Learning that lets you include label columns in your clustering model. If your dataset is only partially labeled, you can use the clustering sweep to fill in the values of the label column. ### Create the experiment 1. Add the Pima Indians Diabetes Binary Classification dataset to your experiment. 2. Use the [Metadata Editor]( module to abbreviate column names and make them easier to work with in later steps: - Number of times pregnant **-> Pregnancies** - Plasma glucose concentration a 2 hours in an oral glucose tolerance test **-> Glucose** - Diastolic blood pressure (mm Hg) **-> BP** - Triceps skin fold thickness (mm) **-> TSF** - 2-Hour serum insulin (mu U/ml) **-> Insulin** - Body mass index (weight in kg/(height in m)^2) **-> BMI** - Diabetes pedigree function **-> Pedigree** - Age (years) **-> Age** - Class variable (0 or 1) **-> Class** ### Divide and clean the data ### This dataset contains many missing values, but the missing values are represented by zeros. The use of zero (0) for "missing" is problematic for several reasons: - The values are biologically impossible in some columns, such as blood pressure. - Zero being a number, it affects the distribution of numeric values in the model. For example, about half the sample cases are missing serum insulin measurements, even though insulin is considered one possible diagnostic marker of diabetes. Additionally, values for blood pressure, glucose, and BMI, and skin fold measurements are missing in 5-10 cases each. For the columns that have relatively few missing values, you can use the [Clean Missing Data]( module to impute values. However, the cases without insulin are separated into a different dataset. the following diagram shows the workflow for cleaning and preparing the data. ![data preparation workflow](../media/dataPrepWorkflow.PNG) 1. Add the [Apply SQL Transformation]( module, with the following CASE statement, to replace the zeros with nulls: ``` select Pregnancies, Glucose, BP, TSF, Insulin, BMI, Pedigree, Age, Class, CASE WHEN (TSF != 0) THEN TSF ELSE null END as NewTSF, CASE WHEN (BMI != 0) THEN BMI ELSE null END as NewBMI, CASE WHEN (BP != 0) THEN BP ELSE null END as NewBP, CASE WHEN (Glucose != 0) THEN Glucose ELSE null END as NewGlucose from t1; ``` 2. Add a [Split]( module with the following relative expression, to divide the dataset into the cases with valid insulin values (394 cases) and those with no insulin values (374 cases). ``` \"Insulin" != 0 ``` 3. Add the [Project Columns]( module on the right-hand side of the split to exclude the [Insulin] column from the dataset. 4. On the left side, add the [Clean Missing Data]( module. Choose as the columns to be cleaned [NewTSF], [NewBMI], [NewBP], and [NewGlucose]. For **Cleaning mode**, select **Replace using MICE**. 5. Add another instance of the [Clean Missing Data]( module to the right-hand side of the experiment, and make the same selections. (You need to do this separately for the cases with and without insulin values, because MICE uses other columns to infer the missing values.) 6. Add an instance of the [Metadata Editor]( module to both sides, select the [Class] column, and in the **Fields** dropdown list, choose **Label**. (This step is required because **Clean Missing Data** will sometimes reset all fields to features.) ### Create two clustering models using a parameter sweep To build the clustering models, use the [Sweep Clustering module]( together with the [K-Means Clustering]( module. 1. Add the [K-Means Clustering]( module in the left and right branches of the experiment, and configure them identically: - For **Create trainer mode**, select **Parameter Range**. - For **Initialization for sweep**, select **Use label column**. - For **Range for Number of Centroids**, type 2-10. Since you are using the label column to guide the initial selection of centroids, it is possible that you will end up with two clusters, one for each class, but a number of other clusters will be tested, and the optimal number of clusters will be selected based on test results. - For **Metric**, choose **Cosine**. - For **Assign Label Mode**, select **Ignore label column**. With this option, the column values will still be referenced when building the model, but the values in [Class] won't be altered. If you had a [Class] column that was only partially populated, you could use the **Fill missing values** option and the sweep would add new label values based on the clustering results. (You'll see how the other options work later in the experiment.) 3. Add the [Sweep Clustering module]( module to the experiment and connect it to the K-Means clustering model you just created. To the right-hand input dataset port, attach the 70% training set. 4. Configure the **Sweep Clustering** module as follows: - For **Metric for measuring cluster result**, select **Davies-Bouldin**. Note that the purpose of this metric is only to measure the internal quality of the cluster for the purpose of the parameter sweep. - Select the **Entire grid** option. Normally you might perform a random sweep for performance reasons, but sweeping the entire grid will let you view the full list of clusters that have been tried. - For **Column Set**, choose all feature columns. Because you are using the label column to initialize the centroids, be sure to add the [Class] column as well. ## Evaluating the clustering model There are multiple ways to evaluate the clustering models. - You can use the principal components visualization to understand how the clusters are related to each other, in a two-dimensional graph. - The **Sweep Clustering** module outputs a set of metrics used for comparing the iterations during the sweep process. They don't represent the accuracy of the final model, but you can tell how each cluster scored on each iteration. - You can use the assignments and the class labels, if available, to compute a simple confusion matrix to see how accurate the cluster assignments are. - if you have a set of cluster assignments **Evaluate Model** can generate a set of cluster metrics. 5. Add the [Assign Data to Clusters]( module. Connect the **Best trained model** output of the **Sweep Clustering** module to the **Trained model** input of [Assign Data to Clusters]( 6. Connect the **Results dataset** output of the **Sweep Clustering** module to the **Trained model** input of **Assign Data to Clusters**. 7. Add the [Evaluate Model]() module and connect the **Results dataset** from the **Sweep Clustering module** to the first **Scored dataset** input. 8. Add the [Apply SQL Transformation]( module, connect the results of **Assign Data to Clusters** to the input port, and paste the following statement in the SQL Query Script text box. ``` select Class, Assignments, count(*) as Summary from t1 GROUP BY [Class], [Assignments]; ``` 8. Run the experiment. ### Sweep metrics The metrics generated by the [Sweep Clustering]( module show you what kind of clusters resulted when different parameters were tried, and how each cluster measured up given the metric you specified. If you run a random sweep, each run might use a different number of clusters and different parameters, but the best results are always listed at the top of the table. ![clustering sweep results](../media/LevsSweepResults.PNG) For this experiment, the grid sweep tested a number of clusters ranging from 2-10, but the best runs had two clusters. ### PCA graph The Principal Components Graph generated by the **Assign Data To Clusters** module captures the differences between clusters in the model using a simple chart. For example, although the cases in this model vary greatly by dimensions such as the number of pregnancies, age, blood pressure, and so forth, it is difficult for humans to visualize the differences in five dimensions! Therefore the PCA graph computes two principal component axes that summarize the multi-dimensional differences between the clusters, and the cluster assignments are plotted along these axes. For example, this combined graph compares the clusters in the model with insulin measurements to clusters in the model with no insulin values reported. ![PCA graph comparison](../media/PCA_combinedclusters.png) From this chart, you can see that it is much easier to distinguish between cases in cluster 0 and cluster 1 in the model with good insulin measurements, although even they overlap quite a bit. If you find that clusters are hard to separate. Here is another example of a PCA graph that found 5 clusters, although cluster 0 could be hard to separate. ![5 cluster PCA graph](../media/Example5clusters.PNG) ### How accurate is the clustering model? One way to estimate the accuracy of clustering is to compute a simple confusion matrix, if labels are known. The following table shows the simple confusion matrix computed using **Apply SQL Transformation**. ![confusion matrix](../media/SimpleConfusionMatrix.PNG) If labels are not known, you can use the results of the [Evaluate Model]( module to assess the quality of the clustering model in terms of how many points are in each cluster, how much scatter there is in each cluster, and how far the clusters are separated. For example, the following graphic shows statistics for the model with insulin. ![metrics for a clustering model](../media/EvaluateTwoCluster.PNG) ## Changing labels in a clustering model Remember how you used the label column to build the model, but elected to ignore the labels? That option was best because your dataset already contained the correct label (which is useful information!), so you didn't want to overwrite the values. But suppose you had an incomplete set of labels? Or suppose you wanted to relabel all the data points that are assigned to a particular cluster? It's easy to do. To illustrate the results, in the right-hand branch of the experiment, we'll engineer some cases without labels. 1. Add the [Apply SQL Transformation]( module and use the following statement to replace any class assignment with a null if there isn't an existing, valid value for [TSF], [BP], [Glucose], or [BMI]. ``` SELECT Pregnancies,Pedigree,Age, NewGlucose, NewBP, NewTSF, NewBMI, Class, CASE WHEN (NewTSF is null) THEN null ELSE Class END as NewClass from t1; ``` 2. Add the [Sweep Clustering module]( module and [K-Means Clustering]( module to build a clustering model exactly as before. However, in the **K-Means Clustering** module, set the option **Assign Label Mode** to **Fill missing values**. 3. Run the experiment. 5. Right-click the **Results dataset** output of the **Sweep Clustering** module and select **Visualize** to see the cases and the cluster assignments. 6. Now change the the option **Assign Label Mode** to **Overwrite from closest to center** and run the experiment again. 7. This option doesn't change the way the model is built, so the clusters are the same and the incoming data points are assigned to the same cluster as before. However, the label that is assigned to each data point is replaced with a new value, depending on the option you selected, as shown in the following graphic. ![fill vs. overwrite](../media/Results-Combined.PNG) - If you select the **Overwrite from closest to center** option, _all_ existing labels are overwritten with the cluster assignments provided by the model. This is useful, for example, if you want to change many detailed labels ("sports car", "coupe", "sedan") into a more generic label ("cars"). - If you select the **Fill missing values** option, only the cases that don't have a label (here, the nulls in the [NewClass] column) will be overwritten with the Assignment values. This is useful if you have labels for only some cases and want to use the cluster assignments to generate new labels for the rest of the cases. ## Acknowledgements Thanks to Lev Lipkin, Senior Software Developer at Microsoft, who created the original experiment. This experiment was developed to demonstrate the many options for clustering model optimization in the **Sweep Clustering** module. <!-- Images --> [Experiment overview]: [data preparation workflow]: [clustering sweep results]: [metrics for a clustering model]: [fill vs. overwrite]: