Binary Classification: Breast cancer detection

By for September 2, 2014
This sample demonstrates how to split the data set using external data; it also demonstrate how to perform binary classification to detect breast cancer using two-class boosted decision tree and compute customized performance metric.
#Binary Classification: Breast Cancer Detection This sample demonstrates how to train binary classifier to detect breast cancer using Azure ML Studio. The data is from KDD Cup 2008 challenge. In this experiment, we focus on the problem of early detection of breast cancer from X-ray images of the breast. The performance metric in this experiment is the area under the ROC curve where probability of false positive is between 0.2 and 0.3. ##Data In this experiment, we have two data sources: - Breast Cancer Info - Breast Cancer Features Both data sets are available at [KDD website](http://www.kdd.org/kdd-cup-2008-breast-cancer). The *Breast Cancer Info* data set contains some meta data about the data set. Specifically, it contains 102,294 rows and 11 columns. We use the first 11 columns of this data set, including 1)label; 2)image-finding-id; 3)study-finding-id; 4)image-id; 5)patient-id; 6)leftbreast; 7)MLO; 8)x-location; 9)y-location; 10)x-nipple-location; and 11)y-nipple-location. Basically, this data set contains the label and many ID information for each examination: image-finding-id, study-finding-id, image-id, and patient-id. The *Breast Cancer Features* data set has 102,294 rows and 118 columns. It contains the features for each patient. There is a one-to-one correspondence relationship between each row of two data sets. In our experiment, we use the label and ID information in *Breast Cancer Info* data set to split the *Breast Cancer Features* data set into training and test data sets. ##Data Processing First of all, we select the first 11 columns of the *Breast Cancer Info* data set by using the **Project Columns** module and then assign column names for these selected columns by using the **Metadata Editor** module. Then we split this data set into positive and negative subsets by using the **Split** module. Note that the splitting is based on patent ID, thus we use **Remove Duplicate Rows** to remain only one row for each patient ID. Then we further split the positive and negative patient ID's into training and test subsets by using the **Split** module. The following image shows the workflow of the preprocessing of the *Breast Cancer Info* data set. ![][image1] Since there is a one-to-one correspondence relationship between the *Breast Cancer Info* data set and the *Breast Cancer Features* data, we can use the **Add Columns** module to combine these two data sets together. After splitting based on patient ID in the *Breast Cancer Info* data set, we can map the splitting back the *Breast Cancer Features* data set by using the **Join** module. Therefore, we obtain the following 4 subsets: - Positive training examples - Positive test examples - Negative training examples - Negative test examples By using the **Add Rows** module, we can get the training data set by combining the positive training examples with the negative training examples. Similarly, we can get the test data set. ###Generate New Data Set Notice that this data set is an imbalanced data set: the number of positive samples is significantly less than the number of negative samples. In classification, we are more interested in correctly classifying positive samples. In order to increase the weight of positive samples, we replicate the positive samples for 93 times. The replication is done by the following R code that is run using the **Execute R Script** module: dataset <- maml.mapInputPort(1) data.set <- dataset[dataset[,1]==-1,] pos <- dataset[dataset[,1]==1,] for (i in 1:93) data.set <- rbind(data.set,pos) row.names(data.set) <- NULL Note that the input of the **Execute R Script** module is a data frame in R; also the output should be a data frame object in R. ##Feature Engineering Since patient ID is very important in this data set, we create 2 more training data sets by quantizing the patient ID in this experiment. Specifically, we use the **Quantize Data** module. In our experiment, the bin edges are set as 0,20000,100000,500000,4000000,4870000. After quantizing, we will get a new variable to denote in which interval the patient ID is. We use **Metadata Editor** to convert this new variable to a categorical variable and then convert it to several boolan variables using the **Indicator Values** module. The following image shows the workflow to quantize data. ![][image2] ##Model In this experiment, we apply **Two-Class Boosted Decision Tree**. In this experiment, 4 different training data sets are used to train 4 different models: - Original training data set - Replicating the positive sample for 93 times in the original training data set - Adding features created by quantizing the patient id feature in the original data set - After adding features created by quantizing the patient id feature in the original data set, the positive samples are replicated for 93 times. We initialize the learning algorithm using the **Two-Class Boosted Decision Tree** module and then use the **Train Model** module to create the actual model. These models are used by the **Score Model** module to produce scores of test examples. Since 4 different predictions on the test data set are generated by 4 different models, we use **Add Columns** to combine these predictions together. The following image shows the workflow of the model training and scoring. ![][image3] In this experiment, we use a **Execute R Script** module to compute customized metric. In this **Execute R Script** module, we have a function call *compute_auc* which computes the fraction of area under ROC where probability of false positive is between 0.2 and 0.3. Actually, this function is the R counterpart of the Matlab function get\_ROC\_KDD.m provided at [KDD website](www.kdd.org/kdd-cup-2008-breast-cancer). The following is the R code of the *compute_auc* function. In this function, we assume that the input data frame has 3 columns: 1) true target; 2) prediction; and 3) patient ID. compute_auc <- function (dataset) { FA_low <- 0.2 FA_high <- 0.3 colnames(dataset) <- c("Y","Pred","PatientID") n_positive_patients <- length(unique(dataset$PatientID[dataset$Y==1])) n_patients <- length(unique(dataset$PatientID)) n_images <- 4*n_patients data <- dataset[order(-dataset$Pred),] n_points <- dim(data)[1]+1 num_FA <- vector(mode="numeric",length=n_points) num_D <- vector(mode="numeric",length=n_points) num_PatientsDetected <- vector(mode="numeric",length=n_points) patients_detected_till_now <- vector(mode="numeric",length=0) for (i in 2:n_points) { if (data$Y[i-1]==1) { num_FA[i] <- num_FA[i-1] patients_detected_till_now <- union(patients_detected_till_now, data$PatientID[i-1]) num_PatientsDetected[i] <- length(patients_detected_till_now) } else { num_PatientsDetected[i] <- num_PatientsDetected[i-1] num_FA[i] <- num_FA[i-1]+1 } } FA_per_image <- num_FA / n_images Pd_patient_wise <- num_PatientsDetected / n_positive_patients index1 <- min(which(FA_per_image>=FA_low)) index2 <- max(which(FA_per_image<=FA_high)) AUC <- Pd_patient_wise[index1:(index2-1)] %*% (FA_per_image[(index1+1):index2] - FA_per_image[index1:(index2-1)]) return(AUC) } The following is the main code which calls the *compute_auc* function to compute the final performance metrics. dataset_input <- maml.mapInputPort(1) ids <- maml.mapInputPort(2) # number of features+labels in the original dataset n_cols <- 120 data.set <- data.frame(matrix(nrow=4,ncol=3)) names(data.set) <- c("features","training set","fraction of area under ROC") # add annotations of results data.set[1] <- c("image features", "image features", "image features + quantized patiend ID", "image features + quantized patiend ID") data.set[2] <- c("replication of positives", "no replication of positives", "no replication of positives", "replication of positives") # compute metrics for the first two training sets for (i in 1:2) { dataset <- cbind(dataset_input[,(i-1)*n_cols+1],dataset_input[,i*n_cols],ids) data.set[i,3] <-compute_auc(dataset) } # compute metrics for the last two training sets base <- n_cols * 2 n_cols <- 126 for (i in 1:2) { dataset <- cbind(dataset_input[,base+(i-1)*n_cols+1],dataset_input[,base+i*n_cols],ids) data.set[i+2,3] <-compute_auc(dataset) } maml.mapOutputPort("data.set") ##Results The final results of the experiment, obtained by right-clicking the **Visualize** output of the **Project Columns** module are: ![][image4] This table summaries the customized metric, i.e., the fraction of area under ROC for 4 different approaches. It can be observed that the best performance is achieved by replicating positive samples and quantizing patient ID. <!-- Images --> [image1]:http://az712634.vo.msecnd.net/samplesimg/v1/6/info_preprocessing.png [image2]:http://az712634.vo.msecnd.net/samplesimg/v1/6/quantize.png [image3]:http://az712634.vo.msecnd.net/samplesimg/v1/6/model.png [image4]:http://az712634.vo.msecnd.net/samplesimg/v1/6/breast_results.png