Lymphoma survival prediction
Predict lymphoma survival from patient gene expression, using real-world data and approaches from an original research article (Rosenwald et al., 2002).
# Diffuse large B cell lymphoma survival prediction from gene expression data
## Description
Differences in the behavior of cancerous cells between patients can affect the probability of response to chemotherapy. One way to quantify differences in behavior between cells is to measure the level of expression of each gene. In this experiment, we attempt to predict patient survival following chemotherapy using gene expression data. Our approach closely parallels the strategy described by [Rosenwald et al. (2002)](http://www.nejm.org/doi/full/10.1056/NEJMoa012914) and uses the open dataset they have made available through the [National Cancer Institute](http://www.cancer.gov/) [caARRAY database](https://wiki.nci.nih.gov/display/caArray2/caArray). This study focuses on patients with cancers originating from diffuse large B cells (the immune system cell type that produces antibodies).
In this experiment, the number of genes for which expression data is available vastly exceeds the number of patients available for training and validating a survival prediction model. This issue is addressed by:
* Removing from consideration any gene that does not correlate significantly with patient survival.
* Performing hierarchical clustering to group genes based on correlations in expression across patients.
* Averaging the expression of genes within each cluster to generate a small number of features for survival prediction.
## Source data
The data used in this experiment can be obtained from the [National Cancer Institute](http://www.cancer.gov/) [caARRAY database](https://wiki.nci.nih.gov/display/caArray2/caArray), where it is listed as experiment [staud-00261](ftp://caftpd.nci.nih.gov/pub/caARRAY/experiments/caArray_staud-00261/). Our approach is based heavily on the one described in the dataset's original publication:
* Rosenwald A et al. (2002). "The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma." NEJM 346(25):1937-1947. ([Journal](http://www.nejm.org/doi/full/10.1056/NEJMoa012914) | [NCBI](http://www.ncbi.nlm.nih.gov/pubmed/12075054))
### Gene expression data
Two input files containing patient data will be used in this experiment. The first, `NEJM_Web_Fig1data.txt`, contains expression data for 240 patients (and additional samples), measured using 7,399 probes. (A given gene's expression may be measured using multiple probes, measured using a single probe, or not measured at all.) Each probe has a unique identifier (`UNIQID`) and a descriptive name (`NAME`).
UNIQID NAME MLC94.46_LYM009... MLC96.45_LYM186... ...other patients...
1 27481 |U76248|~AA053221|... 0.2340 -0.17250
2 17013 |U76248|*AA053221|... 0.4523 -0.03870
3 24751 |U76248|*AA768908|... 0.4050 -0.04135
4 27498 |U76248|~AA053221|... 0.1146 -0.02420
5 27486 |U76248|*AA053221|... 0.2491 -0.10280
... other probes ...
The expression data reported in this file are log ratios between the level of expression in the patient's cancerous cells and the patient's normal cells. Notice that each patient's identifier includes a `LYM` number, which can be used to join the expression data to other facts about the patient.
### Patient survival and cancer type data
The second file, `DLBCL_patient_data_NEW.txt`, contains all other information on the patient, including:
* Whether they were part of the training or validation set in the Rosenwald et al. study.
* The patient's International Prognostic Index (IPI) classification. The IPI is an indicator of survival probability that is based on patient age, whether the cancer has spread beyond the lymph nodes, the stage of the cancer, whether the patient's mobility has been affected, and the activity level of an enzyme in the blood.
* The type of B cells from which the cancer originated (GCB: Germinal center B cells; ABC: Activated [peripheral] B cells; TypeIII: other large B cells).
* When follow-up occurred and whether the patient was alive at the time of follow-up.
* The patient's LYM number (linking them to the expression data described above).
LYM Analysis Set Follow-up (yrs) Status Subgroup IPI Group
4 Training 4.9 Alive GCB Medium
6 Training 5.6 Alive GCB Low
### Groups of related genes
Rosenwald et al. provide four files listing genes associated with each of four processes:
* *Genes involved in cellular proliferation*: Fast growth and division of cancerous cells is associated with a negative patient outcome.
* *Genes expressed in lymph node cells*: Lymphomas are cancers that originate in the [lymphatic system](https://en.wikipedia.org/wiki/Lymphatic_system). Loss of expression of these genes may be a sign of more extreme changes in the behavior of cancerous cells.
* *Genes expressed in the [Germinal center](https://en.wikipedia.org/wiki/Germinal_center) B cells*: B cells are the immune system cells which produce antibodies. Cancers originating from these cells have a higher survival probability following treatment than the other subtypes of lymphomas examined here.
* *[Major Histocompatibility Complex](https://en.wikipedia.org/wiki/Major_histocompatibility_complex) (MHC) class II-related genes*: MHCs are cell surface proteins that enable the immune system to identify cells producing abnormal proteins. Reduced expression of these proteins can impair the immune system's ability to recognize and kill cancerous cells. Since MHC class II genes are expressed only in certain cell types, their expression can also help label the cell type from which a cancer originated.
## Converting probe expression data to overall gene expression data
Our first course of action will be to average, for each gene, the log relative expression levels for all probes for that gene. To do this, we will use the "`NAME` field to guess which probes measure the same gene. This process is complicated by the fact that some probe names which are slightly different clearly correspond to the same gene, e.g.
|X02761|~R62612|Hs.287820|fibronectin 1
|X02761|*R62612|Hs.287820|fibronectin 1
|M14745|*AA810891|Hs.79241|B-cell CLL/lymphoma 2
|M14745|*W63749|Hs.79241|B-cell CLL/lymphoma 2
In both of these example pairs, the second field of the name (as delimited by `|`) is the only field which differs. However, there are other cases where the only information available for finding pairs is the second field, e.g.
||*AA830781||
We have used an **Execute R Script** module to group probes with similar names according to the rule that two probes are considered to correspond to the same gene if their names match exactly on the second field (excluding the first character) or all other fields, and only if the fields are not empty:
expression_df <- maml.mapInputPort(1) # read the expression data into the R module
names_df <- subset(expression_df, select = +c(UNIQID, NAME)) # extract just the probe names and IDs
original_feature_names <- as.character(names_df$NAME)
original_feature_ids <- names_df$UNIQID
# Separate the NAME field into parts by splitting on '|'
temp <- strsplit(original_feature_names, '|', fixed = TRUE) # from each name: 5-element list, 1st empty
second_subfield_of_gene_name <- substring(sapply(temp,`[`,3), 2)
rest_of_gene_name <- paste(sapply(temp,`[`,2), sapply(temp,`[`,4), sapply(temp,`[`,5), sep = '')
# Find groups of probes (hopefully, each group is one gene)
# NB: this implementation will miss "chaining" if it is present (i.e. if A matches B on
# the 2nd subfield, but B matches C only on the other subfields.
grouped_features <- vector(mode = 'logical', length = length(names_df$NAME))
feature_to_group_map <- vector(mode = 'numeric', length = length(names_df$NAME))
counter <- 0
for (i in 1:length(grouped_features)) {
if (grouped_features[i]) {
next
}
counter <- counter + 1
feature_to_group_map[i] <- counter
grouped_features[i] <- TRUE
if (i == length(grouped_features)) {
next
}
my_second_field <- second_subfield_of_gene_name[i]
my_rest <- rest_of_gene_name[i]
for (j in (i+1):length(grouped_features)) {
if (((nchar(my_second_field) > 2) & (second_subfield_of_gene_name[j] == my_second_field)) |
((nchar(my_rest) > 2) & (rest_of_gene_name[j] == my_rest))) {
feature_to_group_map[j] <- counter
grouped_features[j] <- TRUE
}
}
}
This process grouped the 7,399 probes into 4,372 genes.
The **Execute R Script** module concludes by median-shifting all relative expression measurements and averaging the values for probes in the same gene to produce a relative expression value for each gene. Minor cosmetic changes -- transposing the data frame so that each patient's data is on one row, and extracting the LYM number from each patient's ID to create a new column -- are also performed.
## Adding patient data for survival analysis
Our first modification to the patient dataset is to simplify the column names using a **Metadata Editor** module.
Following Rosenwald et al., we will fit [Cox proportional hazards models](https://en.wikipedia.org/wiki/Proportional_hazards_model) to estimate survival probability from gene expression data. We will use the `coxph()` function found in the R package `survival`, which is pre-installed in Azure Machine Learning Studio. This function expects events to be recorded as binary characters. We therefore add a new column to the patient data, `dead_at_followup`, which has the value 1 if true and 0 otherwise. This new column is appended to the data using a short **Execute R Script** module.
Next, we use a **Join Data** module to link the gene expression data with the patient data, using the LYM number columns as our join keys. The data is then split into training and validation sets using a **Split Data** module. We use the same patient splitting as the authors of the original study by setting the Splitting mode to "Regular Expression" and inputting the regular expression:
\"analysis_set" Training
## Creating survival prediction features
To produce a small number of features from the gene expression data available, we will identify groups of genes whose expression patterns are similar across patients, then average the expression data within each group.
### Hierarchical clustering of genes
We group genes using the hierarchical clustering `hclust()` from the pre-installed `stats` package, which we execute within an **Execute R Script** module. Hierarchical clustering requires establishing a metric for dissimilarity between genes: to handle the presence of missing values in our dataset, we use the `daisy()` function in the pre-installed `cluster` package to calculate Euclidean distance between expression data of each gene pair. After clustering has been performed, we use the `cutree()` function from the `stats` package to assign membership of each gene to one of five groups.
### Identification of genes significantly associated with survival
Some of the clusters above contain more than one thousand genes, many of which are not significantly associated with survival: we remove these genes from consideration before performing the group average in each cluster. For each gene, we produce a Cox proportional hazards model for survival as a function of that gene's log relative expression: we retain genes which predict survival with a p-value less than 0.01 (283 of 4,372). We also remove any genes for which >10% of patients have missing expression data and genes whose expression levels do not vary substantially between patients. Only 68 genes met this combination of criteria: slightly more than might be expected by chance.
### Averaging expression data within clusters
We use **Execute R Script** modules to perform the averaging of genes included in each survival prediction feature. Before performing the average, we multiply each gene's log relative expression value by its coefficient sign. (This ensures genes that are positively vs. negatively associated with survival do not "cancel out" one another when the average is performed.) This procedure is performed in parallel on both the training and validation sets.
## Results
### Cox proportional hazards model
A Cox proportional hazards model was fit to the training data using the five features described above as predictors. A summary of the fit for this model, including a test of the proportional hazards assumption and one-way ANOVA, can be viewed in the output log of the corresponding **Execute R Script** module. The model is applied to the validation data to produce scores reflecting survival likelihood (more negative scores indicate a greater likelihood and longer timeframe of survival).
### Kaplan-Meier plots
The International Prognostic Index (IPI) is an existing, categorical predictor of survival probability that does not use gene expression data. To assess whether our survival prediction model contributes independently of the IPI, we plot Kaplan-Meier curves (fraction of patients surviving vs. time) for patients in three ranges of IPI score: low, medium, and high. Within each IPI range, we stratify patients based on their score in our survival prediction model (top vs. bottom 50%). The generated plots can be displayed by clicking the right output port of each **Execute R Script** module and selecting the Visualize option.

### Predicting survival after three years with the Cox proportional hazards model
Probability of survival after three years was estimated from the Cox proportional hazards model for patients in the validation dataset. In an **Execute R Script** module, a Receiver Operating Characteristic (ROC) curve was produced from these probability estimates: this curve can be viewed by selecting the Visualize option after right-clicking the right output port.

We found that the Cox proportional hazards model gave an area under the curve (AUC) of 0.679.
### Incorporating predictor score and prognostic index into a survival predictor
We trained a **Two-Class Support Vector Machine** to predict survival after three years using the gene expression features we extracted as well as patients' prognostic index (IPI) score. Incorporating this additional data improved the quality of the prediction, achieving an AUC of 0.730.
