Prepare the data.
The data used in this exercise was generated by performing whole genome sequencing analysis of metastatic prostate tumors. This assay allows us to identify structural variations in tumor genomes such as duplications, inversions, and large deletions.
For more details on the analysis and results, see Quigley et al. Cell 2018.
We counted the number of structural variations in each tumor sample. For our purposes, variants come in five types: inversions, deletions, duplications, insertions, and translocations. We also assessed each biopsy to identify biallelic inactivating mutations in each gene. Read in summaries of the number of structural variants present in each tumor and the presence/absence of mutations for a few selected genes.
# Change working_folder to the path where the files are located
working_folder = '/notebook/talks/2022_10_BMS_EDA/EDA'
fn_sum = paste0( working_folder, '/SV_summary_table.txt' )
fn_mut = paste0( working_folder, '/WCDT_mutations.txt')
sv = read.table(fn_sum, sep='\t', header=TRUE, stringsAsFactors=FALSE)
mut = read.table(fn_mut, sep='\t', header=TRUE, stringsAsFactors=FALSE)
# if you don't have ggplot2 or reshape2 installed, un-comment and run the next two lines:
#install.packages("ggplot2")
#install.packages("reshape2")
library(ggplot2)
library(reshape2)
The sv matrix object reports counts of five types of structural variants (SV) in 101 patient tumor biopsies.
QUESTION 1.1: Plot a summary of the distributions for each type of SV individually. Do any of the distributions have outliers, defined as “values that exceed the whiskers of a boxplot”?
# ANSWER 1.1
QUESTION 1.2 Which SV types are most frequent? Least frequent?
# ANSWER 1.2
Some analyses are contingent on distributional assumptions. For example, they may assume values are normally distributed. We can test the assumption that a sample is normally distributed with a QQ plot:
set.seed(124)
x=rnorm(100)
qqnorm( x )
qqline( x )
QUESTION 2.1 Generate a QQ plot to evaluate the assumption that deletions are normally distributed. Are deletions normally distributed?
# ANSWER 2.1
Data that are not normally distributed can be coerced towards a normal distribution by transforming the data. How could you transform the distribution of deletions so that it’s closer to normal?
QUESTION 3.1 Replot the QQ plot after performing a log transformation to see what effect your transformation had. Did this transformation make the sample more similar to a normal distribution?
Note: leave the data as their original counts for the rest of the questions. Just transform for this question.
# ANSWER 3.1
Structural variations arise from DNA damage that is not repaired. By analyzing tumor genomes, we can figure out the kind of DNA damage that occurred by studying the patterns of SVs.
The null model is that there is no relationship between the number of any type of SV. Alternatively, there might be an association between some of the SV types, suggesting somthing in common about their etiology.
QUESTION 4.1 Calculate pairwise correlation between all five types of SV and plot the resulting correlation coefficients as a heat map. Try both Pearson correlation (the parametric default method in R) and non-parametric Spearman rank correlation, to see if it matters.
Hint for plotting a simple correlation heatmap, where the matrix X contains the values to plot in the heatmap:
XTidy = melt( X, value.name="val", varnames = c("x", "y") )
ggplot( XTidy, aes( x, y ) ) +
geom_tile( aes( fill = val ) ) +
geom_text( aes( label = val ) )
# ANSWER 4.1
QUESTION 4.2 Which types of SV are most strongly correlated with each other? Without formal testing, do the correlation data support the null model, or is there reason to investigate an alternative model? Which pairs of SV are most likely to occur in similar counts?
# ANSWER 4.2
QUESTION 4.3 (BONUS): Why are the correlation values so different when comparing Spearman rank correlation to Pearson correlation? What might be driving these differences? Does it matter?
# ANSWER 4.3
Let’s drill down on two contrasting pairs of SVs:
QUESTION 5.1 Create two scatter plots: duplications vs inversions, and deletions vs. inversions. Based on question 4, we’d expect the counts of these SVs to be somewhat correlated with each other. Does this hold up? Are there samples that are outliers from the linear trend in these comparisons?
Samples that deviate from a relationship like this might be of particular interest.
# ANSWER 5.1
Let’s drill down on the duplications. Look at your plot comparing the number of duplications to the number of inversions. Note that there are three samples that have far more duplications than any other sample, and four samples that have both a large number of inversions and a large (though not the highest) number of duplications. Looking at this plot, we might wonder if there is something special about the three samples that have a lot of duplications but not a lot of inversions.
The mut matrix that you loaded contains one row for each sample and one column for each of 15 genes, with a TRUE value if there is a biallelic inactivation of that gene in that sample.
QUESTION 6.1 We’ll test the hypothesis that tumors with a particular gene inactivation acquire a lot more duplications than tumors lacking that inactivation. Test this hypothesis by performing a Wilcoxon test comparing the number of duplicates in tumors with vs. without mutation in each gene. Create a barplot of -log10( p ) and nominate the strongest hit as worthy of further investigation.
# ANSWER 6.1
Now let’s take a close look at deletions. Look at the scatter plot you made comparing inversions vs deletions. There tends to be a linear relationship, even for samples with large numbers of inversions, but there is a set of samples that have lots of deletions but not a lot of inversions.
QUESTION 7.1 You might hypothesize that tumors that have inactivated a DNA repair gene harbor a lot more deletions than normal tumors. Test this hypothesis by performing a Wilcox test comparing the number of deletions in tumors with vs. without mutation in each gene. This time, instead of a barplot, create a volcano plot to compare the difference in means (as the effect) vs. the -log10(p) as the statistical strength. Does this analysis nominate any gene as associated with large numbers of deletions?
# ANSWER 7.1
effect_size = ?
logp = ?
gene_names = dimnames(mut)[[2]]
plot( effect_size, logp, las=1,
xlim=c(-250,250),
ylim=c(0,7),
xlab="change in deletions associated with mutation")
text( effect_size, logp + 0.25, gene_names ) # show gene names on the plot
QUESTION 7.2 (BONUS) Re-run the test you performed in problem 7.1, using a t test rather than a Wilcoxon test. Explain why these tests have different performance and produce different results. Discuss the difference between ranking candidates based on P value and based on a combination of P value and effect size.
# ANSWER 7.2