BAD Day 1: Multiple hypothesis testing on Golub

In this tutorial we will do multiple hypothesis testing on the previsouly studied Golub data set.

  • Null Hypothesis $H_0$: states the assumption to be tested. We begin by assuming that the null Hypothesis is True
  • Alternative Hypothesis $H_1$: is the opposite of the null hypothesis

1. Loading the dataset


In [1]:
require(multtest)
data(golub)

Gene expression data (3051 genes and 38 tumor mRNA samples) from the leukemia microarray study of Golub et al. (1999). Pre-processing was done as described in Dudoit et al. (2002). The R code for pre-processing is available in the file ../doc/golub.R.


In [2]:
golub.expr<-golub
row.names(golub.expr)=golub.gnames[,3]

Numeric vector indicating the tumor class, 27 acute lymphoblastic leukemia (ALL) cases (code 0) and 11 acute myeloid leukemia (AML) cases (code 1).


In [3]:
colnames(golub.expr)= golub.cl

Setting the sample sizes


In [4]:
n.ALL <- 27
n.AML <- 11
cancer.type <- c(rep("ALL", n.ALL), rep("AML", n.AML))

Adding the cancer type to the column name, for the display


In [5]:
colnames(golub.expr) <-cancer.type

2. t.test with a single gene


In [6]:
g <- 347

Alternatively, you can select a gene randomly


In [7]:
## g <- sample(1:nrow(golub.expr),1)
g.profile <- as.vector(as.matrix(golub.expr[g,]))

Draw a barplot wiht color-coded cancer type


In [9]:
plot.col <- c('ALL'='mediumpurple', 'AML'='midnightblue')

barplot(g.profile, main = paste("Golub (1999), gene", g),
        col = plot.col[cancer.type], border = 'white')

legend('topright', c("ALL","AML"),col = plot.col[c("ALL","AML")],
       pch = 15, bty = "o")
box()

png

Separate data in two vectors


In [10]:
sample.ALL <- g.profile[cancer.type=="ALL"]
sample.AML <- g.profile[cancer.type=="AML"]

Compute manually the t test paramenters (not necessary, just to practice!)

Estimate the population means


In [11]:
mean.est.ALL <- mean(sample.ALL)
mean.est.AML <- mean(sample.AML)

Compute the sample sd

Remember the sd( ) function automatically computes the estimate corrected with sqrt(n-1)


In [12]:
sample.sd.ALL <- sd(sample.ALL) * sqrt((n.ALL-1)/n.ALL)
sample.sd.AML <- sd(sample.AML) * sqrt((n.AML-1)/n.AML)

Estimate the population standard deviation


In [13]:
sd.est.ALL <- sd(sample.ALL)
sd.est.AML <- sd(sample.AML)

Estimate the standard errors on the mean


In [14]:
sd.err.est.ALL <- sd(sample.ALL) / sqrt(n.ALL)
sd.err.est.AML <- sd(sample.AML) / sqrt(n.AML)

Estimate the standard deviation of the difference between the two means according to Student’s formula


In [15]:
diff.sd.est <- sqrt((n.ALL*sample.sd.ALL^2 + n.AML*sample.sd.AML^2) * (1/n.ALL + 1/n.AML) /(n.ALL+n.AML-2))

Compute t.obs


In [16]:
d <- abs(mean.est.ALL - mean.est.AML)
t.obs.Student <- d/diff.sd.est

Compute the P- vale. Since we are performing the two-tail test, the single-tail probability has to be multiplied by 3 in order to obtai the alpha risk


In [17]:
P.val.Student <- 2 * pt(q = t.obs.Student, df = n.ALL + n.AML-2, lower.tail = F)

This is what you should be doing… FAST!

3. Apply the Student-Fischer t-test

Note: this assumes that the two populations have equal variance.


In [18]:
t.student <- t.test(sample.ALL,sample.AML, var.equal=TRUE)
print(t.student)

4. Apply the Welch t-test

Note: this does not assume that the two populations have equal variance.


In [19]:
t.welch <- t.test(sample.ALL,sample.AML, var.equal=FALSE)
print(t.welch)

We want to test all the genes, so we could just loop over what was written before … (note it can take some time)


In [20]:
t.statistics <- vector()
P.values <- vector()
for (g in 1:nrow(golub.expr)) {
  #  print(paste("Testing gene", g))
  g.profile <- as.vector(golub.expr[g,])
  sample.ALL <- g.profile[cancer.type=="ALL"]
  sample.AML <- g.profile[cancer.type=="AML"]
  t <- t.test(sample.ALL,sample.AML)
  t.statistics <- append(t.statistics, t$statistic)
  P.values <- append(P.values, t$p.value)
}


In [21]:
 print(P.values)

For a more efficient way, you can use apply instead (not shown here)


In [22]:
Data=cbind(golub.expr,P.values)
colnames(Data)[39]='Raw.p'

Order data by pvalue


In [23]:
Data = Data[order(Data[,'Raw.p']),]

Perform p-value adjustments and add to the data frame


In [24]:
Datadf <- as.data.frame(Data)


In [25]:
Datadf$Bonferroni =
  p.adjust(Datadf$Raw.p,
           method = "bonferroni")

Datadf$BH =
  p.adjust(Datadf$Raw.p,
           method = "BH")

Datadf$Holm =
  p.adjust(Datadf$ Raw.p,
           method = "holm")

Datadf$Hochberg =
  p.adjust(Datadf$ Raw.p,
           method = "hochberg")

Datadf$Hommel =
  p.adjust(Datadf$ Raw.p,
           method = "hommel")

Datadf$BY =
  p.adjust(Datadf$ Raw.p,
           method = "BY")

5. Plotting the obtained data


In [26]:
X = Datadf$Raw.p
Y = cbind(Datadf$Bonferroni,
          Datadf$BH,
          Datadf$Holm,
          Datadf$Hochberg,
          Datadf$Hommel,
          Datadf$BY)


In [27]:
matplot(X, Y,
        xlab="Raw p-value",
        ylab="Adjusted p-value",
        type="l",
        asp=1,
        col=1:6,
        lty=1,
        lwd=2)

legend('bottomright',
       legend = c("Bonferroni", "BH", "Holm", "Hochberg", "Hommel", "BY"),
       col = 1:6,
       cex = 1,    
       pch = 16)

abline(0, 1,
       col=1,
       lty=2,
       lwd=1)

png

Day1: outline
If download does not start: right-click and 'Download linked file as'