BAD Day 1: Tutorial
0. Source/install the needed packages
1. Exploratory data analysis
We will be usig the Gene Expression dataset from Golub et al (1999). The
gene expression data collected by Golub et al. (1999) are among the most
classical in bioinformatics. A selection of the set is called golub
which is
contained in the multtest
package loaded before.
The data consist of gene expression values of 3051 genes (rows) from 38 leukemia patients Pre-processing was done as described in Dudoit et al. (2002). The R code for pre-processing is available in the file ../doc/golub.R.
Source: Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, Vol. 286:531-537. (http://www-genome.wi.mit.edu/MPR/).
Data set values:
golub
: matrix of gene expression levels for the 38 tumor mRNA samples, rows correspond to genes (3051 genes) and columns to mRNA samples.golub.cl
: numeric vector indicating the tumor class, 27 acute lymphoblastic leukemia (ALL) cases (code 0) and 11 acute myeloid leukemia (AML) cases (code 1).golub.names
: a matrix containing the names of the 3051 genes for the expression matrix golub. The three columns correspond to the gene index, ID, and Name, respectively.
- 3051
- 38
-1.45769 | -1.39420 | -1.42779 | -1.40715 | -1.42668 | -1.21719 | -1.37386 | -1.36832 | -1.47649 | -1.21583 | ⋯ | -1.08902 | -1.29865 | -1.26183 | -1.44434 | 1.10147 | -1.34158 | -1.22961 | -0.75919 | 0.84905 | -0.66465 |
-0.75161 | -1.26278 | -0.09052 | -0.99596 | -1.24245 | -0.69242 | -1.37386 | -0.50803 | -1.04533 | -0.81257 | ⋯ | -1.08902 | -1.05094 | -1.26183 | -1.25918 | 0.97813 | -0.79357 | -1.22961 | -0.71792 | 0.45127 | -0.45804 |
0.45695 | -0.09654 | 0.90325 | -0.07194 | 0.03232 | 0.09713 | -0.11978 | 0.23381 | 0.23987 | 0.44201 | ⋯ | -0.43377 | -0.10823 | -0.29385 | 0.05067 | 1.69430 | -0.12472 | 0.04609 | 0.24347 | 0.90774 | 0.46509 |
3.13533 | 0.21415 | 2.08754 | 2.23467 | 0.93811 | 2.24089 | 3.36576 | 1.97859 | 2.66468 | -1.21583 | ⋯ | 0.29598 | -1.29865 | 2.76869 | 2.08960 | 0.70003 | 0.13854 | 1.75908 | 0.06151 | 1.30297 | 0.58186 |
2.76569 | -1.27045 | 1.60433 | 1.53182 | 1.63728 | 1.85697 | 3.01847 | 1.12853 | 2.17016 | -1.21583 | ⋯ | -1.08902 | -1.29865 | 2.00518 | 1.17454 | -1.47218 | -1.34158 | 1.55086 | -1.18107 | 1.01596 | 0.15788 |
2.64342 | 1.01416 | 1.70477 | 1.63845 | -0.36075 | 1.73451 | 3.36576 | 0.96870 | 2.72368 | -1.21583 | ⋯ | -1.08902 | -1.29865 | 1.73780 | 0.89347 | -0.52883 | -1.22168 | 0.90832 | -1.39906 | 0.51266 | 1.36249 |
The gene names are collected in the matrix golub.gnames
of which the columns
correspond to the gene index, ID, and Name, respectively.
AFFX-HUMISGF3A/M97935_MA_at | -1.45769 | -1.39420 | -1.42779 | -1.40715 | -1.42668 | -1.21719 | -1.37386 | -1.36832 | -1.47649 | -1.21583 | ⋯ | -1.08902 | -1.29865 | -1.26183 | -1.44434 | 1.10147 | -1.34158 | -1.22961 | -0.75919 | 0.84905 | -0.66465 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AFFX-HUMISGF3A/M97935_MB_at | -0.75161 | -1.26278 | -0.09052 | -0.99596 | -1.24245 | -0.69242 | -1.37386 | -0.50803 | -1.04533 | -0.81257 | ⋯ | -1.08902 | -1.05094 | -1.26183 | -1.25918 | 0.97813 | -0.79357 | -1.22961 | -0.71792 | 0.45127 | -0.45804 |
AFFX-HUMISGF3A/M97935_3_at | 0.45695 | -0.09654 | 0.90325 | -0.07194 | 0.03232 | 0.09713 | -0.11978 | 0.23381 | 0.23987 | 0.44201 | ⋯ | -0.43377 | -0.10823 | -0.29385 | 0.05067 | 1.69430 | -0.12472 | 0.04609 | 0.24347 | 0.90774 | 0.46509 |
AFFX-HUMRGE/M10098_5_at | 3.13533 | 0.21415 | 2.08754 | 2.23467 | 0.93811 | 2.24089 | 3.36576 | 1.97859 | 2.66468 | -1.21583 | ⋯ | 0.29598 | -1.29865 | 2.76869 | 2.08960 | 0.70003 | 0.13854 | 1.75908 | 0.06151 | 1.30297 | 0.58186 |
AFFX-HUMRGE/M10098_M_at | 2.76569 | -1.27045 | 1.60433 | 1.53182 | 1.63728 | 1.85697 | 3.01847 | 1.12853 | 2.17016 | -1.21583 | ⋯ | -1.08902 | -1.29865 | 2.00518 | 1.17454 | -1.47218 | -1.34158 | 1.55086 | -1.18107 | 1.01596 | 0.15788 |
AFFX-HUMRGE/M10098_3_at | 2.64342 | 1.01416 | 1.70477 | 1.63845 | -0.36075 | 1.73451 | 3.36576 | 0.96870 | 2.72368 | -1.21583 | ⋯ | -1.08902 | -1.29865 | 1.73780 | 0.89347 | -0.52883 | -1.22168 | 0.90832 | -1.39906 | 0.51266 | 1.36249 |
- 'AFFX-HUMISGF3A/M97935_MA_at (endogenous control)'
- 'AFFX-HUMISGF3A/M97935_MB_at (endogenous control)'
- 'AFFX-HUMISGF3A/M97935_3_at (endogenous control)'
- 'AFFX-HUMRGE/M10098_5_at (endogenous control)'
- 'AFFX-HUMRGE/M10098_M_at (endogenous control)'
- 'AFFX-HUMRGE/M10098_3_at (endogenous control)'
- 'AFFX-HUMGAPDH/M33197_5_at (endogenous control)'
- 'AFFX-HUMGAPDH/M33197_M_at (endogenous control)'
- 'AFFX-HSAC07/X00351_5_at (endogenous control)'
- 'AFFX-HSAC07/X00351_M_at (endogenous control)'
- 'AFFX-HUMTFRR/M11507_5_at (endogenous control)'
- 'AFFX-HUMTFRR/M11507_M_at (endogenous control)'
- 'AFFX-HUMTFRR/M11507_3_at (endogenous control)'
- 'AFFX-M27830_5_at (endogenous control)'
- 'AFFX-M27830_M_at (endogenous control)'
- 'AFFX-M27830_3_at (endogenous control)'
- 'AFFX-HSAC07/X00351_3_st (endogenous control)'
- 'AFFX-HUMGAPDH/M33197_M_st (endogenous control)'
- 'AFFX-HUMGAPDH/M33197_3_st (endogenous control)'
- 'AFFX-HSAC07/X00351_M_st (endogenous control)'
Twenty-seven patients are diagnosed as acute lymphoblastic leukemia (ALL) and eleven as acute myeloid leukemia (AML). The tumor class is given by the numeric vector golub.cl, where ALL is indicated by 0 and AML by 1.
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AFFX-HUMISGF3A/M97935_MA_at | -1.45769 | -1.39420 | -1.42779 | -1.40715 | -1.42668 | -1.21719 | -1.37386 | -1.36832 | -1.47649 | -1.21583 | ⋯ | -1.08902 | -1.29865 | -1.26183 | -1.44434 | 1.10147 | -1.34158 | -1.22961 | -0.75919 | 0.84905 | -0.66465 |
AFFX-HUMISGF3A/M97935_MB_at | -0.75161 | -1.26278 | -0.09052 | -0.99596 | -1.24245 | -0.69242 | -1.37386 | -0.50803 | -1.04533 | -0.81257 | ⋯ | -1.08902 | -1.05094 | -1.26183 | -1.25918 | 0.97813 | -0.79357 | -1.22961 | -0.71792 | 0.45127 | -0.45804 |
AFFX-HUMISGF3A/M97935_3_at | 0.45695 | -0.09654 | 0.90325 | -0.07194 | 0.03232 | 0.09713 | -0.11978 | 0.23381 | 0.23987 | 0.44201 | ⋯ | -0.43377 | -0.10823 | -0.29385 | 0.05067 | 1.69430 | -0.12472 | 0.04609 | 0.24347 | 0.90774 | 0.46509 |
AFFX-HUMRGE/M10098_5_at | 3.13533 | 0.21415 | 2.08754 | 2.23467 | 0.93811 | 2.24089 | 3.36576 | 1.97859 | 2.66468 | -1.21583 | ⋯ | 0.29598 | -1.29865 | 2.76869 | 2.08960 | 0.70003 | 0.13854 | 1.75908 | 0.06151 | 1.30297 | 0.58186 |
AFFX-HUMRGE/M10098_M_at | 2.76569 | -1.27045 | 1.60433 | 1.53182 | 1.63728 | 1.85697 | 3.01847 | 1.12853 | 2.17016 | -1.21583 | ⋯ | -1.08902 | -1.29865 | 2.00518 | 1.17454 | -1.47218 | -1.34158 | 1.55086 | -1.18107 | 1.01596 | 0.15788 |
AFFX-HUMRGE/M10098_3_at | 2.64342 | 1.01416 | 1.70477 | 1.63845 | -0.36075 | 1.73451 | 3.36576 | 0.96870 | 2.72368 | -1.21583 | ⋯ | -1.08902 | -1.29865 | 1.73780 | 0.89347 | -0.52883 | -1.22168 | 0.90832 | -1.39906 | 0.51266 | 1.36249 |
Note that sometimes it is better to construct a factor which indicates the tumor
class of the patients. Such a factor could be used for instance to separate the
tumor groups for plotting purposes. The factor (gol.fac
) can be contructed as
follows.
The labels correspond to the two tumor classes. The evaluation of gol.fac==”ALL” returns TRUE for the first twenty-seven values and FALSE for the remaining eleven, which is useful as a column index for selecting the expression values of the ALL patients. The expression values of gene CCND3 Cyclin D3 from the ALL patients can now be printed to the screen, as follows.
- 1
- 0.88941
- 1
- 1.45014
- 1
- 0.42904
- 1
- 0.82667
- 1
- 0.63637
- 1
- 1.0225
- 1
- 0.12758
- 1
- -0.74333
- 1
- 0.73784
- 1
- 0.4947
- 1
- 1.12058
Creating the exploratory plots
1.1. Plotting the value of gene (CCND3) in all nRNA samples (M92287_at)
We shall first have a look at the expression values of a gener with manufacurer
name M92278_at
, which is known in biology as “CCND3 Cyclin D3”.
The expression values of this gene are collected in row 1042 of golub. To load the data and to obtain the relevant information from row 1042 of golub.gnames, use the following:
The data has now been stored in the golub
matrix. We will now plot the
expression values od the gene CCND3 Cyclin D3.
In the previous plot we just used the default plotting preferences within R base plotting.We can do some improvements so that the plot is easily understood.
In this plot the vertical axis corresponds to the size of the expression values and the horizontal axis the index of the patients.
1.2. Gene expression between patient 1 (ALL) and patient 38 (AML)
Adding diagonal lines to the plot and changing axes labels
1.3. Scatter plots to detect independence
1.4. Bar plot of 4 cyclin genes expression values in 3 ALL and AML patients
We will analyse the expression values of the D13639_at, M92287_at, U11791_at,
Z36714_AT
genes in three chosen AML and ALL patients
0 | 0 | 0 | 1 | 1 | 1 | |
---|---|---|---|---|---|---|
D13639_at | 2.09511 | 1.71953 | -1.46227 | -0.92935 | -0.11091 | 1.15591 |
M92287_at | 2.10892 | 1.52405 | 1.96403 | 0.73784 | 0.49470 | 1.12058 |
U11791_at | -0.11439 | -0.72887 | -0.39674 | -0.94364 | 0.05047 | 0.05905 |
Z36714_at | -1.45769 | -1.39420 | -1.46227 | -1.39906 | -1.34579 | -1.32403 |
The plot is not very easy to read, so we will add some colours and a legend so that we know which gene each bar segment corresponds to.
In this case the patients are indicated on the X
axis (0 and 1 respectively)
while the gene expression level is indicate on the Y
axis.
We can make some improvements to the plots.
Let’s have a look at the barplot
arguments:
We are going to focus on only a few of the histgram arguments:
beside
:TRUE
for the bars to be displayed as justapoxed bars,FALSE
for stacked barshoriz
:FALSE
bars displayed vertically with the first bar to the left,TRUE
bars are displayed horizontally with the first at the bottom.ylim
,xlim
: limits for the y and x axescol
: colour choices
In the plot above we presented the barplots horizontally and added some colours, which makes it easier to understand the data presented. You can also use the barplots to represent the mean and standard error which we will be doing in the following sections.
1.5. Plotting the mean
In the following we will compute the mean for the expression values of both the ALL and AML patients. We will be using the same 4 cycline genes used in the example above.
First we will compute the ALL and AML for all the patients. Once the means are computed they are combined into a single data frame.
Finally, the means are plotted using the barplot
function.
1.6. Adding error bars to the previous plot
In the previous section we computed the mean expression level for 4 cycline genes between the AML and ALL patients. Sometimes it is useful to add error bars to the plots (as the one above) to convey the uncertainty in the data presented.
For such a purpose we often use the Standard Deviation:
which in turn tells us how much the values in a certain group tend to deviate from their mean value.
Let’s start calculating the Standard Deviation of the data.
Another measure used to quantify the deviation is the standard error, which qutifies the variability in the means of our groups instead of reporting the variability among the data points.
A relatively straigtforward way to compute this is by assuming if we were to repeat a given experiment many many times, then it would roughly follow a normal distribution. Note – this is a big assumption. hence, if we assuemt hat the means follow a nosmal distribution, then the standard error (a.k.a. variability of group means) can be defined as:
which in layman terms can be read as “take the general variability of the points around their group means (the standard deviation), and scale this number by the number of points that you’ve collected”.
Since we have already computed the SD we can now compute the standard error (SE).
Now we can create a plot of the mean data as well as the SE and SD.
Note that the error bars for the SE are smaller than those for the SD. This is no coincidence!
As we increase N (in the SE equation), we will decrease the error. Hence the standard error will always be smaller than the SD.
2. Data representation
This section presents some essential manners to display and visualize data.
2.1 Frequency table
Discrete data occur when the values naturally fall into categories. A frequency table simply gives the number of occurrences within a category.
A gene consists of a sequence of nucleotides (A; C; G; T)
The number of each nucleotide can be displayed in a frequency table.
This will be illustrated by the Zyxin gene which plays an important role in cell adhesion The accession number (X94991.1) of one of its variants can be found in a data base like NCBI (UniGene). The code below illustrates how to read the sequence ”X94991.1” of the species homo sapiens from GenBank, to construct a pie from a frequency table of the four nucleotides .
a c g t
410 789 573 394
2.2 Stripcharts
An elementary method to visualize data is by using a so-called stripchart, by which the values of the data are represented as e.g. small boxes it is useful in combination with a factor that distinguishes members from different experimental conditions or patients groups.
Once again we use the CCND3 Cyclin D3 data to generate the plots.
From the above figure, it can be observed that the CCND3 Cyclin D3 expression values of the ALL patients tend to have larger expression values than those of the AML patient.
2.3 Histograms
Another method to visualize data is by dividing the range of data values into a number of intervals and to plot the frequency per interval as a bar. Such a plot is called a histogram.
We will now generate a histogram of the expression values of gene CCND3 Cyclin D3 as well as all the genes for the AML and ALL patients contained in the Golub dataset.
2.3 Boxplots
A popular method to display data is by drawing a box around the 1st and the 3rd quartile (a bold line segment for the median), and the smaller line segments (whiskers) for the smallest and the largest data values.
Such a data display is known as a box-and-whisker plot.
We will start by creating a vector with gene expression values sorted in
ascending order (using the sort
function).
- 0
- 0.45827
- 0
- 1.10546
- 0
- 1.27645
- 0
- 1.32551
- 0
- 1.36844
A view on the distribution of the gene expression values of the ALL
and AML
patients on gene CCND3 Cyclin D3 can be obtained by generating two separate
boxplots adjacent to each other:
It can be observed that the gene expression values for ALL are larger than those for AML. Furthermore, since the two sub-boxes around the median are more or less equally wide, the data are quite symmetrically distributed around the median.
We can create a histogram of the expression values of gene CCND3 Cyclin D3 of the acute lymphoblastic leukemia patients e.g.
Now we can observe the distribution of all gene expressions values in all 38 patients
To compute the exact values for the quartiles we need a sequence running from 0 to 1 with increments in steps of 0.25
- 0%
- 0.45827
- 25%
- 1.796065
- 50%
- 1.92776
- 75%
- 2.178705
- 100%
- 2.7661
Outliers are data points lying far apart from the pattern set by the majority of the data values. The implementation in R of the boxplot draws such outliers as smalle circles.
A data point x
is defined (graphically, not statistically) as an outlier point
if
2.4 Q-Q plots (Quantile-quantile plots)
A method to visualize the distribution of gene expression values is y the so- called quantile-quantile (Q-Q) plots. In such a plot the quantiles of the gene expression values are displayed against the corresponding quantiles of the normal distribution (bell-shaped).
A straight line is added to represent the points which correspond exactly to the quantiles of the normal distribution. By observing the extent in which the points appear on the line, it can be evaluated to what degree the data are normally distributed. That is, the closer the gene expression values appear to the line, the more likely it is that the data are normally distributed.
To produce a Q-Q plot of the ALL gene expression values of CCND3 Cyclin D3 one may use the following.
It can be seen that most of the data points are on or near the straight line, while a few others are further away. The above example illustrates a case where the degree of non-normality is moderate so that a clear conclusion cannot be drawn.
3. Loading tab-delimited data
‘data.frame’
Now try and do some exploratory analysis of your own on this data!
GvHD flow cytometry data
Only exract the CD3 positive cells
0.956021382271511