BAD DAY 2: Dimensionality reduction
1. PCA
(for more examples see http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual )
Principal Component Analysis (PCA) is a data reduction technique that allows to simplify multidimensional data sets to a lower dimensional space ( tipically using 2 or 3 dimensions) for a first broad visual analysis of the data (via plotting and visual variance analysis).
The prcomp( )
and princomp( )
R functions are devoted to PCA.
The BioConductor library pcaMethods
provides many additional PCA functions.
We are going to use the IRIS data set to demonstrate the use of PCA.
This data was used to show that these measurements could be used to differentiate between species of irises.
This data set contains the sepal/petal length and width measurements in centimeters for 50 flowers from each of 3 species of iris.
The Iris species are: setosa, versicolor, and virginica.
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Now we will create a scatterplot of the dimensions against each other for the three species.
PCA is used to create linear combinations of the original data that capture as
much information in the original data as possible.
We will use the prcomp
function.
-
If we work with standardized data (mean 0 std 1) we need to calculate the principal components through the correlation matrix.
-
If we work with raw data we need to calculate the principal components through the covariance matrix.
It is good practice to standardize our variables when these have different units and have very different variances. If they are in the same units either of the alteratives is appropriate.
In our example all variables are measured in centimetres but we will use the correlation matrix for simplicity’s sake.
First, we are going to examine the variability of all the numeric values
- Sepal.Length
- 0.685693512304251
- Sepal.Width
- 0.189979418344519
- Petal.Length
- 3.11627785234899
- Petal.Width
- 0.581006263982103
- 0.189979418344519
- 3.11627785234899
Maybe this range of variability is big in this context.
Thus, we will use the correlation matrix.
For this, we must standardize our variables using the scale()
function:
- Sepal.Length
- 1
- Sepal.Width
- 1
- Petal.Length
- 1
- Petal.Width
- 1
If we use the prcomp()
function, we indicate scale=TRUE
to use the
correlation matrix.
Standard deviations:
[1] 1.7083611 0.9560494 0.3830886 0.1439265
Rotation:
PC1 PC2 PC3 PC4
Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
This gives us the standard deviation of each component, and the proportion of
variance explained by each component.
The standard deviation is stored in (see str(pca)
):
- 1.70836114932762
- 0.956049408486857
- 0.38308860015839
- 0.143926496617611
Usually the trend is to select up to 3 components (for visualization purposes) but we can have an indication on how many principal components should be retained.
Usually a scree plot helps making a decision. In R we can use the screeplot()
function:
This plot together with the values of the Proportion of Variance and Cumulative
Proportion obtained by means of summary(pca)
suggests that we can retain just
the first 2 components accounting for over 95% of the variation in the original
data!
PC1 | PC2 | PC3 | PC4 | |
---|---|---|---|---|
Sepal.Length | 0.5210659 | -0.37741762 | 0.7195664 | 0.2612863 |
Sepal.Width | -0.2693474 | -0.92329566 | -0.2443818 | -0.1235096 |
Petal.Length | 0.5804131 | -0.02449161 | -0.1421264 | -0.8014492 |
Petal.Width | 0.5648565 | -0.06694199 | -0.6342727 | 0.5235971 |
The weights of the PC1 are all similar but the one associated to Sepal.Width variable (it is negative). This one principal component accounts for over 72% of the variability in the data.
All weights on the second principal component are negative. Thus the PC2 might seem considered as an overall size measurement. This component explain the 23% of the variability.
The following figure show the first two components and the observations on the same diagram, which helps to interpret the factorial axes while looking at observations location.
The first component discriminate on one side the Sepal.Width and on the other side the rest of variables (see biplot). According to the second component, When the iris has larger sepal and petal values than average, the PC2 will be smaller than average.
- setosa
- versicolor
- virginica
- 2
- 3
- 4
2. TSNE and comparison with PCA
The t-Distributed Stochastic Neighbor Embedding (t-SNE algorithm), by Laurens van der Maaten and Geoffrey Hinton, is a NON LINEAR dimensionality reduction algorithm. It enables the representation of high-dimensional data in two or three dimensions and allows visualization via scatter plots.
Barnes-Hut-SNE is a further improvement of the algorithm by Laurens van der Maaten, which uses Barnes-Hut approximations to significantly improve computational speed (O(N log N) instead of O(N2)). This makes it feasible to apply the algorithm to larger data sets.
We will follow the example proposed oh github <https://github.com/lmweber/Rtsne- example> using “Rtsne”, a package by Jesse Krijthe provides an R wrapper function for the C++ implementation of the Barnes-Hut-SNE algorithm.
The data set used in this example is the healthy human bone marrow data set
Marrow1
.
Here we will follow the example on GITHUB https://github.com/lmweber/Rtsne- example. The Marrow1 dataset is made of cells from different cell populations (types).
We will see how tsne will be able to group them as distinct clusters of points in the 2-dimensional projection. There is clear visual separation between clusters.
Amir et al. (2013) also independently verified the interpretation of the clusters using “manual gating” methods (visual inspection of 2-dimensional scatter plots), confirming that several clusters represent well-known cell types from immunology.
Event# | Time | Cell Length | 191-DNA | 193-DNA | 115-CD45 | 110,111,112,114-CD3 | 139-CD45RA | 141-pPLCgamma2 | 142-CD19 | ⋯ | 166-IkBalpha | 167-CD38 | 168-pH3 | 170-CD90 | 169-pP38 | 171-pBtk/Itk | 172-pS6 | 174-pSrcFK | 176-pCREB | 175-pCrkL |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22933 | 42448 | 14.39525 | 321.2244 | 434.9970 | 142.39886 | 18.5329704 | 117.579620 | 4.8356605 | 50.8025017 | ⋯ | 2.019981 | 32.1595116 | 14.3242693 | 8.78977966 | 53.843872 | 82.910294 | 6.8211913 | 5.774820 | 12.3001671 | 1.4769655 |
228744 | 667323 | 41.27012 | 209.3467 | 563.6180 | 50.66235 | 0.2076987 | 19.171650 | -1.3943130 | 10.0177841 | ⋯ | 4.779097 | -0.5546998 | 1.5282539 | -0.26177970 | 17.979921 | 1.054635 | 5.0797348 | 7.080018 | -1.7290312 | -1.0569905 |
230819 | 549730 | 28.91957 | 829.5733 | 1408.7177 | 503.61078 | 294.2422791 | 5.135233 | 0.8516778 | -0.1638297 | ⋯ | 12.946030 | 47.7583084 | 18.4628849 | 0.54432803 | 37.236912 | 38.177971 | 79.4033279 | 2.626387 | 23.7090988 | -1.4266791 |
131780 | 359184 | 26.11232 | 643.0649 | 992.3835 | 178.37289 | 359.2935181 | 79.340591 | -0.2172089 | 5.7050629 | ⋯ | 24.191151 | 49.7774315 | 5.1872330 | -0.07430215 | 6.879214 | 11.992647 | -0.2527083 | 9.006838 | -0.6766897 | -0.4470886 |
32212 | 56694 | 36.61685 | 713.0907 | 1131.0656 | 34.13390 | -12.1336746 | 1.377896 | 3.1527698 | -0.2193382 | ⋯ | 3.426270 | 42.7431793 | -0.8187979 | 9.73161793 | 10.562280 | 26.534126 | -0.7640409 | 50.811363 | 16.6444397 | 3.9240763 |
403958 | 834589 | 29.39848 | 638.0286 | 968.7856 | 288.91412 | 65.9604187 | 6.649710 | -0.7853550 | -1.4377729 | ⋯ | 1.359477 | 89.9450684 | 15.9464016 | 3.46092296 | 4.090930 | 51.456581 | 7.6963782 | 28.520178 | 5.2173581 | -0.3729125 |
- 'Event#'
- 'Time'
- 'Cell Length'
- '191-DNA'
- '193-DNA'
- '115-CD45'
- '110,111,112,114-CD3'
- '139-CD45RA'
- '141-pPLCgamma2'
- '142-CD19'
- '144-CD11b'
- '145-CD4'
- '146-CD8'
- '148-CD34'
- '150-pSTAT5'
- '147-CD20'
- '152-Ki67'
- '154-pSHP2'
- '151-pERK1/2'
- '153-pMAPKAPK2'
- '156-pZAP70/Syk'
- '158-CD33'
- '160-CD123'
- '159-pSTAT3'
- '164-pSLP-76'
- '165-pNFkB'
- '166-IkBalpha'
- '167-CD38'
- '168-pH3'
- '170-CD90'
- '169-pP38'
- '171-pBtk/Itk'
- '172-pS6'
- '174-pSrcFK'
- '176-pCREB'
- '175-pCrkL'
- '144-CD11b'
- '160-CD123'
- '142-CD19'
- '147-CD20'
- '110,111,112,114-CD3'
- '158-CD33'
- '148-CD34'
- '167-CD38'
- '145-CD4'
- '115-CD45'
- '139-CD45RA'
- '146-CD8'
- '170-CD90'
- 10000
- 36
- 9902
- 13
Running the Rtsne (Barnes-Hut_SNE algorithm) without PCA step (see Amir et al. 2013, Online Methods, “viSNE analysis”)
Compare to PCA
About the interpretation of TSNE
SOME EXAMPLES FROM https://gist.github.com/mikelove/74bbf5c41010ae1dc94281cface90d32 on the delicate interpretation of tsne when the underlyng structure of the data is LINEAR
Explore tsne on linear data. First install (if needed) and load the required packages.
Error in runif(n, -1, 1): object 'n' not found
Traceback:
1. runif(n, -1, 1) # at line 5 of file <text>