BAD Day 1: Additional
1. Probability distributions
These can be either discrete or continuous (e.g. uniform, bernoulli, normal), and are defined by a density function $ p(x) $ or $f(x)$.
1.1 Bernoulli distribution Be(p)
Flip a coin $(T = 0, H = 1)$. The probability of H
is 0.1
1.2 Binomial Random sampling
Generate 100 observations from Be (0.1)
1. 3 Normal distribution
The data values are members of a normally distributed population with mean $\mu$ and variance $\sigma^2$.
The value of the distribution function is given by $P(X \leq x)$, the probability of the population to have values smaller than or equal to $x$.
1.4 Normal Random sampling
Generate 1000 observations from N(0,1)
Histograms can be used to estimate densities!
1.5 Overview
For practical computations R has built-in functions for the binomial, normal,
Chi-squared distributions, among others. Where d
stands for density, p
for
(cumulative) probability distribution, q
for quantiles, and r
for drawing
random samples e.g.
Distribution | parameters | density | distributon |
---|---|---|---|
random sampling | quantiles | ||
Binomial | n, p | dbinom(x, n, p) | pbinom(x, n, p) |
rbinom(10, n, p) | qbinom($\alpha$, n, p) | ||
Normal | $\mu, \sigma$ | dnorm(x, $\mu, \sigma$) | pnorm(x, $\mu, |
\sigma$) | rnorm(10, $\mu, \sigma$) | qnorm($\alpha$,$\mu, \sigma$) | |
Chi-squared | m | dchisq(x, m) | pchisq(x, m) |
rchisq(10,m) | qchisq($\alpha$, m) |
2. Descriptive statistics
2.1 Quantiles
(Theoretical) quantiles:
The p-quantile is the value with the property that there is a probability p of getting a value less than or equal to it.
Empirical quantiles:
The p-quantile is the value with the property that p% of the observations are less than or equal to it.
They can be easily obtained in R:
- 0%
- -2.13649385561006
- 25%
- -0.431830039618815
- 50%
- -0.072877791466942
- 75%
- 0.446190752401337
- 100%
- 2.16860031716614
- 10%
- -0.863526139969441
- 20%
- -0.516638898288344
- 90%
- 1.01977663867239
2.2 Statistical summary
We often need to quickly quantify
a data set. This can be done using a set
of summary statistics (e.g mean, median, variance, standard deviation).
0.00291256261996373
-0.0594198974465339
1.26473767173818
1.04184965729599
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.272000 -0.608800 -0.059420 0.002913 0.655900 2.582000
You can use the summary function on almost any R object! (remmber R is an object oriented language, hence it comprises methods and classes)
2.3 Box-plots: understanding the plots
The median of the sampple is denoted by the horizontal line within the boxplot. The IQR corresponds to IQR = 75% quantile -25% quantile
1.291372268193
2.4 QQ plot
Many statistical methods make some assumptions about the distribution of the data.
The quantile quantile (QQ) plot provides a mean to visually verify such assumptions.
Also, the QQ-plot shows the theoretical quantile versus the empirical quantiles. If the distribution assumed (theoreticall) is indeed correct, the result will be a straight line.
Note this is valid only for normal distributions!
Clearly the t
distribution with two degres of freedom is different from
the normal distribution.
Comparing two samples
Exercise: Try with different values of df
2.5 Scatter plots
Biological data sets often contain serveal variables. Hence these data sets are multivariate. Scatter plots allow us to look at two variables at a time.
What can you tell about this data?
This kind of plots can be used to asses independence
2.6 Scatter plots vs. correlation
Note that in the previous example the correlation between the two variables was 0.23
Note that correlation is only good for linear dependence.
-0.0532811841413791
What is the correlation?