# Pick a random variable (note: rounding is only for the presentation)
rnorm(5, mean = 0, sd = 1) %>% round(2) # Normal random variable (r.v)
[1] -0.23 0.12 0.32 0.14 0.31
[1] "j" "e" "g"
[1] 1
2024-02-22
In the first 6 weeks we covered
Basic R
Tidyverse data manipulations
Plotting with ggplot
Now we will bring forth these R skills to interact with data to understand the statistics behind t-tests in the next 3 lectures
The origins of normal distributions and why we see them everywhere
Central limit theorem visually
Sample means make normal distributions
Explore these for yourself with the worksheet (last 30 mins)
Worksheets on the class-worksheets github
Watch full video/ 3blue1brown (0 - 54 seconds)
How to identify a normal distribution?
Central limit theorem in short:
Sum of many random variables makes a normal distribution. Mean involves a sum! (mean <- sum(x) / length(x)
)
the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. Source: Wikipedia.
Refer textbook for mathematical definition in Introduction to Probability andMathematicalStatistics
Galton board ~ add a random variable (+1 / -1) that determines where the balls end up
Watch full video/ 3blue1brown (1:53 - 5:22)
Because they could be expressed as a SUM of other hidden variables.
Heights: We can speculate that height depends on the sum of action of multiple genes, the food/excersise you had everyday till measurement
Gene expression in a bacterial/cell culture: The sum (or mean) of the expression of the millions of individual cells in the culture makes for an excellent normal distribution!
But it is important to recognize when effects are not strictly additive, such as when feedbacks are involved.
Income is famously not normally distributed. Due to positive feedback across generations, and that capital gains far outweigh labor gains, there are very few very rich and many many not-very-rich people.
If your gene has a positive/negative feedback loop (activates/represses itself), then your gene expression will not be normally distributed. You might have a bimodal distribution with 2 peaks!
Definition from the textbook: Modern stats with R
A key component of modern statistical work is simulation, in which we generate artificial data that can be used both in the analysis of real data (..) and for assessing different (statistical) methods .
# Pick a random variable (note: rounding is only for the presentation)
rnorm(5, mean = 0, sd = 1) %>% round(2) # Normal random variable (r.v)
[1] -0.23 0.12 0.32 0.14 0.31
[1] "j" "e" "g"
[1] 1
for()
loops, and use vectorized functions
)10 is no good, let’s try 1,000?
Why are some numbers missing?
Because of the discrete nature of the random variables, odd numbers cannot be made by adding an even number (6) or +1/-1s (if 0 was included, this would be possible)
Let us plot each outcome on top of the histogram
Collecting a random subset of samples from a larger distribution is another way to begin a simulation. This is the technique behind ~Bootstrapping (will get to it later)
Please download the qmd
worksheet for today from github (scripts/ folder in class-worksheets repository)
Understanding normal distributions
Sum or mean of many random variables produces a normal distribution
Using R to simulate random events and explore distributions