2024-02-27
Last class covered
How normal distributions are generated: Sum of random variables / mean
of a random sample
Used R to simulate random events and explore distributions
Finish segment: How do I test if my data is normal? q-q plots
and Shapiro-Wilk
test
What questions do t-tests answer?
t-tests dissected: Layout of a general hypothesis test
Understand sampling distributions, standard error of mean (SEM) and confidence intervals (CI)
How to check if my data is normally distributed?
Mean +/- SD
Source: Wikipedia
Median, quartiles
quartiles divide the probability / distribution area into 4 quarters, a more general way is called quantile (note the spelling!)
Quartiles (4)
Deciles (10)
quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities. Source: Wikipedia
probability distribution = geom_density
plot. Hence quantiles divide the range to make equal area under the geom_density
curve
Q-Q
plot enables comparison of distribution shapesIf they fall along the line, then it is normal.
Normal distribution
[1] 881 665
Non-normal distribution
[1] 1293 1
Shapiro-Wilk
test for normalityWhat is this testing for?
Shapiro-Wilk normality test
data: .
W = 0.99807, p-value = 0.315
Shapiro-Wilk normality test
data: life_exp
W = 0.95248, p-value < 2.2e-16
further reading: statology.org
To make statistical conclusions of this kind based on data
But causality is really really hard to establish, that is why we talk about correlation/association
“Smoking is associated with cancer”
“Smokers are statistically significantly at a higher risk of having cancer”
“Evidence linking smoking and cancer appeared in the 1920s.” Source: The cigarette controversy, 2007/section: Smoking Causes Cancer: When Did They Know?
2d data
Correlation
Regression (linear/non-linear)
1D data
Looking at the mean only is not enough:
3.15 is clearly > 2.46, does that alone satisfy the hypothesis?
This is what the t-statistic encompasses. The t-test tests for a hypothesis based on this statistic
Null hypothesis in english - “smoking is NOT associated with cancer”
In statistical terms - “the samples of smokers and non-smokers came from the same population”
(assumption) For a t-test, this source population is normally distributed
(Other assumption for Student’s t-test) Equal variance between smoker dataset and non-smoker dataset (this can be violated for the welch t-test
)
Watch full video here: “Student’s t-test” : Bozeman science/youtube
mean
) and record it (for each sample)mean
/statistic)
mean
is nice because it follows a normal distribution (due to the central limit theorem!)
We will walk through this activity to understand sampling well
Great explanation of sample size in 7.2 (FIGURE 7.12) in moderndive/chap7
q-q plots
and Shapiro-Wilk
test enable checking data for normality
t-tests answer questions of association such as ‘smoking causes cancer’
t-test = p-value gives the probability that the NULL hypothesis: that both data groups were sampled from the same distribution is TRUE
Understood sampling distributions, standard error of mean (SEM) and confidence intervals (CI)