with hands-on data analysis workshop
2024-01-11
Note:
From the feedback we have so far and hearing from students, I see quite a few people feel overwhelmed. I hope the next 3 slides that clarify lecture format and assignments will help encourage people to stay on.
R
?I believe that if you have one undergraduate level course where you learnt to code in any programming language, it should not be too challenging to pickup R
syntax with 3 weeks of practice.
That said, the examples we walk through in class and working on the assignments (with help from very enthusiastic TAs) will be great resources to guide the learning process.
This classroom is a safe space, so no question is too simple or too “silly” to ask.
The lectures are structured to be interspersed with 5-10 min coding sessions where we will walk everyone through stuff. You essentially save all the work we did in class, and add to that by working on your own time for the assignments.
So 25% of the assignment will already be done in class!
Practicing by working through code at your own time is the only way to actually learn coding.
Thursday: Assignment uploaded and introduced in class
Get help in office hours during the week
Friday 7 pm: submission
Next Thursday: Brief discussion on the previous week’s assignment, common mistakes
Next Saturday: receive your grades and feedback from TAs
Please do finish the coding task at hand even if it seems too easy. There surely will be a thing or two you haven’t noticed before to learn!
Try to help your peers and neighbors during the class.
BONUS CONTENT : We will also have some bonus content that is a good challenge for you to work on during the class (once you finish the regular task, of course)
Do you know why?
Analyze data like in the B.C.E times (before computers era!)
Not this way. this is number theory
This way for statistics
Side point: Are these truly random?
Yes! Depends on how often the random number takes each possible option. Visualizing this is called a distribution.
Fair dice: Each outcome is equally likely. let us simulate it 4
2 4 1 3 1 6 5 6 3 1 1 1 4 4 4 2 6 1 6 2 4 3 1 3 2 5 2 2 3 5
Wonky dice: Certain outcomes are more favored
5 1 6 1 2 3 5 1 2 5 1 1 2 5 5 1 1 5 2 1 3 1 1 5 1 1 5 6 5 1
It does look uniform when a large number of points are sampled. Here I did 3,000 points
Merges into the assignment: Take a photo of your distribution chart at the end of each task. And these photos with one or two sentences that indicate your understanding are the assignment for this week!
Data: We generated “populations” of data from 3 different distributions. For the final task, you can compare your data with another groups data from a different population for best results.
We need 9 volunteers to sample the data for the 9 groups
Each group will get 30 data points ; while everyone is forming teams (5+ members/team)
I will briefly demo how to arrange them into a histogram
How to form a good team?
Get to know new people in the class!
If you know a tiny bit of R, make new friends who are new to R
Teams
5 mins to form teams
5 mins to get to know each other
Where are you from?
What field are you in?
Data seekers
show demo: ~ 5-10 data points
Choose bin widths for your histogram. I recommend 0.25, 0.5, 1.
Once you have a histogram, mark the outlines (top shape) of the distribution
A single value that can describe the whole distribution (to an approximation)
Mark each of these quantities visually on the chart below the distribution (approximate location works)
Mode: the most repeated value / bin in this case
Median: center most value when data is ordered
Mean: average (sum/ number of data points)
Ask the TA or your group’s R experts to give you the sd()
output and note it down.
Visually demarcate the Mean +/- SD
and Mean +/- 2 x SD
locations on the chart.
Count the number of points within the 1 SD
and 2 SD
intervals
Merging the distributions: Is a visual way to tell if two distributions are similar or not.
Befriend a neighboring group with a different colour of datapoints and merge your histograms into a single chart.
Together both teams will assemble two histograms in the same chart (start from any groups) ; and mark the tops of the guest group for easy visualization
Quantitative value: But we need to depend on more concrete statistics than gut feeling, hence we do hypothesis testing using t.tests.
Null hypothesis: You start with the hypothesis that both samples are from the same distribution i.e.) they are essentially identical. And you try to find quantitative evidence to reject this (~ disproving the null hypothesis)
Using R
’s t.test()
function: Get the TAs or your group’s R experts to calculate this for you and note down the values.
Rule of thumb: if the p.value < 0.05
; then the data-sets have very little likelihood of being sampled from the same distribution.
It was ~normally distributed, and standard deviations were all the same hence assumptions for doing t.test
were met!
We distribute data into bins to form a histogram.
The complexity in each individual data can be simplified by describing it with just a central tendancy +/- spread (as a measure of variability).
We compare two (or more..) distributions of data using hypothesis tests, by starting with a null hypothesis and finding evidence against it.
Take the data-points and the chart with you and explore more while doing the assignment. Note: Groups are working together, but we want each individual to turn-in their own report so they get better understanding while writing
Dall-E3, via MS Bing
By ICMA Photos - Coin Toss, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=71147286
germansektor.blogspot.com
sample(1:6, 30, replace = T)