I would define my progress this week as “meh”. I got through the unit Sampling Distributions but only made it through a handful of videos in the following unit Confidence Intervals. I would say the effort/amount of time I put in this week is fairly indicative of the amount of work I got done. Funny how that works. I think I worked on KA for a bit more than an hour each day which at this point doesn’t feel like enough anymore. Having switched back to my regular work schedule where I now work in the evenings, I want to put in a minimum of 1.5 hours a day, 5 days a week. In fact, I’ve just decided that that’s my new minimum from here on out.
The first video I watched this week went through an example question that calculated the mean of a sample taken from a given population. This was essentially review but I found it very useful and helped cement a few concepts that I didn’t have a solid understanding of. The following is more-or-less the question that was asked and the page from my notes I used to work through the question:
- Q. There are approximately 150 million men in the United States. What is the mean height of men in the U.S. that we could infer from a sample of 5 heights being 6.2ft, 5.5ft, 5.75ft, 6.3ft and 5.9ft?
Population Notation | Sample Notation | |
Mean | μ (referred to as “mean”) | x̅ (referred to as “sample mean”) |
Population/Sample Size | Uppercase ‘N’ (“total population”) | Lowercase ‘n’ (“sample size”) |
Proportion | Uppercase ‘P’ | p̂ (referred to as “p-hat”) |
- As you can see, the sample is taken from the population. To find the sample mean, you the use the formula:
- x̅ = [(Σ_(i=1)^n)x_i]/n (As you can see, compared to how the formula looks in my notes, this is a very difficult formula to express linearly).
- = (x_1 + x_2 + x_3 + x_4 + x_5)/n
- = (6.2 + 5.5 + 5.75 + 6.3 + 5.9)/5
- = 29.65/5
- = 5.93
- x̅ = [(Σ_(i=1)^n)x_i]/n (As you can see, compared to how the formula looks in my notes, this is a very difficult formula to express linearly).
- Based on this small sample size and what the question asks, you would infer that the mean of the population at 5.93ft.
Next I watched a video that made it very clear what the difference between sample size is and sample (set):
- Sample Size (‘n’)
- The number of ‘samples’ taken within a sample (set). (In the question above, the sample size was 5 and there was only one sample taken.)
- Sample (Set)
- A set of individual samples grouped together to create a single sample (set).
- ‘Set’ isn’t necessary to say or add but it can be easier to understand what a sample is by adding that word because a single sample contains a ‘set’ of individual samples, a.k.a. the sample size.
Here’s a page from notes that gives three examples of different sample sizes and their corresponding total samples:
As you can see, the first example has a sample size of n = 2 with four total samples, the second example has a sample size of n = 5, again, with four total samples, and the third example has a sample size of n = 8 with 5 total samples.
I then moved on to learning about what’s called the Central Limit Theorem. I think I have a fairly good understanding of what this is and how it works but I haven’t got it completely 100% figured out. Here’s what I know about it so far:
- Central Limit Theorem
- States that when taking multiple sample means from any type of distribution and plotting those means on another distribution, the new distribution will begin to resemble a ‘normal’ curve.
- In my own words this means, for example, if you had a population of 100 random numbers and took 20 samples where the sample size was 30 (a.k.a. n = 30), found the mean of each sample, and plotted each of those 20 sample means on a new distribution, the mean of that new sample distribution would 1) look like a normal curve AND 2) have the same mean as the original mean of the 100 random numbers.
- When n is larger, the new distribution will better resemble a normal cure than when n is smaller.
- I.e. when n -> infinity => Normal Distribution
- “When n approaches infinity, the distribution gets closer to Normal Distribution”
- (I somewhat understand why this is the case but not well enough to confidently put it into my own words.)
- I.e. when n -> infinity => Normal Distribution
- For the C.L.T. to work (i.e. for a sample mean to begin to take the shape of a normal curve), the rule of thumb is that n ≥ 30.
- States that when taking multiple sample means from any type of distribution and plotting those means on another distribution, the new distribution will begin to resemble a ‘normal’ curve.
Here are two pages form my notes that 1) give an example of how using C.L.T. on a ‘crazy’ distribution (lol) will create a distribution that more closely resembles a normal curve as n becomes larger, and 2) the formulas for the standard deviation and variance of a sample distribution created using C.L.T.:
Again, I still don’t fully understand why as n becomes larger the standard deviation becomes smaller. In my mind I think about it as, for example, if n = 3 and the numerator was 1 then [σ = 1/3 = 0.33], whereas if n = 3000 and the numerator was 1 then [σ = 1/3000 = 0.00033].
As for the sample distribution variance and standard deviation formulas, I also don’t fully understand how they’re derived, certainly not well enough to attempt to explain in my own words.
As I mentioned in my intro, the last thing I did this week was get through a few videos in the new unit Confidence Intervals. I don’t have much to say about confidence intervals right now since I just started learning about them, but here’s what I’ve figured out about them so far:
- Confidence Intervals
- “A confidence interval calculates the probability that a population parameter will fall between two set values.
- “Confidence intervals measure the degree of uncertainty in a sampling method.
- “Most often, confidence intervals reflect confidence levels of 95% [a.k.a. 2 standard deviations away from the mean] or 99% [a.k.a. 3 standard deviations away from the mean].”
- In my own words, a confidence level means something a long the lines of “you can expect x% of y intervals to contain the (populations) parameter of interest”.
Again, I really don’t understand confidence levels well enough to explain them in my own words but it has something to do with not knowing the true parameter (‘P’) of a population and inferring what the likelihood of the true parameter would be by taking multiple sample parameters (‘p̂’) and then figuring out how likely it is that those sample parameters overlap the true populations’ parameter.
One last thing worth noting about confidence intervals is that, when calculating the variance/standard deviation of the sample distribution, you must make the denominator n – 1. I forget why you must use n – 1 instead of simply n (to be honest, I don’t think I ever fully understood why you have to use n – 1 for a sample in the first place), but I know it has something to do with increasing the accuracy of the value.
(Just writing that last sentence, it’s pretty clear I really don’t have a great idea of how this all works…)
It feels like I still have a lot to learn when it comes to statistics. Similarly to when I learned trigonometry, however, I have a feeling that when I go back to review what I’ve learned about stats so far, much of it will seem fairly obvious compared to when I first worked through it and hopefully it will all become more clear. Going through the next two courses, High School Statistics and AP College Statistics, will be a good opportunity for me to review and hopefully most of the concepts that I’m struggling with will be easier to understand.
This coming week, Week 57, I’d like to at least get through this unit Confidence Intervals (0/800 M.P.) and hopefully get at least a few videos into the following unit, Significance Tests (Hypothesis Testing) (0/1500 M.P.). It’s looking less and less likely that I’ll get through all of stats by the end of October but that’s ok. I’ll still push hard to make it happen but won’t be too upset if it doesn’t. Hopefully bumping up the amount of time I spend working through KA to 1.5 hours will make a difference!