Well, I got through the unit Significance Tests (Hypothesis Testing) this week and about halfway through the following unit Two-Sample Inference for the Difference Between Groups. I’m a bit concerned, however, that I don’t have a strong grasp on quite a bit of what I worked through. To be fair, I did find the unit test fairly simple (although I had to redo it once as I made a simple mistake on my first attempt) but I felt like I only knew what formulas to use based off the patterns in the way the questions were phrased. I don’t feel like I completely understand why the formulas actually work. I’m concerned that when I move onto calculus I won’t remember how to answer these types of questions because I don’t have it all completely mapped out in my head. BUT… I did come across an interesting comment in a Reddit post which made me feel a bit better which I’ll talk about at the end.
The first thing I did this past week was go through a video on how to derive the formulas for determining the Z-score of a sample proportion and a sample mean. Here’s a photo from my notes that breaks them both down:
In both explanations, you begin with (1) the original hypothesis (H_0) that the populations mean or proportion in question is equal to a certain value (P or μ) and with the alternative hypothesis being that it simply isn’t. You then (2) take a sample from the population and calculate the sample mean or sample proportion. From there, you use different formulas to calculate the Z-score of each:
- Sample Proportion
- Z = (p̂ – P_0)/σ_ p̂
- (States, “the z-score score equals the sample proportion minus the population proportion divided by the standard deviation of the sample proportion.”)
- For some reason you often don’t know the sample proportions’ S.D. (which I don’t understand at this point) so you replace it with the formula:
- σ_ p̂ = √(P_0(1 – P_0)/n)
- This makes the final formula look like:
- Z = (p̂ – P_0)/ √(P_0(1 – P_0)/n)
- Z = (p̂ – P_0)/σ_ p̂
- Sample Mean
- Z = (x̅ – μ_0)/σ_ x̅
- (States, “the z-score score equals the sample mean minus the population mean divided by the standard deviation of the sample mean.”)
- Once again, you often won’t know the S.D. of the sample mean OR the S.D. of the population (this part I really don’t understand) so you replace it with the following formulas:
- σ_ x̅ = σ/√n = ~S_ x̅/√n
- BUT, since you end up using the sample S.D. (S_ x̅) you then must use a T-table with the final formula being:
- T = (x̅ – μ_0)/S_ x̅/√n
- Z = (x̅ – μ_0)/σ_ x̅
I had a hard time understanding the difference between σ_ x̅ and S_x̅ so I made a Reddit post asking for help and got this as an answer:
- “S_(x bar) is what you calculate from your sample data using the formula for standard deviation. This is your estimate of the population standard deviation, and it describes how spread-apart your data are. σ_(x bar) describes instead how wrong your estimate for the mean is going to be.
If you run a test over and over again you’ll get different values for the sample mean each time, and these different sample means follow a normal distribution with standard deviation σ_(x bar).”
As far as I understand it, S_ x̅ is the sample standard deviation from a single sample. You would most likely take many samples in an experiment, however, and each sample would have its own mean. If you took all the means from all the samples and charted them in their own distribution, the standard deviation of the sample means would look like a normal distribution and be described as be σ_ x̅. This, in essence, is how likely the true population mean is away from the mean of all the samples. (I get it now, but it’s still fairly confusing…)
This week I figured out when to use a one-tailed z-score and when to use a two-tailed z-score. If a question asks you to find the probability of an alternative hypothesis being less-than or greater-than the original hypothesis by a certain amount, you calculate a one-tailed z-score whereas if the question asks you to find the probability of an alternative hypothesis simply not equalling the original hypothesis, you’d calculate a two-tailed z-score.
- One-Tailed
- H_a: μ > OR < x
- Two-Tailed
- H_a: μ ≠ x
I’m still having a hard time understanding when to use a Z-statistic versus when to use a T-statistic. A few things I learned this week that helped me understand when to use which one are:
- Z-Statistic
- When n is greater-than 30, you use a Z-statistic.
- To find a Z-score you must know:
- The population mean (μ)
- The population standard deviation (σ)
- The sample mean (x̅)
- The sample Size (n)
- T-Statistic
- When n is less-than 30, you use a T-statistic.
- You use a T-statistic when you have to estimate the population standard deviation.
Lastly, I figured out what the term re-randomizing means in statistics. Re-randomizing data is a way to create a normal distribution from data from an experiment to determine how likely the results from the experiment were to occur. It works as follows:
- Start by creating an experiment with a treatment and control group and find the difference between the mean score of the two.
- Example – You run an experiment with 500 people to determine if a drug decreases cholesterol. You the split the 500 people into two groups of 250, one group being the treatment group that gets the drug and the other being the control group that gets a placebo.
- To re-randomize the scores, you then take all the scores after the experiment’s been run, mix them up in a hat (so to speak), pull them out at random and put each score into one of two new groups. You then take the difference of the means of the two new groups, record the difference and plot the result on a new distribution.
- From the example above, you’d take the 250 scores from each group, mix all 500 scores up in a hat and pull them out at random and create two new, random groups of 250 scores each. You calculate the difference of the means of each group, record the difference and plot the difference on a distribution.
- If you did this 150 times you’d state you “re-randomized the results 150 times.”
- From above, if you re-randomized the results 150 times, you’d have a normal distribution with 150 data points. You’d then compare your initial results from the experiment to this new, 150-point distribution and, wherever the initial results fell on the 150-point distribution, you’d say that’s how likely is was to happen by chance. (I.e. if your results were 2 standard deviations away from the mean of the 150-point distribution you’d say there was 4.55% probability of the results of the experiment happening by chance.)
I think I’m very close to working out these concepts in my mind. I need to think through them a little more but I feel like I’m close to having it all figured out. The reddit post I came across this week had to do with math and I saw and interesting comment that resonated with me about how I’m feeling. The comment said something along the lines of, “a math professor once told [this person] that you won’t understand your current math class until your following math class.” This comment stuck a chord with me as a number of times I’ve said something quite similar to this. It’s often not until I get into the next subject and need to apply what I learned in the previous subject that the concepts I learned before seem to fall into place in my mind. It was reassuring to come across that comment randomly and realize that I’m not crazy for thinking or feeling this way.
This coming week I’d like to get through another 1.5 units. I only have 4 videos left to watch and a single practice exercise to get through in Two-Sample Inference for the Difference Between Groups (0 M.P.), so I should be able to wrap that up tomorrow. That will leave me with four days to get through the following unit, Inference for Categorical Data (Chi-square Tests) (0/700 M.P.). From there, I’ll only have two more units remaining in the course! Neither of them have any M.P. so I should be able to get through them both quickly. I doubt that I’ll be able to get through the entire course in the next 11 days but I’ll give it a shot. I feel like the course test alone will take me a week to work through. Either way, I’m nearly through stats which feels pretty good. 🙂