I didn’t get quite as far along as I hoped I would this week but managed to make some decent progress, nonetheless. It took me until Thursday to get through the unit Inference – Comparing Two Groups or Populations which I was hoping to have finished by Wednesday. It then took me 4 or 5 attempts to get through the Confidence Intervals unit test which was disappointing, though I was surprisingly happy with what I was able to recall about confidence intervals. On Saturday I had enough time to make one attempt at the Significance Tests (Hypothesis Testing) unit test and only scored 12/14 so unfortunately I didn’t get the KA hattrick this week. Similarly to last week, I once again feel like I’m on the brink of having all of these concepts figured out in my mind but haven’t quite got there and so I often still feel fairly lost. So frustrating!!
This week I picked up halfway through the unit Inference – Comparing Two Groups or Populations where I worked on coming up with T-statistic confidence intervals. One of the most frustrating parts from this week was that I’m still not 100% confident on when to use a T-statistic vs when to use a Z-statistic. As far as I understand, the times you would use each type of statistic are:
- T-statistic:
- When n ≤ 30.
- When given a sample distribution’s standard deviation, S_X.
- Z-statistic:
- When dealing with proportions, P and p̂.
- When given the population’s standard deviation, σ.
I feel like there have been times where n ≥ 30 and I was given the sample S.D. and yet I still had to use a Z-table and a Z-statistic. This is what I find the most frustrating, that I have a pretty good idea of when to use each but I’m not completely certain.
Again, this week went over the conditions for inference (i.e. to infer that the sample takes on a normal distribution) for both T-statistics and Z-statistics:
- Random
- Normal
- Means:
- 1) If parent distribution is normal.
- 2) If n ≥ 30.
- (Note: if finding the difference between two distributions, both distributions must have n ≥ 30.)
- 3) If sample data is roughly symmetric.
- Proportions (only for Z-stats, I believe):
- (Successes) ≥ 10, and (Failures) ≥ 10.
- Means:
- Independent
- I.e. sample with replacement or 10% Rule.
To be clear, each of these three conditions need to be met in order to infer that the sample has a normal distribution. If the sample meets all the necessary conditions to infer it takes on a ‘normal’ shape, you’re then able to run it through Z-tests and/or T-tests to find a p-value, a confidence interval, etc.
The last thing I went over in this unit was the difference between a Paired hypothesis test and a Two-Sample hypothesis test.
- Paired
- For each subject being sampled, they produce two measurements/results that are compared to each other.
- Ex. “Do people run faster wearing shorts or pants?”
- Two-Sample
- Sampling two separate subjects/populations on the same variable.
- Ex. “Do dog owners or cat owners spend more money on pet accessories?”
After passing the Inference – Comparing Two Groups or Populations unit test, I then moved on to the unit Confidence Intervals which appeared as being 100% complete but I hadn’t attempted the unit test. I ended up having to redo the unit test a few times because I got 1 or 2 questions wrong each attempt for not choosing the right statistic to use, either Z-stat or T-stat. Redoing the test a few times and working through ~30 questions helped solidify my understanding of the confidence interval equation and understand why it looks the way it does:
- Confidence Intervals
- Equation:
- Ex. 90% Con.In._(P) = p̂ (+/-) Z*_(90%)√(p̂(1 – p̂)/n)
- = p̂ (+/-) 1.65 * √(p̂(1 – p̂)/n)
- Ex. 90% Con.In._(P) = p̂ (+/-) Z*_(90%)√(p̂(1 – p̂)/n)
- A confidence interval tells you that, based on a given sample set of data, there’s an x% probability that the true mean or true proportion of the population being sampled from falls within that interval.
- The first part of the equation, in the example it’s p̂ but could also be a sample mean, x̅, or the difference between two sample means, states where the distribution is centred. The remainder of the equation states the margin of error.
- Equation:
- Margin of Error
- The way I understand it, the Margin of Error is the distance away from either side of the mean of a normal distribution that contains the probability of the true mean falling within that space. I picture it like a set of soccer goal posts. The further you move the goal posts away from the mean, the more likely the distribution is to capture the true mean of the population. The closer you move the goal posts towards the centre, the less likely the true mean is to fall within the interval.
- Equation:
- Z*_(x%)√(p̂(1 – p̂)/n)
Lastly, I was asked a few Type 1 vs Type 2 Errors questions in the Confident Interval unit test and I initially couldn’t remember which was which. I had a vague idea of what the table I came up with a few weeks ago looked like but couldn’t remember it exactly and had to find it in my notes:
H_o True | H_o False | |
Reject H_o | Type 1 Error | Correct |
Fail-to-Reject H_o | Correct | Type 2 Error |
To reiterate what it says, a Type 1 error occurs when in reality the original hypothesis is correct but, based on the sample data, you reject the original hypothesis and suggest the alternative hypothesis as being correct A Type 2 error is when in reality the original hypothesis is actually false but, based on the sample data, you accept the original hypothesis and state that it’s likely correct.
I’m hoping I’ll be able to get through the unit test in Significance Tests (Hypothesis Testing) (1360/1400 M.P.) on Tuesday this coming week. I’d then like finish the following unit’s unit test, from Chi-Square Tests for Categorical Data (700/700 M.P.), by Thursday at the latest. If I can manage that I’ll have 2 days to get through the third last unit of the course, More on Regression (240/300 M.P.), which should be doable considering how small it is. All in all, I’m pretty happy with where I am and feel like it’s likely I’ll get through all of stats by the end of the month. I realized this week that I began stats this summer on June 23rd which means I’m coming up to 6 months of working on statistics. I’ll be happier if I finish before the 23rd of this month but, if I don’t, there’ll be something satisfying about working through stats for half a year. Either way, as long as I’m not counting down the New Year while working through KA, I’ll be happy.