Week 67 – Dec. 6th to Dec. 13th

Once again, I could some up this week’s progress simply with, “it was fine”.  I got through 3 units which sounds pretty good but consider there were no exercises that needed to be done, only each unit’s unit test, it wasn’t that great. To give myself some credit though, the unit tests weren’t exactly easy, and I was also able to wrap my head around a few tricky concepts that I previously couldn’t understand. I’m happy to finally be getting closer to having most of stats mapped out in my head. Generally, I’m now able to visualize the distributions from the questions in my head, differentiate the components of the questions, identify each required step to answer the question, AND know how to proceed through each step. I’m glad to finally be reaching this point considering 1) I should be finished and moving on from stats in the next week or two so it was essentially now-or-never, and 2), after 6 months of studying the same subject, I don’t think I have it in me to study stats for much longer anyways. 

I had a pretty big breakthrough at the beginning of the week when I started to understand why Z-distributions and T-distributions are shaped slightly differently. Here’s a page from notes that explains:

At this point, my understanding is that a Z-distribution takes on essentially the exact shape of a normal distribution. A T-distribution, however, looks similar with the majority of the distribution being centered at the mean and tapering off towards the ends (i.e. it’s unimodal), but, if you’re using a small sample, the tails of a T-distribution will be thicker (😏) because you will be less certain that the distribution will follow the exact normal shape since the sample is small. This is why you must look up the degrees of freedom when determining the p-value from a T-table, because the sample size affects how thick (😏) the tails will be.

After getting a question wrong on the Significance Tests (Hypothesis Testing) unit test, I had to review how to calculate a Z-score for a sample proportion. Afterwards, I finally started to understand and visualize how Z-scores for proportions work in relation to a normal distribution. The formula to calculate a proportion’s Z-score is:

  • Z-star = (  P)/ √(P*(1 – P)/n)

The way I conceptualize this formula is by thinking about a normal distribution where the mean is the assumed population proportion, P, with the standard deviations labelled on the bottom of the distribution are calculated using P. You then compare how ‘extreme’ the sample proportion is relative to the distribution which is based off the assumed populations proportion. Before this, I would always get confused whether to use the populations proportion to determine the S.D. or the sample proportion. (In retrospect, it seems pretty obvious that the S.D. of the distribution should be calculated with P.)

Going through the unit Chi-Square Tests for Categorical Data, I finally understood that chi-square tests are always about comparing observed results with ‘expected results’ to determine the probability of the difference between the two having occurred by chance alone. The statistic chi-square is denoted with X^2. There are three different types of Chi-Square tests which test for ‘goodness-of-fit’, ‘homogeneity’, and ‘independence’. 

A chi-square goodness-of-fit test is used to determine how close observed results are to hypothesized expected results. Here is an example:

  • Q. In the game rock-paper-scissors, Kenny expects to win, tie, and lose with equal frequency. Kenny plays rock-paper-scissors often, but he suspected his own games were not following the pattern, so he took a random sample of 24 games and recorded their outcomes. Here are his results:
OutcomeWinLossTie
Game4137
  • He wants to use these results to carry out a X^2 goodness-of-fit test to determine how unlikely the distribution of his outcomes were to the hypothesized outcome of them all being equal. What are the values of the test stat and the P-Value for Kenny’s test?
    • H_o: Outcomes have equal probability
    • H_a: Outcomes have unequal probability
      • (Note: In a goodness-of-fit test, the original hypothesis is always that the expected outcomes will be accurate and the alternative hypothesis is always that the observed outcomes are in actuality more accurate/correct.)
    • n = 24
    • D.F. = 2
      • (Note: the Degrees of Freedom is calculated by figuring out (rows – 1)(columns – 1) but since there are no columns that part of the equation is removed leaving D.F. = (rows – 1) = (3 – 1) = 2.)
    • Expected outcomes:
OutcomeWinLossTie
Game4137
Expected888
  • X^2 = Σ(Obs._i – Exp._i)^2/Exp._i
    • = (4 – 8)^2/8 + (13 – 8)^2/8 + (7 – 8)^2/8= (4)^2/8 + (5)^2/8 + (1)^2/8= (16)/8 + (25)/8 + (1)/8= 2 + 3.125 + 0.125
    • = 5.25
    • P-value = P(X^2 ≥ 5.25)
      • (At this stage you have to use a chi-square statistic table using X^2 = 5.25 and d.f. = 2 to figure out the ‘tail probability’, a.k.a. the probability of the observed results being what they were given the expected results.)
  • (Going across the row where d.f. = 2, you see that where X^2 = 5.25 falls between the two values 4.61 and 5.99. Going up to the top of each of those values columns, the tail probability of each is 0.10 and 0.05 respectively.)
    • P(X^2 ≥ 5.25) = 0.05 ≤ P-value ≤ 0.10
      • (This means the probability of Kenny having gotten the results he did, assuming he should have got an equal distribution of wins, loses, and ties, was between 5% and 10%.)

A chi-square test for homogeneity is used to determine if two groups have similar distributions of the same variable. It’s worth noting that you always set your H_o to be that the groups do NOT have similar distributions (a.k.a. they are NOT homogeneous) and set H_a to be that the groups DO have similar distributions (a.k.a. they ARE homogenous). Here’s an example:

  • Q. Based on the sample data in the table below and using a significance level of 0.05, is there a difference between left-handed and right-handed people and their preference of subjects to study?
 Right-HandedLeft-HandedTOTAL
STEM301040
Humanities152540
Equal15520
TOTAL6040100
  • Step 1) – Determine H_o and H_a.
    • H_o: No difference between right-handed and left-handed people in terms of their subject preference.
    • H_a: There is a difference between right-handed and left-handed people and their subject preference.
       
  • Step 2) – Given the total right-handed AND left-handed people that prefer each specific subject (i.e. the TOTAL column), use that proportion to determine how many people from each sample of right-handed and left-handed people you would expect prefer each subject.
 Right-HandedLeft-HandedTOTAL
STEM301040
Expected60*0.4 = 2440*0.4 = 16 
Humanities152540
Expected60*0.4 = 2440*0.4 = 16 
Equal15520
Expected60*0.2 = 1240*0.2 = 8 
TOTAL6040100
  • Step 3) – Find the chi-square statistic by calculating the sum off the difference between each observed value and it’s respective expected value, squared, and divided by it’s respective value.
    • X^2 = Σ(Obs._i – Exp._i)^2/Exp._i
      • = (30 – 24)^2/24 + (10 – 16)^2/16 + (15 – 24)^2/24 + (25 – 16)^2/16 + (15 – 12)^2/12 + (5 – 8)^2/8
      • = 14.0625
  • Step 4) Calculate the degrees of freedom.
    • D.F. = (rows – 1)(columns – 1)
      • = (3 – 1)(2 – 1) 
      • = 2 * 1
      • = 2
  • Step 5) – Using a chi-square table, determine if the p-value of X^2 is less than the significance level of 0.05 and therefore whether or not to accept of reject H_o.
    • Going across the row where d.f. = 2, the very last value is 9.21 which corresponds with the p-value of 0.01. This means, based of the X^2 = 14.0625, the p-value < 0.01 which is less than alpha, 0.05, meaning we can reject H_o which suggests there is a difference between right-handed and left-handed people and their preference of subjects. 

Finally, a chi-square test for independence is used to see if there is an association between two variables. It’s once again worth noting that the original hypothesis for chi-square independence tests is always that there is no association between the two variables and the alternative hypothesis is always that there is an association between the variables. Here’s an example I wrote in Week 61 of how to do it:

  • Q. Based on the sample statistics between foot length and hand length in the table below, using a 0.05 significance level, is there an association between those two variables (i.e. if someone’s right foot is longer, is it likely their right hand is longer, etc.)?
 Right Foot LongerLeft Foot LongerBoth Feet the SameTOTAL
Right Hand Longer113822
(Expected)    
Left Hand Longer291425
(Expected)    
Both Hands the Same12132853
(Expected)    
TOTAL252550100
  • Step 1) – Figure out H_o and H_a.
    • H_o: The variables are independent.
    • H_a: The variables are associated.
  • Step 2) – Assuming H_o, figure out the expected results.
    • This is done by multiplying the probability of a specific cell’s (row-total/overall-total) by that cell’s (column-total/overall-total) multiplied by the overall-total.
      • (In layman’s terms, you’re multiplying the probability of a cell being in a specific row by the probability of it being in a specific column by the overall-total number of people in the sample.)
    • Ex. P(R.F.L. and R.H.L.) = (22/100)*(25/100)*100
      • = 5.5
 Right Foot LongerLeft Foot LongerBoth Feet the SameTOTAL
Right Hand Longer113822
(Expected)5.55.511 
Left Hand Longer291425
(Expected)6.256.2512.5 
Both Hands the Same12132853
(Expected)13.2513.2526.5 
TOTAL252550100
  • Step 3) – Use the Chi-Square formula to figure out a Chi-Square statistic.
    • X^2 = Σ((Observed Result – Expected Result)^2/Expected Result)
      • = ((11 – 5.5)^2/5.5) + ((3 – 5.5)^2/5.5) + ((8 – 11)^2/11) + ((2 – 6.25)^2/6.25) + ((9 – 6.25)^2/6.25) + ((14 – 12.5)^2/12.5) + ((12 – 13.25)^2/13.25) + ((13 – 13.25)^2/13.25) + ((28 – 26.5)^2/26.5
      • = ~11.942
  • Step 4) – Determine the Degrees of Freedom.
    • D.F. = (rows 1)(columns 1)
      • = (3 – 1)(3 – 1)
      • = 2*2
      • = 4
    • (Note: This is because if you know 2 values in the rows and 2 other values in the columns, you’re able to figure out the rest of the values in each cell.)
  • Step 5) – Using the X^2 and D.F., use a Chi-Square table to determine the P-Value.
    • Based on the table, going across the row of D.F. = 4 to the columns where X^2 = 9.49 and, X^2 = 13.28, looking up you see the P-Value for those respective X^2 values are 0.05 and 0.01. This means that the P-Value is somewhere between these two probabilities.
  • Step 6) – Based on the significance level and p-value, determine if there is an association between the two variables or not.
    • Since the P-Value < α, we reject H_o which suggests H_a that there IS an association between foot-length and hand-length

Lastly, when I went through the unit test from More on Regression, I got better at understanding computer-generated Least-Squares-Regression (LSRO) outputs and understanding how and why you would want to compare the slope of the regression-line to a slope the equals 0. The reason you’d want to compare the regression-lines slope to is to determine if 1) there is a linear relationship between the two variables and 2) the probability of the sample data having it’s slope simply buy chance alone.

First off, here’s a page from my notes that gives a bit of detail on a computer-generated LSRO output:

The formula for a regression line is ŷ = a + bx where a is the y-intercept and is the slope. On a LSRO output, the y-axis variable and it’s related statistics are in the row labelled as “constant” and the x-axis variable and it’s related statistics are in the row which will be labelled as whatever variable is being measured on the x-axis (in the page from my notes, it’s “Fertility”). The y-intercept, a, is in the y-axis “Constant” row under the column “Coef” (i.e. 89.7), and the slope, b, is in the x-axis row, again, under the column “Coef” (i.e. -5.97). I’m still not confident on why a LSRO output looks the way it does, has the headings for each column and row that it does, or what information the columns “T” and “P” give you (though I’m pretty sure P stands for probability but I don’t know what probability it’s calculating for each row). I believe the column “S.E. Coef” is the squared error of the values of the X- and Y-coordinates. 

Finally, after close to 6 months of working through statistics, I think I have most of the terms associated with and used on normal distributions memorized and mapped out, and have a good understanding of how to calculate them. I have a feeling eventually all of this will seem very obvious and go from seeming like there’s an overwhelming amount of difficult things to remember to seeming like there’s only a couple of simple things that are all straightforward and easy to remember. This coming week I should be able to quickly get through the last two units, Prepare for the 2020 AP Statistics Exam (no M.P.) and AP Statistics Standards Mappings (no M.P.), and then will attempt the course challenge. I have a feeling I will need to do the challenge a few times so if I don’t get a good enough score this week that will be ok. Either way, stats is just about over which is incredibly relieving.