Things didn’t go exactly according to plan this week but, overall, I give myself a passing effort. I only got halfway through the unit Inference for Categorical Data (Chi-Square Tests) bu, even still, ended up learning a number of new concepts. I wasn’t able to work on KA on Saturday or Sunday morning which, if I had I been able to, I think would have made the difference between finishing the unit or not. (Long story short, I went to a cottage that ended up losing power Friday night/Saturday morning which through the entire weekend off and made it impossible to work on KA.) All that said, I feel the same way about what I learned this week as I have the past few weeks; I have a general idea of how the things I learned about work but haven’t fully grasped it all yet. Chi-Squared seem somewhat similar to Z-tables and T-tables, however, so I’m hoping it won’t take me too long to feel comfortable with them.
Something I learned in Week 59 which I forgot to add to that post is that in the notation H_o, the ‘o’ doesn’t stand for original. It turns out that it should actually be written as H_0 where the ‘0’ is a zero. I figured this out after realizing that the notation for an alternative hypothesis can also be written as H_1 instead of H_a.
Before I got into Chi-Squared tables, I made a note about what I believe is the difference between a population distribution, a set of individual sample distributions, and a distribution of combined samples.
The way I’ve come to understand the difference between a population, individual sample, and combined sample is that you can take individual samples (X_1, X_2, X_3, X_4, and X_5) from a population (μ_X) and, as long as they meet certain conditions, you can infer that they’ll have a normal distribution BUT they won’t necessarily have the same mean or S.D. as the population’s distribution. If you combine all the sample distributions, that’s how you get the distribution on the right which I labelled as Sample(s) Distribution (x̅). I added a the ‘(s)’ at the end of ‘Sample’ as a reminder to myself that it’s a distribution of multiple combined individual samples. What I think is most important to remember and what I found difficult to understand is that a sample distribution labelled with x̅ is a distribution of combined individual samples.
One last slightly random side note – this week I learned that the reason you use the word ‘confident’ when describing a confidence interval is because, when doing a confidence test, you have to estimate a population’s mean or parameter at some point with a sample mean or sample parameter. By swapping the sample mean/parameter for the population’s, you can only say, “we’re 95% confident that bla, bla, bla” since you were using the sample’s mean/parameter and therefore can’t be certain that it’s the same as the populations mean/parameter.
I’m still fairly confused about what a Chi-Square statistic is. Here’s what I’ve come to understand about it so far (keep in mind much of this could be wrong):
- Chi-Square Statistic
- Sounds like ‘Ki’ –Squared
- Denoted with X^2
- Note: The X isn’t just a ‘x’ in the English alphabet but the Greek letter Chi which looks like an x but isn’t exactly the same.
- Formula:
- X^2 = Σ(Obs._1 – Ex._1)^2/Ex._1
- “Chi-Squared equals the sum of the observed results minus their respective expected results, squared, divided by each of their expected results.”
- X^2 = Σ(Obs._1 – Ex._1)^2/Ex._1
- An explanation from a website I found:
- “A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.”
- “A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.”
- “A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.”
- “A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.”
- In my own words, it seems to be a statistic that tells you how close an original hypothesis was to an alternative hypothesis based on comparing all the observed data to all the respective expected data.
- Chi-Square Table
- To use the below table, you must 1) find the Chi-Square statistic and 2) determine the degrees of freedom (usually n – 1). Knowing those two values, you can then look to the top of the table, above the Chi-Squared stat, to see the probability. For example, the probability of a Chi-Square statistic of 16.7496 with the degrees of freedom equaling 5 would be 0.005.
- Chi-Square Distribution
- To be honest, I don’t really know how to read this table. I believe you figure out the Chi-Square stat, which is marked on the X-axis, and find the degree of freedom, a.k.a. all the lines labelled as k, find the point on the graph where the specific k value in question and Chi-Square value intersect, and then move across to the Y-axis at that point to determine the probability. (This could definitely be wrong.)
Here’s an example from my notes of a question where I had to find a Chi-Squared statistic:
- Question: Sal is looking to buy a restaurant. The owner of the restaurant in question tells Sal that the distribution of customers from Monday-Saturday has the distribution indicated in the [Expected %] row in the table below. Sal spends a week observing the total number of customers that enter the restaurant from Monday to Saturday and marks down his results in the [Observed Result] row below. Was the probability of him getting his [Observed Results] less-than 5% based on the [Expected %] given to him by the owner?
Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | TOTAL | |
Expected % | 10% | 10% | 15% | 20% | 30% | 15% | 100% |
Observed Result | 30 | 14 | 34 | 45 | 57 | 20 | 200 |
Expected Results |
- The first step is to calculate the Expected results. You do this by seeing the hypothesized [Expected %] for each individual day and multiplying it by the total number of people. Doing so gives you the results of:
Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | TOTAL | |
Expected % | 10% | 10% | 15% | 20% | 30% | 15% | 100% |
Observed Result | 30 | 14 | 34 | 45 | 57 | 20 | 200 |
Expected Result | 20 | 20 | 30 | 40 | 60 | 30 | 200 |
- From here, you then use the Chi-Square formula with the [Observed Result] and [Expected Result] of each day to come up with the Chi-Squared statistic.
- X^2 = Σ(Obs._1 – Ex._1)^2/Ex._1
- = ((30 – 20)^2/20) + ((14 – 20)^2/20) + ((34 – 30)^2/30) + ((45 – 40)^2/40) + ((57 – 60)^2/60 + ((20 – 30)^2/30)
- = ~11.4
- X^2 = Σ(Obs._1 – Ex._1)^2/Ex._1
- Then you need to calculate the Degrees of Freedom from the data:
- D.F. = (n – 1)
- = (6 – 1)
- = 5
- D.F. = (n – 1)
- Lastly, you use the C.S. stat and the D.F. on a Chi-Square table to figure out the probability.
- Finding the row where D.F. = 5 on the left and moving across you see that the 3rd column equals 11.0705 and the fourth column equals 12.8325. Their corresponding respective alpha levels, which are shown at the top of the table, equal 0.05 and 0.025. This means that the probability of Sals results being what they were, assuming the [Expected %] values given to him by the restaurant owner were true, were between 2.5% and 5%.
- Since the significance level was initially set to 0.05 and the results showed a probability of <0.05, we can reject the original hypothesis that the [Expected %] values were accurate.
Just like there are conditions to determine if you can infer that a distribution will likely take on a normal curve, there are three conditions that must be met in order to determine a Chi-Square Goodness-of-Fit test:
- Random
- Self-explanatory
- Large Counts
- The [Expected Result] must be greater-than 5 in every category.
- Independent
- Again, at this point, self-explanatory.
Lastly, I learned about contingency tables which, as far as I know, are tables that look exactly like the table I used in the above example. Here’s another example from my notebook:
In this example, the table at the top of the first page would be the contingency table. It indicates the number of people that took one of two sets of herbs or a placebo (top of table) and the number of people in all three categories who either got sick or didn’t get sick (left side of table). Based on what I’ve learned so far, it seems to me that a contingency table is simply a two-way table. In a way that seems a bit too easy to be true, however, so I’ll have to keep my eye out for more information on contingency tables going forward.
There are only two more videos, 4 exercises, and the unit test remaining in Inference for Categorical Data (Chi-Square Tests) (320/700 M.P.). I already watched the final two videos but want to re-watch both of them and take better notes, but it means I should be able to get through both quickly. I’m going to try and finish the final 3 units of this course the end of the week which I think could be doable since neither of the last two units, Advanced Regression (Inference and Transforming) and Analysis of Variance (ANOVA), have Mastery Points. It’s been awhile since I’ve got a KA hattrick. I think I’m overdue for another one.