This week I got through two units, Inference for Categorical Data (Chi-Square Tests) and Advanced Regression (Inference and Transforming) but I would say my effort this week was, in a way, subpar. I’m fairly sure I spent more than 5 hours studying – I spent 1.5 hours Sunday morning, a day I usually don’t study on – but rate the quality of my effort as somewhat poor. I didn’t push myself any day this week. As soon as I got a bit bored (usually around the 1-hour mark) I would call it a day. What I’m learning right now seems to be more conceptual/theoretical as opposed to what I enjoy more which is working through exercises/equations which is partly why I think I quit early each day. I’ve said this a million times, but I need to push myself if I want to get through stats by the end of the year, let alone by the end of this month.
I began this week learning about homogeneity and association in relation to Chi-Square statistics. I think I have a fairly decent understanding of both terms at this point but still haven’t mastered either concept yet. Here’s a photo from my notes that gives a bit of detail on both:
- Homogeneity
- 2 samples, 1 variable.
- “How similar things are.” – Sal
- Relative to Chi-Squared stats, homogeneity is used when looking at 2 samples to see whether their distributions for a specific variable are similar or not.
- Association
- 1 sample, 2 variables.
- Tests to see if variables are independent
- The opposite of “independent” is “associated”. Variables will either be independent or associated.
- Tells you if there is a correlation between two variables in a single sample.
In the photo above, I mistakenly drew a normal distribution which should actually be a Chi-Square distribution. As a reminder, this is a Chi-Square distribution:
On the unit test, I found the questions on homogeneity and association the most difficult. These questions asked me to determine if 1) samples were similar or not (i.e. homogenous or not) and 2) if variables were either associated or independent from each other. These word-based questions required me to figure out what the H_o and H_a was in order to know whether the test was for homogeneity or association. I realized there was a pattern for both types of questions:
- To determine if you’re testing for homogeneity or association, figure out how many samples were used. If there’s 1 sample, you’re testing for association. If there’s 2 samples, you’re testing for homogeneity.
- Homogeneity
- If the P-Value > α (a.k.a. significance level), the samples are similar (i.e. “fail to reject H_o”).
- If the P-Value < α, the samples are NOT similar (i.e. “accept H_a”).
- Note: When you’re testing for homogeneity, you always assume the samples are similar at the start (a.k.a. H_o: samples are similar).
- Association
- If the P-Value > α, the samples are independent (i.e. “fail to reject H_o”).
- If the P-Value < α, the samples are associated (i.e. “accept H_a”).
- Note: When you’re testing for association, you always assume the variables are independent at the start (a.k.a. H_o: variables have no association).
- Homogeneity
(Random side note: Writing all of the above out really helped me to wrap my head around the definitions of homogeneity and association. It’s interesting how putting a concept into your own words helps you to figure it out.)
Next I went through a video on Association Contingency Tables. It seemed like a contingency table was essentially a two-way table, something I learned about a few weeks ago. After looking up the definition I realized I was more-or-less correct. The definition I found on Google is:
- “A table showing the distribution of one variable in rows and another in columns, used to study the association between the two variables.”
The following is the question I worked through on association contingency tables:
- Q. Based on the sample statistics between foot length and hand length in the table below, using a 0.05 significance level, is there an association between those two variables? (I.e. if someone’s right foot is longer, is it likely their right hand is longer, etc.?)
Right Foot Longer | Left Foot Longer | Both Feet the Same | TOTAL | |
Right Hand Longer | 11 | 3 | 8 | 22 |
(Expected) | ||||
Left Hand Longer | 2 | 9 | 14 | 25 |
(Expected) | ||||
Both Hands the Same | 12 | 13 | 28 | 53 |
(Expected) | ||||
TOTAL | 25 | 25 | 50 | 100 |
- Step 1) – Determine H_o and H_a.
- H_o: The variables are independent.
- H_a: The variables are associated.
- Step 2) – Assuming H_o, figure out the (Expected) results.
- This is done by multiplying the probability of a specific cell’s (row-total/overall-total) by that cell’s (column-total/overall-total) by the overall-total.
- (In layman’s terms, you’re multiplying the probability of a cell being in its specific row by the probability of it being in its specific column by the overall-total number of people in the sample.)
- Example – P(R.F.L. and R.H.L.) = (22/100)*(25/100)*100
- = 5.5
- This is done by multiplying the probability of a specific cell’s (row-total/overall-total) by that cell’s (column-total/overall-total) by the overall-total.
Right Foot Longer | Left Foot Longer | Both Feet the Same | TOTAL | |
Right Hand Longer | 11 | 3 | 8 | 22 |
(Expected) | 5.5 | 5.5 | 11 | |
Left Hand Longer | 2 | 9 | 14 | 25 |
(Expected) | 6.25 | 6.25 | 12.5 | |
Both Hands the Same | 12 | 13 | 28 | 53 |
(Expected) | 13.25 | 13.25 | 26.5 | |
TOTAL | 25 | 25 | 50 | 100 |
- Step 3) – Use the Chi-Square formula to figure out a Chi-Square statistic.
- X^2 = Σ((Observed Result – Expected Result)^2/Expected Result)
- = ((11 – 5.5)^2/5.5) + ((3 – 5.5)^2/5.5) + ((8 – 11)^2/11) + ((2 – 6.25)^2/6.25) + ((9 – 6.25)^2/6.25) + ((14 – 12.5)^2/12.5) + ((12 – 13.25)^2/13.25) + ((13 – 13.25)^2/13.25) + ((28 – 26.5)^2/26.5
- = ~11.942
- X^2 = Σ((Observed Result – Expected Result)^2/Expected Result)
- Step 4) – Determine the Degrees of Freedom.
- When using contingency tables, degrees of freedom are the (number of rows minus 1) times the (number of columns minus 1).
- D.F. = (3 – 1)(3 – 1)
- = 2*2
- = 4
- (Note: This is because if you know 2 values in the rows and 2 other values in the columns, you’re able to figure out the rest of the values in each cell.)
- Step 5) – Using the X^2 and D.F., use a Chi-Square table to determine the P-Value.
- (See Chi-Square Table below)
- Based on the table, going across the row of D.F. = 4 to the columns where X^2 = 9.49 and X^2 = 13.28, looking up you see the P-Value for those respective X^2 values are 0.05 and 0.01. This means that the P-Value is somewhere between these two probabilities, 0.05 and 0.01.
- Step 6) – Based on the significance level and p-value, determine if there is an association between the two variables or not.
- Since the P-Value < α, we reject H_o which suggests H_a that there IS an association between foot-length and hand-length.
After finishing the unit on Chi-Square stats, I then began the unit Advanced Regression (Inference and Transforming) which, as the name suggests, had me revisit regression lines (a.k.a. the Line of Least-Squares). On a X-Y graph that has a number of data points, a regression line is a line drawn across the graph that attempts to get all the data points on one line. It’s a line that figures out the minimum sum of the squared distance of all the data points to it. Here is a photo from my notes that shows how they work:
As shown in the photo above, the formulas for a regression line of a population and a sample are:
- (Note: Regression Lines are denoted with ŷ)
- Population
- ŷ = α + βx
- α = “Alpha” = y-intercept
- β = “Beta” = slope (= m)
- ŷ = α + βx
- Sample
- ŷ = a + bx
Apart from the notation of the formulas being different, the formulas are the exact same. They’re the standard formula for a line, y = mx + b, but simply use different notation. I wasn’t told this week how to calculate the slope of a regression line based on the given data points. I just looked back through my notes to see if I could find if I’d written it down but I wasn’t able to find anything. When I looked it up online, the formula I found to determine the ‘correlation coefficient’, which is needed to calculate the slope, looks pretty intense and I’m fairly certain I’ve never been taught it (I pasted the photo below this paragraph). For now, I’ll have to have to be content with the basic formulas I went over above and will have to wait to learn about the formula for the correlation coefficient later on.
In the same way that there are conditions that must be met to infer that a distribution of a sample will be ‘normal’, there are conditions that must be met to infer that a regression line is appropriate to use on a sample set of a data. The acronym for these conditions is L.I.N.E.R. which stands for:
- Linear
- There must be a linear relationship between the data points (i.e. the data points cannot appear to increase exponentially).
- Independent
- The data must have been sampled either with replacement or meet the 10% Rule.
- Normal
- This one is tricky to understand. The data must be normally distributed away from the regression line in the sense that, at every X-Y coordinate of the regression line, the data must be normally distributed away from it.
- Sal demonstrated this by making a 3D drawing of a normal distribution on top of a regression line to represent that the data piles up on the regression line and disperses away from each point on the line in the form of a normal distribution .
- Equal Variance
- With regard to the normal distribution of data on the regression line, the distribution of data must have equal variance across the regression line.
- Random
- The sample data must be chosen at random.
I was given a number of questions where “the sample data was inputted into a computer” which generated some tables of data that looked very confusing to me. Here’s an example question I wrote down in my notes:
I’m not going to go through the entire question, but I will talk briefly about the table. To make it a bit more clear, this is what the table looked like:
Predictor | Coef | SE Coef | T | P |
Constant | 127.092 | 57.07 | 2.210 | 0.032 |
Speed | 6.084 | 2.029 | 2.99 | 0.004 |
I don’t know much about this table. I know that the cell of Speed/Coef equals b and that Speed/T is the standard deviation. I’m pretty sure that T stands for T-value and P stands for P-Value. I don’t really know what the term Constant means but believe it’s the Y-axis of the charts. I’m fairly certain that Constant/Coef is the Y-intercept, a.k.a. a. I’m hoping and think that KA will run me through more videos on this type of computer-generated table since I don’t think I’ve ever been taught anything about them before now.
This coming week I’d like to get through the final unit of the course, Analysis of Variance (ANOVA) (no M.P.), on Tuesday and then have 4 days to try and get through the course test. I’m almost positive I won’t get 100% on the test on my first attempt but perhaps if I’m able to go through it a few times I’ll pass by the end of the week. Course tests are always difficult, however, so I won’t be upset if this doesn’t happen. I noticed this week that I’m 100% complete the next course, High School Statistics, but am only 75% of the way through the following course, AP College Statistics. Taking this into consideration, I think it’s likely I’ll need to revise my goal of getting all the way through stats by the end of November to getting through it by the end of December. That was my minimum goal from the start was to start calculus by the start of the New Year anyways, however, so I’m not beating myself up too badly about it. That said, I’m still aiming to get through it well before then.
(Also, I still don’t know how to end these blogs…)