Week 65 – Nov. 23rd to Nov. 29th

Although I wrote less notes this week than any other week so far, I don’t feel too bad about the amount of work I got done. I think I averaged slightly more than an hour of work per day and got a better understanding of a few tricky concepts, most of which had to do with sample distributions. I finished off the unit Sampling Distributions and got a bit less-than halfway through the following unit, Inference – Comparing Two Groups or Populations. The most difficult part of the week was the consistent feeling that I’m on the brink of having my head wrapped around the majority of stats but, because I’ve only figured out ~95%, I still feel completely lost. An analogy would be it’s as if I’m in a pitch-black room and have to stumble my way around to map out the room with my hands. Once I have the entire room mapped out in my head, it’s as if a light gets turned on and I can see everything plain and clear. I think close to having stats mapped out but not quite well enough for the light to come on.

It took me four tries to finish the unit test from Sampling Distributions. Going through it multiple times really helped me understand the difference between a sample mean and sample proportion:

  • Sample Mean
    • Denoted with .
    • Conditions to infer a normal distribution:
      • 1) Random
      • 2) n ≥ 30
      • 3) Independent (sampled with replacement or 10% Rule)
    • The normal distribution will show the ‘raw’ numbers (as I call it) of the sample at the bottom.
  • Sample Proportion
    • Denoted with .
    • Conditions to infer a normal distribution:
      • 1) Random
      • 2) n ≥ (10 successes) AND (10 failures)
      • 3) Independent (sampled with replacement or 10% Rule)
    • The normal distribution will show a range of decimals at the bottom to indicate the proportion of ‘successful outcomes’ in regards to whatever’s being measured.

I realized after going through the unit test and a number of exercises that these are the only two types of samples, means and proportion, that exist. What I found particularly difficult in this unit, however, was understanding the formulas for the standard deviation of both a sample mean and sample proportion. The formulas for each are:

  • Sample Mean
    • “The variance of a sample equals the variance of the population divided by the number of datapoints in the sample.”
    • (σ_ x̅)^2x = σ^2/n
    • “The standard deviation of a sample equals the square root of the variance of the population divided by the number of datapoints in the sample.”
    • σ_ x̅ = √ (σ^2/n)
      • = √σ^2 / √n
      • = σ/√n
  • Sample Proportion
    • “The standard deviation of a sample proportion equals the square root of the probability of success multiplied by the probability of failure divided by the number of datapoints in the sample.”
    • σ_p̂ = √P(1 – P)/n

I couldn’t remember why you didn’t square the numerator in the S.D. sample proportion formula which bothered me so much that I spent the better part of a morning going back and reviewing it. I was able to find a video that explains the entire process. Going through the video gave me some good practice using algebra which I realized how rusty I am at using now. To highlight the fact that it’s actually fairly difficult, below is a picture from my notes where I worked through how to derive the sample proportion formula for standard deviation from the probability weighted Bernoulli standard deviation formula:

Considering how messy and out of order that entire note is, I’m quite sure no one will be able to decipher what I wrote. Here’s what I wrote cleaned up a bit:

  • (Side note: In a Bernoulli distribution, there are only two outcomes, failure which equals 0 and success which equals 1. Each outcome has a specific probability which combine to equal 100%. The probability of success is denoted with P and the probability of failure is denoted with (1 – P). The mean of the distribution equals P.)
  •  σ^2 = (P(failure)*(0 – μ))^2 + (P(success)*(1 – μ)^2)
    •  = P(failure)*(0 – μ) + P(success)*(1 – μ)
      • = (1 – P)*(0 – P)^2 + (P)*(1 – P)^2
      • = (1 – P)*(0 – P)(0 – P) + (P)*(1 – P)(1 – P)
      • = (1 – P)*(0 + -0P + -0P + P^2) + (P)*(1 + -P + -P + P^2)
      • = (1 – P)(P^2) + (P)(1 -2P + P^2)
      • = (P^2)(1 – P) + (P)(1 -2P + P^2)
      • = (P^2 – P^3) + (P -2P^2 + P^3)
      • = P^2 – P^3 + P -2P^2 + P^3
      • = P^3 – P^3 + -2P^2 + P^2 + P
      • = -P^2 + P
      • = P -P^2
      • = P(1 – P)
    • √σ^2 = √(P(1 – P))
      • σ = √(P(1 – P))
  • (2nd side note: when dealing with a sample proportion, , you would simply divide by n to get σ_ p̂ = √(P(1 – P)/n)).

I was able to get through two sections in the following unit Inference – Comparing Two Groups or Populations where I worked through 1) calculating confidence intervals between the difference of two sample proportions and 2) using those types of calculations during hypothesis testing.

  • Sample Proportion Confidence Intervals
    • A calculation done on sample set of normally distributed data to generate an interval where there’s x% probability that the true mean of the population falls within that interval. 
    • Con-In. for difference between sample proportions:
      • Con-In. = (p̂_a – p̂_b) (+/–) Z^star * σ_(p̂_a – p̂_b)
    • (Side note: I can barely wrap my head around this formula, the different parts to it, and why it works but not at all well enough to do a good job putting it into my own words.)
  • Difference Between Sample Proportions Using Hypothesis Testing
    • (Side note: hypothesis testing compares an “original hypothesis” a.k.a. the “null hypothesis”, H_o or H_0, to an “alternative hypothesis”, H_a or H_1.)
    • When comparing proportions in a hypothesis test, you begin with the null hypothesis that the true proportion of both populations, P_a and P_b, equal each other. The alternative hypothesis is that they don’t equal each other or sometimes, more specifically, that one is larger than the other. You take each sample proportion, p̂_a and p̂_b, run them through the confidence interval formula and if the resulting interval includes 0 you can accept H_o as accurate and, if not, it suggests H_a.
      • (Side note: I don’t fully understand why if the interval includes 0 then you accept H_oand if the interval doesn’t include 0 then it suggests H_a as being accurate.)

I’m realizing now as I try and write this post that I really don’t have a great understanding of how either confidence intervals or hypothesis testing work. I’m hoping I’ll get more clarity on both this coming week and better understand the nuance between the two, their purposes and the intricacies of both of their formulas. I also realized writing this post that I didn’t take nearly enough notes this week. I need to make a point of writing more things down, going forward.

Clearly, I’m not going to get through stats by Dec. 1st, i.e. this coming Tuesday. I will be thoroughly disappointed with myself if I’m unable to get through stats by the end of December, however, so it’s time to kick things into gear. I just realized that I didn’t do the unit tests for the previous two units Confidence Intervals (800/800 M.P.) and Significance Tests (Hypothesis Testing) (1400/1400 M.P.) which I must have skipped over since they’re both 100% complete. I’m going to start this coming week by going back and doing both of those tests to get them out of the way. My goal is to get through both tests by Wednesday at the latest and then finish up the unit Inference – Comparing Two Groups or Populations (480/1200 M.P.) buy the end of the week. If I can manage that, I’ll still have four units remaining but there’s only 60 M.P. left to get through between all four of them so I’m hoping I can get through all four units in a single week. If I do, that will give me ~2 weeks to finish the course challenge. Hopefully finishing stats by the end of the month won’t be a photo finish.