Week 46 – June 13th to July 19th

I have three courses in statistics left to get through before I begin calculus. My goal since the start of the year has been to begin calculus at least before the new year. Looking ahead and seeing how many units there are in each statistic course and considering my average pace of units per week, I’m concerned that the rate I’m getting through them isn’t quick enough. On top of that, I once again only made it through one unit, which is the bad news. The good news, however, is I just realized the next two statistic courses contain many of the same units which make up this course! This means there’s a good chance I’ll have a fair amount of each course completed before I even begin them which has restored my optimism that I’ll be able to reach my goal. Yay!

This week I once again got a better understanding of two more seemingly important definitions:

  • Percentile
    • Either, the percent of data that is BELOW the amount in question, or
    • The percent of data that is AT AND BELOW the amount in question.
  • Probability
    • Very similar to percent not completely the same and also written differently.
    • Think “how likely something is to happen.”
    • The probability of an event is a number between 0 and 1, where, roughly speaking, 0 indicates impossibility of the event occurring and 1 indicates certainty that it will occur.
    • Where percent can go above 100 or below 0 (for instance a stock could rise by 150% or drop by -20%), probability deals with chance which is always between 0 (no chance) and 1 (guarantee).

I was introduced to Cumulative Relative Frequency tables this week which are line graphs that indicate how accumulated data is broken down to its corresponding percentiles. It measures and shows where data from a population falls from 0-100%. Here is an example CRF table:

This is a CRF table I got from KA which measures sugar content in grams of 32 drinks in a coffee shop. The x-axis indicates that the amount of sugar ranges from 0 to 50 grams and the y-axis measures the cumulative relative frequency. The blue line on the table indicates how much sugar is in each successive drink when they’re ordered from least amount of sugar to greatest amount of sugar. As you can see, the purple arrows indicate that you must get to 0.5, a.k.a. the 50th percentile of sugary drinks, to have at least 25 grams of sugar or more. As another example, according to this table, you must get to the 80th percentile of sugary drinks to have at least 40 grams of sugar or more.

The next thing I learned about is what’s known as a Density Curve which is a curved graph that looks like a hill of some sort. These graphs are used to represent probability. The ‘hill’ can take just about any shape. The area under the ‘hill’ represents 100% of the probability of something occurring at any given x value. The y-axis, just like a CRF table, is used to indicate the probability. Here is a picture I found on google of a density curve:

When reading a density curve graph, I learned that if the curve is skewed to the right (a.k.a. right-tailed, i.e. it tappers off to the right) the mean will be to the right of the median. The reverse is true if the curve is skewed to the left. Here is another photo from google that helps make this concept more clear:

The last thing I worked on this week was what’s known as the Standard Normal Distribution a.k.a. the Normal Curve or Bell Curve. From what I can tell, this is an incredibly important statistical concept based on how Sal talked about it. Here’s a basic photo of the normal curve:

Looking at the shape of it, you can see why it’s also called a Bell Curve. You can also see from this photo that the curve is symmetrical, meaning the data is equally distributed on either side of the mean. Because the curve rises in the centre, however, the majority of the data falls predominantly in the middle of the measurement. The above photo also makes it clear that in a standard distribution, the mean, median, and mode are all equal.

An important concept I learned about normal distributions is what’s known as the Empirical Rule, a.k.a. the 68-95-99.7 rule. You can see from the following photo how this concept works:

The empirical rule states that on a normal distribution curve 68% of the data falls within +/-1 standard deviation of the mean, 95% of the data falls within +/-2 standard deviations of the mean, and 99.7% of the data falls within +/-3 standard deviations of the mean. (As you can see from the photo, it’s actually 68.3% and 95.4% respectively but my guess is they don’t add the three- and four-tenths of each percent tomato things easier to remember). Knowing this rule makes it quite a bit easier to quickly estimate the probability of an event occurring within a normal distribution without having to do too much math.

The last thing I learned about this week are z-scores and how they help to give more specific answers to the probability of certain events occurring within a normal distributions. A z-score is the probability of something happening x standard deviations away from the mean. In order to find a Z-score, you must look at a Z-table which breaks each standard deviation into hundredths and then shows you the probability each specific value has of occurring. Here are the negative and positive Z-tables:

The negative Z-table is used to when measuring data to the left of the mean on a normal distribution and the positive Z-table is used when measuring data to the right. The way you read a Z-table is by figuring out how many S.D.’s you are away from the mean, looking at the left side of the table to figure out the tenths spot of the S.D. value and then moving across to the right from there to find the hundredths place, if necessary.

For example, if you were to say “John’s test score was 2.34 standard deviations higher than the mean of the test – find his z-score”, you’d look for 2.3 on the left side of the positive Z-table and then look five number to the right in the .04 column which shows the z-score being .9904. By multiplying any Z-score by 100 you get the percentile which means John’s test score would have been in the 99th percentile, i.e. he had a better test score than 99.04% of all other students. If the question then went on to ask “the mean test score was 50 and the S.D. was 10 – what was Johns score?” you’d do the following calculation:

  • First calculate how many marks Johns score was above the mean:
    • 2.34/1 = x/10
      • 2.34 * 10 = x
      • 23.4 = x
  • Then add that to the mean to find out his score:
    • 23.4 + 50 = Johns score = 73.4

Using these types of calculations, you can find out the probability of something occurring between two points on a normal curve, as well. Lastly, two important things about a normal curve to remember are that 1) the curve goes on indefinitely in each direction, i.e. the curve never touches the x-axis, and 2) you must specify a range between two x-coordinates to get an approximate probability of an event occurring.

I have 5 weeks left before I reach the 1-year mark. It would be great to get through this course, Statistics and Probability, before I hit that milestone but there are 12 units remaining in the course so I don’t think that’s likely. If I can rattle off a few KA hatricks I suppose it might be doable. The next three units are Exploring Bivariate Numerical Data (0/1300 M.P.), Study Design (0/900 M.P.), and Probability (0/1600 M.P.), each of which look fairly challenging. If I don’t finish this course by the 1-year mark it won’t be the end of the world but I figure I might as well try! That way, even if I don’t manage to do it, I’ll still be further along than I would have been if I didn’t push myself.