Week 45 – June 6th to July 12th

I’m fairly certain I’ve mentioned this before, but I currently work as a squash pro. I believe my best ranking as a junior was 8th in Canada. Since I was 11 years old, I remember wanting to climb the junior rankings to get as highly ranked as possible. I literally spent hundreds (perhaps even thousands) of hours as a kid practicing on my own. Yesterday I realized that what I’m doing here on KA isn’t all that different. Similarly to the vision/goal I had growing up of becoming 1st in the squash rankings, I now have a vision/goal of what I want my life to look like, and I feel like the hours I’m spending working on KA are analogous to the hours I spent practicing on the squash court. Realizing this made me think that the work ethic I developed as a kid on the squash court is being applied here which I thought was interesting and which I’m grateful to have developed.

I didn’t make it through two units this week, unfortunately, but got through the unit Summarizing Quantitative Data. I once again spent a lot of time learning new definitions. I also strengthened my understanding of previous learned definitions which helped to make what I learned in the previous week much more clear. I was happy to also have learned what ‘standard deviation’ and ‘variance’ are and how to calculate them. Lastly, I learned about box plots which I also found quite useful. On a bit of a side note, I’ve begun writing these blogs slightly differently; I start by going through my notebook and figure out the important talking points to create an outline before I begin which seems to make writing the post easier. It thought this was worth noting as it seems that I’m not only improving my understanding of statistics but also my writing skills!

I started this week by learning the definitions of the terms “sample” and “population” and 5 different notation symbols/letters. I don’t know how to add all of the symbols onto Microsoft Word so here’s a picture of what the symbols look like and their basic definitions (below the photo are more detailed definitions of each):

  • Population
    • Any data set that contains each and every data point of the entire set.
  • Sample
    • A data set that has a portion of all the data points but does not contain all the data points within the set.
  • Σ
    • A.k.a. “Sigma”
    • “Sum of”
    • Used to denote the sum of multiple terms.
    • In layman’s terms, this symbol indicates that you need to find the mean of the proceeding part of the equation.
    • Below the symbol is the first number in the set and above it is the last number in the set. To the right of the symbol is the equation each value is to be applied to and then summed.
  • μ
    • A.k.a. “Mu”
    • The mean of a population.
      • It’s worth noting that Mu is what’s called a ‘parameter’ which are numbers that summarize data for an entire population as opposed to a ‘statistic’ which is explained just below this.
  • x^-
    • (The symbol actually looks like an “x” with a line directly over top of it but I can’t write it here.)
    • The mean of a sample.
    • I don’t know what the name of this symbol is.
      • The sample mean, unlike Mu, is considered a ‘statistic’ as opposed to a ‘parameter’ which means it summarizes data from a sample and not a population.
  • N
    • An upper case N is conventionally used to indicate the total count of data points in a population.
  • n
    • A lower case n is conventionally used to indicate the total count of data points in a specific sample.

I then moved on the learning about variance and the standard deviation of a data set, the difference between the two, and how each equation is altered based on whether you’re measuring a sample or a population. The following is a picture of my notes of the two equations and their three variations when measuring a population, a unbiased sample, or a biased sample:

To be honest, I still don’t fully understand why each equation is the way that it is (which reminds me of when I first learned sinusoidal trig equations and I wasn’t able to fully wrap my head around the components of the equation until working on them for a while). From what I’ve come to understand, calculating the variance of a data set will give you a number from zero (meaning all the numbers in the data set are the same) to infinity (a very large number means the data is very spread out). The standard deviation has a similar purpose; it tells you the average distance each data point is away from the mean of the data set (as I edit this post, I realize this may be wrong but don’t know the right answer…). The standard deviation is the square root of the variance. At this point, I don’t know why the square root of the variance is the standard deviation. It’s also worth noting that typically the “unbiased sample” equations are most often used when calculating either the variance or standard deviation of a sample.

As you can see from the above photo, the equation for calculating the variance or standard deviation the changes depending on whether you’re using a sample or population. The main and only difference between calculating a sample or a population (in either the variance or the standard deviation) is that you divide by n – 1 as opposed to N respectively. When calculating the variance or standard deviation of a sample, by dividing it by n – 1, it is said that you’re calculating the “unbiased” sample as opposed to simply dividing it by n which would be considered a “biased” calculation. I was shown 4 videos on why this is and really don’t understand how or why that’s the case. I plan on watching those 4 videos again to better understand this concept.

After that I learned more about Box Plots which I was taught are also known as Box and Whisker Plots (which I think is a bit of a silly name to be honest). Here is another photo from my notes of an example of a Box Plot and a few of its components:

  • Q1 and Q3
    • Stands for “quarter 1”, and “quarter 3”.
    • Mark the 25th and 75th percentiles.
  • Median/Central Tendency
    • The middle number in a set of numbers.
    • Ex. #1 – In the following odd set of numbers, {1, 2, 3, 4, 5}, the median is 3 as it is the exact middle number of the set with two numbers to its left and two numbers to its right.
    • Ex. #2 – In the following even set of numbers, {4, 60, 60, 100, 120, 170}, because it is an even set of 6 numbers with 60 and 100 being the 3rd and 4th values, the median equals 80 as it is the average/mean of the two numbers closest to the middle.
    • I think of this as Q2.
    • On a box plot, this mark makes it easy to see exactly where the halfway point of the data set is.
  • Interquartile Range
    • The ‘box’ on a box plot.
    • Indicates where the middle 50% of value of the data set lies (i.e. between the 25th-75th percentile).
  • Min/Max Value Points
    • The points at the end of each ‘whisker’ (i.e. the lines stemming from the left and the right of the box’).
    • Indicate where the lowest and highest data point of the data set falls (that aren’t considered outliers).
  • Outliers
    • When a data point falls well below or above the overall pattern of a distribution.
    • A commonly used rule to calculate outliers is:
      • Outlier < Q1 – 1.5 * IQR, and
      • Outlier > Q3 + 1.5 * IQR
      • These equations state that a data point that’s 1.5 times the total interquartile range below Q1 or above Q3 are considered outliers.
      • This is a conventional way to determine outliers however it is not a rule.
    • See photo below for an example of how to draw outliers on a box plot:

The last thing I was taught in this unit was about what’s known as the Mean Absolute Deviation (M.A.D.) which is the average distance each data point in a data set is away from the mean. This to me seems very similar to the standard deviation which uses a very similar formula, as well. The only difference I can tell between the two is that you square the difference between each data point and its distance away from the mean and then square root the end result when calculating the S.D. which you don’t do when calculating the M.A.D.

Here is a photo from my notes that explains how to calculate the M.A.D. and what its equation looks like:

This coming week I think it may be possible to get through the next two units, Modeling Data Distributions (0/900 M.P.) and Exploring Bivariate Numerical Data (0/1300 M.P.), although I once again think it may be tough. Looking ahead into each unit, they both contain a number of things which I don’t think I’ve ever worked on before so there’s a good chance those things will take longer for me to get through than I’d like them to. As I’ve mentioned before, however, I’m going to need to start regularly getting through more than one unit a week if I’d like to get into calculus before the start of the new year. No time like the present I suppose!