Week 44 – June 29th to July 5th

At the beginning of the week I thought I was going to be able to get a KA hatrick but I didn’t quite manage it. I did get through the first two units, however, Analyzing Categorical Data and Displaying and Comparing Quantitative Data. I had finished ~20-25% of each unit before I began them but I still went through all the videos in each unit as a review. Overall, I’m happy with the start I made in this course but, considering how many units there are to get through in this and the following two statistics courses, I’ll need to speed up my pace if ‘d like to get into calculus before 2021.

I spent a large part of this week reviewing and memorizing definitions. My guy tells me I’ll be doing a lot of this for the next little while. Here is what’s called a two-way table (which I’ll explain in more detail below) that I copied from a KA video which I’ll use to go through a number of important definitions following it:

DrinkTypeCaloriesSugars (g)Caffeine (mg)
Brewed coffeeHot40260
Caffe latteHot1001475
Caffe mochaHot1702795
CappuccinoHot60875
Iced brewed coffeeCold6015120
Chai latteHot1202560
  • Individuals
    • The objects/things which are being described by the data set.
    • The types of coffee or the individuals of this table.
  • Categorical Variable
    • A variable where the data can be broken up into specific groups.
    • The type of coffee is a categorical variable as it can only be either Hot or Cold.
  • Quantitative Variable
    • A variable where the data can be represented in amounts/numerical values.
    • Calories, sugars, and caffeine are all quantitative variables.
  • Median
    • The middle number in a set of numbers.
    • Calories can be broken down into a set of [4, 60, 60, 100, 120, 170]. Because it is an even set of 6 numbers with 60 and 100 being the 3rd and 4th values, the median of the set equals 80 because it’s the average/mean of the two numbers closest to the middle.
  • Midrange
    • The average/mean of the highest and lowest numbers in a set of numbers.
    • Ex. the midrange of caffeine in the above example would be (260+60)/2 = 160.
    • To remember this term, it helps me to think Mid(dle-of-the)range.
  • Mean/Average/Centre
    • The mean/average is found by adding up all the values in a dataset and dividing by the total number of values.
    • Ex. the mean value of sugar in the above table is (0 + 14 + 27 + 8 + 15 + 25)/6 = ~14.83.
    • As far as I know, the centre refers to the mean/average when specifically looking at and speaking about a graph or chart but is essentially the same thing.
  • Mode
    • The most common number in a dataset.
    • In the calories category, 100 would be the mode as it occurs twice while every other value only occurs once.
  • Range/Spread/Variability
    • The difference between the highest value and the lowest value.
    • The range in the calories category is 170 – 4 = 166.

After going over those definitions, I then learned (or perhaps relearned?) more about two-way tables, a.k.a. joint-distribution table. These are tables which organize the data based on two categorical variables. Take for example this page from my notes:

The table above shows a hypothetical breakdown of the scores students got on a test based on the number of minutes they studied. The top row indicates the number of minutes spent studying and the column on the far left side indicates the scores. Below the chart you’ll see two definitions:

  • Marginal Distribution
    • Focuses in on the totals of one dimension (i.e. all the columns or all the rows) and looks at each individual column or rows percentage relative to the overall total.
    • For example, the right margin shows the breakdown of scores between the 200 students that took the test and indicates what percent of students got which score.
  • Conditional Distribution
    • Looks at one specific row or column and compares each data point in the row or column to the total value of the entire row or column.
    • For example, looking at the conditional distribution of scores in the category of students that spent 41-60 minutes studying, the breakdown is as follows:
      • 80 – 100%
        • 16/86 = ~19%
      • 60 – 79%
        • 30/86 = ~35%
      • 40 – 59%
        • 32/86 = ~37%
      • 20 – 39%
        • 8/86 = ~9%
      • 0 – 19%
        • 0/86 = 0%

One last definition I learned when working on two-way tables is the term:

  • “In counts”
    • In the above two-way table, if a question asked to “find the marginal distribution of students who studied between 21-40 minutes ‘in counts’” it’s asking you to state the total value of the the entire column (i.e. 30) as opposed to stating the percentage (i.e. 15%).
    • In layman’s terms, it just means state the total value as it appears at the bottom of the column or end of the row.

I continued the week going through different types of tables/graphs/charts and learning about their individual characteristics and idiosyncrasies.

  • Frequency Table
    • A list, table, or graph that displays the frequency of various outcomes in a sample. Each entry in the table contains the frequency or count of the occurrences of specific value within a particular group or interval.
    • Looks very similar to a X, Y coordinate breakdown chart.
  • Dot Plot
    • A chart consisting of data points plotted on a fairly simple scale, typically using filled in circles.
    • In my opinion, it’s the exact same thing as a bar graph but uses dots in place of bars.
  • Histogram
    • A bar graph that uses “bins” or “buckets” for a range of values on the X-axis.
    • Ex. in the histogram in the first photo above, the “buckets” were 0-3, 4-6, 7-9, and 10-11.
    • As noted in that photo, I likely drew that histogram wrong as I believe the bars of the histogram need to be touching.
  • Stem and Leaf Plot
    • A chart which displays numerical data by splitting each data point into a “leaf” (usually the last digit of the number/data point) and a “stem” (the preceding digit/digits).
  • Box Plot
    • A linear style of chart which depicts a breakdown of numerical data into quartiles.
    • The ‘box’ (a.k.a. the Interquartile Range) on a box plot indicates where the middle 50% of the data lies (i.e. the data between the 25th-75th percentile). The lines on the left and right indicate where 0-25% and 75-100% of the data falls, respectively.

There were a few somewhat miscellaneous, but nonetheless notable, definitions I learned throughout the week which were:

  • Independent vs Dependent Events
    • Independent
      • An event in which the outcome isn’t affected by another event.
      • I.e. the likelihood of an event occurring will not change based on the outcome of a previous occurring event.
    • Dependent
      • An event which is affected by another event.
      • I.e. the likelihood that an event will occur changes based on the outcome of other events which happened before it.
  • Left- and Right-Tailed Graphs
    • When a bar graph tapers off to the left it is considered ‘left-tailed’ and when it tapers off to the right it is considered ‘right-tailed’.
    • A.k.a. the graphs are “skewed” to the left or right.
  • Standard Deviation
    • “Tells you how spread out the data is. It is a measure of how far each observed value is from the mean. In any distribution, about 95% of values will be within 2 standard deviations of the mean.” – Google
    • A graph with a low S.D. will look like a typical bell curve or normal distribution whereas a graph with a high S.D. looks like two or more peaks spread out on the same graph.
    • In my opinion, KA didn’t do a great job of explaining what the standard deviation is. In fact, I don’t think I was shown/taught anything about S.D. but was given questions in the exercises and unit test about it. I spent close to an hour googling it to figure it out and, at this point, I still don’t have a great idea of what it is and have no clue how to calculate it.

It seems a bit crazy to me that I’m beginning Week 45. I’m only 7 weeks away from hitting the one year mark! I’m very happy with the effort I’ve put in to this point. This coming week I’m hoping to get through another two units although, looking ahead, it looks like the next two units, Summarizing Quantitative Data (0/1700 M.P.) and Modeling Data Distributions (0/900 M.P), will get into things I’ve never learned before which means it could be difficult to finish both. Regardless, I’ll continue working away and keep making progress towards the calculus courses. Even though there are still plenty of stats units to get through before I make it to calculus, looking at how far I’ve come, it really doesn’t seem that far away.