Week 63 – Nov. 9th to Nov. 15th

Well, the bad news is that I made an error at the end of my post last week; I said I had 9 units left to complete in the final statistics course, AP College Statistics, when really I actually had 11. The good news, however, is that I got through 5 units this week. FIVE!! Of course, all of the units were ~75-85% finished before I began them but, nonetheless, I’m pumped that I was able to cross that many off my list in one week. As opposed to going back through the entirety of each unit, I only watched the videos and did the exercises that I hadn’t done from each unit up to that point, and then each unit test afterwards. I figured that the questions I got wrong in the unit test would indicate what videos/exercises I would need to go back and review. Luckily, of the 50+ questions between all 5 unit tests, there were only a handful I couldn’t remember so I didn’t have to spend much time going back through the material. I was happy about this because 1) obviously it meant I could get further ahead but, more importantly, 2) because it meant I actually retained most of what I learned over the past 4 months. WOOO!!!

In the first unit I went through, Analyzing Categorical Data, the only new thing I learned about was what are known as Mosaic Plots. Here is a photo from my notes of an example of this type of plot:

In this photo, the exercise stated something along the lines of, “based on the given data set of adults, children, and infants that developed antibodies to condition x, plot the data in a segmented bar chart and a mosaic plot.” The steps are labelled 1-4 in the photo.

  1. Write down the information in counts, a.k.a. the raw numbers from the sample data of each category and their respective totals.
  2. Convert the raw data from counts to overall percent.
  3. Convert the table that indicates percent into a segmented bar chart.
    • Note: this is where some information is lost. Each bar shows the percent of each category that has/does-not-have the antibodies, but it doesn’t indicate the size of each category relative to the size of the other categories or the total sample size, as a whole.
  4. Convert tables (1) and (2) into a Mosaic Plot.
    • Note: in this plot you see that the height of the ‘Yes’ and ‘No’ section within each category is the same height as in the segmented bar chart but the width of each category has changed. The widths in the mosaic plot indicate the sample size of each category.

In the second and fifth unit I worked through, Displaying and Describing Quantitative Data and Exploring Bivariate Numerical Data respectively, I went through the characteristics of distributions and scatterplots that you should comment on when describing either:

  • Distributions
    • Shape
      • Left-Skew
      • Right-Skew
      • Symmetric
      • Uniform
        • i.e. all data is the same height on the Y-axis across the X-axis.
      • Bimodal
        • i.e. looks like two twin peaks on either side of X-axis with a valley between.
    • Center
      • Mean
      • Median
    • Spread
      • Range
      • Interquartile Range
      • Mean Absolute Deviation
      • Standard Deviation
    • Outliers
  • Scatterplots
    • Form
      • Linear
      • Non-Linear
    • Direction
      • Positive Slope
      • Negative Slope
    • Strength
      • Strong
        • i.e. close to 1 or -1.
      • Moderately Strong
      • Weak
        • i.e. close to 0.
    • Outliers

I failed the third unit I went through, Summarizing Quantitative Data, because of a question involving interquartile range. Before doing that unit test, I felt like I had a pretty good idea of what IQR was and knew how to calculate it. I essentially thought it was just a box-and-whisker plot which, to be fair, involve IQR but there’s more to it than that. Here’s a summary of what I relearned after going through a couple of videos on IQR:

  • Interquartile Range
    • The range between the 25th and 75th percentile.
    • Ex. Q. What is the IQR between the following dataset of 9 numbers?
      • 43, 44, 44, 44, 45, 45, 47, 48, 48
      • Median = 45
      • 25th percentile = ((2nd number) + (3rd number))/2
        • = (44 + 44)/2
        • = 88/2
        • = 44
      • 75th Percentile = ((7th number) + (8th number))/2
        • = (47 + 48)/2
        • = 95/2
        • = 47.5
      • IQR = (75th Percentile) – (25th Percentile)
        • = 47.5 – 44
        • = 3.5
  • IQR Outlier Rule
    • A general rule of thumb states that an outlier can be considered any datapoint that’s < (25th Percentile) – 1.5 * IQR or > (75th Percentile) + 1.5 * IQR.
      • Ex. from above:
        • Outlier < (25th Percentile) – 1.5 * IQR
          • < 44 – 1.5 * 3.5
          • < 44 – 5.25
          • < 38.75
        • Outlier > (75th Percentile) + 1.5 * IQR
          • > 47.5 + 1.5 * 3.5
          • > 47.5 + 5.25
          • > 52.75

Throughout the week there were a few definitions I wrote down that gave me a more thorough understanding of each of the following terms:

  • Y-hat (ŷ)
    • “Consider this to be ‘Y-estimate’ for a given x.” – Sal
    • This is why Y-hat is used for a regression line, since a regression line tries to estimate the best line of fit for all the datapoints.
  • Residuals
    • On a scatterplot, a residual is the distance each datapoint is in the ‘Y’ direction (i.e. vertically speaking) away from the regression line.
  • Leverage Point
    • (Regarding outliers)
    • On a scatter plot, when an outlier is away from most of the other datapoints in the ‘X’ direction, it will ‘pull’ on the slope and change it which wouldn’t happen if the outlier was directly below or above the other datapoints in the ‘Y’ direction.

In the unit Exploring Bivariate Numerical Data I was taught about what’s known as the Standard Deviation of the Residuals, a.k.a. the Root Mean Square Deviation (R.M.S.D.). Simply put, the R.M.S.D. is the average residual size based off of every datapoint on a scatterplot (I’m not sure if I need to add “relative to the regression line” or if it’s implied). Here’s a photo from my notes that goes through an example of this:

To calculate the R.M.S.D. you must add all the squared values of each datapoint, divide that sum by two less than the total number of datapoints, and find the square root of that final value. The formula for this is √(Σ(residuals)^2/n – 2). I’m not sure why it’s n – 2 and not just or even n – 1 like it is when finding a sample S.D. I’m hoping this will get explained to me later on.

Lastly, in the 6th unit I began but didn’t finish, Study Design, I reviewed the difference between an Experiment and an Observational Study:

  • Types of Studies:
    • Experiment
      • Treatment Group
        • Receives some type of stimulus to see if it alters H_o.
      • Control Group
        • Receives a placebo.
      • Results from both groups are measured against each other to draw a conclusion.
    • Observational Study
      • Past Data
        • i.e. Retrospective
        • Compares historical statistics with one and other.
      • Current Data
        • Will look at data from the present, i.e. a ‘snapshot’ of data from this exact moment.
        • Ex. A sample survey.
      • Future Data
        • i.e. Prospective
        • Measures data from the present or beginning at a specific future starting point and continue on over an extended period of time until a specific end date.

My goal this coming week is to get through another 4 units. There’s only 1160 M.P. to get through between the next 4 so I think it’s possible I can do it. If I do, there’ll only be two 2 units remaining in the course which will give me a legitimate shot at finishing the course (and statistics with it) by the end of November. At this point, I’m not too concerned if I manage to do that or not, but it would definitely be nice! Either way, I’m feeling pretty good about my overall understanding of statistics at this point which, considering how intimidated I used to be by statistics, is a wonderful yet fairly surprising feeling.