It was once again another disappointing week in terms of getting through units. The only unit I got through, Exploring Bivariate Numerical Data, ended up being fairly challenging, but it certainly didn’t help that I didn’t have access to the internet for 3/5 days. (I was away at a cottage and the internet ran out.) The unit I got through was interesting in that the material varied from being relatively simple to quite complicated. I’m hoping that I’ll eventually be shown more videos and given more exercises on the difficult material I learned about because as of now I really don’t understand a lot of it all that well.
The week started off easy with simple stuff coming at the beginning of the unit. The first thing I worked through was an introduction to scatterplots. A scatterplot is a type of data display that shows the relationship between two numerical variables and end up looking like a bunch of chickenpox on a graph. Each member of the dataset gets plotted as a point whose (x, y) coordinates relates to its values for the two variables (i.e. the two things being measured by the x-axis and y-axis). The following is a picture from my notes of different ways in which the data on a scatterplot can be distinguished, classified, and/or described:
- Positive vs Negative
- A ‘positive’ scatterplot will have data points whose value will increase in the y-direction as their value increases in the x-direction.
- A ‘negative’ scatterplot will have data points whose value will decrease in the y-direction as their value increases in the x-direction.
- Strong vs Weak
- A scatterplot with data points that have a ‘strong’ connection will have a correlation coefficient (explained below) close to either 1 or -1.
- The data points will all fall close to the Line-of-Best-Fit (also explained below) if they have a ‘strong’ connection.
- A scatterplot with data points that have a ‘weak’ connection will have a correlation coefficient close to either 0.
- The data points will not fall close to the Line-of-Best-Fit if they have a ‘weak’ connection.
- Linear vs Non-Linear
- Data is considered ‘linear’ on a scatterplot if the data points more or less form a straight line in either the positive or negative direction.
- ‘Non-Linear’ data points will often form a parabola of some kind or take no specific shape at all.
- Clusters
- When data points are grouped together, they are considered a ‘cluster’ of data points.
- There aren’t any specific rules for what makes a group of data points a ‘cluster’ or not, rather it is just based on opinion.
- Outliers
- When the majority data points on a scatterplot follow a specific pattern, an ‘outlier’ is a data point that doesn’t fit on that pattern.
I’ve worked with x-y graphs before and had been taught that the x-axis is commonly referred to as the independent variable and the y-axis referred to as the dependent variable. I learned that when working in statistics, however, the x-axis is referred to as the Explanatory variable and the y-axis as the Response variable. As you can likely conclude, this means that the x-axis is used to measure the variable that affects/predicts the outcome and the y-axis is used to measure the response. An example of this would be how temperature (x-axis) would affect ice cream sales (y-axis) on during a hot period of time.
Two other important definitions I learned this week were:
- Bivariate
- “For each x-data point thiere a corresponding y-data point.” – Sal
- “Involving or depending on two variables.” – Google
- As far as I know, bivariate data is any type of data that can be displayed on a (x, y) graph, meaning there are two variables being measured.
- Residual
- The distance a data point is away from the Line-of-Best-Fit, a.k.a. the Least-Squares Regression Line, on the y-axis.
- “A residual is the difference between the measured value and the predicted value of a regression model. It is important to understand residuals because they show how accurate a mathematical function, such as a line, is in representing a set of data.” – Shodor.com
- Residual = [actual y-coordinate of a specific data point] – [expected coordinate based of Least-Square Regression Line].
Next I learned about the Correlation Coefficient which is denoted with the letter r. The correlation coefficient is an indication of how weak/strong the data points being measured on a scatterplot are. It’s value always falls between -1 < r < 1. If r is negative, the regression line slopes downwards and if r is positive the regression line slopes upwards. When r = 0 or is close to 0, the data points are very spread out from each other and the regression line is ‘weak’, whereas if r = 1, r = -1, or is fairly close to either 1 or -1, the data points are all fairly close to the regression line, a.k.a. the regression line is ‘strong’. The correlation coefficient can be thought of as “how well a line can describe the relationship between x_i (i.e. all of the data points simultaneously) and the regression line” – Sal.
Finding r is a somewhat difficult process and impossible for me to write out here simply because there are a lot of square root symbols which I don’t know how to add to a blog post. I wrote out an example in my notes of how to find r from four data points, (1, 1), (2, 2), (2, 3), and (3, 6), and labelled each step 1-7 below:
- For each data point, find the mean of the x-values.
- Find the standard deviation of the x-values.
- For each data point, find the mean of the y-values.
- Find the standard deviation of the y-values.
- Use the correlation coefficient formula.
- Place each term/x- and y-value into the appropriate spot in the formula.
- It’s difficult to understand, but the formula gets you to find the x- and y-z-score for each set of (x, y) coordinates, multiply them together, and then find the sum of all the products.
- Side note – Data points (2, 2) and (2, 3) are crossed out because they equal 0.
- Combine denominators and add the numerators of the products of the x- and y-z-scores and multiply that number with the first part of the equation.
- Side note – I don’t understand why the first part of the equation, (1/(n-1)), is a part of the equation or why it works. (Actually, if I’m being honest, I don’t really understand how/why this equation works, in general….)
After working through how to find the correlation coefficient, I was then introduced to what’s called the Regression Line (denoted with ŷ and the “^” is literally call (y-)”hat” which I think is hilarious). This is a line drawn on a scatterplot that’s as close to all the data points as possible. The formula for the regression line is:
- ŷ = mx + b
- m = r * (s_ ȳ/s_ x̄)
- “Slope (m) equals the correlation coefficient (r) multiplied by (the standard deviation of the mean of the y-values (ȳ) divided by the standard deviation of the mean of the x-values (x̄)).”
- Therefore, ŷ = (r * (s_ ȳ/s_ x̄))x + b
I don’t fully understand at this point, but the regression line is also referred to as the Line-of-Least-Squares. There may be some nuance between the two names that I just don’t understand, however. The reason, as far as I can tell, that it’s called the “Line-of-Least-Squares” is because the line attempts to minimize the sum of the square of each residual from every data point by creating a line that’s as close to all the data points at the same time as possible. The following is the best photo I could find on Google to visualize this concept:
After learning about what the Regression Line/Least Squares Regression Line, I was then shown how to calculate the line, which to me seemed incredibly difficult. It took Sal 9 videos, each of which were ~10 minutes long, to demonstrate how to calculate the line. I watched through the entire series but didn’t take any notes. For the most part I was able to follow a long and understand what he was doing but there were a few parts that went over my head. Generally speaking, however, I’m pretty sure that the parts I didn’t understand had more to do with me not having knowledge about statistical concepts such as “actual value vs expected value” and nothing to do with me not understanding the algebra he used.
As a final note, when Sal calculated the Least Squares Regression Line he used calculus which was the first time I’ve been exposed to it. I’m not going to lie, it definitely seemed very confusing, intimidating, and daunting. It makes me think that when I get to calculus I’m likely going to have to learn to think in a completely different way, i.e. think outside the box, in order to wrap my head around how it works. In a way, I’m excited to get into these concepts but also nervous that I might not be able to understand them.
There are still 11 units left in this course, Statistics and Probability, which makes it very unlikely I’ll get through it by the end of my 52nd week. The last 4 units, however, are quite small so I suppose it’s not completely out of the question. As I mentioned at the end of my last post, I won’t be upset if it doesn’t happen, however. This week I’m going to push to get through at least the next two units, Study Design (0/900 M.P.) and Probability (0/1600 M.P.). I also just looked ahead into the next course, High School Statistics, and was pleasantly surprised to see that 1) I’ve already completed 64%of the course and 2) that the last two units in the course are called Study Design and Probability! Looks like getting through statistics may not take nearly as much time as I thought. 🙂