|Z-12: Correlation and Simple Least Squares Regression|
|Written by Madelon F. Zady|
Learn about r squared, Pearons Products, and other things that will make you want to regress.
EdD, Assistant Professor
Strength of Correlation
|Size of r||Interpretation|
|0.90 to 1.00
0.70 to 0.89
0.50 to 0.69
0.30 to 0.49
0.00 to 0.29
|Very high correlation
Little if any correlation
Here we are talking about the Pearson Product Moment Correlation, or r, that is calculated from the following formula which predicts Y from X or y/x:
The algebraic basis of r is the z-score, and this formula represents something called the Fisher z transformation. You may remember other formulae for the calculation of the correlation coefficient, for example, another calculation is the infamous "Raw Score Formula." It takes the average student about twenty minutes to calculate a correlation by hand (with calculator) using the raw score formula. However, the computer can perform the feat in seconds. (Try the Paired-data Calculator that is part of the method validation toolkit on this website.)
When we talk about correlation this way, we are saying that Y is dependent upon X, which may suggest or imply that X causes Y. We need to be careful when we talk about causality! A correlation between two variables does not always mean a causal relationship, i.e., X causes Y to happen. There may be one or more variables intervening between X and Y, such as an unexamined variable like Z. In this example, we really cannot say that age causes cholesterol to increase. As we all know, there are many intervening variables that are important: genetics, exercise and diet to name a few. And, temporal sequence is also important when looking at causality, as X would always have to precede Y's occurrence in order to be causal. In our cholesterol example, this is not a problem. But if we wanted to say that exercise causes cholesterol to decrease, then we had better be sure that we first evaluate patients' cholesterol levels, put them on an exercise program and then measure their cholesterol levels after their participation in the program.
What if we wanted to use r to give us some idea about how closely the results of two methods compare? Presumably a high correlation between the results of two glucose methods would mean that the methods are comparable. Those who are knowledgeable about the use of statistics in method comparison studies will tell us that the correlation coefficient is not a fool-proof statistic and must be interpreted carefully. Any set of data where the points fall all on a line will give a high correlation coefficient. If a glucose method were consistently 50 mg/dl higher than another method, the results would fall on the line and the correlation coefficient would be high, even though there is a serious systematic error or inaccuracy between the methods.
The square of the correlation coefficient or r² is called the coefficient of determination. We will examine this r² later in regression analysis.
If we really want a statistical test that is strong enough to attempt to predict one variable from another or to examine the relationship between two test procedures, we should use simple linear regression. Regression is more protected from the problems of indiscriminate assignment of causality because the procedure gives more information and demonstrates strength. In fact, the r that we have been talking about above is only one part of regression statistics.
Let's see how this prediction works in regression. Let us say that we have a data set for two variables X and Y. We have calculated the mean for each of these variables. Now let's say that we put all of the Y values into a container and draw one value out of that container at random. Before we look at that Y value, we first are going to guess what the number is. What value should we guess? What value would be most likely? The best guess for the value of Y would be the mean value for the Y data - the arithmetic average is always a good guess. But statisticians have worked out a better method. Another variable (X) can be used to approximate the Y. If Y is dependent upon this X, then the Y estimated this way will be closer to the true Y value than just guessing Y's mean.
In its simplest form, regression is essentially the formula for a straight line that you learned in beginning algebra. In essence, the prediction of Y from X is dependent upon the mathematical formula for a straight line. The first time you saw this formula it appeared as follows:
y = mx + b
Typically in algebra, the student is asked to set up a table of x and y values, graph the points, and draw the best straight line through those points. For example, if y = 2x + 1 the table and graph would appear as shown in the accompanying figure. When x is equal to zero, y is equal to 1 because the mx term falls out [2 times 0 is 0]. In this same expression, the b-term is 1. So when x is equal to zero, y is equal to 1 [y = (2*0) + 1]. If we look at this zero point on the x-axis, the line cuts the y-axis at the number 1. We call this number the y-intercept (or constant). Now if x is any number greater than zero, the m-term or coefficient becomes important. Here m is the number 2. If we make x equal to 1, then y is 2 plus the 1 for the constant. Essentially what this m is telling us is that when x increases by a factor of one, y increases by a factor of 2. There is not 1:1 correspondence. Y is increasing twice as fast as X. And this causes the line to slant upward, so the coefficient m is called the slope of the line.
In regression, the equation for the straight line is recast as y = bx + a. This change in terminology leads to confusion. Here a is the y-intercept or constant and b is the coefficient or slope of the line. A few more words of caution about regression - as in all of statistics there are certain assumptions: the x value is a true measure, both X and Y distributions are normal, and homoscedasticity, i.e., the variance of y is the same for each value of x. Also statisticians often write the formula this way: y = bx + a + e, where e represents the error in prediction.
The objective in simple regression is to generate the best line between the two variables (the tabled values of X and Y), i.e., the best line that fits the data points. Regression uses a formula to calculate the slope, then another formula to calculate the y-intercept, assuming there is a straight line relationship. The best line, or fitted line, is the one that minimizes the distances of the points from the line, as shown in the accompanying figure. Since some of the distances are positive and some are negative, the distances are squared to make them additive, and the best line is one that gives lowest sum or least squares. For that reason, the regression technique will sometimes be called least squares analysis.
The fitted regression line can tell us the actual ratio for the correspondence between x and y. In the case of cholesterol vs age, we don't expect a one-to-one correspondence. However, in a comparison of cholesterol results between two different analytical methods, we would want a one-to-one correspondence, i.e., we want Method Y to give about the same results as Method X. The line of l:l correspondence should have a particular tilt to it, i.e., the line should make a 45-degree angle with the x-axis. And the formula for this line would have a slope coefficient of 1 and a y-intercept or constant term equal to zero. If our original formula, y = 2x + 1, were plotted, we would see that y increases twice as fast as x. There is not l:l correspondence, and the angle of this line is different from 45 degrees. So, by merely inspecting the line generated by least squares regression, we can make some conclusions. Lessons 13 and 14 will give us more information on the usefulness of regression.
- Westgard, J. O. Basic Method Validation. Madison, WI: Westgard QC, Inc., 1999.
|< Prev||Next >|