Z12: Correlation and Simple Least Squares Regression 
Written by Madelon F. Zady  
Learn about r squared, Pearons Products, and other things that will make you want to regress. EdD, Assistant Professor

Strength of Correlation 

Size of r  Interpretation 
0.90 to 1.00 0.70 to 0.89 0.50 to 0.69 0.30 to 0.49 0.00 to 0.29 
Very high correlation High correlation Moderate correlation Low correlation Little if any correlation 
Here we are talking about the Pearson Product Moment Correlation, or r, that is calculated from the following formula which predicts Y from X or y/x:
The algebraic basis of r is the zscore, and this formula represents something called the Fisher z transformation. You may remember other formulae for the calculation of the correlation coefficient, for example, another calculation is the infamous "Raw Score Formula." It takes the average student about twenty minutes to calculate a correlation by hand (with calculator) using the raw score formula. However, the computer can perform the feat in seconds. (Try the Paireddata Calculator that is part of the method validation toolkit on this website.)
When we talk about correlation this way, we are saying that Y is dependent upon X, which may suggest or imply that X causes Y. We need to be careful when we talk about causality! A correlation between two variables does not always mean a causal relationship, i.e., X causes Y to happen. There may be one or more variables intervening between X and Y, such as an unexamined variable like Z. In this example, we really cannot say that age causes cholesterol to increase. As we all know, there are many intervening variables that are important: genetics, exercise and diet to name a few. And, temporal sequence is also important when looking at causality, as X would always have to precede Y's occurrence in order to be causal. In our cholesterol example, this is not a problem. But if we wanted to say that exercise causes cholesterol to decrease, then we had better be sure that we first evaluate patients' cholesterol levels, put them on an exercise program and then measure their cholesterol levels after their participation in the program.
What if we wanted to use r to give us some idea about how closely the results of two methods compare? Presumably a high correlation between the results of two glucose methods would mean that the methods are comparable. Those who are knowledgeable about the use of statistics in method comparison studies will tell us that the correlation coefficient is not a foolproof statistic and must be interpreted carefully. Any set of data where the points fall all on a line will give a high correlation coefficient. If a glucose method were consistently 50 mg/dl higher than another method, the results would fall on the line and the correlation coefficient would be high, even though there is a serious systematic error or inaccuracy between the methods.
Coefficient of Determination
The square of the correlation coefficient or r² is called the coefficient of determination. We will examine this r² later in regression analysis.
Simple Linear Regression or Ordinary Least Squares Prediction
If we really want a statistical test that is strong enough to attempt to predict one variable from another or to examine the relationship between two test procedures, we should use simple linear regression. Regression is more protected from the problems of indiscriminate assignment of causality because the procedure gives more information and demonstrates strength. In fact, the r that we have been talking about above is only one part of regression statistics.
Let's see how this prediction works in regression. Let us say that we have a data set for two variables X and Y. We have calculated the mean for each of these variables. Now let's say that we put all of the Y values into a container and draw one value out of that container at random. Before we look at that Y value, we first are going to guess what the number is. What value should we guess? What value would be most likely? The best guess for the value of Y would be the mean value for the Y data  the arithmetic average is always a good guess. But statisticians have worked out a better method. Another variable (X) can be used to approximate the Y. If Y is dependent upon this X, then the Y estimated this way will be closer to the true Y value than just guessing Y's mean.
Mathematical Formula for a Straight Line
In its simplest form, regression is essentially the formula for a straight line that you learned in beginning algebra. In essence, the prediction of Y from X is dependent upon the mathematical formula for a straight line. The first time you saw this formula it appeared as follows:
y = mx + b
Typically in algebra, the student is asked to set up a table of x and y values, graph the points, and draw the best straight line through those points. For example, if y = 2x + 1 the table and graph would appear as shown in the accompanying figure. When x is equal to zero, y is equal to 1 because the mx term falls out [2 times 0 is 0]. In this same expression, the bterm is 1. So when x is equal to zero, y is equal to 1 [y = (2*0) + 1]. If we look at this zero point on the xaxis, the line cuts the yaxis at the number 1. We call this number the yintercept (or constant). Now if x is any number greater than zero, the mterm or coefficient becomes important. Here m is the number 2. If we make x equal to 1, then y is 2 plus the 1 for the constant. Essentially what this m is telling us is that when x increases by a factor of one, y increases by a factor of 2. There is not 1:1 correspondence. Y is increasing twice as fast as X. And this causes the line to slant upward, so the coefficient m is called the slope of the line.
In regression, the equation for the straight line is recast as y = bx + a. This change in terminology leads to confusion. Here a is the yintercept or constant and b is the coefficient or slope of the line. A few more words of caution about regression  as in all of statistics there are certain assumptions: the x value is a true measure, both X and Y distributions are normal, and homoscedasticity, i.e., the variance of y is the same for each value of x. Also statisticians often write the formula this way: y = bx + a + e, where e represents the error in prediction.
Interpreting the Scattergram
The objective in simple regression is to generate the best line between the two variables (the tabled values of X and Y), i.e., the best line that fits the data points. Regression uses a formula to calculate the slope, then another formula to calculate the yintercept, assuming there is a straight line relationship. The best line, or fitted line, is the one that minimizes the distances of the points from the line, as shown in the accompanying figure. Since some of the distances are positive and some are negative, the distances are squared to make them additive, and the best line is one that gives lowest sum or least squares. For that reason, the regression technique will sometimes be called least squares analysis.
The fitted regression line can tell us the actual ratio for the correspondence between x and y. In the case of cholesterol vs age, we don't expect a onetoone correspondence. However, in a comparison of cholesterol results between two different analytical methods, we would want a onetoone correspondence, i.e., we want Method Y to give about the same results as Method X. The line of l:l correspondence should have a particular tilt to it, i.e., the line should make a 45degree angle with the xaxis. And the formula for this line would have a slope coefficient of 1 and a yintercept or constant term equal to zero. If our original formula, y = 2x + 1, were plotted, we would see that y increases twice as fast as x. There is not l:l correspondence, and the angle of this line is different from 45 degrees. So, by merely inspecting the line generated by least squares regression, we can make some conclusions. Lessons 13 and 14 will give us more information on the usefulness of regression.
References
 Westgard, J. O. Basic Method Validation. Madison, WI: Westgard QC, Inc., 1999.
< Prev  Next > 

Member Login
What's New
ZStats / Basic Statistics
 Z1: Aligning Attitudes Through Purpose
 Z2: An Organizer Of Statistical Terms (Part I)
 Z3: An Organizer Of Statistical Terms (Part II)
 Z4: Mean, Standard Deviation, And Coefficient Of Variation
 Z5: Sum of Squares, Variance, and the Standard Error of the Mean
 Z6: Probability and the Standard Normal Distribution
 Z7: Hypothesis Testing, Tests of Significance, and Confidence Intervals
 Z8: TwoSample and Directional Hypothesis Testing
 Z9: Truth or Consequences for a Statistical Test of Significance
 Z10: ANOVA
 Z11: Confidence Intervals
 Z12: Correlation and Simple Least Squares Regression
 Z13: The Least Squares Regression Model
 Z14: Estimating Analytical Errors Using Regression Statistics