In many data analyses in social science, it is desirable to compute a coefficient of association. Coefficients of association are quantitative measures of the amount of relationship between two variables. Ultimately, most techniques can be reduced to a coefficient of association and expressed as the amount of relationship between the variables in the analysis. For instance, with a t test, the correlation between group membership and score can be computed from the t value. There are many types of coefficients of association. They express the mathematical association in different ways, usually based on assumptions about the data. The most common coefficient of association you will encounter is the Pearson product-moment correlation coefficient (symbolized as the italicized r), and it is the only coefficient of association that can safely be referred to as simply the "correlation coefficient". It is common enough so that if no other information is provided, it is reasonable to assume that is what is meant.
Let's return to our data on IQ and achievement in the previous assignment, only this time, disregard the class groups. Just assume we have IQ and achievement scores on thirty people. IQ has been shown to be a predictor of achievement, that is IQ and achievement are correlated. Another way of stating the relationship is to say that high IQ scores are matched with high achievement scores and low IQ scores are matched with low achievement scores. Given that a person has a high IQ, I would reasonably expect high achievement. Given a low IQ, I would expect low achievement. (Please bear in mind that these variables are chosen for demonstration purposes only, and I do not want to get into discussions of whether the relationship between IQ and achievement is useful or meaningful. That is a matter for another class.)
So, the Pearson product-moment correlation coefficient is simply a way of stating such a relationship and the degree or "strength" of that relationship. The coefficient ranges in values from -1 to +1. A value of 0 represents no relationship, and values of -1 and +1 indicate a perfect linear relationships. If each dot represents a single person, and that person's IQ is plotted on the X axis, and their achievement scores is plotted on the Y axis, we can make a scatterplot of the values which allow us to visualize the degree of relationship or correlation between the two variables. The graphic below gives an approximation of how variables X and Y are related at various values of r :
The r value for a set of paired scores can be calculated as follows:
There is another method of calculating r which helps in understanding what the measure actually is. Review the ideas in the earlier lessons of what a z score is. Any set of scores can be transformed into an equivalent set of z scores. The variable will then have a mean of 0 and a standard deviation of 1. The z scores above mean are positive, and z scores below the mean are negative.
The r value for the correlation between the scores is then simply the sum of the products of the the z scores for each pair divided by the total number of pairs minus 1.
This method of computation helps to show why the r value signifies what it does. Consider several cases of pairs of scores on X and Y. Now, when thinking of how the numerator of the sum above is computed, consider only the signs of the scores and signs of their products. If a person's score on X is substantially below the mean, then their z score is large and negative. If they are also below the mean on Y, their z score for Y is also large and negative. The product of these two z scores is then large and positive. The product is also obviously large and positive if both people score substantially above the mean on both X and Y. So, the more the z scores on X and Y are alike, the more positive the product sum in the equation becomes. Note that if people score opposite on the measures consistently ( negative z scores on X and positive z scores on Y), the more negative the product sum becomes. This system sometimes helps to give insight into how the correlation coefficient works. The r value is then an average of the products between z scores (using n-1 instead of n to correct for population bias). When the signs of the z scores are random throughout the group, there is roughly equal probability of having a positive ZZ product or a negative ZZ product. You should be able to see how this would tend to lead to a sum close to zero.
Interpretation of r
One interpretation of r is that the square of the value represents the proportion of variance in one variable which is accounted for by the other variable. The square of the correlation coefficient is called the coefficient of determination. It is easy for most people to interpret quantities when they are on a linear scale, but this square relationship creates an exponential relationship which should be kept in mind when interpreting correlation coefficients in terms of "large", "small", etc. Note the graph below which shows the proportion of variance accounted for at different levels of r . Note that not even half of the variance is accounted for until r reaches .71, and that values below .30 account for less than 10% of the variance. Note also how rapidly the proportion of variance accounted for increases between .80 and .90, as compared to between .30 and .40. Note that r = .50 is only 25% of the variance. Be careful not to interpret r in a linear way like it is a percentage or proportion. It is the square which has that quality. That is, don't fall into the trap of thinking of r = .60 as "better than half", because it clearly is not (it is 36%).
There are some obvious caveats in correlation and regression. One has been pointed out by Teri in the last lesson. In order for r to have the various properties needed for it's use in other statistical techniques, and in fact, to be interpreted in terms of proportions of variance accounted for, it is assumed that the relationship between the variables is linear. If the relationship between the variables is curvilinear as shown in the figure below, r will be an incorrect estimate of the relationship.
Notice that although the relationship between the curvilinear variables is actually better than with the linear, the r value is likely to be less for the curvilinear case because the assumption is not met. This problem can be addressed with something called nonlinear regression, which is a topic for advanced statistics. However, it should be obvious that one can transform the y variable (such as with log or square functions) to make the relation linear, and then a normal linear regression can be run on the transformed scores. This is essentially how nonlinear regression works.
Another assumption is called homoscedasticity (HOMO-SEE-DAS-STI-CITY or HOMO-SKEE-DAS-STI-CITY). This is the assumption that the variance of one variable is the same across all levels of the other. The figure below shows a violation of the homoscedasticity assumption. These data are heteroscedastic (HETERO-SKEE-DASTIC). Note that Y is much better predicted at lower levels of X than at higher levels of X :
A related assumption is one of bivariate normality . This assumption is sometimes difficult to understand (and it crops up in even more complicated forms in multivariate statistics), and difficult to test or demonstrate. Essentially, bivariate normality means that for every possible value of one variable, the values of the other variable are normally distributed. You may be able to visualize this by looking at the figure below with thousands of observations (this problem is complicated enough to approach the limits of my artistic ability). Think of the normal curves as being frequency or density at their corresponding values of X or Y. That is, visualize them as perpendicular to the page.
Regression and correlation are very sensitive to these assumptions. The values for this type of analysis should not be over interpreted. That is, quantitative predictions should be tempered by the validity of these assumptions.
It should be intuitive from the explanation of the correlation coefficient that a significant correlation allows some degree of prediction of Y if we know X. In fact, when we are dealing with z scores, the math for this prediction equation is very simple. The predicted Z for the Y score (z'y) is:
When the r value is used in this way, it is called a standardized regression coefficient , and the symbol used to represent it is often a lower case Greek beta (b), so the standardized regression equation for regression of y on x is written as :
When we are not working with z scores, but we are attempting to predict Y raw scores from X raw scores, the equation requires a quantity called the unstandardized regression coefficient. This is usually symbolized as B1, and allows for the following prediction equation for raw scores:
The unstandardized regression coefficient (B1) can be computed from the r value and the standard deviations of the two sets of scores (Equation a). The B0 is the intercept for the regression line, and it can be computed by subtracting the product of B1 and the mean of the x scores from the mean of the y scores (Equation b).
Now, suppose we are attempting to predict Y (achievement) from X (IQ). Assume we have IQ and Achievement scores for a group of 10 people. Suppose I want to develop a regression equation to make the best prediction of a person's Achievement if I am given their IQ score. I would proceed as follows:
First compute r.
Now, it is a simple matter to compute B1 .
B1 = SPxy / SSx = 420 / 512.5 = 0.82
Now compute B0 .
B0 = MY - B1 MX = 94.8 - 0.82(99.5) = 13.2
The regression equation for predicting Achievement from IQ is then
Y' = B0 + B1(X)
ACHIEVEMENT SCORE = 13.2 + 0.82 (IQ)
Error of Prediction
Given an r value between the two variables, what kind of error in my predicted achievement score should be expected? This is a complicated problem, but an over simplified way of dealing with it can be stated which is not too far off for anything other than extreme values. The standard error of the estimate can be thought of roughly as the standard deviation of the expected distribution of true Y values around a predicted Y value. The problem is that this distribution changes as you move across the X distribution, and so the standard error is not correct for most any prediction. However, it does give a reasonable estimate of the confidence interval around predicted scores. For standardized (z) scores, the standard error of the estimate is equation (a). For raw scores, it is equation (b) :
For example, given a predicted Y score of 87, and a standard error of estimate of 5.0, we could speculate that our person's true score is somewhere between 87-2(5) and 87+2(5) for roughly 96% confidence. Again, this is an oversimplification, and the procedures for making precise confidence intervals are best left for another time.