


Correlation
Coefficient 

In many data
analyses in social science, it is desirable to compute a coefficient of
association. Coefficients of association are quantitative measures of the amount
of relationship between two variables. Ultimately, most techniques can be
reduced to a coefficient of association and expressed as the amount of
relationship between the variables in the analysis. For instance, with a t
test, the correlation between group membership and score can be computed from
the t value. There are many types of
coefficients of association. They express the mathematical association in
different ways, usually based on assumptions about the data. The most common
coefficient of association you will encounter is the Pearson productmoment
correlation coefficient (symbolized as the italicized r), and it is the
only coefficient of association that can safely be referred to as simply the
"correlation coefficient". It is common enough so that if no other
information is provided, it is reasonable to assume that is what is meant. Let's return
to our data on IQ and achievement in the previous assignment, only this time,
disregard the class groups. Just assume we have IQ and achievement scores on
thirty people. IQ has been shown to be a predictor of achievement, that is IQ
and achievement are correlated. Another way of stating the relationship
is to say that high IQ scores are matched with high achievement scores and low
IQ scores are matched with low achievement scores. Given that a person has a
high IQ, I would reasonably expect high achievement. Given a low IQ, I would
expect low achievement. (Please bear in mind that these variables are chosen
for demonstration purposes only, and I do not want to get into discussions of
whether the relationship between IQ and achievement is useful or meaningful.
That is a matter for another class.) So, the
Pearson productmoment correlation coefficient is simply a way of stating such
a relationship and the degree or "strength" of that relationship. The
coefficient ranges in values from 1 to +1.
A value of 0 represents no relationship, and values of 1 and +1
indicate a perfect linear relationships. If each dot represents a single
person, and that person's IQ is plotted on the X axis, and their achievement
scores is plotted on the Y axis, we can make a scatterplot of the values
which allow us to visualize the degree of relationship or correlation between
the two variables. The graphic below gives an approximation of how variables X
and Y are related at various values of r : The r value
for a set of paired scores can be calculated as follows: There is
another method of calculating r which helps in understanding what the
measure actually is. Review the ideas
in the earlier lessons of what a z score is. Any set of
scores can be transformed into an equivalent set of z scores.
The variable will then have a mean of 0 and a standard deviation of
1. The z scores above mean are
positive, and z scores below the mean are negative. The r value
for the correlation between the scores is then simply the sum of the products
of the the z scores for each
pair divided by the total number of pairs minus 1. This method of computation helps
to show why the r value signifies what it does. Consider several cases of pairs of scores on
X and Y. Now, when thinking of how the
numerator of the sum above is computed, consider only the signs of the scores
and signs of their products. If a
person's score on X is substantially below the mean, then their z score is large and negative. If they are also
below the mean on Y, their z score for Y is also large and
negative. The product of these two z
scores is then large and positive.
The product is also obviously large and positive if both people score
substantially above the mean on both X and Y.
So, the more the z scores on X and Y are alike, the more positive
the product sum in the equation becomes. Note that if people score opposite on
the measures consistently ( negative z scores on X and positive z scores on Y),
the more negative the product sum becomes.
This system sometimes helps to give insight into how the correlation
coefficient works. The r value is then an average of the products
between z scores (using n1 instead of n to correct for population bias). When the signs of the z scores are random
throughout the group, there is roughly equal probability of having a positive
ZZ product or a negative ZZ product. You should be able to see how this would
tend to lead to a sum close to zero. Interpretation
of r One
interpretation of r is that the
square of the value represents the proportion of variance in one variable which
is accounted for by the other variable. The square of the correlation
coefficient is called the coefficient of determination. It is easy for most people to interpret
quantities when they are on a linear scale, but this square relationship
creates an exponential relationship which should be kept in mind when
interpreting correlation coefficients in terms of "large",
"small", etc. Note the graph
below which shows the proportion of variance accounted for at different levels
of r . Note that not even half
of the variance is accounted for until r reaches .71, and that values below .30 account for less than 10%
of the variance. Note also how rapidly
the proportion of variance accounted for increases between .80 and .90, as
compared to between .30 and .40. Note
that r = .50 is only 25% of the variance. Be careful not to interpret r in a linear way like it is a
percentage or proportion. It is the
square which has that quality. That is,
don't fall into the trap of thinking of r = .60 as "better than half",
because it clearly is not (it is 36%). There are some obvious caveats in
correlation and regression. One has
been pointed out by Teri in the last lesson.
In order for r to have the various properties needed for it's
use in other statistical techniques, and in fact, to be interpreted in terms of
proportions of variance accounted for, it is assumed that the relationship
between the variables is linear. If the relationship between the
variables is curvilinear as shown in the figure below, r will be an incorrect estimate of the
relationship.
Notice that although the relationship between the
curvilinear variables is actually better than with the linear, the r value is likely to be less for the
curvilinear case because the assumption is not met. This problem can be
addressed with something called nonlinear regression, which is a topic for
advanced statistics. However, it should
be obvious that one can transform the y variable (such as with log or square
functions) to make the relation linear, and then a normal linear regression can
be run on the transformed scores. This
is essentially how nonlinear regression works. Another
assumption is called homoscedasticity (HOMOSEEDASSTICITY or HOMOSKEEDASSTICITY). This is the assumption that the variance of
one variable is the same across all levels of the other. The figure below shows
a violation of the homoscedasticity assumption. These data are heteroscedastic (HETEROSKEEDASTIC). Note that Y is much better predicted at
lower levels of X than at higher levels of X : A related assumption is one of bivariate normality . This assumption is sometimes difficult to
understand (and it crops up in even more complicated forms in multivariate
statistics), and difficult to test or demonstrate. Essentially, bivariate
normality means that for every possible value of one variable, the values of
the other variable are normally distributed.
You may be able to visualize this by looking at the figure below with
thousands of observations (this problem is complicated enough to approach the
limits of my artistic ability). Think
of the normal curves as being frequency or density at their corresponding
values of X or Y. That is, visualize
them as perpendicular to the page.
Regression and correlation are very sensitive to these
assumptions. The values for this type
of analysis should not be over interpreted. That is, quantitative predictions
should be tempered by the validity of these assumptions. Regression It should be intuitive from the
explanation of the correlation coefficient that a significant correlation
allows some degree of prediction of Y if we know X. In fact, when we are dealing with z scores, the math for this prediction equation
is very simple. The predicted Z for the
Y score (z'_{y}) is: When the r value is used in this way, it is called a standardized
regression coefficient , and the symbol used to represent it is often a lower
case Greek beta (b), so the standardized
regression equation for regression of y on x is written as : When we are not working with z scores, but we are
attempting to predict Y raw scores from X raw scores, the equation requires a
quantity called the unstandardized regression coefficient. This is
usually symbolized as B_{1}, and
allows for the following prediction equation for raw scores: The unstandardized regression coefficient (B_{1}) can be computed from the r value and the standard deviations of the two
sets of scores (Equation a). The B_{0} is the
intercept for the regression line, and it can be computed by subtracting the product of B_{1} and the
mean of the x scores from the mean of the y scores (Equation b). Now, suppose we are attempting to predict Y (achievement)
from X (IQ). Assume we have IQ and
Achievement scores for a group of 10 people. Suppose I want to develop a
regression equation to make the best prediction of a person's Achievement if I
am given their IQ score. I would proceed as follows: First compute r. Now, it is a simple matter to compute B_{1} . B_{1 }= SPxy / SSx = 420 / 512.5 = 0.82 Now compute B_{0 }. B_{0 }= M_{Y}  B_{1}
M_{X} = 94.8  0.82(99.5) = 13.2 The regression equation for predicting Achievement from IQ
is then Y' = B_{0} + B_{1}(X) or ACHIEVEMENT SCORE = 13.2 + 0.82 (IQ) Error of Prediction Given an r value between the two variables, what kind of error in my
predicted achievement score should be expected? This is a complicated problem, but an over simplified way of
dealing with it can be stated which is not too far off for anything other than
extreme values. The standard error
of the estimate can be thought of
roughly as the standard deviation of the expected distribution of true Y values
around a predicted Y value. The problem
is that this distribution changes as you move across the X distribution, and so
the standard error is not correct for most any prediction. However, it does give a reasonable estimate
of the confidence interval around predicted scores. For standardized (z) scores, the standard error of the
estimate is equation (a). For raw
scores, it is equation (b) : For example, given a predicted Y score of 87, and a standard
error of estimate of 5.0, we could speculate that our person's true score is
somewhere between 872(5) and 87+2(5) for roughly 96% confidence. Again, this is an oversimplification, and
the procedures for making precise confidence intervals are best left for
another time. 


