Difference between Correlation and Regression
Six Sigma – iSixSigma › Forums › Old Forums › General › Difference between Correlation and Regression
 This topic has 27 replies, 20 voices, and was last updated 12 years ago by S.

AuthorPosts

October 22, 2006 at 2:36 am #44976
Kiran VarriParticipant@KiranVarri Include @KiranVarri in your post and this person will
be notified via email.Hi,
I have recently completed Six Sigma Green belt certification. I would like to know with examples if any, what is the best way to explain the difference between Correlation and Regression. I am sorry if i sound dumb, but i am still a learner. Your kind responses will be appreciated.0October 22, 2006 at 10:59 am #145414Kiran,
I do not consider myself a specialist but below is my attempt to answer your question:
Correlation:
If X and Y are correlated, there will be a relationship between the increase (or decrease) of the Y and the increase (or decrease) of the X.
Minitab provide two pieces of information which are:The strength of the relationship (between 0 and 1)
A p value which tells you if the two variables are statistically correlated (if p is less than 0,05 there is correlation).
Regression:
Regression analysis will calculate a mathematical relationship between the X (or Xs) and the Y:
By doing so, it will:Tell you if the X is statistically significant that is, if the factor should be taken into account in the mathematical equation (if p is less than 0,05 the term is significant)
Give you the value associated with the X to estimate the Y (example: Y=3X)
Give you the accuracy of the mathematical relationship by providing an Rq value (if Rq is 0,95 your regression equation explains 95 % of the real data.
As far as I can see, if there is a strong correlation between two variables, it is very likely that the regression term will be considered significant because both use the same sort of variance analysis to perform the calculations.
Please take note that regression provide other information but I tried to keep my answer simple.
PS: concerning correlation, I think you have to be cautious about the possible difference between correlation and causation . A correlation might exist just by chance even though no real link exist between the X and the Y.
Vincent0October 22, 2006 at 8:54 pm #145431A good point about correlation and causation. For example, yo umight take data concerinng washing your car and it raining the next day. You might have a good correlation coefficient, but one really doesn’t cause the other to happen.
One way I like to check causation is asking the question backwards.
First hypothesis: Every time I wash my car, it rains.
Second confirmation hypothesis: Every time it rains, I wash my car…
Obviously, washing your car has no effect on the weather.0October 23, 2006 at 6:30 pm #145529
Kiran VarriParticipant@KiranVarri Include @KiranVarri in your post and this person will
be notified via email.Thanks Vincent and Bret…..the response was very helpful and easy to understand….Kiran
0October 30, 2006 at 6:02 pm #146133
Guido SParticipant@GuidoS Include @GuidoS in your post and this person will
be notified via email.Hi Kiran,
the RSq from Regression is r*r from Correlation! For linear relations RSq and r have same sense.
For quadratic and higher relations there is a difference. Correlation is only for linear relations!
Best regards Guido0October 30, 2006 at 6:22 pm #146140
Jonathon AndellParticipant@JonathonAndell Include @JonathonAndell in your post and this person will
be notified via email.Maybe it’s just semantics here, but regression is just an analytical tool. Correlation is one of several bits of information that can come from regression.Causation may the issue here. I heard that somebody was able to correlate birth rates to the number of stork nests, which spawned the belief that storks bring babies. (I don’t know whether that one is true, but it makes a good point.)The point is, two factors can be correlated without one causing the other. Sometimes it’s because both correlated factors are caused by yet another factor. Occasionally it’s due to an asyet unexplained factor.I hope this helps a bit.
0January 28, 2008 at 5:25 am #167823
amir saeed khanParticipant@amirsaeedkhan Include @amirsaeedkhan in your post and this person will
be notified via email.hello, how are you?i am studet of BS bioinformatics in pakistan .i have alot of problems in statistics terms like correlation and regression ,but now i hav’nt ,you are great.tanc alot.
0January 28, 2008 at 7:07 am #167824
Kim NilesParticipant@Kniles Include @Kniles in your post and this person will
be notified via email.Kiran:
Good question, I had a statistics teacher answer this question once by telling me that correlation analysis is always through visual interpretation and regression analysis is always mathematical. I’ve stuck with that definition even though I’m not very comfortable with it. I hope to read more posts on this string.
KN – http://www.KimNiles.com0January 28, 2008 at 8:06 am #167827Yes
Correct
What is difficult in that?it is a common sense.0February 7, 2008 at 9:45 pm #168330Hi all,Stan, could you please explain a bit more?
thanksJane0May 21, 2008 at 5:39 pm #172139
Jude OgunadeParticipant@JudeOgunade Include @JudeOgunade in your post and this person will
be notified via email.First, put it in mind that both terms refer to relationship. Secondly, know that this relationship is between two variables. Thirdly know that whilst correlation is concerned about only the variations of this relationship (another name for correlations could be covariations), regression seeks to finds the best fit of the variations between these two variables, i.e. finding where the two variation come close to being the same thing.
0July 11, 2008 at 4:56 am #173737thanks would u like to tell me ur mailing address plz
0July 15, 2008 at 8:01 pm #173853
Jude OgunadeParticipant@JudeOgunade Include @JudeOgunade in your post and this person will
be notified via email.July 26, 2008 at 6:07 pm #174197
Alessio ToraldoParticipant@AlessioToraldo Include @AlessioToraldo in your post and this person will
be notified via email.Sorry to say that, but many of the above repies are, strictly speaking, inaccurate.
Statistics does not know anything about causal relationships. Neither for correlation (C), nor for Regression (R)
The real difference is the statistical model assumed by the person applying the analysis:
C: a bivariate gaussian is assumed
R: does not assume a bivariate gaussianTo say the same thing in another way:The underlyin model is:Y = aX + b +Ea and b are constants; E is random noise, normally distributed, with mean = 0.
An here comes the difference: correlation assumes that X is a normal distribution; regression does NOT assume so.
Perhaps less intuitive, but correct ;)
Cheers
AT0July 28, 2008 at 1:22 am #174215
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.From pp. 33 Applied Regression Analysis – Draper and Smith – First Edition we have the folloiwng section:
1.6 The Correlation between X and Y
If X and Y were both random variables following some (unknown) bivariate distribution then we could define the correlation coefficient between X and Y as
rhoXY = covariance(X,Y)/sqrt[V(X)*V(Y)]
On pp.3435 of the same book we have the section
Correlation and Regression which states:
(If we have a simple linear model Y = bo +b1*X +e) …b1 is a scaled version of the correlation coefficient. (The two) are closely related but provide different interpretations. The correlation coefficient measures association between X and Y while b1 measures the size of the change in Y, which can be predicted when a unit change is made in X.
This is the difference between the two. Neither correlation nor regression assumes a bivariate gaussian with respect to either X or Y.0August 3, 2008 at 11:28 pm #174521
Alessio ToraldoParticipant@AlessioToraldo Include @AlessioToraldo in your post and this person will
be notified via email.There is confusion between “correlation “and “regression” as statistical inference techniques, and “r” and “b” (coefficients) as simple mathematical indices.Of course the two indices, b and r, can always be computed from any sample, irrespective of the shape of the underlying distribution: they are pure mathematical objects and nothing prevents us from computing them, much in the same way as nobody prevents us from, say, adding up 15 metres and 30 secs, getting 45 …?The issue here is NOT with the COMPUTABILITY of b and r – they are, of course, always computable. The issues is their interpretation – i.e. the use of them in statistical inference.
Now – to drive a valid inference on the slope coefficient (b) in linear regression, you have to satisfy four basic assumptions: (i) that the function relating X to the expected value of Y ((E/Y)X) is linear; (ii) that residuals from the regression line are normally distributed, with 0 mean, for each X value: (iii) that the variance of residuals is equal for all values of X (homoscedasticity); (iv) that residuals on different Xs are independent of each other.
These conditions need to be satisfied in order to have a valid statistical test on the regression angular coefficient b.Instead, correct statistical interpretation for Pearson’s r requires the distribution to be a bivariate gaussian (a different model from the minimal one required by regression, which – see above – only requires the residuals to be normally distributed).It is very misleading to say that neither index assumes normality because it attracts attention over an irrelevant fact – the mere “computability” of the two indices in every possible situation, and takes attention away from what is really important for practical use, i.e. that r does not have a clear meaning if the distribution is not a bivariate gaussian, and that b does not have a clear meaning either, if the residuals are not normally distributed and independent.0August 4, 2008 at 1:21 am #174523
Michael MeadParticipant@MichaelMead Include @MichaelMead in your post and this person will
be notified via email.That was a great answer.
0August 5, 2008 at 1:20 pm #174569
Robert ButlerParticipant@rbutler Include @rbutler in your post and this person will
be notified via email.The original question posted back in 2006 was the following: what is the best way to explain the difference between Correlation and Regression?
The answer to that question is:
The correlation coefficient measures association between X and Y while b1 measures the size of the change in Y, which can be predicted when a unit change is made in X.
As stated the question was one concerning the explanation of a concept. The answer that was provided did just that it explained and (hopefully) it clarified. I suppose one could view this kind of an answer as misleading and irrelevant with a focus mere computability and thus of no practical value but I disagree.
When someone asks me to explain regression my choice of explanation focuses on explaining what the machine is doing when it is told to regress Y on X. My explanation will indeed focus on the mere computability that is a clear explanation of what least squares regression does – because, in my experience, that is precisely what the individual posing the question wishes to know.
After having provided a clear explanation of the basic concept I can use the understanding resulting from the comprehension of my explanation to go in any direction I or the individual asking the question may choose.
If the focus shifts to hypothesis testing and the difference between tests concerning r and b1 one can then begin by discussing the issues of normality that have been provided.
As noted for most tests concerning r both X and Y have to be normal. However, if you just leave it at that your statement is going to be a major call for inaction since, in my experience, people are going to spend an inordinate amount of time worrying about normality instead of getting on with the work. A better answer would be the following:
(For most tests concerning r, both X and Y have to be normal.) Often a bivariate population is far from normal. In some cases a transformation of the variables X and Y brings their joint distribution close to the bivariate normal, making it possible to estimate r in the new scale (In spite of the nonnormality) we may still want to examine whether two variable are independent or whether they vary in the same or in opposite directions. For a test of the null hypothesis that there is no correlation, r may be used provided that one of the variables is normal. When neither variable seems normal, the bestknown procedure is that in which X and Y are both converted to rankings. The rank correlation coefficient, due to Spearman, is the ordinary correlation coefficient r between ranked values of X and Y.
Statistical Methods 7th Edition Snedecor and Cochran pp.191192
What I dont have, and I would be very interested in reading, is a clear discussion of the issue of the word close in particular, just how robust are tests concerning the correlation coefficient to nonnormal behavior in X and Y. The text cited above says only methods of expressing the amount of correlation in nonnormal data by means of a parameter like r have not proceeded very far.
Given that this book has a copyright of 1980 its probably safe to assume that things have proceeded farther and it would be interesting to know just how far that is.0August 17, 2008 at 4:34 pm #174957
Alessio ToraldoParticipant@AlessioToraldo Include @AlessioToraldo in your post and this person will
be notified via email.I agree entirely on the final comment about how “serious” are departures from normality. Normality is more the exception than the rule, so tests should be used provided that violations are not huge and the test is robust (incidentally, I recall a paper which directly showed that even largely skewed distributions of X and Y have relatively little impact on r distribution, especially if Rho=0 under null hypothesis).
I suggested neither that the normality caveats should paralyze a researcher, nor that r and b can be computed with no regard as to their underlying distribution.
I just mentioned what the assumptions are, because this important point had not been addressed yet, and is indeed more relevant than other, typically misleading issues, like that of causality.
To reply to that statement by saying that “neither r nor b assume normality” is very detrimental to comprehension by other readers – it is like saying “X and Y are not eigenvalues” to spectators who do not have a background in mathematics.0August 17, 2008 at 4:38 pm #174958
Alessio ToraldoParticipant@AlessioToraldo Include @AlessioToraldo in your post and this person will
be notified via email.PS Incidentally, the discussion DID shift to hypothesis testing. I did the shift :)
0August 17, 2008 at 4:57 pm #174959
Alessio ToraldoParticipant@AlessioToraldo Include @AlessioToraldo in your post and this person will
be notified via email.Ok – I posted a reply, but only the PS came out.
Well, briefly, I wrote that I agree entirely on the robustness issue – no one should be paralyzed by the suspect of non normality. I know of a paper addressing the issue of robustness of r, but it is even older (1977) than that cited by Robert – see reference below. They say it the test is very robust with skewed and leptokurtic distributions. Much more recent work (ChinDiew Lai, of Massey University, New Zealand) claimed that when the original distributions are lognormal – very clearly skewed, r is a biased estimator of parameter Rho (the true correlation of the population from which the sample was extracted); the bias can be huge and nullified only with millions (!) of data points. Clearly, the positions are different.Effect of the violation of assumptions upon significance levels of the Pearson r.
By Havlicek, Larry L.; Peterson, Nancy L.
Psychological Bulletin. 1977 Mar Vol 84(2) 373377ChinDiew Lai, Department of Statistics, Massey University, New Zealand
John C W Rayner, School of Mathematics and Applied Statistics,
University of Wollongong, , Australia
T P Hutchinson, School of Behavioural Sciences, Macquarie University, Australia
Most statistics students know that the sample correlation coefficient R is used to estimate
the population correlation coefficient ï±. If the pair (X, Y) has a bivariate normal
distribution, this would not cause any trouble. However, if the marginals are nonnormal,
particularly if they have high skewness and kurtosis, the estimated value from a sample
may be quite different from the population correlation coefficient ï±. Our simulation
analysis indicates that for the bivariate lognormal, the bias in estimating ï±ï can be very
large and it can be substantially reduced only after a large number (34 million) of
observations. This example could serve as an exercise for the statistics students to realise
some of the pitfalls in using the sample correlation coefficient to estimate ï±0November 29, 2008 at 7:40 am #178164hi
i m from pakistan and doing masters i have an assignment about this topic and this all information help me alot thankyou u all bye0December 3, 2008 at 2:39 pm #178277
zeeshan ahmadMember@zeeshanahmad Include @zeeshanahmad in your post and this person will
be notified via email.Correlation: It tells us about the relation between two variables
It is two way relation. It is symmetric.
example: smoking is related to cancer
Regression: It tells us about the cause and effect
Its one way relation. Its asymmetric
example: smoking causes cancer
here you cannot say cancer causes smoking. So it is one way relation.0December 3, 2008 at 3:21 pm #178279This explanation is inaccurate. Regression does not tell us “cause and effect.” Such a conclusion would be based on the design, not the analysis. Ryan
0April 22, 2009 at 2:40 pm #183601
anand singhParticipant@anandsingh Include @anandsingh in your post and this person will
be notified via email.thank u
0April 22, 2009 at 4:12 pm #183604Linear regression investigates and models the linear relationship between a response (Y) and predictor(s) (X). Both the response and predictors are continuous variables.
A Pearson correlation coefficient measures the extent to which two continuous variables are linearly related.
One models the relationship the other quantifies the relationship.
0September 24, 2009 at 6:47 am #185655
Alessio ToraldoParticipant@AlessioToraldo Include @AlessioToraldo in your post and this person will
be notified via email.I fully agree with Ryan. Zeeshan ahmad: read the previous posts before repeating (wrong) solutions. A discussion with as many as 20 posts might well have provided better insights as to the solution.
So, I’d better repeat what’s the real difference between regression and correlation. Regression pays attention to the change in the Y as a function of a onestep change in X. The question it poses and investigates is in scalar units, e.g., one might wonder by how many centimeters (Y) do children grow in one year of age (X). Correlations instead, does not care about units of measurement. It is a pure number (no units) telling you how closely two variables X and Y match – to what extent they carry the same information. For instance, the age of children and their education (number of years at school) essentially measure the same thing. Indeed (almost) all children of first year are 6year old; (almost) all children of second year are 7yearold and so on. Correlation here is close to +1, meaning that the degree of redundancy of the two variables is extreme. By contrast, age and height in a set of adults are completely uncorrelated – to know how tall a given adult is does not tell you anything at all about his age. The correlation here is 0.Another, seemingly more technical but crucial point: regression assumes that errors in Y are normally distributed, and nothing else; correlation much more strictly assumes that both X and Y are normally distributed. If these assumptions are not – at least vaguely – met, then hypothesis testing on regression (the slope, usually referred to as Beta) and on correlation (the Rho coefficient) cannot be properly carried out.
Do not listen to people telling you that normality assumptions are irrelevant for regression or correlation. These people refer to the mere computability of indices. They do not understand that the mere computability of something is useful only to the experts – if one has to explain to a learner what is the essence of regression and correlation, s/he should stress what is important to know for practice, not for theoretical, abstract (didactically empty) reasoning.0October 6, 2009 at 5:24 am #185934i dont know
0 
AuthorPosts
The forum ‘General’ is closed to new topics and replies.