Regression and Correlation

A basic for this is: Curve fitting and least square test

"Because two phenomenons appear at the same time at best,
coherence do not necessary light from them for this simple reason"

 

In the link Curve fitting and Least Square Method we built up the foundation for training Regression and Correlation which is of the same kind of questions. Regression shows the outlook of an (asserted) coherence between the variables, collelation indicates the strenght of this (asserted) coherence. Perhaps there might be a measured coherence between the production of chewing gum in USA in 1950s and the production of steel in the Sovjet Union in the same period, but to call this anything beginning or ending with "cause" is just too much. I once co-operated with a abigious chemical engineer. She claim that we could not reject that distilled water did not cause cancer. It has not been investigated, was her argument. I said that nobody knows precisely what causes cancer. The correlation coefficient is the figure indicating the strenght, but remember, always remember, you choose the variables to be analyzed, and the empirical coefficient of correlation might often differ a lot from its true value, especially due to the small number of observations included in your so-called experience.

 

We continue from the link with the Least Square Method and repeat the last calculation concerning deciding the least square line.

 

Xg: The arithmetric mean of the variable X

Yg: The arithmetric mean of the variable Y

Y: The regressand included in the regresion line

X: The regressor included in the regression line

s: The standard deviation

s^2: The variance

r : The coefficient of correlation

r^2: The coefficient of determination

b : The estimated slope of the regression line - coefficient of regression - Y on X.

 

 

Height X

Weight Y

x=X-Xg

y=Y-Yg

xy

x^2

y^2

70

155

3.2

0.8

2.56

10.24

0.64

63

150

-3.8

-4.2

15.96

14.44

17.64

72

180

5.2

25.8

134.16

27.04

665.64

60

135

-6.8

-19.2

130.56

46.24

368.64

66

156

-0.8

1.8

-1,44

0.64

3.24

70

168

3.2

13.8

44.16

10.24

190.44

74

178

7.2

23.8

171.36

51.84

566.44

65

160

-1.8

5.8

-10.44

3.24

33.64

62

132

-4.8

-22.2

106.56

23.04

492.84

67

145

0.2

-9.2

-1.84

0.04

84.64

65

139

-1.8

-15.2

27.36

3.24

231.04

68

152

1.2

-2.2

-2.64

1.44

4.84

S X=802

Xg=66.8

S Y=1850

Yg=154.2

S xy=

616.32

S x^2=

191.68

S y^2=

2659.68

The required least square line has this equation (according to the least square link):

Y - 154.2 = 3.22(X – 66.8) or

Y = 3.22X - 60.9 (1)

The variables X and Y above are not chosen without any a priori supposition about some linear correlation, "Height and Weight" have to follow each other to some degree but precisely which degree?

The figure shows the regression line (Y on X), the line for Yg and the deviation, one example of the explained and the correspondent unexplained deviation.

Figure:

We take (21) from the link "The curve fitting and The Method Least Square":

y = (S xy/S x^2)x which the regression of Y on X or y = (S xy/S y^2)x which

is the regression of X on Y (21 LS)

b = 616.32/191.68 = 3.22

b = 616.32/2659.68 = 0.232

This implicates that the equation of the regression line also might be expressed as:

Y - 154.2 = 3.22(X - 66.8) or (1)

In the figure the example of explained variation refers to linear regression of Y on X, and unexplained variation is the rest of the total variation. X is called the regressor, Y the regressand.

Explained variation S (Yj –Yg)^2 or measured by the variance b ^2S xj^2/1

(Number of degrees of Freedom =1)

In the example with "height and weight" variance = 3.22^2 * 191.68 =1,987

Unexplained variation S (Yj -Y)^2 or measured by the variance s^2 = S (Yj -Y)^2/n-2

(number of degrees of freedom = n-2 12-2=10, explained in the link ls)

In the example: 677.4

Total variation S (Yj – Yg)^2 or S y^2

In the example: 2,659.68 - the divergence to the sum explained and unexplained is cause by roundings. (number of degrees of freedom = n-1=12-1=11)

(Explained variance)/(unexplained variance), F ratio to be tested = b ^2S xj^2/s^2

= [b ^2S xj^2]/[ S (Yj -Y)^2/n-2]

In the example: 1,987/677.4 = 2.93. (number of degrees of freedom is: 1)

Total number of degrees of freedom 1+6=7

Regression of Y on X

Regression of X on Y

Y = Yg + b x

Y – Y

(Y –Y)^2

X=Xg +b ’y

X - X

(X –X)^2

154.2+3.22*3.2= 164.5

155 – 164.5

= - 9.5

= 90.33

66.8 + 0.232

= 67.03

70 – 67.03

= 2.97

= 8.81

141.96

8.00

64.00

65.83

-2.83

8.01

170.94

9.06

82.01

72.79

0.79

0.62

132.30

2.70

7.27

62.35

-2.35

5.50

151.62

4.38

19.15

67.22

-1.22

1.48

164.50

3.50

12.22

70.00

0.00

0.00

177.38

0.62

0.38

72.32

1.68

2.82

148.40

11.60

134.47

68.15

-3.15

9.89

138.74

-6.74

45.48

61.65

0.35

0.12

154.84

-9.84

96.90

64.67

2.33

5.45

148.40

-9.40

88.44

63.27

1.73

2.98

158.06

-6.06

36.77

66.29

1.71

2.93

S (Y-Y)^2

= 677.4

s^2 = (S (Y-Y)^2)/n-2

= 67.7

s = 8.2

S (X-X)^2

= 48.61

s^2 = (S (X-X)^2)/n-2

= 4.86

s = 2.2

F-test and t-test

There is an alternative to the F test of the null hypothesis (H0). To test b =0 by t-distribution t=b /sb . For the moment just remember one alternative to H0 is H1, that for example b >0.

Learn more about Tests of Hypotheses in another link.

The second method, the t-ratio is preferable if a confidence interval also is desired. The two tests are equivalent because the t statistic is related. There are three equivalent ways of testing the null hypothesis that the regressor has no effect on Y: F test and the t test of b =0, and also the test of r =0 in a bivariate normal population for various sample sizes n. Forget the last mentioned for the moment (but I have be correct).

F test of the 2.93 ratio from the example: F0.10=3.46 og F0.25=1.54 checking a table of F. thus 0.10 <prob.value<0.25 Since the the prop-value is not less than 5% we can not accept the hypothesis.

The estimat b can be tested by t-distribution as mentioned:

b =3.22 +- t 0.025 s/Ö (S x^2) t-distribution is related to F with one degree of freedom in the numerator by: t^2 = F

= 3.22 +- 2.45 * 8.2/Ö 191.68

= 3.22+-1.45

Since b =0 is not included in the confidence interval, the null hypothesis must be rejected at the 5% level.

The coefficient of determination and coefficient of correlation:

r ^2 = [S (Y – Yg)^2]/ [S (Yj – Yg)^2] = (explained variation of Y)/(total variation of Y)

r = Ö r ^2 is called the coefficient of correlation.

The maximum value of the right side is 1 and r has the limits +-1. If r^2 =0 the regression line explains nothing. When r =0, then b =0. These are just two equivalent ways of formally stating in this case that there is no observed linear relation between X and Y.

Would you propose to calculate the regression line and measure the correlation with Y as the regressor on X as the regressand?

Here I would perfer to use pure logic in the starting point: The reason why the "weight" is high or low might be determined to some degree by the "height", but not the other way round. Notice, the regression technic does certainly not provide such knowledge. And notice in a lot of cases the logic not so smooth as it is in our example.

And this is very central to underline ("the regression fallacy"):

It is easy to find mathematically an apparently correlation between two or more variables. But still it is not just to request your to use logic, you certainly must use it primery before anything else, especially before using an perfect unaffected

but limited method.

Compare r with standard deviation (s) (the rules are not proved here)

r = Ö (explained/total variation) = Ö 1- (sy,x^2/sy^2) or sy,x = syÖ (1-r^2)

r = sx,y/sxsy, where sx,y is the covariance of X and Y.

Y –Yg = (rsy/sx)(X – Xg) or y = (rsy/sx)x

Similiar for the regresion on Y:

X – Xg =(rsx/sy)(Y - Yg) or y = (rsx/sy)y

The empirical coefficient of correlation: rxy =(S xy)/(n-1)sxsy

Not seldom like in the example above we find that one of the regressions is so dominating of practical interest that it is equivocal clear the regression must express some relevant knowledge. In other hand the method of just measuring the values might affect the equipment of measurement so that you in reality is the testing the equipment, and not the environment with or without knowing it or with or without admitting it.

 

The rules for multiple regression including 3 or more variables correspont perfectly to those of this simple regression in 2 variables. Perhpas we should include the variable "age" in our example above to get a little more succes with pointing out a linear relation between X and Y. The material could grouped according to age or age could been the third variable. Some of the basis has been made in the link "least square". Regression is very tempting among other thing because the method does not assume that the material consists of values of a two dimentional stocastic variable. The samples sizes of Y might vary, and also the way the material is grouped might vary, but take good care, this last variation might be deciding whether you reach a good result seen only from "stocastic point" of view.

From the view of simple truth material might also be manipulated to help somebody’s carrier. The sources of data, and how they were selected is often of greatest importance. The regression method does not give you any warning signals. A knife might also have different applications.

Perhaps this practical reality tells you something: Causal interpretation almost left..