Curve Fitting and The Method of Least Squares
We often want to find out the relation between different phenomenons such as high speed in cars and fatal accidents or between acid rain and death of woods. Perhaps we don't have sure knowledge about the real causes, but then expect some kind of explanation to be the right one. What do we do? Are we able to minimize the distance between knowledge and belief or imagination. As things are developing or dismantling quicker and quicker the need to know the difference is growing accellerated. On TV you have heard reports of several so-called investigations that shows this or that. How do they make such investigations, and are they always trust worthy? We intend to give you some honest advices and a foundation to judge upon. If we accept to transform the real problems or relations to problems included in a mathematical model we are able handle the question/problems. But we have to be very precautious of understanding the difference be- tween reality and the mathematical models. If we just believe what the reporters are telling us, because we do not know how to control if they are lieing in order to make us follow the rulers for good (or bad), we are not independent knowing individuals, but in the end of this when it continues we are just a herd of sheeps running the way the rulers want us to. |
Relationship between variables
Very often in practice a relationship is found to exist between two (or more) variables. For example: The weights of adults males depend to some degree on their heights; circumstances and areas of circles depend on their radii; and the pressure of a given mass of gas depends on its temperature and volume.
The last two mentioned type of relation can be proven purely mathematically/physically. But what about the lack of Ozon-stratum and occurences of skin cancer. Or womens inclination to get abortions related to the where the women live, in cities and towns or in the country.
It is frequently desiable to express the relations in mathematical form by determining an equation connecting the variables.
Curve Fitting
To aid in determining an equation connecting variables, a first step is the collection of data showing corresponding values of the variables under consideration.
For example, suppose X and Y denote respectively the height and the weight of adult males. Then a sample N individuals would reveal the heights X1, X2, ..XN and the corresponding weight Y1, Y2, ..YN
A next step is to plot the points (X1,Y1), (X2,Y2) ..(XN,YN) on a rectangular coordinate system. The resulting set of point is sometimes called a scatter diagram.
From the scatter diagram it is often possible to visualize a smooth curve approximating the data. Such a curve is called an approximating curve. In the next diagram, for example, the data appears to be aproximated well by a straight line and perhaps a linear relationship exists between the two variables. In the diagram to the right some other relation between the two variables seems to exist, perhaps some curve.
The general problem of finding equations of approximating curves which fit given sets of data is called curve fitting.
Equations of Approximating Curves
For purposes of reference we have listed below several common type of approximating curves and equations. All letters other than X and Y represent constants. The variables X and Y are often referred to as independent and dependent variables respectively, although these roles can be interchanging.
1 |
Y = a0 + a1X |
Straight line |
2 |
Y = a0 + a1X + a2X^2 |
Parabola curve |
3 |
Y = a0 + a1X + a2X^2 + a3X^3 |
Cubic curve |
4 |
Y = a0 + a1X + a2X^2 + a3X^3 + a4X^4 |
Quartic curve |
5 |
Y = a0 + a1X + a2X^2 + +anX^n |
nth degree curve |
6 |
Y= 1/(a0 +a1X) or 1/Y = a0 + a1X |
Hyperbola |
7 |
Y = ab^X or logY = loga + Xlogb = a0 + a1X |
Exponential curve |
8 |
Y = ab^X or log Y = log a + blogX |
Geometric curve |
9 |
Y = ab^X + g |
Modified exponential curve |
10 |
Y = aX^b + g |
Modified geometric curve |
11 |
Y = pq^b^X or logY = logp + b^Xlogq = ab^X + g |
Gompertz curve |
12 |
Y = pq^b^X + h |
Modified Gombertz curve |
13 |
Y = 1/(ab^X +g) or 1/Y = ab^X + g |
Logistic curve |
14 |
Y = a0 + a1(logX) + a2(logX)^2 |
Logistic curve |
The right sides of the above equations (1-5) are called polynomials of the first, second, third, fourth and nth degree respectively. The functions defined by the first four of these equations are sometimes called linear, squared, cubic and quartic functions respectively.
To decide which curve should be used, it is helpful to obtain scatter diagrams of transformed variables. For example, if a scatter diagram of Log Y vs X shows a linear relationship the equation has the form (7), while if Log Y vs Log X shows a linear relationship the equation has the form (8). Logarithmic or double-logarithmic paper to show the functions when one or both scales are calibrated logarithmically.
Freehand Method of Curve Fitting
Individual judgment can often be used to draw an approximating curve to fit a set of data. This is called a freehand method of curve fitting. If the type of equation of this curve is known, it is possible to obtain the constants in the equation by choosing as many points on the curve as there are constants in the equation. For example, if the curve is a straight line, two points a necessary, if it is a parabola, three points are necessary. This method has the disadvantage that different observers will obtain different curves and equation. The method is unequivocable.
The straight line
The simplest type of approximating curve is a straight line, whose equation can be written:
Y = a0 + a1X (15)
Given the two points (X1,Y1) and (X2,Y2) on the line, the constants a0 and a1 can be determined. The resulting equation can be writtten:
Y - Y1 = (X X1)*(Y2 -Y1)/(X2 X1) or Y - Y1 = m(X - X1), where m =(Y2 Y1)/(X2 X1) is called the slope of the line and represents the change in Y divided by the corresponding change in X.
When the equation is written in the form (15) the constant a1 is the slope m. The constant a0, which is the value of Y when X = 0, is called the U intercept.
The Method of Least Squares
To avoid invidual judgment in constructing lines, parabolas or other approximating curves to fit sets of data, it is necessary to agree on a definition of a "best fitting line", "best fitting parabola", etc.
To motivate a possible definition, consider the next figure in which the data points are given by (X1,Y1), (X2, X2) (Xn,Yn). For a given value of X, say X1, there will be a difference between the value Y1 and the corresponding value determined from the curve. As indicated in the figure we denote this difference by D1, D2, D3, D4, D5 Dn , which is sometimes referred to as a derivation, error or residual and may be positive, negative, or zero. Similarly, corresponding to the values X1, Xn we obtain D1 Dn.
A measure of the "goodness of fit" of the curve to the given data is provided by the quantity D1^2 + D2^2 + D3^2 + .Dn^2. Somebody might propose D1 + D2 +D3 + .Dn to calculate the total divergence.
But it does not work as it sums to zero, if the curve has drawn accurately, half of the Ds are as positive, as the rest are negative all together.
Best fitting curve: S D^2 is a minimum, where S sums all the D^2 from 1 to n. A curve having this property is said to fit the data in the least square sense and it is called the Least Square Curve.
Thus a line having this property is called the least square line, a parabola with this property is called a least square parabola, etc.
It is customary to employ the above definition when X is a independent variable and Y is the dependent variable. If X is the dependent variable the definition is modified by considering the horizontal instead of the vertical deviation, which amounts to an interchange of the X and Y axes. These two definitions in general lead to different least square curves. Unless otherwise specified we shall consider Y as the dependent and X as the independent variable.
It is possible to define another least square curve by considering perpendicular distances from each of the data points to the curve instead of either vertical or horizontal distances. However, this is not often used.
The Least Square Line
The least square line approximating the set of points (X1,Y1), (X2,Y2)
(Xn,Yn) has an equation:
Y = a0 + a1X (18)
where the constants a0 and a1 are determined by solving simultaneously the equations:
S Y = a0N + a1S X
S XY = a0S X + a1S X^2 (19)
which are called the normal equations for the least square line (18)
a0 = [(S Y)(S X^2) - (S X)(S XY)]/[NS X^2 - (S X^2)]
a1 = [NS XY (S X)(S Y)]/[NS X^2 (S X)^2] (20)
The normal equation (19) are easily remembered by observing that the first equation can be obtained formally by summing on both sides of (18), i.e. S Y = S (a0 + a1X) = a0N + a1S X, while the second equation is obtained by first multiplying both sides of (18) by X and then summing, S XY = S X(a0 + a1X) = a0S X + a1S X^2 . Note that this is not a derivation of the normal equation but simply a mean for remembering them. For a derivation using the calculus. The sum of squared derivations is minimized by derivating this sum S and equalizing it to zero (that is a simple mathematical rule not to be proved here).
S = (a0 + a1X1 - Y1)^2 + .+(a0 + a1Xn +Yn)^2
The labor involved in finding a least square line can sometimes be shortened by transforming the data so that x = X - Xg and y = Y - Yg, where the notation g means average. The equation of the least square line can then be written :
y = (S xy/S x^2)x or x = (S xy/S y^2)y (21)
This is easily proved but not included here.
If particular X is such that S X = 0, i.e. Xg =0, this becomes:
Y = Yg + (S XY/S X^2)X (22)
From these equations it is at once evident that the least square line passes through the point (Xg,Yg), called the centroid or the center of gravity of data.
If the variable X is taken as dependent instead of independent variable, we write (18) X = b0 + b1Y. Then the above results hold if X and Y are interchanged and a0 and a1 are replaced by b0 and b1 respectively. The resulting least square line, however, is in general not the same as that obtained above.
Non-linear Relationships
Non-linear relationships can sometimes be reduced to linear relationships by appropriate transformation of variables. The logarithm-function can sometimes be used. But it is often the problem to find out if the original data fit a known mathematical function
The least square Parabola
The least square parabola approximating the set of points (X1,Y1)
(Xn,Yn) has the equation:
Y = a0 + a1X + a2X^2
where the constants a0, a1 and a2 are determined by solving simultaneously the equations:
S Y = a0N + a1S X +a2S X^2
S XY = a0S X + a1S X^2 + a2S X^3
S X^2Y = a0S X^2 + a1S X^3 + a2S X^4 (24)
called the normal equation for the least square parabola (23)
A least square cubic or quartic curve can easily be determined by extending the multiplying technic mentioned in the section following (20).
Regression
Often, on the basis of sample data, we wish to estimate the value of a variable Y corresponding to a given value of a variable X. This can be accomplished by estimating the value of Y from a least square curve which fits the sampled data. The resulting curve is called a regression curve of Y on X, since Y is estimated from X.
If we desired to estimate the value of X from a given value of Y we would use a regression curve of X on Y, which amounts to interchanging the variables in the scatter diagram so that X is the dependent variable and Y is the independent variable. This is equivalent to replacing vertical deviations in the definition of least square curve by horizontal deviations. The last is based on simple mathematic not included here. More about regression and correlation in another link.
Applications to Time Series
If the independent variable X is time, the data shows the values of Y at various times. Data arranged according to time are called time series. The regression line or curve of Y on X in this case is often called a trend line or trend curve and is often used for purposes of estimation, prediction or forecasting.
Problems Involving More Than Two Variables
Problems involving more than two variables can be treated in a manner analogious to that for two variables. For example, there may be a relationship between the three variables X, Y and Z which can be described by the equation:
Z = a0 + a1X + a2Y (25)
which is called a linear equation in the variables X, Y and Z.
In the three dimensional rectangular coordinate system this equation represents a plan and the actual sample points (X1,Y1,Z1), ..(Xn,Yn,Zn,) might "scatter" not too far from this plane which we can call an approximating plane.
By extension of the method of least squares, we can speak of a least square plane approximating data. If we are estimating Z from given values of X and Y, this would be called a regression plan of Z on X and Y. The normal equations corresponding to the least square plane (25) are given by:
S Z = a0N + a1S X + a2SY
S XZ = a0S X + a1SX^2 + a2S XY
S YZ = a0 S Y + a1S XY + a2S Y^2 (26)
and can be remembered as obtained from (25) by multiplying by 1, X and Y succesively and then summing.
More complicated equations than (25) can also be considered. These represent regression surfaces. If the number of variables exceeds three, geometric intuition is lost since we then require four, five dimentional spaces.
Problems involving estimation of a variable from two or more variables are called problems of multi-regression and will be dealt with in more detail under another link.
Problems
1.
Corresponding data of the variables X and Y, height and weight of males |
||||||||||||
Height (inches) |
70 |
63 |
72 |
60 |
66 |
70 |
74 |
65 |
62 |
67 |
65 |
68 |
Weight (pounds) |
155 |
150 |
180 |
135 |
156 |
168 |
178 |
160 |
132 |
145 |
139 |
152 |
Obtain a scatter diagram from the data: |
The scatter diagram is obtained by plotting the points (70,155), 63,150) .(68,152). By using a ruler you find several straight line which apparently suites the relation in question. Choosing any two point on the line just drawn you can account the slope of a fitting line.
Y Y1 = (X X1)(Y2 -Y1)/(X2 -X1)
(170 - 156)/(68 66) = 7
Y - 156 = 7 (X - 66)
Y = 7X 306
If X=63 then Y= 7*63 - 306 = 135 provided that the line expresses the relation between height and weight among females in right way. We chose the best fitting line in the diagram, we hope. As we shall see below this method is certainly not exactly. Instead of the point (66,156) we could have chosen (65,139) and got the result: Y = 10.33X 316. Train a more exact method below.
2.
Find the least square line to the following data: (1,1), (3,2), (4,4), (6,4), (8,5), (9,7), (11,8), (14,9)
The equation of the line is Y = a0 + a1X. The normal equations are:
S Y = a0N + a1S X
S XY = a0S X + a1S X^2
The work involved in computing the sums can be arranged as in the following table. Although the last column is not needed for this part of the problem, it has been added to the table for use with X as an dependent variable which gives quite another result. The last called regression of X on Y.
X |
Y |
X^2 |
XY |
Y^2 |
1 |
1 |
1 |
1 |
1 |
3 |
2 |
9 |
6 |
4 |
4 |
4 |
16 |
16 |
16 |
6 |
4 |
36 |
24 |
16 |
8 |
5 |
64 |
40 |
25 |
9 |
7 |
81 |
63 |
49 |
11 |
8 |
121 |
88 |
64 |
14 |
9 |
196 |
126 |
81 |
S X = 56 |
S Y = 40 |
S X^2 = 524 |
S XY = 364 |
S Y^2 = 256 |
Since there are 8 pairs of values of X and Y, N =8 and the normal equations become:
8a0 + 56a1 = 40
56a0 + 524a1 = 364
Solved simultaneously, a0 =
6/11 or 0.545, a1= 7/11 or 0.636
Another method:
a0 = [(S Y)(S X^2) - (S X)(S XY)]/[NS X^2 - (S X)^2] = [(40*524) - (56*364)]/[(8*524) - (56)^2] = 6/11 or 0.545
a1 = [NS XY - (S X)(S Y)]/[NS X^2 - (S X)^2] = [(8*364) - (56*40)]/[(8*524) (56)^2] =
7/11 or 0.636
Perhaps we should try to estimate the regression line from the example with heigts and weights a little more exactly:
Height X |
Weight Y |
x=X-Xg |
y=Y-Yg |
xy |
x^2 |
y^2 |
70 |
155 |
3.2 |
0.8 |
2.56 |
10.24 |
0.64 |
63 |
150 |
-3.8 |
-4.2 |
15.96 |
14.44 |
17.64 |
72 |
180 |
5.2 |
25.8 |
134.16 |
27.04 |
665.64 |
60 |
135 |
-6.8 |
-19.2 |
130.56 |
46.24 |
368.64 |
66 |
156 |
-0.8 |
1.8 |
-1,44 |
0.64 |
3.24 |
70 |
168 |
3.2 |
13.8 |
44.16 |
10.24 |
190.44 |
74 |
178 |
7.2 |
23.8 |
171.36 |
51.84 |
566.44 |
65 |
160 |
-1.8 |
5.8 |
-10.44 |
3.24 |
33.64 |
62 |
132 |
-4.8 |
-22.2 |
106.56 |
23.04 |
492.84 |
67 |
145 |
0.2 |
-9.2 |
-1.84 |
0.04 |
84.64 |
65 |
139 |
-1.8 |
-15.2 |
27.36 |
3.24 |
231.04 |
68 |
152 |
1.2 |
-2.2 |
-2.64 |
1.44 |
4.84 |
S X=802 Xg=66.8 |
S Y=1850 Yg=154.2 |
S xy= 616.32 |
S x^2= 191.68 |
S y^2= 2659.68 |
The required least square line has this equation:
Y - 154.2 = 3.22(X 66.8) or
Y= 3.22X - 60.9
Sometimes when the raw figures are large it is an advantage to subtract a large figure at least from one of, perhaps from both of the variables before accounting. Then you must remember to add the same figures to the averages you account to get the right result.
As you have seen now, the least square method has not many yielding uses.
In practice investigators often look upon the straight line or the proportional relations as holy. Perhaps because the straight line is the simplest to handle mathe- matically.
Phenomenons in society is seldom to be described generally as organized variables in one or another known mathematical function.
The least square method presup- poses that.
But especially the development of variables in time (Time Series) is often expressed in trend and season curves estimated with use of least square curves. Here the time variable represents the regressor on the X-axis. As we shall see in another link you might solve problems with more variables and still try to measure
for linear dependence on the regressand (Y) from one or from more regressors at a time. The problem often seems to be that you perhaps have collected a too small sample (when the expected results did not materialize) or should we rather choose to claim that your assumptions perhaps were wrong or just a little bit adjusted to more or less determining circumstances not included in your analyses. In reality take good care of your assumptions and beliefs, they the beginning and the end of your reality. Perhaps this practical reality tells you something:Causal interpretations almost left.. |