Linear regression
Encyclopedia : L : LI : LIN : Linear regression
In statistics, linear regression is a method of estimating the conditional expected value of one variable y given the values of some other variable or variables x. The variable of interest, y, is conventionally called the "response variable". The terms "endogenous variable" and "output variable" are also used. The other variables x are called the explanatory variables. The terms "exogenous variables" and "input variables" are also used, along with "predictor variables". The term independent variables is sometimes used, but should be avoided as the variables are not necessarily statistically independent. The explanatory and response variables may be scalars or vectors. Multiple regression includes cases with more than one explanatory variable.
The term explanatory variable suggests that its value can be chosen at will, and the response variable is an effect, i.e., causally dependent on the explanatory variable, as in a stimulus-response model. Although many linear regression models are formulated as models of cause and effect, the direction of causation may just as well go the other way, or indeed there need not be any causal relation at all. For that reason, one may prefer the terms "predictor / response" or "endogenous / exogenous," which do not imply causality.
Regression, in general, is the problem of estimating a conditional expected value.
It is often erroneously thought that the reason the technique is called "linear regression" is that the graph of [y = \alpha + \beta x] is a line. But in fact, if the model is
- [y_i = \alpha + \beta x_i + \gamma x_i^2 + \epsilon_i]
Linear regression is called "linear" because the relation of the response to the explanatory variables is assumed to be a linear function of some parameters. Regression models which are not a linear function of the parameters are called nonlinear regression models. A neural network is an example of a nonlinear regression model.
Still more generally, regression may be viewed as a special case of density estimation. The joint distribution of the response and explanatory variables can be constructed from the conditional distribution of the response variable and the marginal distribution of the explanatory variables. In some problems, it is convenient to work in the other direction: from the joint distribution, the conditional distribution of the response variable can be derived.
- 1 Historical remarks
- 2 Statement of the linear regression model
- 3 Parameter estimation
- 3.1 Robust regression
- 3.2 Summarizing the data
- 3.3 Estimating beta
- 3.4 Estimating alpha
- 3.5 Displaying the residuals
- 3.6 Ancillary statistics
- 4 Multiple linear regression
- 5 Scientific applications of regression
- 6 See also
- 7 References
- 8 External links
Historical remarks
The earliest form of linear regression was the method of least squares, which was published by Legendre in 1805, and by Gauss in 1809. The term "least squares" is from Legendre's term, moindres carrés. However, Gauss claimed that he had known the method since 1795.
Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun. Euler had worked on the same problem (1748) without success. Gauss published a further development of the theory of least squares in 1821, including a version of the Gauss-Markov theorem.
The term "reversion" was used in the nineteenth century to describe a biological phenomenon, namely that the progeny of exceptional individuals tend on average to be less exceptional than their parents, and more like their more distant ancestors. Francis Galton studied this phenomenon, and applied the slightly misleading term "regression towards mediocrity" to it (parents of exceptional individuals also tend on average to be less exceptional than their children). For Galton, regression had only this biological meaning, but his work (1877, 1885) was extended by Karl Pearson and Udny Yule to a more general statistical context (1897, 1903). In the work of Pearson and Yule, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925. Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.
Statement of the linear regression model
A linear regression model is typically stated in the form
- [ y = \alpha + \beta x + \varepsilon. \, ]
An equivalent formulation which explicitly shows the linear regression as a model of conditional expectation is
- [ \mbox(y|x) = \alpha + \beta x \, ]
A linear regression model need not be affine, let alone linear, in the explanatory variables x. For example,
- [y = \alpha + \beta x + \gamma x^2 + \varepsilon]
Often in linear regression problems statisticians rely on the Gauss-Markov assumptions:
- The random errors εi have expected value 0.
- The random errors εi are uncorrelated (this is weaker than an assumption of probabilistic independence).
- The random errors εi are "homoskedastic", i.e., they all have the same variance.
Sometimes stronger assumptions are relied on:
- The random errors εi have expected value 0.
- They are independent.
- They are normally distributed.
- They all have the same variance.
A statistician will usually estimate the unobservable values of the parameters α and β by the method of least squares, which consists of finding the values of a and b that minimize the sum of squares of the residuals
- [e_i = y_i - (a + bx_i). \,]
Notice that, whereas the errors are independent, the residuals cannot be independent because the use of least-squares estimates implies that the sum of the residuals must be 0, and the scalar product of the vector of residuals with the vector of x-values must be 0, i.e., we must have
- [e_1 + \cdots + e_n = 0 \,]
- [e_1 x_1 + \cdots + e_n x_n = 0. \,]
- [e_1^2 + \cdots + e_n^2 \,]
- [\sigma^2 \chi^2_, \,]
- the sum of squares divided by the error-variance σ2, has a chi-square distribution with n − 2 degrees of freedom,
- the sum of squares of residuals is actually probabilistically independent of the estimates a, b of the parameters α and β.
Parameter estimation
By recognizing that the [y_i = \alpha + \beta x_i + \varepsilon_i ] regression model is a system of linear equatioins we can express the model using data matrix X, target vector Y and parameter vector [\delta]. The ith row of X and Y will contain the x and y value for the tth data sample. Then the model can be written as
- [ \begin y_1\\ y_2\\ \vdots\\ y_n \end= \begin 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n \end \begin \alpha \\ \beta \end + \begin \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end ]
- [Y = X \delta + \varepsilon \,]
Then it can be shown that
- [\widehat = (X' X)^\; X' Y \,]
- [Y' (I_n - X (X' X)^ X')\, Y.]
- [\widehat = (X'X)^X'Y,\,]
The matrix In − X (X′ X)−1 X′ that appears above is a symmetric idempotent matrix of rank n − 2. Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional spectral theorem of linear algebra says that any real symmetric matrix M can be diagonalized by an orthogonal matrix G, i.e., the matrix G′MG is a diagonal matrix. If the matrix M is also idempotent, then the diagonal entries in G′MG must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So In − X(X′X) 1X′, after diagonalization, has n − 2 0s and two 1s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a chi-square distribution with n−2 degrees of freedom.
Regression parameters can also be estimated by Bayesian methods. This has the advantages that
- confidence intervals can be produced for parameter estimates without the use of asymptotic approximations,
- prior information can be incorporated into the analysis.
- [ y = \alpha + \beta x + \varepsilon \, ]
Robust regression
A useful alternative to linear regression is robust regression in which mean absolute error, or some other function of the residuals, is minimized instead of mean squared error as in linear regression. Robust regression is computationally much more intensive than linear regression and is somewhat more difficult to implement as well. However, least squares estimates are known to be very poor under certain circumstances, particularly when outliers are present. Since most real data are non-normal and often subject to outliers, robust estimation will frequently be preferable to least squares in practice.
Robust regression sometimes means linear regression with robust (Huber-White) standard errors (e.g. relaxing the assumption of homoskedasticity).
Summarizing the data
We sum the observations, the squares of the Ys and Xs and the products XY to obtain the following quantities.
- [S_X = x_1 + x_2 + \cdots + x_n \,]
- [S_ = x_1^2 + x_2^2 + \cdots + x_n^2 \,]
- [S_ = x_1 y_1 + x_2 y_2 + \cdots + x_n y_n. \,]
Estimating beta
We use the summary statistics above to calculate b, the estimate of β.
- [b = - S_X S_Y \over n S_ - S_X S_X}. \,]
Estimating alpha
We use the estimate of β and the other statistics to estimate α by:
- [a = . \,]
Displaying the residuals
The first method of displaying the residuals use the histogram or cumulative distribution to depict the similarity (or lack thereof) to a normal distribution. Non-normality suggests that the model may not be a good summary description of the data.
We plot the residuals,
- [e_i=y_i - a - bx_i \,]
- Residuals increase (or decrease) as the explanatory variable increases —indicates mistakes in the calculations. Find the mistakes and correct them.
- Residuals first rise and then fall (or first fall and then rise) —indicates that the appropriate model is (at least) quadratic. Adding a quadratic term (and then possibly higher) to the model may be appropriate. See polynomial regression.
- One residual is much larger than the others —suggests that there is one unusual observation which is distorting the fit.
- *Verify its value before publishing or
- *Eliminate it, document your decision to do so, and recalculate the statistics.
- Studentized residuals can be used in outlier detection.
Ancillary statistics
The sum of squared deviations can be partitioned as in ANOVA to indicate what part of the dispersion of the response variable is explained by the explanatory variable.
The correlation coefficient, r, can be calculated by
- [r = - S_X S_Y \over \sqrt - S_X^2) (n S_ - S_Y^2)}}. \,]
Multiple linear regression
Linear regression can be extended to functions of two or more variables, for example
- [Z = a X + b Y + c +\varepsilon. \,]
- [S_r = \sum_^n (a x_i + b y_i + c - z_i)^2. \,]
- [ \begin \sum_^n x_i^2 & \sum_^n x_i y_i & \sum_^n x_i \\ \\ \sum_^n x_i y_i & \sum_^n y_i^2 & \sum_^n y_i \\ \\ \sum_^n x_i & \sum_^n y_i & n \\ \end \begin a \\ \\ b \\ \\ c \end= \begin \sum_^n x_i z_i \\ \\ \sum_^n y_i z_i \\ \\ \sum_^n z_i \end]
Scientific applications of regression
Linear regression is widely used in biological and behavioural sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines. For example, early evidence relating cigarette smoking to mortality came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized experiments are considered to be more trustworthy than a regression analysis.
See also
References
Historical
- A.M. Legendre. Nouvelles méthodes pour la détermination des orbites des comètes (1805). "Sur la Méthode des moindres quarrés" appears as an appendix.
- C.F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum. (1809)
- C.F. Gauss. Theoria combinationis observationum erroribus minimis obnoxiae. (1821/1823)
- Charles Darwin. The Variation of Animals and Plants under Domestication. (1869) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)
- Francis Galton. "Typical laws of heredity", Nature 15 (1877), 492-495, 512-514, 532-533. (Galton uses the term "reversion" in this paper, which discusses the size of peas.)
- Francis Galton. Presidential address, Section H, Anthropology. (1885) (Galton uses the term "regression" in this paper, which discusses the height of humans.)
- Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute, 15:246-263 (1886). (Facsimile at: [link])
- G. Udny Yule. "On the Theory of Correlation", J. Royal Statist. Soc., 1897, p. 812-54.
- Karl Pearson, G. U. Yule, Norman Blanchard, and Alice Lee. "The Law of Ancestral Heredity", Biometrika (1903)
- R.A. Fisher. "The goodness of fit of regression formulae, and the distribution of regression coefficients", J. Royal Statist. Soc., 85, 597-612 (1922)
- R.A. Fisher. Statistical Methods for Research Workers (1925)
Modern theory
- Draper, N.R. and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics (1998)
Modern practice
External links
- [Regression Analysis]
- [ZunZun.com] Online curve and surface fitting.
- [Earliest Known uses of some of the Words of Mathematics]. See: [link] for "error", [link] for "Gauss-Markov theorem", [link] for "method of least squares", and [link] for "regression".
- [Online linear regression calculator.]
- [Online regression by eye (simulation).]
- [Leverage Effect] Interactive simulation to show the effect of outliers on the regression results
- [Linear regression as an optimisation problem]
- [Visual Statistics with Multimedia]
- [Multiple Regression] by Elmer G. Wiens. Online multiple and restricted multiple regression package.
From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.
