Are there non-probabilistic assumptions under which least squares regression can be justified as a rational method?

asked Mar 09 '11 at 09:02

Nikita%20Zhiltsov's gravatar image

Nikita Zhiltsov
664511

edited Mar 09 '11 at 09:05

What is a rational method? Probability is one way of defining rationality (that sayd rationality minimizes expected risk), but I'm not aware of a general way.

Other than saying "whenever you would use a square loss function (that is, when the cost of making a mistake grows with the square of the error)" I'm not sure what you're looking for.

(Mar 09 '11 at 09:08) Alexandre Passos ♦

3 Answers:

The justification for least squares as a data fitting method is geoemetric and analytic.

Its based on the assumption you want the model line that best fits the observed data. That best fit line is the one that minimizes the distance between any data point and predicted point on the line.

That minimization is an optimization problem which can take the form of a multivariate calculus problem which in the case of least squares criterion is using partial derivatives for minimization and Cramer's rule for solutions to the resulting system of equations.

Alternatively one can create a linear program for the optimization problem using Chebyshev's approximation criterion and the simplex method.

There are also numerous other methods with other requirements. Least squares is often used because it is computationally easier to work with and adapts well to fitting data of other functional forms (e.g. power functions).

F. Giordano, W. Fox, S. Horton, M. Weir A First Course in Mathematical Modeling discusses this and more at length in chapter 3.

It is only probabilistic if you're looking at what sorts of populations of potential datasets can this least square line represent. Then you have the points on the line being means for normal distributions along the line at every point where the data points you actually observed are a sample of. This model is reasonable under its model assumptions (and some modest violations thereof). This explanation (pdf) might also provide some interesting insights for your question. Further analysis of that would be based on goodness of fit and other regression model evaluations/diagnostics.

I hope that helps. If I've misunderstood the meaning of your question I apologize, but please feel free to follow up.

answered Mar 09 '11 at 12:25

Chris%20Simokat's gravatar image

Chris Simokat
161147

From geometric stand-point, minimizing sum of squares makes sense when underlying geometry is Euclidian. However, it often makes sense to minimize other quantities. For instance, if you want to do regression on vectors that represent word frequencies, the following measure of distance has proven to work better

$$text{arccos} (sum_i sqrt{x_i y_i}$$

Guy Lebanon's "Axiomatic geometries for text documents" talks more about geometry based approach to distance choice.

answered Mar 09 '11 at 16:06

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
1963193458

Unfortunately that chapter does not seem to be freely available online. Could you give the intuition as to why using that measure works better?

(Mar 09 '11 at 17:35) Oscar Täckström

That chapter is not available from official sites, but the electronic book containing it seems to be circulating. As to why it works better -- its derived from Fisher Information metric, so you are taking into account the "shape" of sampling noise and incorporating it distance metric. If you look at multinomial distribution over 3 words, the cloud of samples from that distribution looks like a triangle with rounded corners, which is captured by this metric.

(Mar 09 '11 at 17:54) Yaroslav Bulatov

"From geometric stand-point, minimizing sum of squares makes sense when underlying geometry is Euclidian." In this case a least squares solution corresponds to the orthogonal projection of target values vector onto the subspace spanned by basis functions (e.g. linear ones in case of ordinary least squares). Is that right?

(Mar 10 '11 at 05:29) Nikita Zhiltsov

The book Optimal Estimation of Dynamic Systems (158488391X) has a great description of least-squares and other algorithms (including kalman filters). For least squares they gave both a probabilistic and analytic derivation, but as I recall the rest of the algorithms are derived using analytic techniques based on lagrange multipliers.

They'd setup the typical least-squares fitting problem, take appropriate derivatives, setup an expression involving lagrange multipliers, solve for the multipliers, and finally plug it back in.

Their description was pretty straightforward, though I don't have the book in front of me to quote anything for you. Google books or amazon might have a preview of the sections I'm talking about.

-Brian

This answer is marked "community wiki".

answered Mar 09 '11 at 14:44

Brian%20Vandenberg's gravatar image

Brian Vandenberg
644183444

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.