|
I wonder why do we have Gaussian assuption in modelling the error. In Stanford's ML course, Prof. Ng describes it basically in two manners:
I'm interested in the second part actually. Central Limit Theorem works for iid samples as far as I know but we can not guarantee the underlying samples to be iid. Do you have any ideas about the Gaussian assumption of the error? |
|
Usually it's for analytic convenience, but it's more than that. If the error has nonzero mean or nonzero odd moments if can bias the inference procedure in one direction strongly (and you can't know which direction that is a priori). Of course, one could argue that it's a better idea to use (for example) laplace-distributed error (which amounts to putting a higher mass on zero-valued error), spike-and-slab error (if you want to model that some points are perfectly correct while others have gaussian error), cauchy-distributed error (for error with a higher variance than would be possible with a gaussian distribution), etc, but all these solutions will lead to a higher cost at inference time, and some might even make exact inference intractable. The rationale for using the central limit theorem is more like this: let's say we're estimating a person's weight given their height. There are many other factors that interfere: gender, obesity incidence, age, etc, but all these factors are mutually independent, so as long as there are many of them it shouldn't be too unreasonable to model their total contribution to the weight as a gaussian. In other words, assuming normal error means assuming that there are many other independent variables we're not modeling, and they add up in an independent fashion. Of course, it's always better to add more information if availiable. Thank you for the detailed explanation. The error types were welcome. I didn't know about the spike-and-slab error and couldn't find a good resource for the related distribution. Would you suggest any? Best regards.
(Feb 25 '11 at 09:03)
İsmail Arı
spike-and-slab error is usually represented as a graphical model that assigns probability p to "this data point has no error" and probability 1-p to "this data point has gaussian error with known variance sigma", and you have to do bayesian inference to find out the parameters.
(Feb 25 '11 at 09:11)
Alexandre Passos ♦
3
A good reference on spike-and-slab in general is http://arxiv.org/pdf/math/0505633
(Feb 25 '11 at 09:11)
Alexandre Passos ♦
1
Not to hijack this thread to hard, but I'm really digging the spike-and-slab model. Thanks for the citation.
(Mar 19 '11 at 12:14)
Andrew Rosenberg
@Andrew: Indeed, I also find spike-and-slab to be really interesting conceptually, and sometimes practically, but inference is not always very nice, as you pretty much have to use sampling if you want a sparse model (you can do loopy BP and variational, but then you'll probably get a dense continuous model that looks a lot like an elastic net).
(Mar 19 '11 at 12:36)
Alexandre Passos ♦
|
|
Just to take on the other side of the argument... beyond whether samples are truly iid. A distribution I'm quite fond of is the lognormal distribution. A good introduction and many good reasons for using it (instead of gaussians) are mentioned by: E. Limpert, W. Stahel and M. Abbt (2001) Log-normal Distributions across the Sciences: Keys and Clues, BioScience, 51 (5), 341–352. What is interesting is that often many measurements in the world can never take on negative values. A few that Alexandre mentions in his post actually fit in this category: weight, height, age... In fact I would argue that, more often than not, you're going to see this restriction in the sciences. Modelling these values with a gaussian (especially if they're close to zero... i.e. variance large wrt mean) can often lead to some rather strange results. Gaussian distributions are useful as a pedagogical tool and it simplifies the maths. If I was a being living in an abstract mathematical world then the best distribution to choose would be the normal distribution. However, we live in a world where measurements are often positive, discrete (my age is not 29.17808... it's 29), sometimes bimodal and (if it's a really 'interesting' problem) long tailed. Gaussians are often insufficient. So my point? Uhhh, things aren't really that simple in the real world. It's actually not that simple more often than people realize. And whether they realize it or not, they're using Gaussians primarily for mathematical convenience. I like being contrarian :) The log-normal distribution seems to be cool. I enjoyed reading the sections "Why the normal distribution is so popular." and "Why the log-normal distribution is usually the better model for original data.". It says that multiplication may be more important than addition for some events in nature. So it favors log behaviour. Tus it is nice to see the data and -with intuition and the background knowledge- decide on the distribution. The question becomes what is the numerical correspondence when we use log-normal distribution. Normal distribution assumption leads to pseudoinverse and it is a very nice simple numerical technique.
(Feb 25 '11 at 11:19)
İsmail Arı
1
To add to Janto's point, many processes exist that cannot be easily (or sometimes tractably) modeled using 'simple' distributions like normal, lognormal, etc. One of the most obvious is the various markets (stocks, commodities, foreign exchange, etc). These types of processes exhibit heavy-tailed behavior, and can be accurately modeled (not necessarily predicted with) using 'stable' distributions. Most stable distributions are rather difficult to work with and almost always require numerical approximations to determine probabilities & whatnot. The parameters can be tweaked such that samples from the chosen stable distribution are strictly positive, negative, etc (satisfying Janto's needs where he is using the log-normal dist), or can exhibit non-symmetric behavior (eg, 'normal' is symmetric around its mean; stable distributions can be asymmetric without being one-sided like the log-normal distribution. Chances are, no one reading this has actually heard of a ML technique using stable distributions because they're difficult to work with, both algorithmically/computationally, as well as when deriving formulas/expressions. In order to overcome hurdles like this, where a process may modeled accurately by an intractable distribution, it's not at all uncommon for people to put on their blinders and use a different distribution with the expectation that actual behavior won't differ significantly from the true model. -Brian
(Feb 26 '11 at 03:49)
Brian Vandenberg
I am currently reading a book entitled, "The Black Swan: The Impact of the Highly Improbable". So far it is very interesting reading. I mention the book because, among other things, it criticizes the wide spread use of Gaussian distributions in modeling. Whether you agree the book's arguments or not (I'm still deciding), I recommend it for the critical thinking it stimulates.
(Mar 10 '11 at 20:28)
Art Munson
|
|
Classic Central Limit Theorem is just one of a set of central limit theorems which give conditions under which sums of random variables, not necessarily IID, approach normal distribution. To copy a relevant segment from Terry Tao's blogpost on universality -- "Roughly speaking, this theorem asserts that if one takes a statistic that is a combination of many independent and randomly fluctuating components, with no one component having a decisive influence on the whole, then that statistic will be approximately distributed according to a law called the normal distribution" There are also theorems which show Gaussian behavior for sums of dependent variables, Wikipedia has an overview. Basically you get Gaussian behavior if your variables are not too dependent on each other. An intuitive way to understand why we end up with Gaussian is that adding a random variable increases entropy. Gaussian distribution is the highest entropy distribution with bounded variance, so sums of random variables will tend towards Gaussian. Multiplying by a random variable can decrease entropy, so this argument doesn't apply to products of random variables. 1
As always, very interesting. I thought that if you modeled the error as the product of positive random variables then log-error should be normal by the central limit theorem. Is this invalidated by your entropy argument or is there something I'm not seeing?
(Feb 25 '11 at 18:23)
Alexandre Passos ♦
1
The product is not normal, but logarithm of the product is normal -- you can bring the log into the product which gives you a sum of random variables
(Feb 25 '11 at 18:50)
Yaroslav Bulatov
1
So the entropy can decrease in exp-space but increase in log-space by the same operation?
(Feb 25 '11 at 18:52)
Alexandre Passos ♦
1
Alexandre -- yes. Consider multiplying N[0,1] random variable by a constant below 1. It'll decrease differential entropy, but adding a constant does not change entropy. There are probably a few technical conditions I'm missing to make entropy argument rigorous, I think the "generating function" proof of CLT on Wikipedia is the easiest rigorous way to show why you get a Gaussian
(Feb 25 '11 at 19:10)
Yaroslav Bulatov
There's also a more general version of the central limit theorem that applies to stable distributions. The normal distribution is a subset of stable distributions, and the generalized central limit theorem basically says the same thing, but replacing IID random variables with random variables drawn from stable distributions (not necessarily the same distributions). In short: the sum of many stable random variables is also stable.
(Feb 26 '11 at 16:38)
Brian Vandenberg
1
Stable distribution is what the sum converges to, individual components of the sum don't need to be stable
(Feb 27 '11 at 18:47)
Yaroslav Bulatov
showing 5 of 6
show all
|
|
Keep in mind that worrying about the error distribution as such is more of a statistician's fixation. I think the machine learning / data mining perspective runs more along the lines of: 1. What is an appropriate performance measure for this problem (mean absolute error, for instance)? 2. How do I minimize it (typically as measured on holdout data). This difference of perspective is explored in Friedman's "Data Mining and Statistics: What's the Connection?" 1
Adding to what will said: it can also be convenient to choose a distribution for the simple fact it's easy to work with. The sigmoid is an example of this. The sigmoid is the CDF of a logistic distribution, and happens to be an ODE with very convenient [1st, 2nd, etc] derivatives that don't depend on the inputs to the CDF. Gaussian is convenient because (among other things) many real-world processes when added together will behave like a Gaussian process (central limit theorem). Poisson can be used to model both discrete and continuous events, it has convenient algebraic properties, and I'm sure there's plenty of other reasons to use it. The list could go on.
(Feb 28 '11 at 13:46)
Brian Vandenberg
|