9
5

I know that generative means "based on P(x,y)" and discriminative means "based on P(y|x)," but I'm confused on several points:

  • Wikipedia (+ many other hits on the web) classify things like SVMs and decision trees as being discriminative. But these don't even have probabilistic interpretations. What does discriminative mean here? Has discriminative just come to mean anything that isn't generative?

  • NB is generative because it captures P(x|y) and P(y), and thus you have P(x,y) (as well as P(y|x)). Isn't it trivial to make, say, logistic regression (the poster boy of discriminative models) "generative" by simply computing P(x) in a similar fashion (same independence assumption as NB, such that P(x) = P(x0) P(x1) ... P(xd), where the MLE for P(xi) are just frequencies)?

  • I know that discriminative models tend to outperform generative ones. What's the practical use of working with generative models? Being able to generate/simulate data is cited, but when does this come up? (I personally only have experience with regression, classification, collab. filtering over structured data, so are the uses irrelevant to me here?)

  • What's with the completely contradictory quotes from Wikipedia? "generative models are typically more flexible than discriminative models in expressing dependencies in complex learning tasks" vs. "discriminative models can generally express more complex relationships between the observed and target variables"

Related question that got me thinking about this.

asked Jun 27 '11 at 16:27

Yang%20Zhang's gravatar image

Yang Zhang
2915914

edited Jun 28 '11 at 15:10


3 Answers:
11

I think of generative algorithms more as algorithms that provide a model of how the data is actually generated (I think of them as giving you a model of both P(X|Y) [and P(Y)], rather than of P(X, Y), though I guess it's equivalent), and discriminative algorithms as algorithms that do a slightly more black-box classification (which doesn't necessarily have to be probabilistic).

Compare, for instance, Gaussian mixture models and k-mean clustering. In the former, we have a nice probabilistic model for how points are generated (pick a component with some probability, and then emit a point by sampling from the component's Gaussian distribution), but there's nothing we can really say about the latter.

[Note that generative algorithms have discriminative properties, since you can get P(Y|X) once you have P(X|Y) and P(Y) (by Bayes' Theorem), though discriminative algorithms don't really have generative properties.]

1: Discriminative algorithms are any algorithms that allow you to classify points, without providing a model of how the points are actually generated. So these could be either algorithms that try to learn P(Y|X) directly (e.g., logistic regression) or algorithms that try to learn the mappings directly from the points to the classes (e.g., perceptron and SVMs -- these pretty much simply give you a separating hyperplane, but no model of generating new points). So yep, discriminative classifiers are any classifiers that aren't generative.

Another way of thinking about this is that generative algorithms make some kind of structure assumptions on your model, but discriminative algorithms make fewer assumptions. For example, Naive Bayes (generative) assumes conditional independence of your features, while logistic regression (discriminative "counterpart") does not; similarly, latent Dirichlet allocation (generative) assumes your topic mixtures are sampled from a Dirichlet distribution and so on, while latent semantic analysis (not sure I'd quite call it a "discriminative algorithm", since it's not really a classifier, but it's more discriminative than generative) is pretty much just a mathematical algorithm.

2: Yep, Naive Bayes is generative because it captures P(X|Y) and P(Y). For example, if we know that P(Y = English) is 0.7 and P(Y = French) is 0.3, along with English and French word probabilities (e.g., P(hello|English) = 0.09, P(bonjour|French) = 0.08), then we can now generate a new document by first choosing the language of the document (English with probability 0.7, French with probability 0.3), and then generating words according to the chosen language's word probabilities.

Yeah, I guess you could make logistic regression generative in that fashion, but it's only because you're adding something to logistic regression that's not already there. That is, when you're performing a Naive Bayes classification, you're directly computing P(Y|X) = P(X|Y) P(Y) [the terms on the right, P(X|Y) and P(Y), are what allow you to generate a new document]; but when you're computing P(Y|X) in logistic regression, you're not computing these two things, you're just applying a logistic function to a dot product.

3: Elaborating a bit on what Alexandre Passos said, generative models often outperform discriminative models on smaller datasets because their generative assumptions place some structure on your model that prevent overfitting. For example, let's consider Naive Bayes vs. Logistic Regression. The Naive Bayes assumption is of course rarely satisfied, so logistic regression will tend to outperform Naive Bayes as your dataset grows (since it can capture dependencies that Naive Bayes can't). But when you only have a small data set, logistic regression might pick up on spurious patterns that don't really exist, so the Naive Bayes acts as a kind of regularizer on your model that prevents overfitting. There's a paper by Andrew Ng and Michael Jordan on discriminative vs. generative classifiers that talks about this more: http://robotics.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf

4: I think what it means is that generative models can actually learn the underlying structure of the data if you specify your model correctly and the model actually holds, but discriminative models can outperform in case your generative assumptions are not satisfied (since discriminative algorithms are less tied to a particular structure, and the real world is messy and assumptions are rarely perfectly satisfied). [I'd probably just ignore these quotes if they're confusing.]

answered Jun 28 '11 at 13:12

grautur's gravatar image

grautur
961122328

edited Jun 28 '11 at 16:40

  1. SVMs and decision trees are seen as discriminative in that they don't really bother saying anything about the Xs, just how to compute the Y given the X. And actually the probabilistic extension to SVMs is half generative, which is rather nice.

  2. You could make a generative model that has a logistic regression approach to generate the class labels. However, this would give you nothing in terms of predictive performance, as the Xs are always observed and then there will be no edges in the graphical model connecting the Ys with whatever prior structure you had. So as far as the classification goes it'd still be discriminative. A good rule of thumb is "if you observe a new X without its Y, do you learn anything that can change the classifier?" if so, then your model is at least a bit generative; otherwise it is discriminative.

  3. Usually generative models are nicer to train (maximizing P(y,x) requires less normalization fudging than P(y|x)) and they work better when you don't have a lot of data. Also, it is easier to add hierarchical structure to them.

  4. Think of the first quote as doing something like learning a complicated class-based topic-model to discriminate between two or more classes in text documents. You can't usually do this in discriminative models as a lot of this structure is in the data distribution itself. Conversely, as discriminative models only model P(y|x), they don't have to waste model structure with the information about the Xs, so they can be more flexible in cases of model mismatch, which is what the second quote is trying to say.

answered Jun 27 '11 at 17:33

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

For (3), is this because generated models can depend on the prior P(x) when little data is available to get a good estimate of P(y|x)?

(Jul 01 '11 at 15:29) crdrn

There is also two types of discriminative: discriminative models and discriminative functions.

The first gives you the p(y|x), while the latter only gives you $f: X rightarrow Y$ -- it maps the input directly to a class.

edit: For the downvoters: This definition is not mine, it's Chris Bishop's.

answered Jun 27 '11 at 17:43

Justin%20Bayer's gravatar image

Justin Bayer
170693045

edited Jun 29 '11 at 02:11

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.