This has be asked before, but I still have not grasped it completely. I know that generative models model the feature distribution and that this includes modelling the P(x|y) and P(y), which are not required if we are trying to classify (find P(y|x)).

Question: Many text books say that it is easier to include features in discriminative models which is rarely explained. Also they mention that discriminative models allow overlapping features (features that are interdependent). Could anybody explain what this means and why is it true, or guide me to the place to read? I see that it is possible to include features in generative models as well, and I can't see why it should be easier or more efficient in discriminative models.

asked May 06 '14 at 11:48

Nawar%20Halabi's gravatar image

Nawar Halabi
16113

which text books are you talking about?

(May 06 '14 at 13:58) eder

Machine Learning: A Probabilistic Perspective. This one just mentions on page 268 that discriminative models handle feature preprocessing, unlike generative models, by replacing x with f(x) in the model, which I can't figure out why is it not possible in generative models. There are many other examples to list. Like Jebara's PhD thesis which is really good but I could not answer this question after reading it http://www.cs.columbia.edu/~jebara/papers/jebara4.pdf. Also, I have taken a look at this book http://research.microsoft.com/en-us/um/people/cmbishop/prml/ and did not work for me as well.

(May 09 '14 at 12:48) Nawar Halabi

One Answer:

OK this is only a partial answer based on what I understand so far:

  1. It is easier to make the dataset separable (either linearly or not) by applying a feature extraction step (feature transformation or a preprocessing step as it is called sometimes) before classification than making the dataset's class-conditional probability distributions follow an assumption (Gaussian or Uniform ...). Why? not entirely sure but it seems to be that assuming a probability distribution is a stronger assumption than the linear separability in the case of logistic vs Bayesian. So, amusing that after the feature extraction these assumption is going to hold is even more ambitious than achieving linear separability by going to another possibly high dimensional space (like in Support Vector Machines).
  2. In the literature, feature functions are included as part of the logistic regression model sometimes (and sometimes not). But never for the Generative counterpart. So theoretical convenience is my answer here but I am not convinced deep inside that this matters.

Happy to here some thoughts on this

answered May 13 '14 at 12:29

Nawar%20Halabi's gravatar image

Nawar Halabi
16113

edited May 13 '14 at 12:30

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.