Document representations are used as the input for your supervised classifier, in such document-level tasks as sentiment analysis, subjectivity detection, document categorization, language indentification etc.

However, a bag-of-words representation (perhaps with tf*idf weights) gives poor parameter estimates for terms that are rare in the supervised data. If words that are rare in the supervised corpus are more common in large, unsupervised corpus, how can we use the large unsupervised corpus to induce document representations to use as our document features for the supervised classifier? Moreover, I would like an online training algorithm.

What is an online, unsupervised technique for inducing document representations?

asked May 20 '10 at 16:30

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

edited Oct 09 '10 at 11:50

7 Answers:

LSA / LSI (Latent semantic indexing) is a classic linear technique for inducing low-dimensional dense document representations. The gensim package in Python implements LSA using incremental SVD, so that it does not need to store the unsupervised document corpus in RAM. Gensim also implements LDA and random indexing for inducing docreprs. According to their workshop paper (Rehurek and Sojka, 2010), they use Gensim to induce LSA and LDA models over 270 million word tokens, with a vocabulary size of 300K word types.

One can also apply deep learning to induce document representations. See for example:

  • "Semantic Hashing", Salakhutdinov (2007), wherein one can do constant-time semantic (not sure keyword-based) information retrieval by learning a 32-bit code for each document. Purely unsupervised.
  • Ranzato + Szummer (2008), wherein they compare dense and sparse hidden layers for a deep architecture. Unsupervised, followed by supervised fine-tuning for doccat.

The difficulty with these deep approaches is that each training update scales as the size of the vocabulary, not as the number of non-zeros in the document bag-of-words representation. Ugh. So while there might only be 50 different word types in the document, if your vocabulary is 100K you have to do a matrix operation in 100K. The authors above only go up to a vocab size of 10K or 20K, if I recall.

Finally, I have some not-yet-tested ideas for deep techniques for inducing document representations. Please post a followup comment if you want to learn more about these ideas.

answered May 20 '10 at 16:43

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

edited May 20 '10 at 16:47

I would love to learn more about your ideas. I just took a quick look at that Semantic Hashing paper and recently have come into similar road-blocks. Right now I'm currently playing with LDA -- in (what feels like) a feeble attempt to bypass these limitations.

(Jun 30 '10 at 16:19) Andrew Nelder

The inputs for semantic hashing are actually bag of word vectors. (I think RBMs with Poisson distributed inputs were introduced in that paper because of this.)

Also, due to the pretraining required by deep nets composed out of RBMs, this is not an online solution.

(Oct 10 '10 at 09:46) Justin Bayer

Justin, yes, semantic hashing input is a bag-of-words.

The pretraining operates stochastically or over mini batches. So training is purely online. If you had an infinite stream of documents, you could train your semantic hashing over the beginning of this stream. If that's not online, I'm not sure what is.

(Oct 10 '10 at 11:39) Joseph Turian ♦♦

However, training is divided into pretraining and finetuning. As soon as you get a new item from the stream, how are you going to handle it? Repeat pretraining on the whole dataset (would that still be online?)? Finetune only for the new item? It's not really clear to me.

(Oct 11 '10 at 22:11) Justin Bayer

@Justin Bayer: what is usually done is online training in two phases: for the first K examples you pretrain, for the others you fine-tune. An alternative, that I recall being described by Yann LeCun, is to do pre-training online and using all the nodes of the network as features for a discriminative layer that is trained independently, also online. As long as the pre-training is slower than this separate fine-tuning layer (with a good choice of learning rates this is easy to ensure, especially since this last layer is linear and linear models train fast) you can make it perfectly online.

(Oct 11 '10 at 22:49) Alexandre Passos ♦

Thanks for the elaboration.

(Oct 12 '10 at 01:26) Justin Bayer

But then again, this will only work if your topics are static. If your corpus will be growing over a long period of time and you wish to continuously learn from it, the exchangeability assumption of the documents which underlies LDA might no longer correspond to reality. Topics will most probably disappear and appear throughout training and your model will get confused. See Blei's paper I referred to in by answer below. Online learning is to be used with care with LDA... there are other models though, that deal with this kind of time dynamics (besides the Blei paper below, there's also a paper by Wang and McCallum)

(Oct 12 '10 at 08:54) Breno
showing 5 of 7 show all

You should probably also look at Latent Dirichlet Allocation (LDA). There is quite a bit of software available for LDA and the good news is that training an LDA model scales relatively nicely: generally it is easier to train than any neural net (or deep architecture) based approach.

You can train an LDA model and slowly add datapoints to do incremental online learning. Specific details for LDA can be found in this work.

answered Jun 30 '10 at 02:55

Jurgen's gravatar image


Maybe it's unrelated to this, but this EMNLP paper by Sauper, Haghighi, and Barzilay might be of interest to you: Incorporating content structure into text analysis applications. The idea behind it is how to tie together an unsupervised generative model (such as a hidden markov model or a topic model) with a supervised logistic regression classifier. The techniques they propose are not online, but as they're based on EM it should be possible to adapt them to an online setting using something similar to Percy Liang's paper on Online EM for unsupervised models, although I haven't looked too closely to this mixture. You can also make it use a classifier more complex than logistic regression, as the M step for this model is essentially a gradient step in the classifier plus an updating of the count matrix of the unsupervised models.

answered Oct 09 '10 at 16:26

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

there's also been atleast one straightforward extensions of LDA that appear to provide a bit more intuitive semantic categories than standard LDA: The Hidden Markov Topic Model: A Probabilistic Model of Semantic Representation

hope that helps, =joe

answered Oct 10 '10 at 01:47

Joseph%20Austerweil's gravatar image

Joseph Austerweil

LDA is definitely one way to go; it's unsupervised and works pretty well.

Apart from Canini's paper there's also one by McAllum and two of his students here.

Nevertheless, you should be aware that LDA is not really suitable to be learned online. The exchangeability assumption on the documents is stressed if you decide to learn incrementally over a long period of time. This assumption is a fundamental part of LDA and this question is also discussed by Blei himself here.

answered Oct 10 '10 at 17:18

Breno's gravatar image


This was the topic of my thesis. The method I tried is similar and motivated by the Salakhutdinov's approach. But instead of using autoencoders directly on the tf vectors, I first preprocessed the tf-idf feature vectors with PCA then projected the tf-idf vectors to the top 100 principal components and used those to train a much smaller RBM with gaussian visible units. Much more scalable than applying an autoencoder on the tf vectors. This actually is improvement over PCA. Also since you train a single RBM you don't need expensive fine-tuning. I improved the results even more training the RBM by splitting it smaller RBMs and training those, as described in the question I posted earlier on parallel RBMs. (I still haven't found a bug or a mistake in the results so I should think they are ok for now)

answered Feb 11 '11 at 11:39

Oliver%20Mitevski's gravatar image

Oliver Mitevski

The best solution of an autoencoder with one layer of linear units is the solution of the PCA. In other words PCA directly finds the "best" linear transformation, which can be realized by such an autoencoder. So pretraining+finetuning is suboptimal and less scalable way to find such a transformation for one layer linear units autoencoder, than the very scalable implementations of SVD.

(Apr 21 '11 at 05:05) Oliver Mitevski

Is your code available?

(Aug 05 '12 at 15:44) Joseph Turian ♦♦

Just be sure you compare with a plain old bag of words, log TF * IDF representation. Surprisingly hard to beat if the learner is reasonably regularized.

answered Aug 09 '11 at 22:54

Dave%20Lewis's gravatar image

Dave Lewis

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.