Revision history[back]
click to hide/show revision 1
Revision n. 1

May 20 '10 at 16:43

Joseph%20Turian's gravatar image

Joseph Turian
579051125146

LSA / LSI (Latent semantic indexing) is a classic linear technique for inducing low-dimensional dense document representations. The gensim package in Python uses incremental SVD Gensim also implements LDA and random indexing for docreprs.

One can also apply deep learning to induce document representations. See for example:

  • "Semantic Hashing", Salakhutdinov (2007), wherein one can do constant-time semantic (not sure keyword-based) information retrieval by learning a 32-bit code for each document. Purely unsupervised.
  • Ranzato + Szummer (2008), wherein they compare dense and sparse hidden layers for a deep architecture. Unsupervised, followed by supervised fine-tuning for doccat.

The difficulty with these deep approaches is that each training update scales as the size of the vocabulary, not as the number of non-zeros in the document bag-of-words representation. Ugh. So while there might only be 50 different word types in the document, if your vocabulary is 100K you have to do a matrix operation in 100K. The authors above only go up to a vocab size of 10K or 20K, if I recall.

Finally, I have some not-yet-tested ideas for deep techniques for inducing document representations. Please post a followup comment if you want to learn more about these ideas.

click to hide/show revision 2
Revision n. 2

May 20 '10 at 16:47

Joseph%20Turian's gravatar image

Joseph Turian
579051125146

LSA / LSI (Latent semantic indexing) is a classic linear technique for inducing low-dimensional dense document representations. The gensim package in Python uses incremental SVD implements LSA using incremental SVD, so that it does not need to store the unsupervised document corpus in RAM. Gensim also implements LDA and random indexing for docreprs.inducing docreprs. According to their workshop paper (Rehurek and Sojka, 2010), they use Gensim to induce LSA and LDA models over 270 million word tokens, with a vocabulary size of 300K word types.

One can also apply deep learning to induce document representations. See for example:

  • "Semantic Hashing", Salakhutdinov (2007), wherein one can do constant-time semantic (not sure keyword-based) information retrieval by learning a 32-bit code for each document. Purely unsupervised.
  • Ranzato + Szummer (2008), wherein they compare dense and sparse hidden layers for a deep architecture. Unsupervised, followed by supervised fine-tuning for doccat.

The difficulty with these deep approaches is that each training update scales as the size of the vocabulary, not as the number of non-zeros in the document bag-of-words representation. Ugh. So while there might only be 50 different word types in the document, if your vocabulary is 100K you have to do a matrix operation in 100K. The authors above only go up to a vocab size of 10K or 20K, if I recall.

Finally, I have some not-yet-tested ideas for deep techniques for inducing document representations. Please post a followup comment if you want to learn more about these ideas.

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.