LSA / LSI (Latent semantic indexing) is a classic linear technique for inducing low-dimensional dense document representations. The gensim package in Python uses incremental SVD Gensim also implements LDA and random indexing for docreprs.
One can also apply deep learning to induce document representations. See for example:
- "Semantic Hashing", Salakhutdinov (2007), wherein one can do constant-time semantic (not sure keyword-based) information retrieval by learning a 32-bit code for each document. Purely unsupervised.
- Ranzato + Szummer (2008), wherein they compare dense and sparse hidden layers for a deep architecture. Unsupervised, followed by supervised fine-tuning for doccat.
The difficulty with these deep approaches is that each training update scales as the size of the vocabulary, not as the number of non-zeros in the document bag-of-words representation. Ugh. So while there might only be 50 different word types in the document, if your vocabulary is 100K you have to do a matrix operation in 100K. The authors above only go up to a vocab size of 10K or 20K, if I recall.
Finally, I have some not-yet-tested ideas for deep techniques for inducing document representations. Please post a followup comment if you want to learn more about these ideas.