|
What is the best approach for calculation of semantic distance? examples: related questions, distance between 2 chunks of texts, distance between interests of Facebook users. Most interest aspect is how to use Wikipedia or other open collection - very large - to solve this problem. |
|
One approach would be to do something like a projection of the documents on to the semantic space of the corpus, for example by doing a latent semantic analysis (or the probabilistic version of it), and then compute the distances in that space. If you have a multi-lingual corpus then you can use the cross-lingual variants of latent semantic analysis. Alternatively, some people also use multidimensional scaling or self-organizing maps to compute an embedding of the data and then compute distances in that space. Also see this paper for some empirical evaluation of some of the existing approaches. What from those methods can be applied to very large collection like Wikipedia? In my knowledge SOM, LDA can work only for limited size data.
(Jul 31 '10 at 13:07)
yura
1
Self-organizing maps may be limited in dimensionality, but LDA is not so much. If you are willing to settle for suboptimal solutions, a collapsed gibbs sampler (or a collapsed variational algorithm) plus an on-disk database to store the assignments and the documents (keeping in memory only the counts) can give you a very-close-to-optimal solution in very few iterations (around 5 or 10), and each iteration takes time proportional to the number of words times the number of topics. I don't think a method of dimensionality reduction can be much faster than that.
(Jul 31 '10 at 13:29)
Alexandre Passos ♦
I agree with Alexandre above. However, if you want to use latent semantic analysis (LSA) which can be expensive for large data, one way to speed things up would be to use random projections to first reduce the data dimensionality to a manageable size (note that it can be much faster than PCA/SVD etc) and then apply LSA on the resulting data. See this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.6984&rep=rep1&type=pdf
(Jul 31 '10 at 13:37)
spinxl39
|
|
A few remarks : all previous answers mostly considerd (I think) the bag of word model. I the BOW model you starts with a representation of your document as a vector of the counts of the words. So for instance if your vocabulary is A B C and your document is "A A B A C C A", then your first representation of the document is {A:4, B:1, C:2}, which is the same representation than for this document : "A C A C A B A". Whatever the dimension reduction method you use afterward, you won't be able to overcome the limitation of this model. While this ambiguity is almost unimportant when you just want to know the theme of a document, it becomes problematic if you want to analyze questions, small sentences or even paragraphs, and want to obtain a semantic in the logical sense of the term (that is, the implications of the document). The problem is that we don't really have a simple way to take the grammatical constructions (and how they combine the words' meanings together) and discourse structure into account yet. There is some hope with reccurrent neural networks, though. About using an external primary source (Wikipedia in your question): it can be a good things if the data you want to analyze, is into the same domains/language levels. I don't think wikipedia is a good dataset for general question analysis, but neither is the "web as a corpus". |
|
You could use semantic hashing and use the hamming distance of any 2 documents. |
Thank You a lot! I was going to recommend that autoencoders tutorial. What do you mean by how to use it? Did you train it in the given data (images)? Are you not sure about how to apply it to text data?
(Jul 31 '10 at 10:09)
Alexandre Passos ♦
You should extract the 2000 or 5000 most frequent words and assign them to specific dimensions and then compute word frequencies vectors for each documents, normalize them and use them as input to the SdA of the deep learning tutorial for instance. You can also have a look at the hashing trick to vectorize documents with low memory requirements and reduce the dimension as the same time.
(Jul 31 '10 at 10:14)
ogrisel
Thank You for your replies, Alexandre and ogrisel! I am trying to figure out how to apply the existing code to text. I mean the program result is like what mentioned in hashing semantic paper, the input is the bunch of text, and the output is the 20-bit binary code, so that I can use hamming distance to compare similarity. Is there any detailed docs/blogs/tutorials/faq/links for this? Thank you a lot!
(Jul 31 '10 at 10:33)
LeeeeCAN
2
Please read about "bag of words" features for document classifcation and clustering, then the Vector Space Model article on wikipedia, the TF-IDF article too, then the afore mentioned hashing trick papers. And then have a look at the following feature extractor source code from the scikits.learn lib that mixes all of the above. There are some tests that can be helpful to understand how to use it.
(Aug 01 '10 at 20:26)
ogrisel
to use hinton's code for hashing, i'm pretty sure you go to the backprop.m and rbmlinear.m files and replace the top-level linear units with logistic ones (e.g. softmax) then in backprop.m add some random gaussian noise (using matlab's normrnd) to the input to those units. i'd welcome correction if i'm wrong... :)
(Aug 09 '11 at 19:53)
E Bennett
|
|
It really depends (obviously) on what sort of semantics you want to consider. If your answer is "any", then "any" dimensional reduction method (except random projections, probably) + a standard distance function should work, as pointed out in spinxl39's answer. However, things can get interesting if you are after more specific semantics; for example, if you want to compare two chunks of text by sentiment, or the interests of two facebook users by topic, etc. I think this is still mostly an open problem (and a reason for it being open is that it seems hard to evaluate a solution). About random projections: I have seen some papers which have actually shown good results on text data using random projections; see this paper: www.cis.hut.fi/ella/publications/randproj_kdd.pdf , or using random projections as a preprocessing step to speed up latent semantic indexing; see this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.6984&rep=rep1&type=pdf. However, one should keep in mind that simply applying random projection alone (the first paper) is expected to be useful only if the distances in original high dimensional dataset are actually meaningful (because random projection is known to preserve distances).
(Jul 30 '10 at 21:21)
spinxl39
|
|
The baseline approach is to take the cosine distance of the TF-IDF vector representations of documents. This called the Vector Space Model. More advanced methods usually involve learning a mapping into a lower dimensional space where dimensions are closer to semantic concepts than to simple word frequencies. This mapping can be linear as in Latent Semantic Analysis using a SVD decomposition or non-linear by using deep neural networks such as stacked autoencoders or deep belief networks (a.k.a stacked Restricted Boltzmann Machines). |
|
Deep Autoencoders (there's some implementations in MATLAB, you probably shouldn't try them yourself though) a good technique. They encode a document into rather low-dimensional space very effectively using a large, specialized neural network. The approaches I know of are based on word frequency though, so you might want a more sophisticated approach. |
See also the discussion here: http://stats.stackexchange.com/questions/2476/measuring-document-similarity