1
1

What would be a good kernel for clustering text documents? Does it make sense to use the gaussian kernel, which works very well for digits? I should think there are some specific, like these string kernels This seems like a general case, i.e. for any kind of strings. Is there any work done to adapt these for natural language, for instance adapting these for tf-idf feature vectors?

asked Nov 12 '10 at 08:31

Oliver%20Mitevski's gravatar image

Oliver Mitevski
753172640


3 Answers:

Gives two text document d1 and d2, one possibility is to use: K(d1,d2) = d1' * P' * P* d2 (where ' means transpose and * means dot product). P is the proximity matrix of terms in your dictionary. P_{ij} > 0 if term i is related to term j (for example, if they are synonyms). The proximity matrix can be known a priori (e.g., from a source like Word-net), or could be learned from the corpus. You can even nonlinearize the above mapping by using a polynomial or gaussian kernel on top of it, e.g. (K(d1,d2) + 1)^m, where m is the degree of the polynomial kernel). In fact, latent semantic indexing (LSI) can be seen as using a special form of the P matrix.

LSI itself can be seen as learning a kernel: the projections you obtain can be seen as a semantic representation of each document and you can compute the similarities in the semantic subspace by simply using a linear kernel.

Also see the paper Latent Semantic Kernels for more details.

answered Nov 12 '10 at 12:56

spinxl39's gravatar image

spinxl39
3458104368

In my experience simply using any common wordnet similarity as a kernel makes things slightly worse for classification and clustering (higher entropy at fixed K, for k-means, for example). To get this to work one needs to do serious disambiguation, which is generally harder than learning a good enough kernel with PCA or LSI, as you mentioned.

(Nov 12 '10 at 12:59) Alexandre Passos ♦

String kernels do not work with tf-idf feature vectors. They compute, given two strings, a dot product in a very high dimensional space where each sequence of characters possible is a basis and the weight of each sequence in each string is proportional to its length and its gaps (noncontiguous sequences have far less weight, but are computed).

I can't say, however, with which kernels are better for clustering text documents.

answered Nov 12 '10 at 10:09

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

It depends on the nature of the documents. If they are very short documents, then you might get improvements by taking term proximity into account. String kernels can also be useful, since they will handle morphological variation. On the other hand they might also introduce spurious relations compared to using good morphological normalization. When you have sufficiently many documents and they are not too short, then I believe a standard word-based representation would work just as well or better than using more complex kernels.

answered Nov 17 '10 at 04:49

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
1459102743

Given the suggestions, what I would do is construct a term proximity matrix from some good semantic representation (for example random indexing) for the terms and then use it as a kernel. It should work much better than LSI (which is linear) if the proximity matrix is good, shouldn't it? And using polynomial kernel on top of it should improve things further, but it's not always the case, is it? Do you think more complex kernels are used in this context in industry or academia still? I got the impression from the previous paper (Latent Semantic Kernels) that the hype about these kernels is kind of over, since the paper is from 2003 and research in this direction has decreased since. That's just my impression, maybe I am wrong.

(Nov 17 '10 at 06:59) Oliver Mitevski

@Oliver: I don't understand why random indexing should be better than LSI, specially since both are linear and LSI optimizes reconstruction error, while random indexing doesn't optimize anything. Please clarify.

(Nov 17 '10 at 07:00) Alexandre Passos ♦

Random indexing only adds noise (albeit a small amount when using many dimensions in the projection matrix), while the full-rank SVD keeps all information. I would rather suggest extracting a sparse matrix of high precision term similarities. Then again, if you want to do clustering, you already capture a lot of this information in the cluster centroids. k-Means for example is highly related to PCA.

(Nov 17 '10 at 07:09) Oscar Täckström

@Alexandre: Basically I would like to learn the term proximity matrix by using terms-by-context matrix as in random indexing (Kanerva et al., 2000), instead of terms-by-documents matrix as in LSI.

(Nov 17 '10 at 07:57) Oliver Mitevski

@Oscar: Sparse matrix of high precision terms I intend to get by thresholding my term proximity matrix, so I get only the significant term similarities, and discard any far-fetched semantic similarities this way. Then again this could only be as good as constructing the proximity matrix from wordnet, except that it's corpus based, which could be advantageous.

(Nov 17 '10 at 08:18) Oliver Mitevski
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.