Currently, I am trying to perform a text classification(binary) task on a collection of documents. The text documents are noisy in nature viz. they have a lot of word variations like misspellings, lexemes and orthographic variants leading to a highly sparse representation of features. For example, {bcuz,bcause,becoz,bcoz}, {what, wht, wat,wt}, {whr,vr} , {india, indians, indo } etc. I also perform some linguistic profiling (content as well as genre/style based) of my positive class primarily using lexicon lookups. Due to very high word variants, the profiling isn't accurate.

I was wondering if I could perform clusters of such variant word distributions before inputting them as features to my machine learning framework. I suspect it could help in two ways : 1] Reduce the feature dimensional space , 2] Help in my linguistic profiling. I had a peek into Joseph Turian's t-SNE visualization of neural word embeddings and such formation of clusters looks pretty similar to what I would like to achieve albeit for noisy text.

Is there any way I could use any of the existing techniques which would help me in my task? Which one would be the most appropriate? I would ideally like to have these clusters formed : {bcuz,bcause,becoz,bcoz}, {what, wht, wat,wt}, {whr,vr} , {india, indians, indo}

I am more specifically looking at unsupervised world clustering

asked Feb 23 '11 at 06:40

Dexter's gravatar image


edited Feb 23 '11 at 07:30

One Answer:

One way of countering the data sparseness is by using an n-gram kernel. The simplest way is to add character n-grams as additional features to your classifier. A more powerful variant is to use a gappy n-gram kernel, which is implemented in for example the OpenFST toolkit (http://openfst.cs.nyu.edu/twiki/bin/view/Kernel/KernelQuickTour). You could also cluster the n-gram representations of the words to find clusters of ortographically similar words.

Another alternative, if your words to be clustered are not necessarily ortographically similar, is to induce distributed word representations. A very simple way of achieving this is to collect statistics of co-occurrences with neighbouring words, using random projections to reduce the dimensionality. With enough text, you should be able to achieve decent representations. You could then try to either add the feature vectors for each word to the document representation (which in my experience does not work very well), or use a clustering software to cluster the induced representations. The problem with this approach is that although you will probably get good representations of some words, you will also add a lot of noise.

Other variants that you could try, which I am not that practically familiar with, are neural-network based and Brown clustering based word embeddings. Perhaps you could use the pre-trained models (http://metaoptimize.com/projects/wordreprs/) that Joseph makes available directly?

Edit 1: As discussed on another post, topic models, such as LDA might also be something to consider. The problem again is that even though topics capture spelling variations etc, they also capture a lot of other things, which you would not want to collapse. Joseph links to implementations of the various embedding algorithms on his page as well as pre-trained models.

Edit 2: If you are mostly concerned with ortographic variations, then perhaps you could use n-gram similarity as a way of filtering out distributionally similar words that should not be considered to belong to the same class.

answered Feb 23 '11 at 08:20

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström

edited Feb 23 '11 at 12:15

Oscar, Thanks for the detailed response. I will soon look to n-gram kernel or character n-gram representations. However, currently I am looking for more of a solution which help m linguistic profiling hence your induced representations looks a good bet. I would like to try it out. I suspect I can't use pre-trained models as they are trained on more formal text.

Since, I am more or less an amateur let me try to understand this step-wise. Is there any out-of-box library which can help me out with inducing word representations? If not, I don't mind trying it out myself. Firstly, Lets assume I have a decent collection of text documents. My first step as I understand would be to collect statistics of co-occurrences with neighboring words. How can I achieve the same? Is it simple n-gram collocation statistics?

(Feb 23 '11 at 09:44) Dexter

Yes, the simplest method I described for word representations would be based on simple n-gram collocations. See http://metaoptimize.com/projects/wordreprs for links to implementations of the various algorithms. The simple co-occurrence model is described as "random indexing" on that page.

(Feb 23 '11 at 10:03) Oscar Täckström

Oscar, Thanks. I don't have any training data available. I want a completely unsupervised method. Hence, neural language models may not suit my task. The only library I found useful was Brown Clusters but how do I decide the number of clusters to be formed? I have no prior knowledge what so ever on the same.

(Feb 23 '11 at 11:58) Dexter

I don't think there is a general way of deciding the number of cluster. To keep it simple, I would start by treating the number of clusters as a hyper-parameter and try a small set of different values, such as 100, 500, 1000, 2000. If you have a lot of training data for the supervised classification task, I would assume that you could make use of a larger number of clusters.

I guess you could in principle make some power law assumptions to decide the number of clusters relative the number of types in your data. Note that words like "india" and "what" will probably behave quite differently distributionally ("india" belongs to an open word class, while "what" belongs to a closed one). I am not sure how to take this into account into the clustering (if it is at all necessary to do so).

(Feb 23 '11 at 12:12) Oscar Täckström

Oscar, It looks really difficult for me to decide the number of clusters. :-(

(Feb 23 '11 at 14:55) Dexter
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.