I have a collection of documents belonging to two classes. The documents consist of free form text containing user-generated-content from a popular Web 2.0 website. I performed my baseline text classification task using the traditional Bag-Of-Words model. I am using NLTK (Natural Language ToolKit) for the classification task. I experiment with the NaiveBayes, MaxEnt(which I read is same as logistic regression) and DecisionTree classifiers. I observe MaxEnt performs the best amongst the three.

I would now want to increase the accuracy of my classifier by adding more (and probably advanced ) feature sets. One important feature set consists of lexicon based features. I perform a lexicon lookup(I use General Inquirer) and observe the following problems due to the inherent nature of underlying noisy text (consisting of misspellings, transliterated text etc.)

The lexicons are not domain-specific. For example, the "region" lexicon in GI doesn't consist of the a few regional words(misspellings, transliterated words) which may occur in my documents. The lexicon look-up isn't correct due to similar aforementioned regions. A fuzzy string matching approach doesn't yield great accuracy too. All the above leads to feature sparseness both for the BOW model and the advanced feature sets I am trying to add.

Is there a way I could group clusters of words together depending on their distribution in the corpus? For example, can I group "wht","wat","what", "wht" in a cluster corresponding to the word "what" which would help reducing the feature sparsity in the BOW model and also help me in forming a better advanced feature set. I am unsure about how to go with is.


asked Jan 24 '11 at 15:34

Dexter's gravatar image


2 Answers:

Have you considered running a spell checker on the content? Hunspell is an open source example. A white list of common abbreviations and misspellings would also help.

If you're trying to cluster misspelled words together, you probably want to either do so in terms of co-occurence or context. If the words 'wht' and 'doing' appear in the same sentence 5% of the time and so do 'what' and 'doing', that implies a degree of mutual information. Matching contexts can be a stronger clue: Take ngrams around a word and blank out the word: how many of these ngrams overlap? E.g., if you've got the sentence "wht is ur name?" and "what is ur name?", that implies a lower information loss if you replace 'wht' with 'what'. Ideally you'd use the spell checker, context similiarity, co-occurences, and edit distance and train some sort of model to get word pair distances.

answered Jan 25 '11 at 11:00

Paul%20Barba's gravatar image

Paul Barba

Paul, Thanks !

Can I use a sorted(alphabetical order) term-frequency matrix and apply standard approximate string matching algorithms like edit distance, weighted edit distance, jaccard to collapse such variations in text first?

(Jan 25 '11 at 12:08) Dexter

Approximate string matching is probably a very good place to start if you want to prune down to likely misspellings first. It's often necessary but not sufficient that the misspelling 'look' like the real version.

The TF matrix isn't likely to be useful on its own, except as an intermediate to a context matrix: I wouldn't expect misspellings to appear in the same document as the correct version (the writer has established a potential for mistyping this word). I also wouldn't expect the same TF counts: correct spellings are generally more likely. But if you cross the TF matrix with itself, you can find words that co-occur with the same sorts of words. If the content is short, that should be a very useful feature. If it's longer form content, you may want TF broken up by sentence/paragraph.

(Jan 25 '11 at 13:09) Paul Barba

Paul, I would probably want to use approximate string matching to reduce the size of "unique" terms as opposed to what's going on currently. I am currently looking at LingPipe's String Comparison tutorial.

Are you aware of any other (probably better) approximate string matching algorithms? The size of my unique terms (I converted all terms to lower case) is decently large. For a collection of 289 documents I have ~ 32k unique terms.

(Jan 25 '11 at 16:41) Dexter

This idea may obvious, but what about basic summaries of the documents, like: mean and standard deviation of words per sentence, total word count, mean and standard deviation of word length (with large enough documents: proportion of words of various lengths), ratio of "interesting words" (words other than "the", "of", etc.) to all words, and so on?

answered Jan 25 '11 at 16:26

Will%20Dwinnell's gravatar image

Will Dwinnell


Such information gives us information more about the style than the content of the document. Though such information is important for my task, I don't have a problem in processing those terms. I would like to extract some content-bearing features like a lexicon lookup which is where direct lookup is faltering. Hence, the need for an approximate string matching algorithm.

(Jan 25 '11 at 16:45) Dexter
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.