I have a collection of documents belonging to two classes. The documents consist of free form text containing user-generated-content from a popular Web 2.0 website. I performed my baseline text classification task using the traditional Bag-Of-Words model. I am using NLTK (Natural Language ToolKit) for the classification task. I experiment with the NaiveBayes, MaxEnt(which I read is same as logistic regression) and DecisionTree classifiers. I observe MaxEnt performs the best amongst the three.
I would now want to increase the accuracy of my classifier by adding more (and probably advanced ) feature sets. One important feature set consists of lexicon based features. I perform a lexicon lookup(I use General Inquirer) and observe the following problems due to the inherent nature of underlying noisy text (consisting of misspellings, transliterated text etc.)
The lexicons are not domain-specific. For example, the "region" lexicon in GI doesn't consist of the a few regional words(misspellings, transliterated words) which may occur in my documents. The lexicon look-up isn't correct due to similar aforementioned regions. A fuzzy string matching approach doesn't yield great accuracy too. All the above leads to feature sparseness both for the BOW model and the advanced feature sets I am trying to add.
Is there a way I could group clusters of words together depending on their distribution in the corpus? For example, can I group "wht","wat","what", "wht" in a cluster corresponding to the word "what" which would help reducing the feature sparsity in the BOW model and also help me in forming a better advanced feature set. I am unsure about how to go with is.
asked Jan 24 '11 at 15:34
Have you considered running a spell checker on the content? Hunspell is an open source example. A white list of common abbreviations and misspellings would also help.
If you're trying to cluster misspelled words together, you probably want to either do so in terms of co-occurence or context. If the words 'wht' and 'doing' appear in the same sentence 5% of the time and so do 'what' and 'doing', that implies a degree of mutual information. Matching contexts can be a stronger clue: Take ngrams around a word and blank out the word: how many of these ngrams overlap? E.g., if you've got the sentence "wht is ur name?" and "what is ur name?", that implies a lower information loss if you replace 'wht' with 'what'. Ideally you'd use the spell checker, context similiarity, co-occurences, and edit distance and train some sort of model to get word pair distances.
answered Jan 25 '11 at 11:00
This idea may obvious, but what about basic summaries of the documents, like: mean and standard deviation of words per sentence, total word count, mean and standard deviation of word length (with large enough documents: proportion of words of various lengths), ratio of "interesting words" (words other than "the", "of", etc.) to all words, and so on?
answered Jan 25 '11 at 16:26