|
Hi, I am performing linguistic pre-processing on a collection of text documents. I pass my text documents through the following pipeline (I use NLTK for the same) : Sentence Splitter, Part-Of-Speech Tagger and filter out the Proper Nouns. I am left with a set of tokens and pass each of them through a stemmer/lemmatizer. I experiment with Porter/Lancaster Stemmer and the WordNet lemmatizer. I see no effect of the stemmer or lemmatizer on any of the tokens viz. no difference in the output before and after the processing. These tokens have a lot of syntactic variations(both spelling mistakes and morphological variations). I would like to collapse these variations. For example, I would like to collapse {country, country&, countries...., country!,country/,country...now } to one class country. Another example would be, {lol....., lolz, .lol, .....................lol } to one class lol. I saw a very interesting thread which mentioned how LDA can capture such variations (Aditya Mukherji's answer and Alexandre Passos' comment). I would like to probe more on the same. I am also open to other methods. Currently, I am trying standard string matching approximate algorithms like edit-distance, Jaro-Winkler, Jaccard Distance but all these algorithms return a numerical quantity making it difficult to "scientifically" decide the closest matches. PS: A spell checker may not work as tokens also consist of transliterated text i.e. text in other native languages written in English.
showing 5 of 6
show all
|
Have you tried using a distance-based clustering algorithm (preferrably one that can infer the number of clusters) using edit distance?
Alexandre, Not as yet. Are there any quick out-of-the-box implementations of the same for me to try?
A poor man's version of clustering with an edit distance metric is to do clustering with a character n-gram feature map. Of course this is not as powerful as edit distance, but it's way faster, O(n) compared to O(n^2) per comparison, and extremely simple to implement. I used a similar approach to cluster questions posed to a QA system and it worked remarkably well. I used the hierarchical clustering in http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm, which should scale up to at least a couple of thousand points in reasonable time. Hierarchical clustering is great if you want to inspect the results manually, but if you want to extract equivalence classes as your post indicates, then you will have to tune a threshold parameter to select the number of clusters.
What are you going to use the inferred classes for?
Oscar, Thanks. You are spot on that I would like to extract equivalence classes.
I am training a machine learning classifier for a binary text classification task. I use the BOW model and then add a few linguistic feature sets to improve my accuracy. There are a few lexicon look ups which I perform but due to the syntactic variation (as I mentioned earlier stemmer/lemmatizer doesn't work) it leads to errors in lookup. Instead of performing lookups on word tokens I would like to perform lookup on equivalence classes. Makes sense?
Any tips/insights on tuning the threshold parameter to select the number of clusters?
If the lexicon is fixed, I think using clustering might be overkill. Even though there are heuristics for choosing the number of clusters, it would probably be easier to just set a threshold for edit-distance by manually inspecting the matches/misses you get on some data sample.
However, if you want to use hierarchical clustering, you could map all points dominated by a lexicon entry to the class of the lexicon entry (where x dominates y if y belongs to a subtree of x).
Oscar, There are ridiculously large syntactic variations of most words due to which the feature dimensions to the machine learning classifier are increasing exponentially leading to more confusion. Though hierarchical clustering may be an overkill I may have to do it.
Lets say I have a set of Strings S ={S1,...Sm} and set of word tokens W ={W1,...Wn} each String S1,...,Sm will be compared to each token in W giving me 'n' edit distances for each string in S. What underlying data would the hierarchical clustering work on? All m * n points or individually on a set of Sx*n (where x= 1...m) points ?