|
I'm not sure if this is the best place to ask this question, but here goes: I'm in a position where I need to do some text mining without being an expert. One of the things we wanted to do was simply discover some groups or clusters in our text and see if anything cool would come out or if it would give us a place to start. Latent Dirichlet Allocation came very highly recommended, so I downloaded a tool called Mallet that implements it and started playing around. The results have not been good so far in that there is not a solid relationship we can figure out among the groups. There are a couple particulars of our situation that I think might be causing us problems: 1) The text is very domain-specific, so there is a lot of jargon and a lot of abbreviations. 2) Many of the people who compose the text are not native speakers of English, which affects not only issues like spelling but also word choice and others. There are ideas I have on how to improve the results (i.e. the determination of appropriate stop words for the domain) but I was wondering if anyone can give me some good "beginner" pointers. Thanks |
|
I am not an expert in LDA but have tried using it a lot to play with product reviews and study different aspects of them.
I mostly just used the Mallet package to do the actual sampling so my answer might refer to things, that package supports out of the box 2) Hyper-parameter optimization might also yield to you better topics (Mallet has a flag which lets you turn this on). 3) There are tons of really cool LDA-variants built by various people in academia, some of which might do better than the basic model in solving your needs. One of them is Supervised LDA which lets you input a set of document-variable tuples where the variable might be something like a product rating, and then the model returns a topic model such that, using regression on the topic distribution for a given document would help you estimate the rating very well. |
|
A lot of times you can flexibly model stop words by having a hierarchical emission model. Meaning there is a "background" topic which is always active across documents with some small, usually constant, probability. The Wallach et. al. paper mentioned in Alexandre's answer also addresses how to have "flexible" stop words, but I've found hierarchical emission more satisfying. All that being said, if you're not happy with the topics that come out of LDA, you should be able to seed topics with a hand ful of words. I'm not sure if Mallet allows for this, but I'll be adding LDA to my UMass NLP package and will definitely have this facility. |
|
In my experience applying LDA, usually some topics will be interpretable and some will be noise. Depending on the dataset this varies, but I've seen up to 35% of the topics being noisy. What specific issue are you having? Are the topics too broad, grouping together things that do not make sense? If so, you can try to use more topics. Or are they too narrow, with most words occurring just in a couple of (or a dozen) documents? If so, you can either try less topics or use stemming/lemmatizing and canonicalize your abbreviations and spellcheck your corpus, so these incorrect terms don't compete with their correct representations. About stopwords, in the last NIPS there was a paper that showed that if you use an assymmetric prior for the topic distributions in LDA you can get rid of stopwords and find better topics than with the usual symmetric distribution, although I couldn't find code for that. I would say that the biggest problems are 1) Words appearing in the top-list of multiple topics 2) Abbreviations causing multiple representations of the same word to appear in the top-list for a topic. The answer to 1) sounds like removing stop-words, and I have removed a number of words for that reason but sometimes there are words that seem like really relevant words that appear in multiple topics. I'd have to look at my results again (I'm not in a place where I can do that) in order to be sure that they are not two different senses of the same word. The answer to 2) is probably some sort of abbreviation or acronym expansion, but of course this is a non-trivial problem and I am not by any means well-versed in NLP. An abbreviation "AM" is very difficult to appropriately expand for example. Any thoughts?
(Jul 04 '10 at 00:27)
Troy Raeder
If abbreviations are causing the "same" word to appear multiple times, why not just remove abbreviations? If it's just exploratory analysis, maybe that will be enough to tell you if LDA is going to work or not.
(Jul 08 '10 at 18:08)
aditi
That's worth trying, although removing abbreviations is a non-trivial problem. People rarely use proper capitalization or punctuation with them, so they could be difficult to identify. I guess maybe a spell-check would be a place to start? Short words that are not in an English dictionary are probably abbreviations. Does that make sense?
(Jul 09 '10 at 08:59)
Troy Raeder
With a few regexes or a substitution table you can go very far, at least for the most common abreviations.
(Jul 09 '10 at 09:01)
Alexandre Passos ♦
Are you using unigrams as your word feature vector? If so, have you tried including bigrams(or tri) in the feature vector?
(Sep 19 '10 at 00:20)
tommy chheng
If your corpus is big enough bigrams should make for clearer topics, but if it isn't you will end up with a lot more noise (and your model will be a lot slower).
(Sep 19 '10 at 07:43)
Alexandre Passos ♦
3
Abbreviations are harder than they look. For instance, in a medical setting the abbreviation AIDS isn't something that you can get rid of, but keeping it in a system that does conventional stemming is a total disaster. I have generally had to build special purpose statistical recognizers that detect abbreviations based on very strange statistical behavior .
(Sep 24 '10 at 02:18)
Ted Dunning
Could you elaborate at all on what "very strange statistical behavior" means?
(Jan 08 '11 at 12:58)
Troy Raeder
seconded; please share your preprocessing gems, Ted :-)
(Feb 08 '11 at 13:01)
Radim
showing 5 of 9
show all
|
Can I ask what exactly you want to do with the text analysis? Whether or not LDA is appropriate depends a lot on what you're actually trying to do. We can much better answer your questions with some more specifics on what you want.
We have a database of textural correspondence from customers. It could be questions, comments, complaints or whatever but they are all in some form or another about our products.
The customer service reps have a series of categories that they tag these bits of correspondence with. Ideally, these could be used for metrics but they currently aren't because the human application of them is highly inconsistent and spotty.
We would like to derive an automatic set of categories/tags that would be consistent and useful and could be automatically applied. Once we have this, then they can be used for metrics to see what kinds of correspondence we are getting with increasing/decreasing frequency or whatever.
For this, LDA seemed natural. Any thoughts?
For this scenario, the first thing I would try would be clustering the correspondences. Probably each correspondence has only one topic anyway, so LDA might be a misfit. You could try hierarchical clustering, or k-means where you seed the centroids with examples from the categories you already have.