|
If I have 11 text documents. And I want to extract keywords from each document, to use these keywords in clustering. How I can extract keywords from a text document? |
|
First you can do clustering without keyword extraction but extracting words or bi-grams count and then applying a TF-IDF normalization. This approach is used in this example script. Now if you really want to extract keywords you could use a Part of Speech tagger (or even a Chunker a.k.a. Shallow Parser) for your language and extract noun phrases. You can then rank the noun phrases by averaging the IDF values of the individual words belonging to it (using a geometric mean for instance). |
|
A simple straight method would be to:
I guess this would be quite naive, but 11 documents are too few to actually do a lot on them. |
|
There's a well-known NLP task called 'Keyphrase Extraction'. A lot of scientific papers require the authors to provide keywords or keyphrases, and this task is the automated equivalent. Most of the methods to use this allow you to lower the threshold to get more keyphrases, which might help given the small number of documents. Check out some of the papers cited by the SemEval task as well as some of the submitted systems (which are more domain dependent on scientific papers. I'd agree with ogrisel that a TF-IDF approach might work best given your small number of documents, but if you really do need to extract keywords/keyphrases then there's a good amount of literature and the above link is a good place to start. A tool for keyphrase extraction; KEA (Keyphrase Extraction Algorithm) http://www.nzdl.org/Kea/
(Sep 27 '11 at 19:02)
y2p
|
|
You can try topic modelling using 11 text documents seem to few to actually use LDA, since I do not think he has a rich mixture of topics. Which is the base assumption of LDA
(Sep 27 '11 at 01:08)
Leon Palafox
Yes, you are right. I agree !!
(Oct 02 '11 at 02:33)
y2p
|
I edited the title to make the question a little more relevant.
how long are the individual documents?
What problem are you really trying to solve? What do you want clusters on these 11 docs to tell you? Presumably you only expect 2, 3 possibly 4 clusters.
I have 11 text documents. I want to know which documents are near to each other. In my work I should extract the keywords of each document and use these keywords in clustering the 11 documents. This is the first time for me to work on clustering. I extracted the keywords using a keyword extraction tool. But, the extracted keywords have a lot of stop words and a lot of words repeated. I ask about a good keyword extraction method and then making clustering depending on these keywords