0
1

If I have 11 text documents. And I want to extract keywords from each document, to use these keywords in clustering. How I can extract keywords from a text document?

asked Sep 26 '11 at 23:18

rasha%20Elagamy's gravatar image

rasha Elagamy
1232

edited Sep 27 '11 at 02:21

Robert%20Layton's gravatar image

Robert Layton
1520102337

3

I edited the title to make the question a little more relevant.

(Sep 27 '11 at 02:21) Robert Layton

how long are the individual documents?

(Sep 27 '11 at 10:29) Daniel Mahler

What problem are you really trying to solve? What do you want clusters on these 11 docs to tell you? Presumably you only expect 2, 3 possibly 4 clusters.

(Sep 27 '11 at 10:35) Daniel Mahler

I have 11 text documents. I want to know which documents are near to each other. In my work I should extract the keywords of each document and use these keywords in clustering the 11 documents. This is the first time for me to work on clustering. I extracted the keywords using a keyword extraction tool. But, the extracted keywords have a lot of stop words and a lot of words repeated. I ask about a good keyword extraction method and then making clustering depending on these keywords

(Sep 28 '11 at 03:39) rasha Elagamy

4 Answers:

First you can do clustering without keyword extraction but extracting words or bi-grams count and then applying a TF-IDF normalization. This approach is used in this example script.

Now if you really want to extract keywords you could use a Part of Speech tagger (or even a Chunker a.k.a. Shallow Parser) for your language and extract noun phrases. You can then rank the noun phrases by averaging the IDF values of the individual words belonging to it (using a geometric mean for instance).

answered Sep 27 '11 at 03:13

ogrisel's gravatar image

ogrisel
398464480

edited Sep 27 '11 at 08:49

A simple straight method would be to:

  • Count the times different words appear in the document (without stopwords), and take the probabilities of each word (#times the word appear/# of words)

  • Count the times words concurred in a sentence an you can create a coocurrence matrix.

  • You could argue that words that concur with words that have high counts are also keywords. You can also obtain the Pointwise mutual information between words, to see how much each word are related.

I guess this would be quite naive, but 11 documents are too few to actually do a lot on them.

answered Sep 27 '11 at 01:13

Leon%20Palafox's gravatar image

Leon Palafox
31265471107

There's a well-known NLP task called 'Keyphrase Extraction'. A lot of scientific papers require the authors to provide keywords or keyphrases, and this task is the automated equivalent. Most of the methods to use this allow you to lower the threshold to get more keyphrases, which might help given the small number of documents. Check out some of the papers cited by the SemEval task as well as some of the submitted systems (which are more domain dependent on scientific papers.

I'd agree with ogrisel that a TF-IDF approach might work best given your small number of documents, but if you really do need to extract keywords/keyphrases then there's a good amount of literature and the above link is a good place to start.

answered Sep 27 '11 at 13:45

Kirk%20Roberts's gravatar image

Kirk Roberts
34637

A tool for keyphrase extraction; KEA (Keyphrase Extraction Algorithm) http://www.nzdl.org/Kea/

(Sep 27 '11 at 19:02) y2p

You can try topic modelling using LDA. Here is a tutorial for a package in R;

topicmodels: An R Package for Fitting Topic Models

answered Sep 27 '11 at 00:44

y2p's gravatar image

y2p
1062410

edited Sep 27 '11 at 00:45

11 text documents seem to few to actually use LDA, since I do not think he has a rich mixture of topics. Which is the base assumption of LDA

(Sep 27 '11 at 01:08) Leon Palafox

Yes, you are right. I agree !!

(Oct 02 '11 at 02:33) y2p
Your answer
toggle preview

Subscription:

Once you sign in you will be able to subscribe for any updates here

Tags:

×2

Asked: Sep 26 '11 at 23:18

Seen: 1,070 times

Last updated: Oct 02 '11 at 02:33

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.