|
I am summing all the tf-idf value for a word frequency across documents. Is it possible that the sum would be > 500? |
|
I am summing all the tf-idf value for a word frequency across documents. Is it possible that the sum would be > 500? |
Once you sign in you will be able to subscribe for any updates here
Tags:
Asked: Mar 09 '11 at 20:43
Seen: 600 times
Last updated: Mar 10 '11 at 09:36
This falls under "possible but unlikely". Maybe more context would help. What are you using for a corpus? What word is summing to over 500? How many documents is the word in? I could see a single nasty document containing hundreds of instances of a garbage token that no other document contains.
Processing Wikipedia I've seen pretty extreme tf-idf values. One example was "crown", a relatively uncommon word. Wikipedia has a page "List of Royal Crowns", where just about every third word was crown (State Crown of George I, Crown of Norway, Pahlavi Crown...) If the word you're referencing is causing you trouble I'd recommend a)throwing out excessively dense term occurrences and/or b)Normalizing tf-idf values so that every word sums to the same value.