I am summing all the tf-idf value for a word frequency across documents. Is it possible that the sum would be > 500?

asked Mar 09 '11 at 20:43

Alex%20Hernandez's gravatar image

Alex Hernandez
4081015

This falls under "possible but unlikely". Maybe more context would help. What are you using for a corpus? What word is summing to over 500? How many documents is the word in? I could see a single nasty document containing hundreds of instances of a garbage token that no other document contains.

(Mar 09 '11 at 21:23) Kirk Roberts

Processing Wikipedia I've seen pretty extreme tf-idf values. One example was "crown", a relatively uncommon word. Wikipedia has a page "List of Royal Crowns", where just about every third word was crown (State Crown of George I, Crown of Norway, Pahlavi Crown...) If the word you're referencing is causing you trouble I'd recommend a)throwing out excessively dense term occurrences and/or b)Normalizing tf-idf values so that every word sums to the same value.

(Mar 10 '11 at 09:36) Paul Barba
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.