I'm deciding which document to assign to which user with a naive bayes classifier and each users document history. Currently this works pretty well with a small set of users that don't overlap much. Each user has their own classifier ['interesting', 'not interesting'] where 'interesting' is their documents and 'not interesting' is everyone elses documents.

I'm trying to understand my options for scaling this system to thousands of users (many with similar preferences) and millions of documents. As I see it I can use:

  • A single NBC with a category for each user trained on the full set of documents
  • A 2 category NBC for each user
    • trained on the full set of documents
    • trained on all of the documents from users that aren't clustered with them in a kmeans
    • trained on a ~5k doc sample from users that aren't clustered with them in a kmeans
  • Skip the NBC and just use clustering

Are all of these feasible? Is there a clear best path? Can you recommend any reading material that deals with this?

asked Nov 27 '13 at 17:40

Paul%20Denya's gravatar image

Paul Denya
1112

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.