hiall, recently i came to the problem that: the vocabulary is too large(In chinese, almost 400million after pre-process), # of docs is 200,000, therefore, the tfidf matrix occupes about 64TB...cannot fit in my 16GB memory at all...does any one know how to deal and is there such papers on such kind of problem? Thanks a lot!!

asked Aug 23 '11 at 02:24

ylqfp's gravatar image

ylqfp
0334

My area is not NLP, but you might try looking into sparse matrix management.

(Aug 23 '11 at 06:34) Leon Palafox

One Answer:

200K documents should take far less than 64TB to store in any decent sparse matrix representation. Indeed, most LDA implementation will use sparse matrices by default, as doing text processing on dense matrices is dangerous.

However, if you have 400M characters each topic should take 400MB-1GB depending on the representation, which can indeed be costly. I'm not even sure LDA can do something useful with 400M words. Also, as far as I know in Chinese there are to the order of tens of thousands of characters and most people don't have a vocabulary with even near 100K words. What kind of thing are you considering a word? All possible character n-grams? Maybe you should use a word segmentation algorithm before doing LDA to trim the vocabulary size.

answered Aug 23 '11 at 07:32

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Thanks for answering alex, it's used for Input Method Editor software in word recommendation condition. Segmentation won't work well for recommendation. For example, chinese pinyin "jiqi" correspond to “machine”(机器),“extremly”(及其),“remember”(记起)。。。many chinese words, in this situation you must rememer every phase that user input and recommendate to user when he want to use next time.If segment, you can just obtain word suggestion, other than phase suggestion.

(Aug 23 '11 at 10:27) ylqfp

And why do you think topic models are a good idea for word recommendation? Also, even if your model uses texts with segmented words you can easily do a lookup on words that have as a prefix any character n-gram the user types, and it is probably not only much faster than a topic model with everything thrown in it but also should lead to a better model.

(Aug 23 '11 at 21:06) Alexandre Passos ♦

that's different tasks, in the real sceenario, the word-pool is too big, so have to be splited into small ones. our work is just finding these high related small word-pools according to user's previous and current interest, and push(suggest,advertising) high related word-pools to user that he may use in future. So i think lda would be used in this kind of situation...

(Aug 24 '11 at 01:51) ylqfp
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.