|
I recently made a small plugin for google chrome that save web pages offline, I want to present the user with automatically generated tags that may be present in his small collection of documents. Of course I could use scriptaculous and all delicious... but... I wont. :) I want to apply something else. I am thinking word and topic clusters like LDA, but the data set is small. The word distribution should be fairly indicative of the various topics present in the documents. I would suggest the top k probabilistic words from the clusters that are representative of those documents to the user as tags, given that he selects a document for tagging. Means select the closest clusters matching the docs and show the words from those. The application is allowed to contact a backend server that can store large amounts of data(like wikipedia dumps). What are my options here? I am looking to code up a snazzy little HTML5 canvas code for showing the user a nice little visualization of the topic clusters(sort of what t-SNE clusters looks like) and when user clicks a document the relevent clusters should zoom in and show with the topic words... This is the primary way tags will be served up to the user. I am looking to peacemeal battle field tested components and not very complicated. The service should be responsive so user does not have to wait for tags... I would eventually use this as a tag system as a tag suggestion service. |
|
OK so I have the 1000 Topic model ready, I also made 300 and 500 topic models also. Apparently i had to use PLDA with 16GB Ram. Now one thing that was bugging me was how to do inference on unseen documents... I know PLDA has an infernece module... But there is a problem with it... The 300 and 500 topic models are more or less take upward to 2 GB ram depending on the size. The 1000 one taking 5 GB of RAM. I am looking to compress/distribute this as I dont have that much RAM on my server hosting the webservice. But I do have 5-6 web servers in close proximity. I am thinking of an in memory distributed REDIS system across 2-3 machines to share the load of the memory. One point is bugging me here, if there is anyone who worked on plda, why does the memory usage double as more processors are added? Its a really bad parallel implementation... I am guessing they are sweeping the entire VOCAB*NUM_topics size matrix into the memory... But that is only to do fast Gibbs sampling... Should parallel implementations have an imperitive on speed when it comes to doing inference as well as learning? No one seems to be doing that... Unless I dont know about any existent implementations... |
|
So getting off from last time where this discussion left off... Still no luck with a topic model. :( I have tried running the LDA implementations for Mahout(hadoop) and mallet over wikipedia but they just dont seem to scale. I always run out of memory. Here is what I am using 1) ~47000 Selected articles from wikipedia released by them as supposely well curated 2) Preprocessing of documents to plain text format followed by stemming(Porter) and stop words removal and filtering some error I got 41000 non empty docs. 3) the unique vocab count from the last step o/p is about 440,000(Dictionary size). LDA options chosen I Chose about a 1000 topics to fit the model with a smoothing of 0.05(50/numtopics) and decided to use Mahout and mapreduce. (Mallet is dead on arrival. Cant scale.) Here is mahout's implementation for LDA. Here are listed some filtering techniques, how can i use them to get a good data set mix? Even this I think is too big I get a out of memory exception on Hadoop with 2.00GB heap space per node. 2 node cluster with 20 Map processes and 4 reducers. Any suggestions for improvement here are welcome 1
I guess the problem is that with this vocabulary size and # of topics you'd need 3.52 gigabytes in RAM just to store a dense topic x word matrix for the counts. I don't know how the implementations work but I think they probably work by keeping this matrix in memory somewhere and doing gibbs steps on each document in some order (since a gibbs step on a word of a document can be done in constant time and constant space). So I guess you can do a few things: (1) is to either use less topics or make sure to keep the topic x word count matrix small enough to fit in RAM (2) prune some words at first randomly from the vocabulary list and then readd them per-document or something like that, (3) switch to a database-backed implementation, where you will have to keep essentially four different things in the database: the topic x word matrix, the document x topic matrix (better stored in one row per-document as you will always need all counts for a single document), the documents themselves, and an assignment vector per-document stating to which topic is each word assigned. In this setting a gibbs step is essentially a transaction in the database with one random number choice in the middle (with this many topics and words it shouldn't hurt if you allow for inconsistent collapsed gibbs steps and don't lock the topic x word count table) (4) switch to variational / uncollapsed sampling. It will take more iterations to converge but each iteration is embarassingly paralell, and can be done reasonably fast if you have everything backed in a database / key-value store as I mentioned above in (3) I guess to actually scale you need to know the size of your data and where to fit it.
(Oct 22 '10 at 20:02)
Alexandre Passos ♦
|
|
I'd say, build a representative enough stoplist, get a large dataset (this question might be of interest in this part), train a topic model on the server, and create an API where you can get a per-word topic posterior and use that to build a topic posterior for a document in linear time. Otherwise you can use one of the recent streaming LDA algorithms (see Efficient Methods for Topic Model Inference on Streaming Document Collections. Limin Yao, David Mimno and Andrew McCallum) to train a topic model incrementally as you get more data, probably starting from a pre-trained larger corpus. To get topic clusters a nice thing you can do is run a topic model on the topics themselves. This is essentially the idea behind pachinko allocation. So first the corpus expansion and then a Topic model(say LDA) over this. One question is what are the number of topics that I should choose here to train my topic model(Say I use LDA)?
(Oct 07 '10 at 19:37)
kpx
A good rule of thumb is a big number of topics with a small alpha hyperparameter for the document-topic dirichlet distribution.
(Oct 07 '10 at 19:40)
Alexandre Passos ♦
:) Big huh? How about square root of the number of terms in the initial document query...
(Oct 07 '10 at 20:22)
kpx
LDA has two sets of distributions: each topic has a distribution on words and each document has a distribution on topics. Each of these sets of distributions usually has a hyperparameter, and reducing it makes LDA act like an approximation of HDP-LDA (a model with a potentially infinite number of topics), so it will effectively use as many topics as needed. Heuristics such as what you mentioned are useful, but I'm not aware of any principled ones.
(Oct 07 '10 at 20:26)
Alexandre Passos ♦
that is helpful I am using your initial advice.. i have started the parallel-lda on a 64-processor cluster on filtered wikipedia dumps. I hope to make a nice public tagging service based on it that will be based on the topic model based on a query document by the user.
(Oct 09 '10 at 05:48)
kpx
|