|
Hey All, I'm doing some prep work for a machine learning / nlp project at my summer job where i'm using several years of classification data for content at various urls (blog posts in particular), and at some point also getting ahold of the original blog pages themselves. Given that I'll be dealing with on the order of millions of such items and associated text, I am wondering:
I'm not averse to a wee bit of programming work to make things work the way I like, though I will admit that I lean towards a low overhead interactive approach in a language such as scala, haskell, or scheme |
|
Here is my recommendation: Since you mention few million files with Text and Markup and On a Mac Book with 4Gb ram, I would suggest that you store them in a simple directory on normal file system. If your work is related to Machine Learning/NLP, Lets assume Blog Classification, The Important Aspect will be implementing those algorithms, rather than achieving low latency. You could use Document Oriented databases but mostly then rate limiting process in your pipeline would always be the ML/NLP algorithm rather than performing disk reads. Thus utilizing database would only reduce time for development available to you. These databases are suitable for serving those pages via a server, if you are planning to create a scalable server, then only it would be recommended to use them. thats somewhat what i'm thinking, though the meta data/ by hand classification data i already have is in a db (though one not organized for ml/nlp use), so i need to have that organized somewhere nice. That being said, half of the appeal of using eg mongodb or couchdb is that if I need to kill a run of whatever algorithms i'm using, the data on disk remains un borked [edit: fixed typo]
(Jul 03 '10 at 21:04)
Carter Tazio Schonwald
Meta-data, token counts and other dataset which have a structure are useful to keep in a DB. As far as algorithm borking your dataset, few 100GB isn't large enough to be difficult to manually replicate. Also you can have the files belong to a different user and provide read only access to the algorithm. It is rare for an algorithm to overwrite data on disk, unintentionally on killing the process. At least it has never happened with me. IMHO My advice would be to avoid complexity in aspects other than the actual algorithm.
(Jul 03 '10 at 21:20)
DirectedGraph
this seems like for my use case that this'll be the right way to go, though i'm not sure if its the "right" answer for this question in a broader setting
(Jul 03 '10 at 21:38)
Carter Tazio Schonwald
I feel most of the questions on this forum do not have a single right answer including this one.
(Jul 03 '10 at 22:02)
DirectedGraph
@DirectedGraph: I feel that you are correct that with questions of this nature, there are rarely any "right" answers.
(Jul 03 '10 at 22:31)
Joseph Turian ♦♦
|
|
I have been using MySQL for a similar project (for tweeter) and I have been very happy (and yes I know that the hype is all about the NoSQL movement at the moment) My datasets are typically around 10GB I don't know if it can handle larger databases well enough but I have never had problems with it. I use Python and a home-brewed MVC framework to analyze my data (I assume you can also use Django for talking to the DB if you are using Python) I have to confess though, I am thinking about taking it up a notch a little and move to Hbase |
|
Cory's answer reminded me: of course you need some Apache Solr (HTTP / RESTful interface for lucene) instance lying around. It is very interesting nice for fast lookup in your collection including fuzzy searchs, you have access to plenty of interesting stats on the tokens of your data using luke and you can perform similarity queries using MoreLikeThisQuery. There is also a n-gram tokenizer filter named shingle. And if you don't like the verbosity of Java, either use the REST interface or the native JVM interface from Scala or Clojure. EDIT: However AFAIK the data held in a Lucene/Solr index in not map-reduceable on an hadoop cluster for batch processing. So keeping a raw filesystem/HDFS copy of the original data is always a good idea for batch processing. +1, I have gotten used to MongoDB for canonical data storage and Solr for indexing. I run my ML/NLP algorithms against Mongo and then index them in Solr. Rinse and repeat and you get good results.
(Aug 27 '10 at 11:18)
Andrew Montalenti
|
|
I am using Mongo with good success at ~20GB. It is wonderfully easy to use and would be nice if you just wanted to store the full text of your URLs, perhaps with a few tags, for later processing. Caveat: it doesn't have any built-in facilities for full-text search; you have to roll your own by using arrays/sets. However, if you wanted to transform to a vector model, it would probably be dumb to choose Mongo over a structured DB, just for performance reasons. And of course, there's good 'ol Lucene. But then, you mentioned you don't like verbosity and Java's the King of Verbosity. well, thats the beauty of scala, native java library compatibility :)
(Jul 03 '10 at 19:50)
Carter Tazio Schonwald
I you like Scala and Scheme you should definitely give Clojure a try. And also the cascalog framework for map reduce queries in a clojure DSL on a hadoop cluster.
(Jul 03 '10 at 20:24)
ogrisel
1
Clojure is amazing. With one of my projects, literally 10x fewer lines compared with the Java equivalent.
(Jul 03 '10 at 20:25)
Cory Giles
i think i'll look into using mongo or the like for tracking the metadata, thanks!
(Jul 05 '10 at 00:37)
Carter Tazio Schonwald
|
|
Depends on the size. For more than a couple of hundreds of GB of data I would rather go for a format that is easily map-reduceable on a Hadoop cluster (that further brings backups through redundancy for free if you have several machines). Maybe you could use elephant-bird tooling by the twitter folks that further brings very fast LZO compression which can be very interesting for text data (disclaimer: I have not tried it myself yet). and what about the sub 100s of GB but still large scale? I'm not sure how big the dataset'll be, but i'm pretty such it'll be below that scale
(Jul 03 '10 at 19:08)
Carter Tazio Schonwald
1
Right now I am experimenting with the following: raw original text files on the FS and reduced dimensions bag of words features in a HDF5 database using the PyTables library. So far so good.
(Jul 03 '10 at 19:23)
ogrisel
@ogrisel, I've used elephant-bird with Hadoop with success although it was on large scale graph data.
(Jul 05 '10 at 17:16)
Delip Rao
|
Since we're getting answers involving MapReduce, question: is parallelization an option for you? Do you have a cluster? (My own opinion is that for a < 100GB problem, networked MapReduce is more trouble than it's worth unless you're doing deep parsing.)
umm, I don't think theres a cluster in this projects future (unless the employer springs for it, which seems unlikely). At least at first i'll just be using my macbook +4gb of ram.
so it is perhaps worth delineating between: works for your laptop, works on a big server, small cluster and "i am jeff bezos, i own ec2" levels of machines :)
I"m really liking the remarks people are putting down thus far, though I'll wait till after the weekend holiday stuff to select the "accepted answer"