Hi All,

So I have very large lexicons that need to be efficiently stored but more importantly efficiently queried. The lexicons are used in a number of machine learning and nlp processes with an extremely large lookup rate. They don't change to often so a rebuild can be done.

I'm currently reviewing 2 solutions, Compressed Suffix Arrays and D. J. Bernstein cdb concept.

My question, what do you use and or suggest as the "best" lexicon storage format.

Thanks,

Christopher

asked Dec 15 '10 at 21:56

communicating's gravatar image

communicating
106246

What exactly are you storing? A list of words? What do the queries look like?

(Dec 16 '10 at 19:12) Alexandre Passos ♦

4 Answers:

I've worked with language processing stuff a little bit (~8-12 months of naive bayes on document classification, using an SQL backend for sample storage), so my exposure is limited. However, I can offer advice from a data warehousing perspective; that I do have a lot of experience with, so I hope it helps.

While investigating database tools for storing/retrieving information rapidly in a format that allows for high-speed analysis, the best solutions I found were vertical column stores -- for DWH purposes, of course. Most DBMS systems use a row-store mechanism where all columns in the table are stored together as a vector/tuple of data. Vertical stores do the opposite in order to capitalize on columnar analyses.

For example, in warehouses I've built previously it wasn't at all unheard of to have hundreds of columns in a single table, but in any given query I'm usually only reading from 2-10 at a time. If you're reading from the columns individually it's able to read/analyze rows faster because it's doing fewer seeks.

Additionally, many column store systems implement tricks for storing/extracting information from chunks of a column if you're only attempting to extract summary information (eg: sum, avg, rolling sums/averages, etc), but those sorts of performance enhancements probably wouldn't help much for lexical analysis.

-Brian

This answer is marked "community wiki".

answered Dec 16 '10 at 11:02

Brian%20Vandenberg's gravatar image

Brian Vandenberg
794203746

edited Dec 16 '10 at 13:57

If the dictionary can fit in memory, I usually just use a java hashmap with java's native serialization. super fast and no external libraries. For larger data I generally use berkelydb or memcached. Disk resident key/value stores that are ultra fast.

answered Dec 17 '10 at 00:03

downer's gravatar image

downer
54891720

Hi,

another possibility is to use a key-value database (or something similar in the aka "NoSQL" camp). I have used Redis in the past which offers various commands (and data structures like lists, sets, sorted sets, etc.) in addition to the simple set/get commands, and I was very much satisfied with it.

You can find a good tutorial of its capabilities here. Redis can be used from a variety of languages and there's even an R package.

Stelios

answered Dec 16 '10 at 06:22

Stelios%20Sfakianakis's gravatar image

Stelios Sfakianakis
12624

edited Dec 16 '10 at 08:55

Thanks for the feedback. :)

I'm storing strings and the lookups are simple matching queries (with support for fuzzy operators). The 'dB" is bigger than memory.

I use a Column store in my application now but it's not ideally suited to this use case but thanks for the detailed answer, it gives me something to think about.

Christopher

answered Dec 17 '10 at 22:29

communicating's gravatar image

communicating
106246

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.