|
Hi All, So I have very large lexicons that need to be efficiently stored but more importantly efficiently queried. The lexicons are used in a number of machine learning and nlp processes with an extremely large lookup rate. They don't change to often so a rebuild can be done. I'm currently reviewing 2 solutions, Compressed Suffix Arrays and D. J. Bernstein cdb concept. My question, what do you use and or suggest as the "best" lexicon storage format. Thanks, Christopher |
|
Hi, another possibility is to use a key-value database (or something similar in the aka "NoSQL" camp). I have used Redis in the past which offers various commands (and data structures like lists, sets, sorted sets, etc.) in addition to the simple set/get commands, and I was very much satisfied with it. You can find a good tutorial of its capabilities here. Redis can be used from a variety of languages and there's even an R package. Stelios |
|
I've worked with language processing stuff a little bit (~8-12 months of naive bayes on document classification, using an SQL backend for sample storage), so my exposure is limited. However, I can offer advice from a data warehousing perspective; that I do have a lot of experience with, so I hope it helps. While investigating database tools for storing/retrieving information rapidly in a format that allows for high-speed analysis, the best solutions I found were vertical column stores -- for DWH purposes, of course. Most DBMS systems use a row-store mechanism where all columns in the table are stored together as a vector/tuple of data. Vertical stores do the opposite in order to capitalize on columnar analyses. For example, in warehouses I've built previously it wasn't at all unheard of to have hundreds of columns in a single table, but in any given query I'm usually only reading from 2-10 at a time. If you're reading from the columns individually it's able to read/analyze rows faster because it's doing fewer seeks. Additionally, many column store systems implement tricks for storing/extracting information from chunks of a column if you're only attempting to extract summary information (eg: sum, avg, rolling sums/averages, etc), but those sorts of performance enhancements probably wouldn't help much for lexical analysis. -Brian
This answer is marked "community wiki".
|
|
If the dictionary can fit in memory, I usually just use a java hashmap with java's native serialization. super fast and no external libraries. For larger data I generally use berkelydb or memcached. Disk resident key/value stores that are ultra fast. |
|
Thanks for the feedback. :) I'm storing strings and the lookups are simple matching queries (with support for fuzzy operators). The 'dB" is bigger than memory. I use a Column store in my application now but it's not ideally suited to this use case but thanks for the detailed answer, it gives me something to think about. Christopher |
What exactly are you storing? A list of words? What do the queries look like?