Let's say you have a corpus of journal papers or textbooks, and you're interested in performing a classification exercise to determine whether a given resource is relevant to your current field of research. This would be a semi-supervised learning application, with the classification starting by comparing your current body of research as the papers you've published, the textbooks you've read, etc. The point is that you have a clear 'initial' set of sources against which the classifier will compare future instances to determine whether the current instance falls within the 'relevant' or 'irrelevant' class.

Because this is semi-supervised, there'd be a training set with journal papers, textbooks, etc. that are classified as relevant, and others that are tagged as irrelevant. Assuming we have a set of features, and assuming the classification method is robust, the exercise is pretty trivial for small datasets that will fit in memory.

Now, let's consider that you have an extremely large database of raw data that you're looking to pre-process for easy classification later on. Do any of you happen to know of any relational database modeling tips for NLP applications?

Specifically, what are some best practices for storing data for use as features? If you're using something like a bag-of-words model to represent a document, could you have a dictionary table and then a junction table between entries in the dictionary and documents in the corpus? I realize that for a large corpus with millions of records and thousands of words per record, you're talking billions of entries in the junction table. But, with indexing, would there be any major performance penalties for such a representation?

Going a step further, we could have another junction table between the dictionary and document indicating the term frequency within that document, and yet another for POS tags (either n-gram- or HMM-based; I'm not sure which is more accurate for POS tagging). Again, these would yield many billions of records, but at least in the term frequency case, it enables us to calculate the tf-idf with one SELECT statement (with a lot of inner joins and SQL magic to get the tf and the idf, and then to compute the product of the two).

This is all well and good for relatively simple features, such as the tf, tf-idf, document length, word senses, etc. It gets a little more complicated when you're using vectorial semantics techniques, like LSA. I guess the normal solution would be to create an LSA table of three (corresponding to the matrix row number, column number, and cell value), with the cell value of each record describing the occurrence of term col-1 in document col-2. Indexing on col-1 would give the relationship between each document, and indexing on col-2 would show the relationship between a document and each term, with the dot product between the col-1 and col-2 indices giving the correlation between the terms over the documents. Another table, LSA-1 could be the matrix product (LSA)(LSA') (where ' indicates matrix transpose), and another table LSA-2 could be the matrix product (LSA')(LSA). The SVD of LSA would provide the eigenvectors of LSA-1 and LSA-2. I'm not really sure if this is the best way to store the data, and if you have a large number of clusters (large value of k) (thousands), you'd be required to do a pretty expensive computation.

Anyway, these are just a couple ideas I had, and I'd really appreciate any additional insight that the community can provide, specifically in the form of database best practices.

asked Sep 06 '11 at 09:09

kmore's gravatar image

kmore
26447

Right now, I'm debating using either a NoSQL or RDBMS database engine to store data like this. One of my primary tasks is in the field of smart query searching of a document with several distinct sections: abstract, references, and body of text (which itself is a collection of sections, paragraphs, sentences, and words). This type of hierarchical representation is better suited for a document store or graph database, but the trouble arises when you want to do anything with that data.

For example, in query searching/document clustering, things like tf-idf, LSA, pLSA, LDA, and semantic hashing come to mind. They all have intermediate data generated by means of data preprocessing, and this information would be valuable to store with each document. Having something like a dictionary table allows you to maintain a list of all words and their distinct senses, hypernyms, synonyms, parts of speech, etc. within a single table, and have each document point to records in that table. Maintaining a junction table (in the RDBMS solution) that indexes on both the foreign key to the dictionary table and the foreign key to the document table would make it pretty simple to quickly find which words are within a document. Moreover, it would let you go a step further, being able to quickly retrieve nouns and other parts of speech that you deem important for later studies.

In the RDBMS solution, it seems easy to accomplish something like that. However, in something like couchdb, you'd have to re-compile views each time that you add new data (for me, it'd be maybe 10,000 records weekly, so not too many).

These are just some opening thoughts, so hopefully someone with more experience could jump in.

(Sep 07 '11 at 03:38) kmore

One Answer:

I won't bash traditional DB methods for accomplishing a task like this, but in my experience, I have had many problems making them work. What most people have done historically is use text files and unix-based tools, which I have found both powerful and intuitive. The great thing with the text file + unix tool setup is that it is very flexible while being efficient.

If you are willing to break some rules that DB people hold as sacred (normal forms, etc), then you can get a lot of simplicity and performance. The downside is that you loose robustness for transaction systems with mutable data. If you plan to use your data in a read-only (semi-immutable) fashion, I would recommend text files + unix tools even more.

Granted, I haven't really answered your question, but you have a difficult problem with lots of details, and you are the best one to answer it. I would recommend learning awk, and spend some time thinking about all that you can do with it. Store everything in text files. Learn sort, uniq, tr, wc, python/perl, and some shell (which are mostly relatively small, dare I say simple, tools). Text files are your friend :)

answered Sep 07 '11 at 12:39

Travis%20Wolfe's gravatar image

Travis Wolfe
235119

Using text files is a great thought, and I imagine it finds its use when memory is large enough to hold the corpus. However, disk IO operations are very expensive, even if you're using an SSD. I'm just not sure if text files are scalable to the millions-of-documents case.

When doing NLP, of course the data set is immutable; however, it's the intermediate operations that are not, and in many cases, pre-processing limits the amount of calculations you'd have to repeat or perform again. Storing that data turns what, at first glance, seems like immutable data into highly mutable data, and the subsequent IO operations will grind your ML process to a halt.

(Sep 08 '11 at 10:03) kmore

IO will not necessarily kill you. If you do your experiments in batch, and there is a non-trivial amount of work being done on the CPU, then you will be amazed at what little difference it makes. About having "duplicate" representations in text files, you are right, you can end up with a lot of files if you are not careful. If you need to do a ton of different things with your data, then you may want to avoid text files, but if you only have a few tasks you can usually find a representation or two (in text) that will support everything you need to do.

(Sep 08 '11 at 10:07) Travis Wolfe

I really detest the use to text files for this type of work. Especially when the results are likely to be queried in complex ways. The problem of storing vast quantities of complex data, maintaining data integrity, while also allowing sophisticated query and retrieval was solved 20 years ago, and it's called a RDMS. With implementations like MySQL/PostgreSQL that are dead-simple to setup, there's really very little reason not to use one.

(Sep 08 '11 at 14:28) Cerin

@cerin, "data integrity" is the key point of your comment. if you have to seriously worry about data integrity, i agree, use and RDMS. querying the data is something you do with a database, no question about it. if however, you just need to read your data to either translate it or read it (for some kind of machine learning), then I think you are better off with text files. the bottom line is, for many people, myself included, it is very difficult to understand what a RDMS is doing at any given time, and for large data they can be difficult to optimize for.

(Sep 08 '11 at 14:36) Travis Wolfe

@Travis Certainly if you can understand what the algorithms that you're applying are doing, you can understand (at whatever level's necessary) what your database is doing :)

(Sep 08 '11 at 21:17) kmore
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.