|
Say you are generating collocation information for a large corpus. And you are counting the number of times two word appear together so you will have a matrix with million rows and columns (we will have a row and column for each unstemmed word). My question is what are the proper techniques to work with this huge and sparse matrix (from basic sparse storage techniques to more tricky ones) I am working with Python so any words of wisdom (a.k.a library) that helps me work with this matrix more efficiently is definitely appreciated. |
|
I suggest you try dictionaries of dictionary approach. Since list have an O(n) lookup time, while Dictionaries offer O(log n) lookup time.
This answer is marked "community wiki".
That's my usual approach too. You can also try to intern the words in you dictionaries to try to save some more space (though some say this optimization already happens in Python, I don't know the details).
(Jul 09 '10 at 07:13)
Amaç Herdağdelen
|
|
scipy.sparse has many classes to deal with this sort of matrix. The idea is that you first use their lol (list of lists) representation to build the matrix, and then convert to a compact (there is more than one option) when using it, for fast access.
This answer is marked "community wiki".
|
|
I have a similar problem but in my scenario, I need to write once such a matrix and than read it many times in a memory-limited web-server. Instead of using a DB-based approach, I created one dict for each row in the matrix and pickle-dumped the dict as a line in a plain data file. I also keep a separate poor man's index file which keeps the file positions of the lines for each word. That index can be kept in memory as a dict of words->file positions. When I have to lookup a vector, I lookup the position, seek to the current byte of data file and read one line and pickle.load it. Something like that helps me to index the lines of a big matrix file which contains the row's label as the first field.
Why not use:
(Jul 09 '10 at 10:07)
DirectedGraph
|