Say you are generating collocation information for a large corpus. And you are counting the number of times two word appear together so you will have a matrix with million rows and columns (we will have a row and column for each unstemmed word). My question is what are the proper techniques to work with this huge and sparse matrix (from basic sparse storage techniques to more tricky ones)

I am working with Python so any words of wisdom (a.k.a library) that helps me work with this matrix more efficiently is definitely appreciated.

asked Jul 08 '10 at 20:02

Mark%20Alen's gravatar image

Mark Alen
1118233442


3 Answers:

I suggest you try dictionaries of dictionary approach. Since list have an O(n) lookup time, while Dictionaries offer O(log n) lookup time.
NLTK has sparse vector encoding which is implemented as a python dict

This answer is marked "community wiki".

answered Jul 09 '10 at 01:29

DirectedGraph's gravatar image

DirectedGraph
54531422

edited Jul 09 '10 at 01:57

That's my usual approach too. You can also try to intern the words in you dictionaries to try to save some more space (though some say this optimization already happens in Python, I don't know the details).

(Jul 09 '10 at 07:13) Amaç Herdağdelen

scipy.sparse has many classes to deal with this sort of matrix. The idea is that you first use their lol (list of lists) representation to build the matrix, and then convert to a compact (there is more than one option) when using it, for fast access.

This answer is marked "community wiki".

answered Jul 08 '10 at 20:28

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1899744214335

I have a similar problem but in my scenario, I need to write once such a matrix and than read it many times in a memory-limited web-server. Instead of using a DB-based approach, I created one dict for each row in the matrix and pickle-dumped the dict as a line in a plain data file. I also keep a separate poor man's index file which keeps the file positions of the lines for each word. That index can be kept in memory as a dict of words->file positions. When I have to lookup a vector, I lookup the position, seek to the current byte of data file and read one line and pickle.load it.

Something like that helps me to index the lines of a big matrix file which contains the row's label as the first field.

def index_write(data_file):        
    f = open(data_file)
    pos = f.tell()
    n = 0
    while True:
        line = f.readline().strip() 
        if line == "":
            break
        n += 1            
        key = line.split()[0] # First field is the row label
        print key,pos            
        pos = f.tell() # read current position

answered Jul 09 '10 at 07:22

Ama%C3%A7%20Herda%C4%9Fdelen's gravatar image

Amaç Herdağdelen
1763813

Why not use:
for line in file(data_file):

(Jul 09 '10 at 10:07) DirectedGraph
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.