|
I am working on a text mining project for which I need to build a weighted co-occurrence network where words are nodes, and edge weight between the two nodes is defined as the number of songs they co-occur in (which is equivalent to computing the Euclidean or cosine similarity between the columns corresponding to the two words in the Document Term Matrix(DTM). What I would like to understand is, what should be the ideal data structure which should be used to develop the network and is there anywhere I can read about the algorithm to build one? Also, is there any python/Java package which provides similar functionality out of the box? |
|
I'd suggest using a hashmap in which keys are words and values are lists of songs they occur in (ordered somehow, e.g. lexiconigraphically, so that cosine similarity can later be computed efficiently with sparse dot product. If your list of songs is large, this is vital). Words that appear in a sufficiently small number of songs can be disregarded, since cost is O(kn + mn^2), where k is the total number of songs, m is number of songs the average word appears in and n is number of words. pseudocode is as follows:
And once you have simMatrix, you can easily threshold it to turn it into a sparse network, or you can simply use it as an adjacency matrix, or you can turn it into an adjacency list easily. I have implemented this before, and I'd give you my code if I had access to it (though it isn't too clean tbqh). A library called CLAIRLIB for perl has all sorts of good NLP stuff. NLTK for python might also. Thank you Jacob for your help. I am slightly confused regarding a couple of things. Shouldn't the words be ordered lexiconigraphically instead of the songs so that I can store only half of the graph. Can you please explain the significance of sorting the list of songs for each word.
(Jun 05 '11 at 00:19)
aseembehl
1
(I'm aware k and m are the reverse of typical notation use below. humor me, I messed up above and want to stay consistent) SimMatrix only fills in the upper half of the triangle, its lower half is symetric. You can store it whatever way you want, but you don't need to sort it any particular way as long as words are consistently ordered with respect to row/column of the matrix. Sorting songs allows you to computer dot product (for similarity) very quickly if m (average number of songs each word appears in) << k (total set of songs). Instead of taking a dot product of k-length binary indicator vectors, you can just take two sparse m-length lists of words, put an iterator on each, increment the iterator that is lexiconigraphically behind the other, and add 1 to the co-occurence count if the iterators are on the same word. This gives an O(m) dot product instead of an O(k) dot product, although it has higher overhead typically. @Alexandre Passos below gives an algorithm that is more straight-forward than mine and easily suitable for any small/mid-scale dataset. Asymptotically, however, it is O(n*m^2) instead of O(kn + mn^2), although in this case that's almost certainly better (the approach I give is appropriate m > n, more words per document than documents per word, i.e. not song lyrics). If you give the number of songs in your dataset, I can give you more guidance, but unless its a very large number you should probably have no trouble using an algorithm already given.
(Jun 05 '11 at 01:08)
Jacob Jensen
Thanks, your comment makes the algorithm a lot more clear. I have around 6500 songs in my corpus, so I think either of the algorithms should be fine.
(Jun 05 '11 at 05:14)
aseembehl
|
|
The simplest thing is keeping a sparse matrix in memory (maybe a hash map of (string,string) pairs to counts) and iterating over the dataset, for each document, for each word in the document, for each other word in the document, incrementing the corresponding counts. Note that as the graph is undirected you only need to store one half of the full matrix. A simple way to ensure this is to use the lexicographic order, and store only count[i,j] for word is < j. +1 for sparse matrix representations
(Jan 11 at 13:12)
mcenley
|
|
Rather than using a HashMap of string pairs, I'd recommend using a sparse matrix format and a mapping from strings to unique indices. This should also make it a lot easier to compute similarities between words. I have a fairly scalable java library written for learning distributional semantics. We've included a lot of support for storing large co-occurrence matrices, which are essentially the same as your co-occurrence network. It's called the S-Space Package. The code for doing what you want using my package should look something like the following:
You can then do a lot of comparisons between vectors by using the methods in the
|