I am working on a text mining project for which I need to build a weighted co-occurrence network where words are nodes, and edge weight between the two nodes is defined as the number of songs they co-occur in (which is equivalent to computing the Euclidean or cosine similarity between the columns corresponding to the two words in the Document Term Matrix(DTM).

What I would like to understand is, what should be the ideal data structure which should be used to develop the network and is there anywhere I can read about the algorithm to build one?

Also, is there any python/Java package which provides similar functionality out of the box?

asked Jun 04 '11 at 10:14

aseembehl's gravatar image

aseembehl
568913

edited Jun 10 '11 at 05:09


3 Answers:

I'd suggest using a hashmap in which keys are words and values are lists of songs they occur in (ordered somehow, e.g. lexiconigraphically, so that cosine similarity can later be computed efficiently with sparse dot product. If your list of songs is large, this is vital). Words that appear in a sufficiently small number of songs can be disregarded, since cost is O(kn + mn^2), where k is the total number of songs, m is number of songs the average word appears in and n is number of words.

pseudocode is as follows:

HashMap<String,List<String>> wordMap 
for each song:
    tokenize set of words and create set of all included words 
         (include counts if you want to do tf-idf)
    for each word in set:
         wordMap.get(word).add(song)
for each word:
   wordMap.get(word).sort()
simMatrix = new float[n][n] (or a sparse matrix structure if n is too large)
for each word i = 1:n:
    for each word j = i+1:n:
         simMatrix[i][j] = innerProduct(wordMap.get(word i),wordMap.get(word j))

And once you have simMatrix, you can easily threshold it to turn it into a sparse network, or you can simply use it as an adjacency matrix, or you can turn it into an adjacency list easily. I have implemented this before, and I'd give you my code if I had access to it (though it isn't too clean tbqh). A library called CLAIRLIB for perl has all sorts of good NLP stuff. NLTK for python might also.

answered Jun 04 '11 at 12:29

Jacob%20Jensen's gravatar image

Jacob Jensen
1644285360

edited Jun 04 '11 at 12:34

Thank you Jacob for your help. I am slightly confused regarding a couple of things. Shouldn't the words be ordered lexiconigraphically instead of the songs so that I can store only half of the graph. Can you please explain the significance of sorting the list of songs for each word.

(Jun 05 '11 at 00:19) aseembehl
1

(I'm aware k and m are the reverse of typical notation use below. humor me, I messed up above and want to stay consistent) SimMatrix only fills in the upper half of the triangle, its lower half is symetric. You can store it whatever way you want, but you don't need to sort it any particular way as long as words are consistently ordered with respect to row/column of the matrix. Sorting songs allows you to computer dot product (for similarity) very quickly if m (average number of songs each word appears in) << k (total set of songs). Instead of taking a dot product of k-length binary indicator vectors, you can just take two sparse m-length lists of words, put an iterator on each, increment the iterator that is lexiconigraphically behind the other, and add 1 to the co-occurence count if the iterators are on the same word. This gives an O(m) dot product instead of an O(k) dot product, although it has higher overhead typically. @Alexandre Passos below gives an algorithm that is more straight-forward than mine and easily suitable for any small/mid-scale dataset. Asymptotically, however, it is O(n*m^2) instead of O(kn + mn^2), although in this case that's almost certainly better (the approach I give is appropriate m > n, more words per document than documents per word, i.e. not song lyrics). If you give the number of songs in your dataset, I can give you more guidance, but unless its a very large number you should probably have no trouble using an algorithm already given.

(Jun 05 '11 at 01:08) Jacob Jensen

Thanks, your comment makes the algorithm a lot more clear. I have around 6500 songs in my corpus, so I think either of the algorithms should be fine.

(Jun 05 '11 at 05:14) aseembehl

The simplest thing is keeping a sparse matrix in memory (maybe a hash map of (string,string) pairs to counts) and iterating over the dataset, for each document, for each word in the document, for each other word in the document, incrementing the corresponding counts. Note that as the graph is undirected you only need to store one half of the full matrix. A simple way to ensure this is to use the lexicographic order, and store only count[i,j] for word is < j.

answered Jun 04 '11 at 14:01

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

+1 for sparse matrix representations

(Jan 11 at 13:12) mcenley

Rather than using a HashMap of string pairs, I'd recommend using a sparse matrix format and a mapping from strings to unique indices. This should also make it a lot easier to compute similarities between words. I have a fairly scalable java library written for learning distributional semantics. We've included a lot of support for storing large co-occurrence matrices, which are essentially the same as your co-occurrence network. It's called the S-Space Package.

The code for doing what you want using my package should look something like the following:

import edu.ucla.sspace.basis.StringBasisMapping
import edu.ucla.sspace.matrix.YaleSparseMatrix

...

public void makeNetwork(List<String> sentences, int vocabSize) {
    StringBasisMapping basis = new StringBasisMapping();
    SparseMatrix network = new YaleSparseMatrix(vocabSize, vocabSize);
    for (String sentence : sentences) {
        //Somehow iterate over word pairs in your sentence as desired.
        ....
        String word1 = ...;
        String word2 = ...;
        int i = basis.getDimension(word1);
        int j = basis.getDimension(word2);
        network.set(i,j, 1+network.get(i,j));
    }
}

You can then do a lot of comparisons between vectors by using the methods in the edu.ucla.sspace.common.Similarity class. This includes things like Euclidean Distance, Cosine Similarity, Jaccard Similarity, and a slew of others. All you have to do is something like this:

public void compareRows(SparseMatrix sm, int row1, int row2) {
    return Similarity.cosineSimilarity(sm.getRowVector(row1), sm.getRowVector(row2));
}

answered Jan 19 at 16:17

Keith%20Stevens's gravatar image

Keith Stevens
4363820

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.