|
I have a CSV file with 4 million edges of a directed network representing people communicating with each other (e.g. John sends a message to Mary, Mary sends a message to Ann, John sends another message to Mary, etc.) I would like to do two things:
I would like to do this on the command-line on a Linux server since my laptop does not have much power. I have R installed on that server and the statnet library. I found this 2009 post of someone more competent than me trying to do the same thing and having problems with. So I was wondering if anyone else has any pointers on how to do this, preferably taking me step by step since I only know how to load the CSV file and nothing else. Just to give you an idea, this is how my CSV file looks like:
|
|
Degree should be easy. I don't know for betweenness. But for eigenvector centrality I did it for a similar sized problem in this piece of python code: this computes the first eigenvector for 5M edges that represent links between Wikipedia articles. I used both a randomized SVD approach and the more classical power iteration method (a.k.a PageRank). Both were tractable on a single macbook pro machine (only one core is used). For larger datasets I probably go for the distributed implementation of the Lanczos algorithm available in Apache Mahout. This implementation will require the setup of an Hadoop MapReduce cluster. |
|
The igraph library in R should easily be able to handle this. For details on how to load and process the dataset, see this post: http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html A single machine with a 8 GB of RAM should have no trouble with it. |