Following on this question,

I would like any advice on how to create a link map of blogs so to reflect the "social network" between the bloggers.

Such a scrapper/service would take a starting point of a blog or two, and start adding links and mapping the links between them.

What would you recommend for doing that ?

(I am sure I once found a service that does that - but can't find it at the moment)

asked Jul 11 '10 at 04:44

Tal%20Galili's gravatar image

Tal Galili

edited Jul 12 '10 at 06:34

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

On what basis would you link two blogs? Because one links to the other? Other something less explicit?

(Jul 11 '10 at 08:23) Joseph Turian ♦♦

Hi Joseph. I imagine there are many ways to do so. I wonder what tools are available at the moment :)

(Jul 11 '10 at 14:22) Tal Galili

have a look at memetracker dataset, it is a good dataset to begin with.

(Jul 11 '10 at 15:12) DirectedGraph

8 Answers:

You can crawl through the blogs and build your network as you go: scrape a set of blogs (eg. with beautifulsoup in Python or RCurl/XML in R), get all the links from them, then scrape those links, etc, etc. You can use the links to build ties in your network. Or you can use other meta-data (such as tags on the blogs) to form form the ties. Ultimately, you could also apply a cluster analysis as well based on blogs that link to each other and have similar tags/keywords.

Another easy thing to use is the "blogroll" that many blogs use: this self-identifies other blogs that are within the particular blogger's network.

answered Jul 11 '10 at 15:22

Shane's gravatar image


Hi Shane, I was considering to simply do it with R if all else fails (RCurl/XML). But since I have no experience with it I was wondering if someone else had tried to do it prior to me. Thanks :) Tal

(Jul 11 '10 at 15:33) Tal Galili

There aren't any tools that will do this out-of-the box unless you know the specific type of relationship you want edges to capture. Here are a couple that spring to mind:

  1. A blogger links to another blog on its "blogroll"; this is a binary attribute reflecting one blogger recommends another.
  2. A blogger links to another blog in actual entries; this is a weighted attribute reflecting one blogger's tendency to comment on another's story.
  3. Link blogs which share the same content. At least two sources of information for this: (1) Blogs share the same external links; they tend to comment on the same events. (2) Blogs share a lot of textual content.

The last one is the only that requires a lot of heavy ML lifting. Essentially, you have a lot of document (the textual and link content of a blog) and you want to determine strongest pair-wise similarity, but the naive O(n^2) approach is too slow. There is actually a good paper on how to efficiently do this: Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. You can take some other suggestions and consider a hierarchical clustering approach, but without more details I can't recommend something much more specific.

answered Jul 11 '10 at 18:47

aria42's gravatar image


edited Jul 11 '10 at 18:51

I guess a simple way to do what you want is to limit yourself to a few blogging platforms (blogger, wordpress, tumblr, typepad, livejournal) and either use a standard crawler "seeded" with a few blogs and limited to URLs in this platform (think something like wget -rc) or write a simple crawler yourself. Python is a nice language for this, since it has urllib and lxml.html (for parsing, and then iterating over all links. The code could look a bit like

has_visited = set()
edges = set()
while queue:
  page = queue.pop()
  d = domain(page)
    page = lxml.html.fromstring(urrlib.open(page).read())
  except: continue # url / parse error
  for link in page.xpath("//a"):
      link = link.href
      if not link in has_visited and valid_domain(link): queue.push(link)
      if domain(link) != d: edges.add((d, domain(link)))

and leave it running for a while. This should avoid visiting the same page multiple times. I left unspecified functions like "domain()" that returns a canonical name for a blog (which could be its domain name) and "valid_domain()", which checks if a blog is in the list of pages you want to crawl.

It's always hard, with this sort of thing, to deal with failures, so it might make sense to back up your queue or have some better way of handling network errors than throwing out the page (you can reinsert at the end of the queue with a limit on the number of re-trials, for example).

answered Jul 11 '10 at 16:02

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

There are a number of tools categorized here http://www.mkbergman.com/414/large-scale-rdf-graph-visualization-tools/ that are nice for visualizations. There is a cool animated version of these finite group graphs here: http://www.aharef.info/static/htmlgraph/ it lets you input a website and graph the relationships dynamically.

Are you thinking of something like this, where the nodes are bloggers?: finite group graph example

answered Jul 12 '10 at 03:39

bvmou's gravatar image



O.k, very nice links - thank you!

This is indeed my end purpose, yet my question is in what tools/way can I produce the data in order to create such graphs. Cheers.

(Jul 12 '10 at 03:45) Tal Galili

OK -- maybe check out chapter 3 of Toby Segaran's Programming Collective Intelligence if you want an introduction to this with very nice python examples -- pages 30-50 may be available in various previews. Are you interested in using python for this? I can edit the question with a python example if you like (though it would likely crib heavily from Segaran ;)

(Jul 12 '10 at 04:08) bvmou

Few datasets to start with

  • http://www.icwsm.org/data.html [ICWSM dataset of 10 million bloposts]
  • http://www.memetracker.org/data.html [Contains links as well as short text snippents from, Huge dataset more than few million]

Finally you can use the livejournal network where each blogger adds othe live journal users as friends.

answered Jul 11 '10 at 15:11

DirectedGraph's gravatar image


For this you have to following few steps:- 1 Find some seed blog 2 Find all neighbors of the seed, and create its ego-net (star graph) with the seed at the center connected to its neighbors. If you want to do blogging for your site you can try this .I assure you it will give you best result.https://www.fiverr.com/noodgie/do-30-days-blogging-for-you

answered Jan 06 at 23:44

mary's gravatar image


You should try out the Issue Crawler at govcom: http://www.govcom.org/

It is documented and it is already being used by researchers (friends of mine) to do exactly what you describe.

answered Jul 14 '10 at 18:24

alper's gravatar image


edited Jul 14 '10 at 18:25

There isn't code for doing this, per se, but you could link blogs that have similar content. In particular, you can find the term-document vector for each blog, where the "document" is the aggregate over individual blogs posts. You then find the "edges" as all blogs that are above some threshold cosine similarity.

A faster approach (since the naive technique for finding edges is quadratic in the number of blogs) is to use an LSH approach, like min-hash. Any two blogs that fall in the same bucket are linked. On can link more blogs by using different random matrices for LSH. This approach is linear in the number of blogs.

answered Jul 11 '10 at 14:26

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

Thanks Joseph, I'll keep looking :) Cheers.

(Jul 11 '10 at 14:48) Tal Galili
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.