I wish to cluster all the webpages within a domain into natural clusters. Is it possible to use a generic (website independent) approach to solve this?

There's literature available which deals with various types of webpage classification approaches, but I don't want to specify any prior categories.

The website I'm dealing with have thousands of url's. Naive approach which comes to mind is to scrap the texts out of all the url's, then do some clustering or topic modelling on this text corpus. Is there a more sophisticated/ elegant way which uses meta-tags, visual information, link information etc. (while ignoring advertisements and other unnecessary information); or maybe use some category as seed category and then improve this category during learning.

Thanks.

asked Apr 03 '12 at 07:10

Ankur%20Pandey's gravatar image

Ankur Pandey
1224

edited Apr 04 '12 at 02:11


2 Answers:

After some literature survey, I've some idea:

I can convert each webpage into a text document, but now giving proper weights to texts in meta-tags, title, heading etc. (following this, & this). Then, I proceed to cluster this corpus by converting into a vector space model.

Then, I use neighboring URLs of our target webpage (inlinks & outlinks) & compute similarities between their clusters & our target webpage's cluster. Assign more similarity weight to clusters more similar to that of our target webpage. Neighbors belonging to same clusters are grouped; if sum of similarity weights for a group exceeds a predefined threshold, target page is assigned that group cluster (following this).

Kindly comment on the feasibility of the approach. Also suggest page scrapping, clustering approaches, tools, apps etc.

answered Apr 04 '12 at 03:23

Ankur%20Pandey's gravatar image

Ankur Pandey
1224

Are there relationships between the URLs beyond the fact that they are associated with a particular domain, e.g. does one URL contain a link to another URL?

answered Apr 03 '12 at 14:47

Aengus%20Robinson's gravatar image

Aengus Robinson
23051114

Note: Just to avoid any confusion, I'm considering entire website with all its webpages, & not just the URL's.

There is a lot of information in the URL's, I was thinking of exploiting them at first but they don't have inlink & outlink URL information.

(Apr 04 '12 at 02:19) Ankur Pandey
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.