|
Is the algorithm of data clustering used by google, known ? If yes, where can I find it. |
|
You could search Google publication database for clustering and K-means: http://research.google.com/pubs/papers.html One relevant Googler paper that's not in that list for some reason is "Fast and Accurate k-means For Large Datasets", NIPS 2011 |
|
Since you ask about web-scale clustering of web pages, I would assume they use either a distributed k-means, or the following algorithm which has a Googler as a co-author: Efficient Clustering of Web-Derived Data Sets (Sarmento et al, 2009). I outline the algorithm in the answer to Large-scale clustering. |
Google uses many variants of many clustering algorithms internally. Some of them are known (k-means), some of them are not (their clustering spelling corrector). Can you be more specific?
What is the main essential algorithm that Google use to cluster the data (mainly web pages based content) ? They use kmeans for which task ? Is it an incremental version that they use ? It's not feasible to do a static clustering (e.g. classical kmeans) because the amount of concerned data is very large, change, and progress continuously ...
Perhaps you might look into PageRank, which is he algorithm Google is best known for.
this is lol
They don't use k-means, they use keywords, user behavior, and all the ingenuity and technical effort that their hundreds or thousands of search engineers working with the largest server farm in the world an a dataset larger than the sum total of all text produced by humans before the year 2005 can muster. There is no single algorithm (PageRank is the largest single element though).