What are the current main methods for flattening hierarchical clusters? Specifically, given a dendrogram Z, deciding where to cut the dendrogram to result in k clusters.

Scipy has the fcluster algorithm, which can use an inconsistency metric to split clusters when the child clusters are different. However there are a lot of different options, and each has the manually chosen t parameter. I'm looking for something more automated, even if it isn't perfect.

asked Dec 22 '10 at 18:56

Robert%20Layton's gravatar image

Robert Layton
1520102337

I've marked this question as answered, but if anyone stumbles across this question and has other suggestions, I'd be happy to hear them.

(Apr 27 '11 at 21:57) Robert Layton

2 Answers:

I have code written against the Python hcluster package, which uses this approach to find the smallest flat-clustering. I can share this code, if you would like. You may not want the smallest flat-clustering, though. Perhaps you want a sample of flat clusters at various thresholds.

Here is how I find the smallest flat-clustering. I do a line-search against the flattening threshold t. The minimum of the interval is t small enough to have every cluster included. The maximum of the interval is t large enough to have only one cluster. You then do a line search to find the value of t that has more than one cluster, but the minimum number of clusters. I do a line search with a maximum of 15 steps.

Here is the code:

def fcluster_with_count(Z, t):
    clusters = hcluster.fcluster(Z, t)
    clustercnt = len(frozenset(clusters))
    return clusters, clustercnt

def narrowest_fcluster(docs, Z):
    """
    Try to find the t that gives the smallest number of flat clusters.
    We do this using a line search.
    """

    tlow = 0.
    thigh = 2.
    cnt = 0

    cnt_to_clusters = {}

    clusters, clustercnt = fcluster_with_count(Z, thigh)
    cnt_to_clusters[clustercnt] = clusters
    print >> sys.stderr, thigh, clustercnt
    assert clustercnt == 1

    clusters, clustercnt = fcluster_with_count(Z, tlow)
    cnt_to_clusters[clustercnt] = clusters
    print >> sys.stderr, tlow, clustercnt
    assert clustercnt > 1

    while cnt < 15:
        tmid = (tlow + thigh) / 2.
        clusters, clustercnt = fcluster_with_count(Z, tmid)
        cnt_to_clusters[clustercnt] = clusters
        print >> sys.stderr, tmid, clustercnt

        if clustercnt == 1:
            thigh = tmid
        else:
            tlow = tmid

        cnt += 1

answered Apr 11 '11 at 05:09

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
467541105126

This is similar to pruning a decision tree, so many of the techniques (e.g., this one, with suitable modifications) for that task are applicable here.

answered Dec 26 '10 at 23:43

Daniel%20Hsu's gravatar image

Daniel Hsu
161

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.