It is debatable wether you can evaluate clustering, pure and simple, just using labels. While you can easily get performance numbers, most serious practitioners and theorists argue that these numbers don't mean anything, as it is not a priori true that one wants necessarily the labels as used. For example, when clustering amazon reviews, one can be interested in grouping by:
- star rating of the review
- perceived helfulness of the review
- positive/negative sentiment of the review
- product or product category of the review
- author of the review
etc. Any clustering algorithm that successfully finds clusters good according to one of these criteria will probably perform very poorly when measured using one of the other criteria. I at least have been interested in each of these clusterings in the past, and some people can certainly think of other interesting options.
This caveat assumed, use any standard dataset close enough to the domain you are interested in claiming good performance in, and try to measure performance against the usual gold-standard label. For clustering text, for example, most people would pick some category structure from the Reuters RCV-1 labels as ground truth, and for vision they'd look at one of the many object recognition datasets.
What I would advise is for you to read carefully papers describing whatever baseline technique you're building upon or comparing against, and use their datasets and performance metrics at least, possibly including something else new that you find relevant. If you care about actually true scientific conclusions the only way to evaluate is to use clustering as part of some bigger process with a true loss function, and measure how the performance of that changes if you switch clustering algorithms. This, however, makes no claims as to how your algorithm will generalize to other end-to-end scenarios.