I would look to know which gives better approximation of similarity between two vectors,euclidean distance or cosine similarity. I would like to take the case of word embedings,latent feature representation of words. If i get cos_theta between two representation of words nearly equal to 1,can I assume that they have similar latent variable representation and carries the similar syntactic and semantic features? Or do I need to calculate euclidean distance between them?

asked Sep 27 '12 at 14:48

Kuri_kuri's gravatar image

Kuri_kuri
293273040

edited Sep 28 '12 at 03:21


One Answer:

Rare words, even after training, will generally have vectors close to their initialization point. Typically people initialize word representations from a distribution with mean zero that puts the initial vectors pointing in all sorts of directions. This means that if you use cosine distance, rare words will randomly appear to be very close to other (possibly more frequent) words.

However, if you use euclidean distance, all rare words will seem similar to one another. I still think euclidean distance is probably a better choice since in some sense all the rare words being the same isn't that bad.

In general, keep in mind that the exact distance, however you decide to measure it, between words will not be very meaningful, only rank ordering of distances or some other relative notion of distance.

Furthermore, I would also like to point out that most of the information encoded in word representations learned by a typical neural language model that looks at short local windows of words will be syntactic. What I mean by that is that most ways of learning word representations will learn that "japan" should be much closer to "china" than to "japanese". This might not be what you want.

answered Sep 27 '12 at 22:13

gdahl's gravatar image

gdahl ♦
341453559

edited Sep 27 '12 at 22:14

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.