If one wants to represent e.g. words in a vocabulary with vectors, a trivial way to do that is the one hot encoding. This kind of encoding is very high dimensional and sparse. Another way to represent the words would be assigning a number to each of them and using the number's binary representation. For example if the vocabulary size was 5, the biggest number would be 5 which is [1,0,1] in a binary notation. Word number two would then for example be represented as [0,1,0].

I wonder if there is any work comparing in particular those two word representations when used as input for methods like Neural Networks.

I am not looking for more complex representations like cluster based or distributed representations.

Is the binary representation an absolute no-go?

Thanks in advance!

asked Oct 12 '12 at 00:08

ogh's gravatar image

ogh
31225


One Answer:

The second option doesn't really make much sense -- all the information about a word is discarded, except for it's ID in the vocabulary, which isn't too informative. A single input per word would probably work in certain applications, but there're sparsity issues to consider and this would result in a rather large input vector.

If you are dealing with a broad-domain natural language text, word clustering approaches or distributed word representations seem to present a better option. This page has been cited here a number of times in conjunction with this topic.

answered Oct 12 '12 at 06:42

Mikhail's gravatar image

Mikhail
1555

Thank you for your answer. I'm aware that both of the options I mentioned (one-hot and the word-id as binary number) don't contain any information other than the word id. However, my question was more towards: Does the previously mentioned binary representation per se ruin performance on neural networks? Therefore I was looking for a comparison of different input codings which includes this binary-number way of encoding word ids.

(Oct 15 '12 at 01:34) ogh
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.