|
If one wants to represent e.g. words in a vocabulary with vectors, a trivial way to do that is the one hot encoding. This kind of encoding is very high dimensional and sparse. Another way to represent the words would be assigning a number to each of them and using the number's binary representation. For example if the vocabulary size was 5, the biggest number would be 5 which is [1,0,1] in a binary notation. Word number two would then for example be represented as [0,1,0]. I wonder if there is any work comparing in particular those two word representations when used as input for methods like Neural Networks. I am not looking for more complex representations like cluster based or distributed representations. Is the binary representation an absolute no-go? Thanks in advance! |
|
The second option doesn't really make much sense -- all the information about a word is discarded, except for it's ID in the vocabulary, which isn't too informative. A single input per word would probably work in certain applications, but there're sparsity issues to consider and this would result in a rather large input vector. If you are dealing with a broad-domain natural language text, word clustering approaches or distributed word representations seem to present a better option. This page has been cited here a number of times in conjunction with this topic. Thank you for your answer. I'm aware that both of the options I mentioned (one-hot and the word-id as binary number) don't contain any information other than the word id. However, my question was more towards: Does the previously mentioned binary representation per se ruin performance on neural networks? Therefore I was looking for a comparison of different input codings which includes this binary-number way of encoding word ids.
(Oct 15 '12 at 01:34)
ogh
|