|
I am looking at google's project word2vec and notice that they are doing a unigram sampling based on a power law, i.e., the sampling probability of a unigram is proportional to its frequency to the power of 0.75. My question is what is the significance of the number 0.75 here? One thing that I can think of is related to the study that the vocabulary size is roughly proportional to the corpus size to the power of 0.75. Is there a better explanation behind this? And compared to other sampling methods, such as based on the unigram's frequency alone, or Zipf's law, what is the point of using a power law sampling here? thanks! |
|
Where are they doing this? The skip gram log linear model doesn't require this at all. The only thing I can think of in their code that is like this is when they throw away frequent words at random from the training data. Is this what you are talking about? I wouldn't describe any part of the skip gram code as sampling unigrams. The InitUnigramTable code is only called when you provide non-standard options to the program. The basic algorithm doesn't use it at all. When it is enabled, it is probably doing something similar to noise contrastive estimation. However, the actual negative distribution probably doesn't matter that much and it makes sense to take the 0.75 power to dampen the effect of very frequent words and smooth things out. I expect other powers would also be ok. Here's the code, if it helps:
(Sep 13 '13 at 15:17)
Daniel Hammack
Thanks for the answer. The negative sampling was used in training both 'cbow' and 'skip-gram' architectures - see the code here for details. My understanding of the code is that the Unigram Table precomputes the sampling of the unigrams by a power law distribution. Here are some notes that I took for the code. And based on the writers of the code, negative sampling has certain advantages over hierarchical softmax when there are enough words in the training set, and the feature space to be learned is low dimensional. But I am confused about why the sampling of the unigrams should follow such a power law, where the power is explicitly specified as 0.75.
(Sep 14 '13 at 23:57)
dolaameng
Yeah, that makes sense. Actually the code of training cbow with negative-sampling looks very similar to a denoising auto-encoder, where the syn1 representation should be avoided to be exact sampling of these words in the training set. As for the 'smoothing' part, I am not quite sure about that yet, because the infrequent words have actually been filtered out in previous steps when the vocabulary is built. Thanks for the alternative explaination!
(Sep 15 '13 at 01:06)
dolaameng
|
|
In the paper originally describing negative sampling, they say that they tried a number of different choices for the noise distribution and found that the unigram distribution raised to 0.75 gave the best performance on a number of different tasks (end of section 2.2). It's not clear whether they tried other values for that power, or only compared it to the unigram and uniform distributions (as stated in the paper). |