|
I am reading the Google Correlate white paper and I am confused about the need for both vector quantization and k means to reduce the dimensionality of the data on pages 5 and 6. k-means seems to do a similar job as vector quantization in the first place, so why use two passes of different algorithms? My best guess is that VQ can better take advantage of local substructure in the input (for example, produce a small encoding for upward trends, down ward trends, and peaks) especially in the weekly dataset. |
|
From reading those two pages, it seems like there's two reasons they use vector quantization and K-means. They use VQ to do dimensionality reduction by getting an approximate location of where the cluster would be and then use K-means to get the specific best match. This was likely done due to the size of their search space given the amount of data they have, comparisons to codebook vectors is likely to give a good approximate. At the same time, this does create the possibility that the correlations returned are not always the absolute best matches. |