|
In reading Yoshua Bengio's excellent paper "Learning Deep Architectures for AI" I became puzzled about section 4 on Distributed Representations. As I understand it, a distributed representation is the concept in AI / brain research that we can represent knowledge with a grouping of cooperative units rather than a single symbolic unit; for example there is an explanation here. In the paper it is posited that general AI should have certain characteristics, such as deep hierarchies, the ability to operate in unsupervised manner, and others…one being a distributed representation. Now I understand that PCA and ICA are distributed representations, and I also understand why clustering is NOT a distributed representation; clusters are mutually exclusive i.e if you have 4 clusters you represent inputs by the closest one and thus your representation is something like [1 0 0 0] or [0 0 1 0], i.e a 4-element vector that is all 0s with a single 1. But I don’t understand why Gaussian Kernels, or a fuzzy clustering (for example instead of [1 0 0 0] you have something like [0.8 0.1 0.05 0.05]) wouldn’t be a distributed representation. I realize they may not the the most compact representation but aren’t they somewhat distributed? Can anyone clarify this? |
|
This is a very good question, to which I have devoted quite a bit of thought over the past decade. It is really about generalization to new configurations, and the need to learn highly-varying functions. A local representation is one in which the input space is divided into regions, and you need training examples in each region in order to generalize correctly within that region. This is true even if the regions are soft, as in the case of a Gaussian mixture or a Gaussian kernel SVM. The other property of a learner based on a local representation is that within each region the output prediction is very smooth (often constant, e.g. for k-nearest-neighbors or histograms or decision trees). With soft representations, there is typically some kind of interpolation between neighboring regions. In order to produce a highly-varying function with a local-generalization learner, you need enough (a lot of) regions (basically one per intended 'bump' or variation in the target function to be represented), and you basically need a number of training examples proportional to the number of these regions. But it is difficult to generalize far from these regions (except in trivial ways, such as with a constant or linear function extrapolating from the nearest training examples). Instead, with a distributed representation, there is the potential to generalize to a region (i.e. a configuration of the inputs) for which you have not seen any example. This is because such a region corresponds to a configuration of the distributed code associated with that region, and each bit (or real number) in that code corresponds to an aspect of the input, and it has parameters that can be estimated from examples in other regions, possibly quite far away in input space (and sharing similar values of that aspect). This is possible because all examples associated with the similarly-valued aspect have something in common (captured by that one element of the distributed representation), and some parameters are devoted to that aspect. For example, if you have pictures of people (say the input space is pixels), and you have learned a distributed representation that somehow captured features such as eye color, hair color, age, eye shape, eyebrow shape, mouth shape, nose shape, etc, then you may have enough examples to basically capture each of these aspects, without requiring one example for every possible combination of their (possibly discretized) values. This is how a learner based on distributed representations has the potential for non-local generalization. We have shown that this actually happens. Instead, a local representation would need to basically enumerate a set of examples (or templates or prototypes), each implicitly associated to only one configuration of these aspects, and it would not generalize to other configurations except in a very limited sense (for example, attribute the same properties to images that are very similar in surface, e.g., in euclidean distance), i.e., by interpolating to nearby examples (nearby in input space, which typically can become meaningless as soon as you go a bit of distance). In addition to the deep learning review paper, which mentions this issue, have written some papers about 'local vs non-local generalization', which discuss this issue in more detail, with proofs in specific cases, and experimental evidence in other cases: Yoshua Bengio and Yann LeCun, Scaling Learning Algorithms towards AI, in: Large Scale Kernel Machines, MIT Press, 2007. Yoshua Bengio, Olivier Delalleau and Nicolas Le Roux, The Curse of Highly Variable Functions for Local Kernel Machines, in: Advances in Neural Information Processing Systems 18 (NIPS'05), pages 107--114, MIT Press, 2006. Yoshua Bengio, Martin Monperrus and Hugo Larochelle, Nonlocal Estimation of Manifold Structure (2006), in: Neural Computation, 18(2509--2528). Yoshua Bengio, Hugo Larochelle and Pascal Vincent, Non-Local Manifold Parzen Windows, in: Advances in Neural Information Processing Systems 18 (NIPS'05), MIT Press, 2006. Yoshua Bengio, Olivier Delalleau and Clarence Simard, Decision Trees do not Generalize to New Variations (2010), in: Computational Intelligence. -- Yoshua Bengio |