|
I know that there are a number of intuitive explanations for the meaning of the eigenvalues and eigenvectors of a matrix, but I never quite grasped them. Could someone explain them to me or point me to a good source? |
|
Your question would probably be better asked at mathoverflow.net, but that forum can be a little scary to wade into. I'd suggest posting there anyway, but I'll also give a description, albeit not rigorously. Analyzing eigenvectors & eigenvalues of a matrix in the simple case of a matrix-vector product can give a lot of information about how that matrix will skew a vector. In a sense, you can imagine each eigenvector grabbing hold of a rubber sheet -- the vector 'x' in the product Ax -- and pulling it in some direction. Each eigenvector pulls in that direction, and the eigenvalue controls the strength of that pull. The magnitude of the eigenvector itself can potentially be anything, but it's fairly straightforward to prove that a matrix factorization PDP^-1 with scaled eigenvectors is equivalent to another matrix factorization QEQ^-1 with unit eigenvectors; the difference being their eigenvalues (note: PDP^-1 = QEQ^-1 = A). This is at the heart of proving that an eigen decomposition of a matrix (into the form PDP^-1 or QLQ^-1) is a non-unique factorization of the matrix A. Hence, it's not at all uncommon to see the eigenvectors be expressed as unit vectors. In the 'best' case from the standpoint of numerical linear algebra, you'd want to have a full rank matrix A with orthogonal (perpendicular) or orthonormal (perpendicular unit vectors) eigenvectors. However, in reality it can be common to have a matrix without full rank (which corresponds to one or more eigenvalues being zero), or matrices with nearly parallel eigenvectors which makes the matrix nearly singular or nearly non-invertible (which corresponds to the matrix having a high condition number). Matrices with a high condition number are more unstable, and when used with iterative methods can have a higher tendency to exhibit pathological behavior in whatever algorithm their used in. Behavior in the direction of the other non-conflicting eigenvectors will be less unstable, but in the directions of the eigenvectors that are nearly parallel things will be far more erratic. As an example, there are a number of iterative methods for attempting to solve Ax=b without using least squares or directly inverting A. When 'x' points in a direction that is also nearly parallel to the eigenvectors that are nearly parallel to eachother, the algorithm will take more time to converge than if it pointed in a more friendly direction -- if it converges at all. -Brian
This answer is marked "community wiki".
3
Actually, mathoverflow.net is for research-level questions, math.stackexchange.com is better
(Jan 06 '11 at 12:48)
Yaroslav Bulatov
Nice, I didn't know about that one.
(Jan 06 '11 at 13:09)
Brian Vandenberg
@Yaroslav: Thanks; I was always afraid of posting dumb questions on mathoverflow.
(Jan 06 '11 at 13:17)
Alexandre Passos ♦
Nitpicking, but an important one: Eigenvectors are per definition only those vectors != the zero vector (A0=0 for any matrix A), so your sentence here: "(which corresponds to one or more eigenvectors being the zero vector)" should be: ""(which corresponds to one or more eigenvalues being the zero)"
(Jan 06 '11 at 16:17)
osdf
Quite true. That should have said eigenvalues.
(Jan 06 '11 at 16:32)
Brian Vandenberg
|
|
Complementing Brian's answer, some eigenvalues/eigenvectors of special matrices have special properties, and knowing some of these can save you a lot of trouble understanding papers. One that is very common in machine learning is the equilibrium distribution of a Markov chain. A Markov chain over a discrete state space is a probability distribution on sequences of states such that the next state depends only on the current state. Given an initial state distribution, a Markov chain is specified fully by a transition matrix, that is, a matrix M such that M[i,j] = P(j|i). Trivially, if you multiply M by the column vector that is 1 only on the current state you have the distribution of states that follow that first state. Some Markov chains are reversible and stationary (there is always a positive probability to from each state to each other state through some path in the matrix). These Markov chains have the property that they have an equilibrium distribution, that is, after running for some time, regardless of the initial state, the probability of finding the chain in a certain state is always the same. Hence, if p is a vector such that p[i] the stationary probability of M being in state i, then Mp = p; that is, p is an eigenvector of M with eigenvalue 1. This leads to all sorts of interesting algorithms, such as PageRank. Another very common eigenvalue/eigenvector in machine learning is the one (or ones) used in spectral clustering. There there is no clear intuition as to why it works, but if you build a specific matrix from a graph (called the graph laplacian matrix), the eigenvectors with smallest eigenvalues (except for the eigenvector with the smallest eigenvalue) can be used to cluster nodes in the graph in a way that approximates the best normalized cut. Also, eigenvalues and eigenvectors can be related to matrix approximation. Since, as Brian said in his answer, the eigenvalues represent how much the matrix "stretches" vectors in the direction of the eigenvectors, if the matrix has some big eigenvalues and lots of very small eigenvalues then you can essentially replace it with V S V^T, where V is a rectangular matrix with one eigenvector per column, S is a small diagonal matrix with the corresponding eigenvalues and V^T is V transposed. If the matrix has some very big and lots of very small eigenvalues, you can use just the eigenvectors with large eigenvalues in V and obtain essentially a much smaller representation of the matrix. This is the idea behind PCA, LSI, and many other forms of dimensionality reduction (specially because this truncated approximation is the best possible low-rank approximation of the original matrix). Last, eigenvalues and eigenvectors can be used to compute some important functions of the matrix. The determinant is the product of the eigenvalues, and the trace is the sum of the eigenvalues. [A brief redirect approaching off-topic material] I'd say that there are pretty solid intuitions about why spectral clustering works (via showing that the Graph Laplacian can be used an approximation of MinCut). For more reading: http://www.kyb.mpg.de/publications/attachments/Luxburg07_tutorial_4488[0].pdf
(Jan 06 '11 at 14:24)
Andrew Rosenberg
@Andrew, yes, it does make sense, but I don't think it will trivially be understood by someone inexperienced in linear algebra and graph algorithms, which is why I avoided going in detail (while the markov chain thing is really obvious any way you look at it)
(Jan 06 '11 at 14:26)
Alexandre Passos ♦
@Alexandre: From what you said about the markov chain, I have a question: Using dynamical systems lingo, if p1 is in the basin of attraction of some stable attractor (whether it be a point, limit cycle, or just some general region that acts as a stable attractor), then p1 is at least near to some eigenvector of the markov chain's transition function. If p1, or the region surrounding p1 can be thought of as an eigenvector, would it also be safe to assume there are other 'stable' regions p2, p3, ..., pn that are distinct and separate from p1? For example, if I use gibbs sampling on an RBM, the point I end up at (near the equilibrium distribution) is in just one of many (as I'm understanding it) equilibrium regions for the transition function. -Brian
(Jan 07 '11 at 02:34)
Brian Vandenberg
@Brian, I think you are perhaps not talking about different equilibrium distributions but modes of a single equilibrium distribution. If the Markov chain is ergodic (like it should be with Gibbs sampling), it should only converge to a single equilibrium distribution but in practice it might get stuck in a single mode for a very long time which might be different every time you run the algorithm.
(Jan 07 '11 at 04:39)
Philemon Brakel
@Phil: hm. Thinking of it from the perspective of a standard probability distribution, what you're saying should be plainly obvious; there should be one distribution over the input space. Where I'm drawing my question, and perhaps where I'm confused, is the comparison of a stationary point drawn from the equilibrium distribution to an eigenvector of the transition process. For example, using a simple transition matrix, x_(n) = Ax_(n-1) will be at or near an equilibrium point when x_(n) ~= x_(n-1); but if there is more than one distinct equilibrium point, then each should be an eigenvector of the matrix A. So, training such a process [is? may?] be a matter of altering the eigenvectors of the process to make the equilibrium distribution match the data distribution. I understand a lot of the math behind it, I just haven't studied markov processes in any serious depth yet, hence the questions/confusion. -Brian
(Jan 07 '11 at 10:58)
Brian Vandenberg
|
|
Try this out...you will feel what the concept means. http://www.physlink.com/education/askexperts/ae520.cfm |
Matrix can be viewed as a linear transformation of vectors, eigenvectors are vectors that stay on the same line after the transformation . You can get intuition if you play around with some eigenvalue demos like http://demonstrations.wolfram.com/LinearTransformationWithGivenEigenvectors/ (also search for "eigenvectors" on that site)
A similar question can be found in a "competitor" site.
Thanks! There are some very nice answers in there.