Hi all, I've been working on calculating different measures of similarity on these datasets I have that come in the form of time series, and I want to include a non linear measure to capture possible non linear interactions, which lead me to mutual information. However, I am having a tough time figuring out how to calculate it just based on the wikipedia page. How do I calculate the joint and marginal pdfs for my variables?


Also, I found a package in R ("entropy" - http://strimmerlab.org/software/entropy/), that calculates it, but it looks like I need to bin the data for this to work.

If any of you could shed a little more light on Mutual Information for me, that would be great. Thanks.

EDIT 7/9/10: Thanks everyone for your responses! I'm still a little bit confused on how to construct P(x), P(y), and P(x,y) given my time series data. The points are perfectly lined up (ie the first point in x was taken at the same time as the first point in y).

asked Jul 05 '10 at 10:12

Alex%20Pelan's gravatar image

Alex Pelan

edited Jul 09 '10 at 11:25

7 Answers:

Hi, all. Thank you for your answers. For future reference (to anyone who might find this via Google/searching metaoptimize in the future), I actually found a Matlab package that no one here had mentioned, since I had trouble getting some of your answers to work. Just input two vectors to the mutualinfo function and it spits it out for you:

Matlab Code

For the simple approximation I needed, this was the easiest method.

answered Jul 14 '10 at 09:54

Alex%20Pelan's gravatar image

Alex Pelan

Be Careful! This issue can get quite complicated very fast, especially if your time series has a large alphabet.

For a simple answer, these papers are a good place to start: They apply mutual information to clustering of many data types, including financial time series: These papers contain supplemental references and example MATLAB code which should be helpful for getting started.


The above work is about a clustering method, but they use mutual information as a measure of similarity! The method used for estimating the mutual information is a slight modification of the method described here: ( Entropy and information in neural spike trains. SP Strong, R Koberle, RR de Ruyter van Steveninck & W Bialek, Phys Rev Lett 80, 197-200 (1998)) which is based on coincidence counting. This method is ultimately inspired by methods developed in the statistical physics literature many years ago by S K Ma, that introduce N variable size bins so that each bin has the same amount of probability mass. When this is satisfied the entropy is log(N) so all that is needed is to estimate the number of bins. (Think of an inverse birthday paradox where you guess the number of days in the year from the number of pairs in a group with the same birthday)

There are many well-known difficulties with the estimation of entropy from a finite sample. Many cases of this problem you are in the undersampled limit (N << K) and the entropy is very sensitive to the mass in the tails and thus can require the observation of very rare events which are unlikely to appear except in a very large sample. There are many practical estimators which have been developed over the years:

Here is a nice article, which clearly explains, in depth, many of the statistical issues involved, and discusses some estimators and their asymptotic properties:

Liam Paninski, Estimation of entropy and mutual information, Neural Computation 15: 1191-1254

Here are more recent methods due to Nemenman, Bialek et.al.

I Nemenman, F Shafee, and W Bialek. "Entropy and inference, revisited." NIPS 14, (2002)

Entropy and information in neural spike trains: Progress on the sampling problem. I Nemenman, W Bialek & R de Ruyter van Steveninck, Phys Rev E 69, 056111 (2004);

Regarding the application of entropy or mutual information to time series, as another answer mentioned this depends on the world of models you are willing to consider. If you assume that each observation instance of the time series is i.i.d. then you can apply the above methods. For many time series i.i.d. is a terrible model and you will want a model which can account for the fact that observations at different times are not independent (the present and future depends on the past)

This answer is marked "community wiki".

answered Jul 05 '10 at 14:31

jbowlan's gravatar image


The trick with mutual information is defining your variables, and the models P(X), P(Y) and P(Y|X). Remember that I(X;Y) (mutual information) = D(P(Y|X)||P(Y)) (Kullback-Leibler Divergence)

Mutual information I(X;Y) can be interpreted as the increase in accuracy of predicting Y already knowing X. So, in a temporal prediction problem you are comparing two models P(y_t|y_t-1, y_t-2, ...) and P(y_t|y_t-1,x_t-1,y_t-2,x_t-2,...) with K-L divergence. This corresponds to mutual information.

For more information about mutual information, here's my dissertation.

answered Jul 08 '10 at 20:18

Aleks%20Jakulin's gravatar image

Aleks Jakulin

edited Jul 08 '10 at 22:43

This kind of problem is one alluded to in Claude Shannon's original paper "A Mathematical Theory of Communication". However, in his example, he assumed A) that he knew all of the P(x_{t+1}|x_{t}) and P(x_{0}) in a Markov Chain, B) that his statespace was discrete and finite, and C) that the Markov Chain was ergodic. If all of this is true, then you can obtain the "stationary distribution" of the Markov Chain and use that as your distribution to calculate the mutual information between two distributions.

answered Jul 09 '10 at 17:30

Daniel%20Duckwoth's gravatar image

Daniel Duckwoth

edited Jul 09 '10 at 17:30

Mutual information is a quantity that you compute based on two random variables for which you know the probability distributions (joint and marginals). If you just have two time series, which are not random variables, you have to make some modeling assumptions. The simplest model you could build is:

  • consider each point of a time series as being an i.i.d. sample coming from a fixed distribution
  • model distributions using histograms (for discrete values) or perhaps some Gaussians or Gaussian Mixtures (for continuous case)
  • compute Mutual Information directly from definition.


1) for the joint distribution you need to know how the time series are aligned (which, should be fine if you have timestamps for each data point)

2) the i.i.d. assumption is probably not a good idea for a time series, however. Using a Markov Model, for example, would allow you to encode some time dependencies.

answered Jul 05 '10 at 11:30

Hugo%20Penedones's gravatar image

Hugo Penedones

edited Jul 06 '10 at 04:59

MILCA is a toolbox in C and Matlab with novel approach to estimate mutual information which doesn't need binning. It has very good implementation here

"MIxnyn" and "MIhigherdim" are functions both could estimate MI, please see a documentation in sources for further informations. the original paper link is here: Estimating mutual information A. Kraskov, H. Stögbauer, and P. Grassberger, Phys. Rev. E 69 (6) 066138, 2004

also you may visit here

I have used it as a subroutine in implementing feature selection algorithms mainly mRMR, Also I used it in linear regression with constraint on mutual information between predictors and response which I myself invent it and work on it. Estimation of mutual information in this way is very fast and accurate.

answered Jul 08 '10 at 03:29

Ehsan%20Khoddam%20Mohammadi's gravatar image

Ehsan Khoddam Mohammadi

edited Jul 08 '10 at 03:43

For comparing time series you may want to check out Approximate Entropy.

The following link has a step-by-step example and a link to C++ and MATLAB code. I'd check the code against the journal articles and the examples though - I think some of the code has been wrong on this site.


answered Jul 07 '10 at 10:23

bluelou's gravatar image


edited Jul 07 '10 at 10:23

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.