I want to know what a good machine learning method for forecasting an infection would be. In particular, my training data consists of multiple time-series indicating whether a set of individuals have been infected or not. The infection and network that it spreads through are both highly structured, so there's not nearly the randomness (I think) that you'd expect from a pathogenic epidemic.

I want to know how to forecast eventual extent of an infection based on a few initial snapshots of how it has spread so far. What I've seen used before is a combination of expectation-maximization on node-to-node infection likelihood followed up by monte-carlo simulations on this model. I'd rather steer clear of this, since my particular infection (I hypothesize) spreads in large part based on the global state of the network, and because I'm missing certain data (e.g. important hubs in my network).

So what's an appropriate ML approach for forecasting this kind of thing?

Ideally, whatever algorithm you give will be easy to implement (or already in a good library), be able to take account of network structure without being bound by it and have a flexible objective (so for instance I could try to predict eventual percentage infected or try to predict particular nodes).

Edit: I found an interesting paper related to what I'm asking - not an answer in of itsself, but good guidance and probably interesting to other people as well. Predicting Popularity of Online Content

asked Aug 12 '10 at 14:53

Jacob%20Jensen's gravatar image

Jacob Jensen
1644285360

edited Aug 18 '10 at 12:49


2 Answers:

Assuming you'd rather not take the infection modeling route, why not try to model the percentage of infection as a single variable. Construct features from the known information in your network, and treat it as a traditional regression problem?

You could treat the infection of a particular individual node as a secondary classification task which takes as input the predicted overall infection rate. This approach allows you to avoid explicit modeling of each node, and rather considering "The Network" as a whole, or "A Particular Node" and "The Rest of The Network" depending on the task.

Personally, I'd think that modeling the peer-to-peer infection likelihood and running simulations would be a more effective approach, but this could be another way to address the problem.

answered Aug 17 '10 at 16:03

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
156252135

Can you describe your data a bit more. Do you have any explanatory variables apart from the infection?

answered Aug 17 '10 at 06:53

dirknbr's gravatar image

dirknbr
1112

It's the spread of a meme in a blog network. I have blog connection and and post information, which is actually a more-than-sufficient amount of info. The analysis is the part I don't know how to handle.

(Aug 17 '10 at 15:32) Jacob Jensen
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.