I'm giving a talk about how you can improve modeling accuracy simply by adding more data, but not changing the model: The art of predictive analytics: More data, same models

What techniques do you know about for doing so? I will list some of the techniques I know, but I am looking for more suggestions.

Also, real world examples are good. I want more specific examples of the techniques I outline in my answer.

asked Jan 26 '12 at 04:01

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited Jan 31 '12 at 23:14

I don't know if you can help me see your talk, the RSVP list is full. I've followed metaoptimize for awhile.

(Jan 27 '12 at 18:20) Rob Renaud

@Rob: Email me: joseph at metaoptimize dot com.

(Jan 30 '12 at 20:40) Joseph Turian ♦♦

4 Answers:
  • Find more training data on the web that is applicable to the problem.
  • Turk (annotate) more training data.
  • Use distant supervision. For example, in the Twitter sentiment analysis work, the authors used emoticons in tweets as training data.
  • Use a self-training / bootstrapping approach.
  • Use active learning to select the best new training examples to annotate.
  • Use better input features:
  • Train an unsupervised model on a lot of data, and use those as input features. For example, word embeddings or document clusters.

answered Jan 26 '12 at 04:04

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Not every model can benefit from more data, only high-variance models.

answered Jan 29 '12 at 20:35

Melipone%20Moody's gravatar image

Melipone Moody
221468

There's a short, fun article in a recent issue of the American Statistician that might be a toy example of what you want:

"Fisher’s Conditionality Principle in Statistical Pattern Recognition", The American Statistician Aug 2011, Vol. 65, No. 3: 167–169

It's essentially a stylized example of how an ancillary statistic can be used to improve classification. Not a real world example, but very simple to state and present.

answered Jan 30 '12 at 22:04

Chris%20Jordan%20Squire's gravatar image

Chris Jordan Squire
3112

I agree with Melipone, not every problem benefits from getting more training data. If you have a High Variance, you can benefit from getting more training data, but on the other hand, if you have High-bias, it does not matter how much training data you add, you'll probably end up with the same error.

Here are Andrew's NG slides on how to deal with different issues in different ML settings

On how to add more training data:

Depending on your input, if you need more training data, you can also generate a set of artificial training data, as Bishop's describes it in the NN chapter. You can basically do transformation on the data you have. There is a proof that doing that and using tangent propagation to make the model being robust against variations are closely related. On the last part of the Neural Networks Chapter (Regularization)

answered Jan 30 '12 at 23:34

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.