I am analyzing the effects of noise on my learning algorithm, and so I need to add measured noise to my feature space. It's text data that I have used, and so my features are numeric (word counts). What is the best or standard way to add noise to this sort of data?

asked Sep 15 '10 at 09:51

priya%20venkateshan's gravatar image

priya venkateshan
1646812


3 Answers:

I know you asked about noising up the feature space, but perhaps making the document labels noisy will serve your needs? Assuming you are doing supervised modeling, altering a fixed percent of the training labels is simple.

Making the features noisy is harder, especially if you want to preserve the distribution you have prior to injecting noise. In addition to Alex's suggestions above, I would also consider permuting features across documents. At x% of noise, randomly pick x% of the words in each document and shuffle them across documents. You might or might not want to constrain this randomization to preserve the original lengths of the documents.

answered Sep 15 '10 at 12:56

Art%20Munson's gravatar image

Art Munson
54111114

I'm assuming you're doing bag-of-words classification. I'd just add some uniform random words to your document and/or add some weighted random words (weighted by full corpus counts).

You can also add some noise features, that is invent a few "words" that show up on x% of the data, for different values of x. In general there is no standard way of doing this for bag-of-words classification.

answered Sep 15 '10 at 10:50

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

The first question I would ask if what sort of noise you expect to encounter in real data.

answered Dec 06 '10 at 13:00

Dave%20Lewis's gravatar image

Dave Lewis
785162644

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.