|
I am analyzing the effects of noise on my learning algorithm, and so I need to add measured noise to my feature space. It's text data that I have used, and so my features are numeric (word counts). What is the best or standard way to add noise to this sort of data? |
|
I know you asked about noising up the feature space, but perhaps making the document labels noisy will serve your needs? Assuming you are doing supervised modeling, altering a fixed percent of the training labels is simple. Making the features noisy is harder, especially if you want to preserve the distribution you have prior to injecting noise. In addition to Alex's suggestions above, I would also consider permuting features across documents. At x% of noise, randomly pick x% of the words in each document and shuffle them across documents. You might or might not want to constrain this randomization to preserve the original lengths of the documents. |
|
I'm assuming you're doing bag-of-words classification. I'd just add some uniform random words to your document and/or add some weighted random words (weighted by full corpus counts). You can also add some noise features, that is invent a few "words" that show up on x% of the data, for different values of x. In general there is no standard way of doing this for bag-of-words classification. |
|
The first question I would ask if what sort of noise you expect to encounter in real data. |