The question is rather self-explanatory - I have a large number of phone calls to businesses that I would like to classify. The set of possible classifications is rather small, but might increase later. The transcriptions are rather poor, since we can't train the acoustic models to individuals.

So I guess what am I asking for is some helpful papers, or even just guidelines on how I might want to adapt existing, "ordinary" text classifiers to account for the higher level of noise in my data.

asked Jul 23 '10 at 18:45

george%20s's gravatar image

george s
51568


2 Answers:

It will help if you have a good prior. As prior you can use a model trained on clean data (normal text), and regularize your features not to deviate too far from that. Use strict regularization (because you don't want to model the actual noisy observations too closely). Also, use a classifier combination rather than just one classifier, have lots of training data, maybe use some heuristic outlier detection to throw outliers out or downweigh them appropriately.

answered Jul 23 '10 at 18:57

Frank's gravatar image

Frank
1169254150

One thing you can do is a bit of degrading of your inputs. For example, if "t"s and "d"s are usually confused by your transcribing software, replace both of them by an arbitrary symbol. In the same way, if a letter is usually dropped (say a mute "g" in the end of a word) you can remove it from other places where it appears. If some words are mistaken, use a single feature for them, etc. I'm not sure that transfer learning is the way to go, since you have corrupted features, and you would have to find a larger data set of labeled examples for your specific problems, which are not always available.

answered Jul 23 '10 at 20:05

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1901244215335

edited Jul 23 '10 at 20:07

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.