3
1

Hi,

I would like to find out syntactic properties of noisy(informal) text. Traditional deep learning syntactic markers like dependency trees, Part-Of-Speech(POS) tags may not work owing to the noisy nature of the data. In addition to simple markers like punctuation, capital words, position of x words, average word length etc. can we do some more deeper analysis on such data to characterize their syntactic properties?

asked Oct 29 '10 at 06:47

mcenley's gravatar image

mcenley
356243436

1

Speech recognition has to deal often with very informal text. What speech people do, I think (but you should wait until a speech specialist answers) is use models that capture similar properties to supervised grammatical models and tune them to improve the performance of the system, like using a HMM language model (which can have pos-tagging-like properties).

(Oct 29 '10 at 07:04) Alexandre Passos ♦

Alexandre, Thanks for the response. Waiting for some more answers as you said.

(Oct 29 '10 at 07:45) mcenley

2 Answers:

Could you elaborate a bit on the nature of the data or the task you would like to solve. First, there is nothing or very little syntax in: punctuation, capital letters, position of x words, word length, etc. Second, one can by no means use dependency or phrase structure trees, POS tags, etc. to do syntactic analysis on noisy data. As Alexandre brought forth the speech people, when doing speech recognition and other similar tasks one may need to do analysis on really noisy data where one has pauses, changes of topics, unfinished utterances, "uh"'s, "mmm"'s and lots of other sounds, no capital letters or punctuation, no perfect recognition of the spoken text ... and people still succeed to do some syntactic analysis. So, what's your problem? What do you want to do?

answered Oct 31 '10 at 17:39

Svetoslav%20Marinov's gravatar image

Svetoslav Marinov
23617

Svetoslav, Thanks for the response. The nature of the data is textual viz. I have a collection of text comments/posts from forums on the Internet like YouTube. I understand that there's very little syntax in the features I have mentioned but my question was precisely the same! I would like to characterize comments on forums by genre.

(Nov 01 '10 at 06:20) mcenley

@Denzil Correa: and why doesn't a linear classifier with word featurees work? If it doesn't, can't you improve it by adding spelling correction features, word similarity features, etc, to cover the mistakes the algorithm is making?

(Nov 01 '10 at 06:32) Alexandre Passos ♦

Alexandre, A linear classifier with Bag-Of-Words is a baseline which I have implemented. It doesn't give great results. I would like to improve over the baseline.

(Nov 01 '10 at 07:00) mcenley

A good way to improve performance is to look at the classifier feature list to see if it's learning too much noise and inspect mistakes made by the classifier to figure out what extra information could help steer it in the right direction, and then adding this extra information as features.

(Nov 01 '10 at 07:02) Alexandre Passos ♦

Alexandre, Thanks. I will try to do that.

(Nov 08 '10 at 08:38) mcenley

Denzil, the syntactic properties of the text/sentences/phrases would not help much. We did some experiments with a dependency-based parser (MaltParser) for text clustering and we only concentrated on extracting subjects and objects. My advice as a linguist is that syntactic information will give you few clues in distinguishing the genres (e.g. what is the difference between NEWS:"A volcano erupted today in the South-East Pacific-" vs Prose:"Her intricate use of metaphors is thrilling and will keep the reader engulfed in the reading." ). Maybe (but I haven't thought much about it) by analyzing some sample data you will see - length of sentences differ, much use of adjectives (POS tag), lacks of subjects (syntax), lack of articles (POS tags). Also as Alexandre suggests see wether you classifier does not learn too much noise (BUT get a clear idea of what is noise for your present task.). Hope this helps a bit.

(Nov 15 '10 at 10:11) Svetoslav Marinov
showing 5 of 6 show all

As other responders have indicated, it really matters how you want to use this information.

There are probably two main classes of answers here, 1)are you interested building a system that is trying to take advantage of this information, or 2) are you trying to understand something about the syntax of noisy text (comments/chat/tweets/etc.)? (These two are not mutually exclusive, mind you.)

If it's the former, you can use noisy results from ots parsers, taggers, etc and see what you get. Often, I've found that noisy intermediate processors aren't all that large a problem, because the way that they are noisy is fairly regular. Take this very simple example. I have a low coverage tagger, and when I encounter an OOV I tag it with a Proper name tag (PPN). Now when I use this crappy tagger, and build a linear model or conditional n-gram model over tags or tag sequences, I find that long sequences of PPNs are common. But maybe it turns out that these sequences are highly correlated with the variable of interest -- say sentiment polarity. It's not actually a sequence of NNPs, it's a sequence of UNKs or OOV words, but for whatever reason people use more OOV words when performing one communicative function than another. In this example, I'm able to extract highly discriminative information from the noisy behavior of the automatic intermediate system. More specifically, there is signal in the noise.

If it's the latter, it's a different and more difficult problem. You've essentially got to make a parser more robust to irregularities. One recent paper which looked at this was at HLT-10 by Jennifer Foster “cba to check the spelling”: Investigating Parser Performance on Discussion Forum Posts . She found that she could improve automatic parser performance on web forum posts by systematically introducing noise into WSJ training data. But the nature of this noise was carefully selected and specific to the corpus she was investigating. You may also want to look into automatic sentence segmentation tools, and partial parsing techniques which are more robust to sentence fragments. I'm hardly an expert on these, so hopefully someone here will follow up with some pointers.

answered Nov 15 '10 at 17:52

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
156252135

edited Nov 15 '10 at 23:28

Andrew, I am trying to build a system which makes use of the syntactic properties of text in the two classes. I guess I understand your point of view of using such taggers. I will soon try to incorporate POS tag related information. I guess that's the least I can do. I am still very unsure about dependency grammars. Unfortunately, Jennifer Foster's method may not be feasible for my task at hand.

(Nov 16 '10 at 07:20) mcenley
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.