I want to detect poorly written English text. Let's say poorly written text is anything that would lead a native speaker to believe that text was written either hastily or by someone who wasn't fluent. Among other causes, this might because it is lacking capitalization, misspelled, or ungrammatical.

Are there any off the shelf tools to do this? I guess detecting poor grammar is the hardest part.

Heuristically, my first stab at this might be to build a language model for English wikipedia, and another for twitter. To the degree that text looks like it came from twitter rather than wikipedia, it is bad.

I'd like to do something like this, where researchers found that well written text increases conversions on Amazon and TripAdvisor, but instead trying to find a correlation between well written text and getting answers on QA sites like metaoptimize and stackoverflow.

This question is a slight generalization of this question.

asked Dec 26 '12 at 15:34

Rob%20Renaud's gravatar image

Rob Renaud
724111931

Hastily written text by an experienced writer will have very little in common with text from a non fluent speaker. Looking at the kinds and amount of punctuation, complexity of the words themselves and the length and number of sentences and paragraphs would go a long way towards categorizing text without the need for digging into the grammatical structure and will likely be a more accurate measure of facility of language. Looking at the start of sentences for active voice writing might also suggest a better writer. All that said I look forward to hearing any approaches to grammar ranking.

(Dec 26 '12 at 16:01) Casey Basichis

3 Answers:

What you are basically asking is how to handle situations where you have significantly noisy oracle. Unfortunately, this noise is created by an actual bias (their misunderstandings of the language), rather than just accidental mistakes.

From what I remember reading recently about crowd sourcing, off the top of my head there are two things you could do:

  1. Change the query. Basically, you either adjust how you are asking them or change the format you request their answer to be in.

  2. Learn their bias (build a profile). It may be possible to learn specifically to their bias with enough samples, especially if you have answers given by many other people in the crowd. This is most useful when there are general types of bias, such "Polish Native Speaker Level 3". Or at least you can perhaps learn when to ignore their answers.

I doubt this helps much for your specific issue, but it might foster some decent ideas. There has been a lot of good work done recently on imperfect oracles in crowdsourcing that might be worth reading.

answered Dec 30 '12 at 01:19

Daniel%20E%20Margolis's gravatar image

Daniel E Margolis
1065510

A formal way to address some of your problems could be the following ones:

  1. Poor spelling i really easy to detect, as long as you can detect the metric between misspelled words and you have a reliable difference, then you can just make an aggregate and detect how good the overall spelling was. The problem with this approach, is that even native speakers tend to have a poor spelling when it comes to online reviews, just check any YouTube video comment section.

  2. To detect poor grammar, I often find, that looking for specific cases, such as the miss use of "to, the, a," is a good indicator, since they are one of those details non native speakers tend to forget. Specially when it comes to Asian languages, that do not use them or have the same word for all of them.

answered Jan 03 '13 at 20:39

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

I think the first question is whether the text is known belong to a specific domain, because if it is then you can probably label some data and use word features, stopword n-grams, lists of common misspellings (wikipedia has a pretty nice one), and probably counts of words in domain-specific lexicons to train a linear model that is expected to generalize well. This problem should be fairly easy.

If you want to do this for all the web then it's much harder. One problem is changing goal-posts: a badly-written wikipedia page is often much better written than well-written blogposts, nevermind tweets. Another problem is lack of data: not only will any reasonably-sized labeled dataset of the web be too small to be representative but getting the correct proportions of all the kinds of text you see out there is really hard. If your problem is this I'd try to first isolate a few domains that should cover a large fraction of the text out there (wikipedia, blogs, tweets, etc) and learn domain-specific approaches for each domain, which you can then average (with weights coming from a domain classifier or something), though probably some simple heuristics like counting common mispellings, weird punctuation/capitalization usage, and some known-bad stopword n-grams might get you good enough accuracy.

answered Jan 04 '13 at 07:05

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

Subscription:

Once you sign in you will be able to subscribe for any updates here

Tags:

×3
×2

Asked: Dec 26 '12 at 15:34

Seen: 1,806 times

Last updated: Jan 04 '13 at 07:05

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.