|
I want to detect poorly written English text. Let's say poorly written text is anything that would lead a native speaker to believe that text was written either hastily or by someone who wasn't fluent. Among other causes, this might because it is lacking capitalization, misspelled, or ungrammatical. Are there any off the shelf tools to do this? I guess detecting poor grammar is the hardest part. Heuristically, my first stab at this might be to build a language model for English wikipedia, and another for twitter. To the degree that text looks like it came from twitter rather than wikipedia, it is bad. I'd like to do something like this, where researchers found that well written text increases conversions on Amazon and TripAdvisor, but instead trying to find a correlation between well written text and getting answers on QA sites like metaoptimize and stackoverflow. This question is a slight generalization of this question. |
|
What you are basically asking is how to handle situations where you have significantly noisy oracle. Unfortunately, this noise is created by an actual bias (their misunderstandings of the language), rather than just accidental mistakes. From what I remember reading recently about crowd sourcing, off the top of my head there are two things you could do:
I doubt this helps much for your specific issue, but it might foster some decent ideas. There has been a lot of good work done recently on imperfect oracles in crowdsourcing that might be worth reading. |
|
A formal way to address some of your problems could be the following ones:
|
|
I think the first question is whether the text is known belong to a specific domain, because if it is then you can probably label some data and use word features, stopword n-grams, lists of common misspellings (wikipedia has a pretty nice one), and probably counts of words in domain-specific lexicons to train a linear model that is expected to generalize well. This problem should be fairly easy. If you want to do this for all the web then it's much harder. One problem is changing goal-posts: a badly-written wikipedia page is often much better written than well-written blogposts, nevermind tweets. Another problem is lack of data: not only will any reasonably-sized labeled dataset of the web be too small to be representative but getting the correct proportions of all the kinds of text you see out there is really hard. If your problem is this I'd try to first isolate a few domains that should cover a large fraction of the text out there (wikipedia, blogs, tweets, etc) and learn domain-specific approaches for each domain, which you can then average (with weights coming from a domain classifier or something), though probably some simple heuristics like counting common mispellings, weird punctuation/capitalization usage, and some known-bad stopword n-grams might get you good enough accuracy. |
Hastily written text by an experienced writer will have very little in common with text from a non fluent speaker. Looking at the kinds and amount of punctuation, complexity of the words themselves and the length and number of sentences and paragraphs would go a long way towards categorizing text without the need for digging into the grammatical structure and will likely be a more accurate measure of facility of language. Looking at the start of sentences for active voice writing might also suggest a better writer. All that said I look forward to hearing any approaches to grammar ranking.