Suppose that we have digitized document images; we pass this documents through OCR and we get text text documents, that we classify according to their topics in different classes such as payroll, various bills, information letters ...

I wonder if this classification of scanned documents, is useful for any information extraction task ?

asked Jan 29 '13 at 09:17

shn's gravatar image

shn
462414759

edited Jan 29 '13 at 17:39


2 Answers:

For the general problem of classifying documents by topics I can see many uses. For example, automatically suggesting tags for metaoptimize posts. Another example is google news, except there you also want to group by news story, not just news topic (politics, football, etc). It's also useful to classify queries by topic and use the classes to trigger topic-specific behavior in a search engine.

I'm not familiar with your specific situation so I don't know what it could be useful for.

answered Jan 29 '13 at 20:04

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

This is going to be long.

The way humans recollect data is nothing if noisy. For example, politicians speeches are usually full of rhetoric that does not help at all, so if I were to ask you what was Barack Obama main's speeches topic, you might not know. Specially if you do not have the time to hear or read every speech yourself.

The same goes to every other speech or piece of document, that has not been correctly tagged. And even those tags might be subjective to people's feeling and subjectivity.

Machine Learning approaches offer a more quantitative way to do this tagging, and having a precisely tagged set that describes speeches, you can basically create a webpage where you can search for speeches that concern you, like immigration, or economy.

Furthermore, there is a hole variety of books published whose only tag is "fiction" or "non-fiction", when the book might have a richer way of being classified.

And once you do this, you can make product suggestions, by recommending things that fall on the same category.

A quick example, if you know someone who read Jurassic Park, Brave New World and watched Gattaca , you might be tempted to suggest another random book by the same authors or directors, but if you have topics, you would see that the 2 products are related to genetic manipulation, so maybe you could suggest something on that vein.

answered Jan 30 '13 at 02:25

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

What about the specific case of scanned documents, for information extraction tasks ?

(Jan 30 '13 at 04:18) shn

In today's age, there is probably no reason to scan documents, as pretty much everything is already in some electronic version.

I could think of scanning old yearly reports and extract the topics, that way you can have a broader picture of a company's history. Of course that would only be applicable to companies with more than 40 years of history.

(Jan 30 '13 at 05:42) Leon Palafox ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.