I am working on a classification task where we are building models that detect the type of an entity present in a span of text (ie, annotation). These models can be built with a dataset where each instance is represented by three independent text variables:

  • pre-context: document text before the annotation.
  • annotation: span of the document where we want to detect the entity type. If no entity exists, all the entity type columns (isPerson, isOrganization, isTime) are marked 0
  • post-context: document text after the annotation.

Data Set 1: Entity type classification in spans of text.

preContext  | annotation       | postContext | isOrganization | isPerson | isTime 
....        | on July 12, 2011 | ....        | 0              | 0        | 1 
With over 8 | million invested | in Chrysler | 0              | 0        | 0

Data Set 2: Boundary detection - "start-of-entity"

In the first example, the transition between preContext and text marks the start of an organization-type entity. In the second example, there is no entity present at the transition between preContext and text, therefore all of the dependent variable columns are marked as zero.

preContext          | text
    | isStartOfOrganization | isStartOfPerson | isStartOfTime
Private equity firm | Westbridge Capital could exit part or all of its stake in Hyderabad-based technology firm.
    | 1 | 0 | 0

I been using basic NLP techniques like TF/IDF, N-grams, Tokenizers, Stemmers, POS Taggers, Stoplist for the above problem. But I now really want to do is to experiment with some new technique other than what I tried. This is my Problem and I couldn't able to find any valid techniques. If you can suggest me It will be great i.e The only way to make significant further gains is to start to start thinking outside the box!. Could you please suggest me some new techniques for solving above problems?

asked May 08 '13 at 05:23

kishore's gravatar image

kishore
1112

edited May 09 '13 at 00:56

This seems like named entity recognition. CRFs are popular for that, with various feature sets usually comprising things like "previous word is X" for all words X.

(May 08 '13 at 05:56) larsmans
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.