|
I am searching for methodologies to apply text analytics to financial crimes detection, using UIMA, combined with correlation and deviation statistical analysis. This is a call to anybody would has
This question is marked "community wiki".
|
|
So for (1), de-identification is usually done by using an entity-type tagger to identifying all people, places, organization, and date time expressions and replacing them with generic tokens (e.g., PERSON). Sometimes a de-identifier will try to make a consistent replacement of a person's name with a new one (this requires solving coref). You should check out the bio-medical domain for de-identifier papers since they have to do this a lot.
This answer is marked "community wiki".
|
|
The Enron email dataset might be of interest to you. UIMA can be used to tag semantically similar words-of-interest in the email body text (e.g. Jeff Skilling's emails).
This answer is marked "community wiki".
|
I haven't done 1), 2), or 3), but if you give us more details about the task, we can give you some ideas.
I attended a talk on fraud detection by an auditing firm. Even they have a data sparsity problem. There just aren't enough positive examples, and you can't be sure the negative examples are really negative.
One possible approach would be to extract any numbers that appear in the text and tag them by type from the context. Then you could apply Benford's Law or some other statistical method.