|
Hi everyone, I am working on a project searching a large database of radiology reports trying to identify all reports that have a specific finding. We are initially focused on the adrenal gland and want to select all reports that mention adrenal mass(es), nodule(s), or lesion(s). The part that makes this difficult is the fact that each radiologist has a different set of phrases they use when dictating the reports. Also they often say things like: “No distinct adrenal nodules are seen”, “There are no suspicious adrenal gland masses”, and "without a discrete nodule". The third confounding problem is a lot of the time there is a finding of mass/lesion/nodule in the sentence before the "adrenal" sentence. example: "....the liver contains a large mass. The adrenal glands are normal..." My initial approach was to use proximity searches combining adrenal + mass/nodule/lesion. This gave decent results but I know there are better methods (and this seems like a great place to find some). another way to ask the question: once you have a database of reports that contain the word "adrenal", how would separate them to the following three categories? Positive/True findings: “A right adrenal nodule measures 1.5 x 1.1 cm” “A stable nodule in left adrenal gland which is unchanged for almost 2 years is likely benign.” “Bilateral adrenal nodules are noted. The nodule on the left measures 1.2 x 1.2 cm. The lesion on the right measures 1.4 x 1.6 cm.” Negative findings: “No distinct adrenal nodules are seen” “There are no suspicious adrenal gland masses.” “There is persistent mild thickening of the left adrenal gland without a discrete nodule.” “There is mild thickening of the adrenal glands bilaterally without focal mass” False positive findings: (lesion/mass/nodule in sentence before or after adrenal sentence) “likely occluded by the adjacent tumor mass. The left adrenal gland is slightly thickened but otherwise normal.” “The liver is otherwise unremarkable with no evidence of laceration or focal mass. The gallbladder pancreas adrenal glands and spleen are unremarkable.” “The gallbladder spleen pancreas and adrenal glands are unremarkable. There are multiple hypodense lesions which are too small to characterize in the bilateral kidneys” Thanks for your help! |
|
Dependency parses will allow you to identify the sentences you're looking for with high precision because they are exactly those in which "adrenal" is an adjective modifying masses(s) nodule(s) and lesion(s). You could download the Stanford Parser (very well documented, in Java), which will give you dependency parses of sentences, and much more besides (the nltk book's chapter 8 has a decent explanation of what parsing is). You would then dependency-parse each sentence in your reports and just choose those in which the appropriate dependency relationship ("amod"- adjectival modifier) exists between "adrenal" and nodule(s), lesions(s), or mass(es). How much data, time and computational power do you have? This approach is very exact, but will take a lot more time than just a keyword search. The parser does about 5 sentences a second on a a 4-core 16gb, but you only need to parse the collection once. More generally, you might be interested in different approaches to extracting relationships and entities from text, try reading some information extraction papers from the Etzioni clan to get an idea of what's possible. |
|
Adding to aditi's answer, I'd suggest using a CRF on the dependency tree, much in the style of this paper. Or even maybe scratch the dependency parse altogether and just use a CRF with BIO encoding to divide the mentions in the classes you want. 1
I think I know the paper you mean - it had to do with sentiment classification using dependency trees and CRF's. [pdf] The use of CRF's seemed like overkill there, especially because it was just to handle polarity reversal. I didn't quite believe that their few-percent gains from using the CRF were worth the extra effort. CRF's with BIO encoding would also require a ton of manual annotation, so that's probably not the best way to go.
(Jul 08 '10 at 21:03)
aditi
Yes, true enough. Although it's not completely clear to me that a couple of simple heuristics in a dependency parse can solve the problem. Then again, I've never really studied information extraction.
(Jul 08 '10 at 21:28)
Alexandre Passos ♦
|
I would encourage you to put more content in your title: "Finding adrenal mass diagnosis in free-form medical reports", perhaps, rather than "how would you approach this problem?".
Sorry about that - didnt know what to title it