2
1

Given a document (say a scientific paper) and a set of other source documents that you suspect influenced it (the set of papers it cites), what NLP and Machine Learning techniques exist that can give an idea of how much each source influenced the main document?

I thought using terms and frequencies and finding a linear combination that works best, but this seems pretty crude.

What Machine Learning and NLP techniques would work for this problem?

asked Mar 29 '11 at 19:31

Jacob%20Jensen's gravatar image

Jacob Jensen
1914315663


3 Answers:

LSA or LDA might be a better choices for measuring article similarity than straight term frequencies. I'd also recommend using noun phrases to deal with jargon: using long n-grams can leave you overly sensitive to wording coincidences, but a lot of scholarly topics are strings of nouns/adjectives ("Hierarchical Hidden Markov Model").

The citation tree will give information about indirect influence. If the target paper cites 11 articles, 10 of which also cite the 11th, that implies the 11th was a seminal paper on the field and had a second order impact on the target paper, shaping how the rest of the research flowed. I suspect you'll see better results if you classify the influencers of each article in temporal sequence, and try to model the flow of influence through intermediate documents.

Finally, there's going to be an issue with domain sensitivity. For example, articles that use a specific test set are likely to use similar language even if they approach the problem divergently. Or consider a paper that applies a technique from a neighboring field to a common problem: The biggest influence may be a paper in that neighboring field, but the language may be more in line with the primary field. A simple hack might be to give additional weight (or ideally, make that a learned feature in a model) to "unexpected" citations that connect disparate sections of the citation network. Or even better, intersecting all the citers of an article can give you information about what specific topics this article tends to add to its citers, especially if you cluster articles in your citation network: What phrases do papers that cite X tend to use that similiar papers who don't cite X don't use. Then the centrality of those phrases to your target article suggests whether this influenced the writer, or whether he was rounding out his citation list.

answered Mar 30 '11 at 12:34

Paul%20Barba's gravatar image

Paul Barba
4464916

Seems like a good answer. At this stage, I'm thinking an iterative approach would be best, progressing through stages: 1) Use citation data only 2) Use (1) and tf-idf similarity data, with unigrams, bigrams or noun phrases 3) Use moderately more sophisticated NLP technique, e.g. LDA, on the features of (2) 4) Use everything in (3) but with some parameter learning 5) Do everything in (4) PLUS explore more advance advanced NLP, biological (gene analogy) and ML techniques

(Mar 31 '11 at 00:47) Jacob Jensen

There is a topic model approach by Sean Gerrish and David Blei, A language-based approach to measuring scholarly impact that solves the more general problem, of finding the influential documents for a corpus instead of a document, but you can probably solve your problem with a variant of that.

Just as a note, you don't want just to find documents that share the words of the query document; you want to find the first documents that introduced the important words of a query document.

answered Mar 29 '11 at 19:53

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.