I'd like to build a Phrase Graph as described in Experiments in Microblog Summarizaion. The idea is to build a graph that counts the number of sentences have a particular phrase, where phrases are composed of sequences of nodes in the graph. For example, if I have these two sentences

#archery, by the Republic of Korea - and the guy is legally blind. Awesome. 
#archery by the Republic of Korea in archery - by a guy who is legally blind.

I'd want to build a graph that looks like

Sample Phrase Graph with key structures.

This is highly related to a Prefix Tree except common nodes in the tree are merged together. And writing the prefix tree version is really straight forward, but i'm not sure how to code up this phrase graph, nor can I think of a more proper name for this type of graph, which I know exists and have seen before. Does anyone have experience with building these graphs? And would you be able to link me to any algorithms for building them?

Thanks!

EDIT: It looks like Wikipedia already has an answer. They link to Incremental construction of minimal acyclic finite-state automata, which is precisely what I want. It's 12 years old however, so If you know of any newer/faster ways or code packages that already do this, I'd definitely still be interested.

asked Aug 09 '12 at 01:27

Keith%20Stevens's gravatar image

Keith Stevens
62161327

edited Aug 09 '12 at 01:37

Can a phrase graph encode more paths than there are examples sentences?

(Oct 15 '12 at 23:12) PaulDixon

4 Answers:

brics.automaton has a makeStringUnion function which should cover your use case.

answered Aug 09 '12 at 17:24

Daniel%20V's gravatar image

Daniel V
312

A sweetness! That's a great example of what I was looking for. Though I'm probably going to have to write my own code using the algorithm in the paper I linked to since I need some extra features that I don't think I can hack into the brics' code like frequency counts of the edges.

(Aug 09 '12 at 21:47) Keith Stevens

I believe the "special case" of automata you're looking for is called a directed acyclic word graph (DAWG) ==> coolest data structure name ever.

The easiest way to create a graph is to have your script write a file with the graphviz encoding of your graph. Given this file graphviz optimize a layout and render it in a variety of image formats.

answered Aug 26 '12 at 15:09

timv's gravatar image

timv
162

http://code.google.com/p/febird/ has a extremely fast implementation of DAWG and incrementally construction of Minimal Acyclic DFA, it is more than 10x faster and 10x small then Jan Daciuk's implementation.

answered May 29 '13 at 22:54

Peng%20Rockeet%20Lei's gravatar image

Peng Rockeet Lei
1

http://code.google.com/p/febird/ has a extremely fast implementation of DAWG and incrementally construction of Minimal Acyclic DFA, it is more than 10x faster and 10x small then Jan Daciuk's implementation.

answered May 29 '13 at 22:54

Peng%20Rockeet%20Lei's gravatar image

Peng Rockeet Lei
1

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.