|
I'd like to build a Phrase Graph as described in Experiments in Microblog Summarizaion. The idea is to build a graph that counts the number of sentences have a particular phrase, where phrases are composed of sequences of nodes in the graph. For example, if I have these two sentences
I'd want to build a graph that looks like
This is highly related to a Prefix Tree except common nodes in the tree are merged together. And writing the prefix tree version is really straight forward, but i'm not sure how to code up this phrase graph, nor can I think of a more proper name for this type of graph, which I know exists and have seen before. Does anyone have experience with building these graphs? And would you be able to link me to any algorithms for building them? Thanks! EDIT: It looks like Wikipedia already has an answer. They link to Incremental construction of minimal acyclic finite-state automata, which is precisely what I want. It's 12 years old however, so If you know of any newer/faster ways or code packages that already do this, I'd definitely still be interested. |
|
brics.automaton has a makeStringUnion function which should cover your use case. A sweetness! That's a great example of what I was looking for. Though I'm probably going to have to write my own code using the algorithm in the paper I linked to since I need some extra features that I don't think I can hack into the brics' code like frequency counts of the edges.
(Aug 09 '12 at 21:47)
Keith Stevens
|
|
I believe the "special case" of automata you're looking for is called a directed acyclic word graph (DAWG) ==> coolest data structure name ever. The easiest way to create a graph is to have your script write a file with the graphviz encoding of your graph. Given this file graphviz optimize a layout and render it in a variety of image formats. |
|
http://code.google.com/p/febird/ has a extremely fast implementation of DAWG and incrementally construction of Minimal Acyclic DFA, it is more than 10x faster and 10x small then Jan Daciuk's implementation. |
|
http://code.google.com/p/febird/ has a extremely fast implementation of DAWG and incrementally construction of Minimal Acyclic DFA, it is more than 10x faster and 10x small then Jan Daciuk's implementation. |

Can a phrase graph encode more paths than there are examples sentences?