|
Aria Haghighi is teaching about structured prediction, was asking about good references for history-based models. What does one need to know about history-based models? What are seminal works about history-based models? |
Everything you wanted to know about history-based models, but were afraid to askSummaryBecause they offer arbitrary modeling power, I believe that history-based models are necessary to achieve human-level performance on NLP tasks. Restrictive independence assumptions (like those of most dynamic programming parsers) could force us to underfit the underlying linguistic criteria of interest. However, there has been insufficient work on history-based models to support this claim. BackgroundHistory-based models are those in which arbitrary information from the prediction history can be used to evaluate possible decisions at each choice point in the search space. There are not necessarily any independence assumptions, which means that arbitrary features can be used in the modeling. But, independence assumptions allow one to collapse the size of the search space, and thus perform inference using dynamic programming. History-based models involve search for the minimum loss structure, and people in NLP are scared of search. They would prefer an inference technique which involve guaranteed upper-bounds on inference time, even if that upper bound is implausibly large, like O(n^5). For this reason, history-based methods have not been common since the 90's. Results with history-based models have been less competitive than those using dynamic-programming, but I believe this is mainly because history-based models receive less scrutiny. So there is a negative reinforcement cycle. Notable previous workThe early thread of work on history-based methods starts with the Black IBM models in the early 90's, and was adopted by Magerman in his NL parser (see my thesis, chapter 1, page 9, for references). Ratnaparkhi's (1999) work on NL parsing using maxent methods is one of the most well-known, but I find the treatment to be slightly confusing. Henderson had a handful of history-based neural network parsing papers in the early 00's, but these works are also confusing and I think you need to read several of his works to understand what is going on (see my thesis, chapter 1, page 12, for references). In the 00's, members of the dependency parsing community began to revisit the idea of history-based parsing. McDonald + Nivre (2007) refer to these parsers as "transition-based" parsers (compare to "graph-based" parsers which use MST-style inference). This term emphasizes that the model is applied at every state transition. They cite the following works as examples of transition-based dependency parsers, which also stand out in my mind: Yamada and Matsumoto (2003), Nivre et al. (2004), and Attardi (2006). Note that these works generally refer to themselves as "deterministic" parsers, insofar as they use greedy search as inference. Greedy parsers are a subset of history-based parsers. I suspect that these parsers were all greedy because the authors were not sure of a sensible way to combine the scores from independent decisions into the score for an entire derivation path. For that reason, they cannot compare different derivation paths, so there is no sensible way to define the agenda priority for search, and greedy parsing is the only option. NB: In my thesis (section 2.2, page 16), I use the term "deterministic" in a different sense than in the "deterministic dependency parsers", and in the same sense as Marcus (1980). By "deterministic", I mean that each state in the search space can only be reached by one path, i.e. there is a one-to-one mapping between partial parses and their derivations. Non-determinism corresponds to what is termed "spurious ambiguity" by, for example, Clark + Curran (2007). General treatments on history-based modelsIMNSHO, one good treatment of history-based is in my thesis (sections 2.4 + 2.5, pages 18-20). In particular, with a probabilistic history-based model, multiplication can be used to combine the probabilities of individual parse decisions into the probability of a partial parse. I argue against using locally-normalized models, as that can cause label bias (Lafferty et al 2001 and LeCun et al. 2007, Section 6.1). Ratnaparkhi (1997, 1999), McCallum et al. (2000), Henderson (2004), and Titov and Henderson (2007a,b) all use locally-normalized models, and hence should be subject to label bias (section 7.4, page 94 of my thesis). The Related Work chapter of my thesis (chapter 7) gives a high-level overview of different design decisions possible in a structured predictor. Another good overview is in Hal Daume's invited ACL 2010 talk. That link is just to the summary, you should email him and ask for his slides. Inference in history-based modelsI argue that a history-based model can have a smart enough model that inference is actually fast. The number of states in the search space is exponential in the size of the input, and one might worry about search errors. However, in my thesis experiments, the inference evaluation function was learned accurately enough to guide the parser to the optimal parse reasonably quickly without pruning, and thus without search errors. Our best model has such low perplexity that, on 80.5% of sentences, its completely greedy solution is optimal. (The completely greedy solution is the one found using breadth- first search with a beam width of 1.) This is in comparison to previous non-DP-based discriminative parsers, which all used best-first search with pruning (Ratnaparkhi, 1999; Henderson, 2004; Collins and Roark, 2004; Titov and Henderson, 2007a,b) or no search at all (the greedy dependency parsers). With an appropriate choice of parsing logic, best-case and observed history-based parsing performance can be linear. The parsing logic (Melamed and Wang, 2005) is a set of constraints that structures the search space for an arbitrary input sentence. For example, Ratnaparkhi (1999) gives a logic such that a greedy parse requires time linear in the size of the sentence. In other words, best-case parsing time is linear. Compare this to the best-case parsing time of McDonald's first-order non-projective MST parser, which is quadratic. Ratnaparkhi also observes linear-time parsing in practice. My thesis argues that a refined history-based model will have little perplexity, and hence observed inference complexity will match best-case inference complexity. I find it strange that lot of parsers make independence assumptions to improve inference, but then prune, so they don't end up getting the optimal results anyway. The Collins model (Collins, 1997, 1999, 2003; Bikel, 2004a) is O(n^5) without pruning. Second-order non-projective dependency parsing is NP-hard, but even an approximate second-order approach is more accurate than the exact first-order approach (McDonald and Pereira, 2006). This supports my argument that stronger models are more important tractable inference (which is rarely tractable in practice, anyhow). Koo and Collins (2010) demonstrate inference algorithms for third-order dependency parsers. My feeling is that we have developed enough specialized inference algorithms for diff. Just stop inventing new inference algorithms, and use search over a history-based model in which individual scores can be sensibly combined. Training a history-based modelThere are three main training techniques for history-based models (section 7.2 of my thesis:
My arguments, recapped
1
Hi Joseph, that's a great comment, it's my mistake. I should have said that they often use greedy search. Here's the relevant text in the paper: "Transition-based parsers do not impose factorizations, so they can define arbitrary features on the tree as it is being built. As a result, however, they rely on greedy or approximate search algorithms to solve Eq. 1 [the search for the highest-scoring parse]." What I was trying to get at there is that transition-based parsers have the opportunity to extract non-local features by examining the stack, but they are intractable when they exploit such non-local features (hence the greedy search). So transition-based parsing versus graph-based parsing is apples and oranges.
(Sep 22 '10 at 17:47)
Terry Koo
1
Terry: Actually, I was mistaken, I interpreted your claim too strongly. I have updated my answer. I don't necessarily believe transition-based parsing versus graph-based parsing is as cut-and-dry as apples and oranges. An nth-order parsing model converges to a transition-based model that uses the entire history as n increases. The difference, then, is the choice of inference algorithm.
(Sep 22 '10 at 17:52)
Joseph Turian ♦♦
1
Joseph, interesting analysis, I'm just adding some data-points you seem to be missing: Kenji Sagae (acl 2006) did search over an history based model for dep. parsing. Liang Huang and Kenji Sagae (acl 2010) have dynamic programming together with a history based model with very good results. My own easy-first parser (naacl 2010) is also doing search for a history-based parser, although the search is integrated into the learning (so inference is greedy), and it's not a "transition-based" parser. All of these "history based" models have a very restricted kind of history, though (in principle we could condition on anything, in practice we only look at local stuff anyways). Also, their search is structured in such a way that it is very difficult for later decision to override earlier ones. These "history-based" models are not really stronger than a forth-order parsing model. Eugene Charniak demonstrated (emnlp 2010) that by looking at a much larger history (he used random-forests for that) one can improve results by a lot.
(Sep 22 '10 at 18:37)
yoavg
So a history-based model is any model that can "condition" on any part of the search history (that is, non-viterbi) when predicting the structure? If so, all search in such models is either exponential-is in the worst case or greedy in some sense (since you shouldn't really be able to tell when making a local decision which effect it will have in the distant future). I mention this because I think many peple's resistance to this sort of model (and hence insistence on n-th order models) is due to abandoning the guarantees of optimality. I was recently thinking that this does not make much sense in the positive case (i.e., if you see an approximate algorithm performing better than the state of the art you're done and you have a good result), but when experimental results are "negative" it gets harder to see if it's a model issue on an approximation issue. How to most techniques for history-based methods deal with the issue of negative results? (of course, most of what is done in ML is approximate with little to no guarantees about negative results; bayesian methods and neural networks being prime examples)
(Sep 22 '10 at 19:59)
Alexandre Passos ♦
@Yoav, perhaps I'm missing something, but I think that history-based = transition-based. Each transition is a decision at a choice point in the search, i.e. an edge in the search graph, i.e. monotonically adding some assertion (item) about the final parse to the current state (set of assertions). We are not talking about transition in a dependency parse, rather a transition in the state-space that structures the prediction.
(Sep 22 '10 at 22:00)
Joseph Turian ♦♦
1
@Alexandre: Not exactly, the issue of search can make this trickier to understand. Consider that the output is a set of assertions over the input (e.g. there is an NP bracket from the first through the third word). A partial parse or state is a set of assertions. Each inference or decision is adding an item to a state. A history based model can condition on the entire state. The search space is a DAG. The root of the DAG is the initial state. Each edge in the search graph is an inference. My argument is that, even when people make independence assumptions and use DP, they still often end up with inference is that is too expensive. O(n^5) in the Collins parser, for example. So pruning is introduced into the beam search. And, once you prune, you can never be sure you get the optimal answer. In my history-based parser, the model has little perplexity about the decisions at each choice point, so I can get the optimal parse and don't need to prune.
(Sep 22 '10 at 22:05)
Joseph Turian ♦♦
2
!) The term "history-based" goes back to Black et al (ACL 1993) and David Magerman's thesis (1995). 2) "I find it strange that lot of parsers make independence assumptions to improve inference, but then prune, so they don't end up getting the optimal results anyway. The Collins model (Collins, 1997, 1999, 2003; Bikel, 2004a) is O(n^5) without pruning." The point of these models (and Charniak's generative ones, and earlier IBM work) is that training is trivial -- it just counts observable events. The independence assumptions are not there to improve inference, they are there to speed up learning. 3) There is an important difference between "history-based" and "transition-based". Transition-based parsers use a classifier as the controller for a suitable pushdown automaton. A history-based parser does not assume a PDA. Abney et al (ACL 99) show that there are subtle differences between these two views. 4) " I suspect that these parsers were all greedy because the authors were not sure of a sensible way to combine the scores from independent decisions into the score for an entire derivation path." I don't share this suspicion. Deterministic transition-based parsers, especially Nivre's, are by far the fastest for a given level of accuracy. For a good example, see JMLR 11(Apr):1471−1490, 2010. 5) "This supports my argument that stronger models are more important [than] tractable inference." It doesn't, really. The problem is that matching approximate inference with discriminative learning is a black art (Kulesza & Pereira NIPS 2008). There's hope (Weiss & Taskar, AIStats 2010) that we'll finally understand how pruning and learning interact, but until now it's all been trial and error.
(Sep 22 '10 at 23:42)
Fernando Pereira
@Joseph, by your definition (transitions are search-space-transitions) just about any parser is transition-based.
(Sep 23 '10 at 02:05)
yoavg
1
Very nice overview, Joseph. Just a couple of comments: Your conjecture that early transition-based dependency parsers all used greedy search because it is difficult to score complete transition sequences is only partially true. Personally, I was just intrigued to find out just how far we could push such a simple-minded approach before it broke down. I guess I am still pushing ... :) In addition, I think I was inspired by the evidence from psycholinguistics that human sentence processing is deterministic or near-deterministic. On the topic of training, there is at least one transition-based dependency parser that uses online learning with global optimization, that of Yue Zhang and Steve Clark presented at EMNLP 2008 in a very nice paper entitled "A Tale of Two Parsers". Finally, on the topic of transition-based vs. graph-based dependency parsing, that distinction has nothing to do with search in itself, but only with the parameterization of the parsing problem. In our 2007 EMNLP paper, Ryan and I were careful to characterize the two approaches contrasted (that of MSTParser and MaltParser, respectively) as global, exhaustive, graph-based parsing vs. local, greedy, transition-based parsing, where global vs. local is about learning, exhaustive vs. greedy is about inference, and graph-based vs. transition-based is about parameterization. Of course, these are long and cumbersome descriptions, so somewhere along the line they got shortened to just graph-based vs. transition-based. But Zhang and Clark's parser is a good example of a globally trained transition-based parser. In addition, it uses beam search, so it is no longer strictly greedy.
(Sep 23 '10 at 02:15)
Joakim Nivre
1
Fernando, thanks for reminding me that efficiency was also an important reason to go deterministic. :)
(Sep 23 '10 at 02:20)
Joakim Nivre
Fernando and Joakim, thank you for your comments. I wanted to comment further on the idea of determinism. Kudo + Matsumoto was one of the first deterministic dependency parsers, and my suspicion is that they didn't do search because it's not clear if the different SVM scores could be combined sensibly. But this is mere speculation. I agree that the speed of greedy parsing is definitely a draw. But I do believe there is an interesting question of how often backtracking is necessary. For example, if doing full search increases the parsing time by only a factor of 2x, is that so bad? In my thesis parser, I found that our cost function is so refined that, on 80% of sentences, the greedy solution is also the optimal one. So backtracking is search might not be necessary often, and might not cause a big speed drop. One way to explore this question is with an anytime search algorithm, one that begins with the greedy parse and then keeps searching for better solutions, but always has a complete parse if the search is terminated before the agenda is exhausted. I used a search algorithm that I call "greedy completion" (chapter 3 of my thesis), which does just that. Traditional greedy search pursues the lowest cost expansion at each step. After a solution is found, greedy search terminates. In greedy completion, traditional greedy search is the inner loop and all expansion partial parses considered but not greedily pursued go on the agenda. When traditional greedy search finds a solution or cannot proceed because it cannot beat the current best solution, the inner loop terminates and there is a frontier of expansion partial parses not explored during previous greedy searches. These frontier partial parses form the agenda. Greedy completion picks the lowest cost partial parse in the frontier and starts a new inner loop (iteration of greedy search) beginning at this partial parse. Another perspective is that greedy completion is agenda-based search with a particular priority: In greedy completion, the priority of a partial parse is its negative cost, except that we always prefer any partial parse that was a consequent of the last partial parse popped. I was not able to find any other example of this in the literature, even though it seems pretty obvious in retrospect.
(Sep 23 '10 at 16:50)
Joseph Turian ♦♦
@Yoavg, I would argue that a parser is no longer transition-based if the search space is structured as a non-degenerate hypergraph. But I think the most useful distinction to be made is in terms of what restrictions are placed on features of the history.
(Sep 28 '10 at 14:14)
Joseph Turian ♦♦
1
Thanks for a nice overview. Still, I cannot agree with your comment that our models (SSN/ISBNs) are susceptible to label bias. Though they do use local normalization but they are generative as they estimate joint probability of a sentence and its syntactic structure. Namely, additionally to computing the probability of structural decisions (e.g., creating an arc in dependency parsing), they compute the probability P(next_word | parse_history). So, the score of a partial parse will be dumped if the next word is unexpected in this (partial) syntactic context, unlike for MEMM-type models. However, given that we (have to) use a form of beam search, all the partial analyses may be dumped but that is a different story.
(Sep 28 '10 at 14:31)
Ivan Titov
Since they are locally normalized, however, shouldn't it be very hard to detect that the next word is bad if you're in a bad state already? Label bias arises from good states and bad states always passing exactly the same mass forward.
(Sep 28 '10 at 21:29)
Alexandre Passos ♦
1
Alexandre,
No, I do not think so. This does not happen here exactly because bad states do not pass the same mass forward on these word prediction operations. Also, if this state is bad (say, very different from anything observed in the training set) then the word prediction distribution is going to be flat over the vocabulary (roughly it will correspond to the unigram word probability) and therefore the score will be quite low too (many outgoing states). Note also that the state here corresponds to a derivation prefix (not including the non-processed words) and this derivation prefix is either "not bad" (similar derivations where observed in training) or will have already low score because of the preceding word predictions. These word predictions are "hard" for a different reason though: they involve normalization over the vocabulary (which cannot be precomputed unlike standard PCFG production probs) and therefore having a large vocabulary is problematic. Still, in practice, the results are quite good even with a relatively weakly lexicalized model.
(Sep 29 '10 at 05:21)
Ivan Titov
@Ivan Titov: Thanks, I understand now.
(Sep 29 '10 at 05:35)
Alexandre Passos ♦
@Ivan: So the Pr(next word | history) is combined with the structure decisions during search? I think I see your argument. Although in principle there could still be label bias, it will be very very weak, because the vocabulary is so large. Correct? BTW, you should consider the stochastic sampling technique used by Collobert + Weston as a possible solution to the "normalizing over the whole vocabulary" issue.
(Sep 29 '10 at 12:43)
Joseph Turian ♦♦
@Joseph: yes, essentially correct. Normalization issue: we were more thinking about some form of hierarchical word clustering and predicting a path in this graph (i.e. decomposing P(next word |history) into a product of simpler probabilities). Anyway, we haven't observed much of improvement from increasing the vocabulary beyond cut-off of 5 on WSJ (which we can handle in a straightforward way), so there was little incentive to study it too closely.
(Oct 05 '10 at 13:02)
Ivan Titov
Okay. The idea of predicting the word through a tree-structured output is also good. Mnih did it for his hierarchical language models (HLBL) to speed up training.
(Oct 06 '10 at 11:27)
Joseph Turian ♦♦
showing 5 of 19
show all
|