Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut

We present a novel representation, evaluation measure, and supervised models for the task of identifying the multiword expressions (MWEs) in a sentence, resulting in a lexical semantic segmentation. Our approach generalizes a standard chunking representation to encode MWEs containing gaps, thereby enabling efficient sequence tagging algorithms for feature-rich discriminative models. Experiments on a new dataset of English web text offer the first linguistically-driven evaluation of MWE identification with truly heterogeneous expression types. Our statistical sequence model greatly outperforms a lookup-based segmentation procedure, achieving nearly 60% F1 for MWE identification.


Introduction
Language has a knack for defying expectations when put under the microscope. For example, there is the notion-sometimes referred to as compositionalitythat words will behave in predictable ways, with individual meanings that combine to form complex meanings according to general grammatical principles. Yet language is awash with examples to the contrary: in particular, idiomatic expressions such as awash with NP, have a knack for VP -ing , to the contrary, and defy expectations. Thanks to processes like metaphor and grammaticalization, these are (to various degrees) semantically opaque, structurally fossilized, and/or statistically idiosyncratic. In other words, idiomatic expressions may be exceptional in form, function, or distribution. They are so diverse, so unruly, so difficult to circumscribe, that entire theories of syntax are predicated on the notion that constructions with idiosyncratic form-meaning mappings (Fillmore et al., 1988;Goldberg, 1995) or statistical properties (Goldberg, 2006) offer crucial evidence about the grammatical organization of language.
Here we focus on multiword expressions (MWEs): lexicalized combinations of two or more words that are exceptional enough to be considered as single units in the lexicon. As figure 1 illustrates, MWEs occupy diverse syntactic and semantic functions. Within MWEs, we distinguish (a) proper names and (b) lexical idioms. The latter have proved themselves a "pain in the neck for NLP" (Sag et al., 2002). Automatic and efficient detection of MWEs, though far from solved, would have diverse appli-cations including machine translation (Carpuat and Diab, 2010), information retrieval (Newman et al., 2012), opinion mining (Berend, 2011), and second language learning (Ellis et al., 2008).
It is difficult to establish any comprehensive taxonomy of multiword idioms, let alone develop linguistic criteria and corpus resources that cut across these types. Consequently, the voluminous literature on MWEs in computational linguistics-see §7, Baldwin and Kim (2010), and Ramisch (2012) for surveys-has been fragmented, looking (for example) at subclasses of phrasal verbs or nominal compounds in isolation. To the extent that MWEs have been annotated in existing corpora, it has usually been as a secondary aspect of some other scheme. Traditionally, such resources have prioritized certain kinds of MWEs to the exclusion of others, so they are not appropriate for evaluating general-purpose identification systems.
In this article, we briefly review a shallow form of analysis for MWEs that is neutral to expression type, and that facilitates free text annotation without requiring a prespecified MWE lexicon ( §2). The scheme applies to gappy (discontinuous) as well as contiguous expressions, and allows for a qualitative distinction of association strengths. In Schneider et al. (2014) we have applied this scheme to fully annotate a 55,000-word corpus of English web reviews (Bies et al., 2012a), a conversational genre in which colloquial idioms are highly salient. This article's main contribution is to show that the representationconstrained according to linguistically motivated assumptions ( §3)-can be transformed into a sequence tagging scheme that resembles standard approaches in named entity recognition and other text chunking tasks ( §4). Along these lines, we develop a discriminative, structured model of MWEs in context ( §5) and train, evaluate, and examine it on the annotated corpus ( §6). Finally, in §7 and §8 we comment on related work and future directions.

Annotated Corpus
To build and evaluate a multiword expression analyzer, we use the MWE-annotated corpus of Schneider et al. (2014). It consists of informal English web text that has been specifically and completely annotated for MWEs, without reference to any particular lexicon. To the best of our knowledge, this corpus is the first to be freely annotated for many kinds of MWEs (without reference to a lexicon), and is also the first dataset of social media text with MWE annotations beyond named entities. This section gives a synopsis of the annotation conventions used to develop that resource, as they are important to understanding our models and evaluation.
Rationale. The multiword expressions community has lacked a canonical corpus resource comparable to benchmark datasets used for problems such as NER and parsing. Consequently, the MWE literature has been driven by lexicography: typically, the goal is to acquire an MWE lexicon with little or no supervision, or to apply such a lexicon to corpus data. Studies of MWEs in context have focused on various subclasses of constructions in isolation, necessitating special-purpose datasets and evaluation schemes. By contrast, Schneider et al.'s (2014) corpus creates an opportunity to tackle general-purpose MWE identification, such as would be desirable for use by high-coverage downstream NLP systems. It is used to train and evaluate our models below. The corpus is publicly available as a benchmark for further research. 1 Data. The documents in the corpus are online user reviews of restaurants, medical providers, retailers, automotive services, pet care services, etc. Marked by conversational and opinionated language, this genre is fertile ground for colloquial idioms (Nunberg et al., 1994;Moon, 1998). The 723 reviews (55,000 words, 3,800 sentences) in the English Web Treebank (WTB; Bies et al., 2012b) were collected by Google, tokenized, and annotated with phrase structure trees in the style of the Penn Treebank (Marcus et al., 1993). MWE annotators used the sentence and word tokenizations supplied by the treebank. 2 Annotation scheme. The annotation scheme itself was designed to be as simple as possible. It consists of grouping together the tokens in each sentence that belong to the same MWE instance. While annotation guidelines provide examples of MWE groupings in a wide range of constructions, the annotator is not Gaps. There are, broadly speaking, three reasons to group together tokens that are not fully contiguous. Most commonly, gaps contain internal modifiers, such as good in make good decisions. Syntactic constructions such as the passive can result in gaps that might not otherwise be present: in good decisions were made, there is instead a gap filled by the passive auxiliary. Finally, some MWEs may take internal arguments: they gave me a break. Figure 1 has additional examples. Multiple gaps can occur even within the same expression, though it is rare: they agreed to give Bob a well-deserved break.
Strength. The annotation scheme has two "strength" levels for MWEs. Clearly idiomatic expressions are marked as strong MWEs, while mostly compositional but especially frequent collocations/ phrases (e.g., abundantly clear and patently obvious) are marked as weak MWEs. Weak multiword groups are allowed to include strong MWEs as constituents (but not vice versa). Strong groups are required to cohere when used inside weak groups: that is, a weak group cannot include only part of a strong group. For purposes of annotation, there were no constraints hinging on the ordering of tokens in the sentence.
Process. MWE annotation proceeded one sentence at a time. The 6 annotators referred to and improved the guidelines document on an ongoing basis. Every sentence was seen independently by at least 2 annotators, and differences of opinion were discussed and resolved (often by marking a weak MWE as a compromise). See Schneider et al. (2014) for details.
Statistics. The annotated corpus consists of 723 documents (3,812 sentences). MWEs are frequent in this domain: 57% of sentences (72% of sentences over 10 words long) and 88% of documents contain at least one MWE. 8,060 55,579=15% of tokens belong to an MWE; in total, there are 3,483 MWE instances. 544 (16%) are strong MWEs containing a gold-tagged proper noun-most are proper names. A breakdown appears in table 1.

Representation and Task Definition
We define a lexical segmentation of a sentence as a partitioning of its tokens into segments such that each segment represents a single unit of lexical meaning. A multiword lexical expression may contain gaps, i.e. interruptions by other segments. We impose two restrictions on gaps that appear to be well-motivated linguistically: • Projectivity: Every expression filling a gap must be completely contained within that gap; gappy expressions may not interleave. • No nested gaps: A gap in an expression may be filled by other single-or multiword expressions, so long as those do not themselves contain gaps. Formal grammar. Our scheme corresponds to the following extended CFG (Thatcher, 1967), where S is the full sentence and terminals w are word tokens: S → X + X → w + (Y + w + ) * Y → w + Each expression X or Y is lexicalized by the words in one or more underlined variables on the right-hand side. An X constituent may optionally contain one or more gaps filled by Y constituents, which must not contain gaps themselves. 3 3 MWEs with multiple gaps are rare but attested in data: e.g., putting me at my ease. We encountered one violation of the gap nesting constraint in the reviews data: I have 2 1 nothing 2 1 but 2 1 fantastic things 2 to 2 1 say 2 1 . Additionally, the interrupted phrase Denoting multiword groupings with subscripts, My wife had taken 1 her '07 2 Ford 2 Fusion 2 in 1 for a routine oil 3 change 3 contains 3 multiword groups-{taken, in}, {'07, Ford, Fusion}, {oil, change}-and 7 single-word groups. The first MWE is gappy (accentuated by the box); a single word and a contiguous multiword group fall within the gap. The projectivity constraint forbids an analysis like taken 1 her '07 2 Ford 1 Fusion 2 , while the gap nesting constraint forbids taken 1 her 2 '07 Ford 2 Fusion 2 in 1 .

Two-level Scheme: Strong vs. Weak MWEs
Our annotated data distinguish two strengths of MWEs as discussed in §2. Augmenting the grammar of the previous section, we therefore designate nonterminals as strong (X, Y ) or weak (X,Ỹ ): may be lexicalized by single words and/or strong multiwords. Strong multiwords cannot contain weak multiwords except in gaps. Further, the contents of a gap cannot be part of any multiword that extends outside the gap. 4 For example, consider the segmentation: he was willing to budge 1 a 2 little 2 on 1 the price which means 4 a 4 3 lot 4 3 to 4 me 4 . Subscripts denote strong MW groups and superscripts weak MW groups; unmarked tokens serve as single-word expressions. The MW groups are thus {budge, on}, {a, little}, {a, lot}, and {means, {a, lot}, to, me}. As should be evident from the grammar, the projectivity and gap-nesting constraints apply here just as in the 1-level scheme.

Evaluation
Matching criteria. Given that most tokens do not belong to an MWE, to evaluate MWE identification we adopt a precision/recall-based measure from the coreference resolution literature. The MUC criterion (Vilain et al., 1995) measures precision and recall great gateways never 1 before 1 , so 2 3 far 2 3 as 2 3 Hudson knew 2 , seen 1 by Europeans was annotated in another corpus. 4 This was violated 6 times in our annotated data: modifiers within gaps are sometimes collocated with the gappy expression, as in on 1 2 a 1 2 tight 1 budget 1 2 and have 1 2 little 1 doubt 1 2 .
of links in terms of groups (units) implied by the transitive closure over those links. 5 It can be defined as follows: Let a b denote a link between two elements in the gold standard, and aˆ b denote a link in the system prediction. Let the * operator denote the transitive closure over all links, such that ⟦a * b⟧ is 1 if a and b belong to the same (gold) set, and 0 otherwise. Assuming there are no redundant 6 links within any annotation (which in our case is guaranteed by linking consecutive words in each MWE), we can write the MUC precision and recall measures as: This awards partial credit when predicted and gold expressions overlap in part. Requiring full MWEs to match exactly would arguably be too stringent, overpenalizing larger MWEs for minor disagreements. We combine precision and recall using the standard F 1 measure of their harmonic mean. This is the linkbased evaluation used for most of our experiments. For comparison, we also report some results with a more stringent exact match evaluation where the span of the predicted MWE must be identical to the span of the gold MWE for it to count as correct.
Strength averaging. Recall that the 2-level scheme ( §3.1) distinguishes strong vs. weak links/ groups, where the latter category applies to reasonably compositional collocations as well as ambiguous or difficult cases. If where one annotation uses a weak link the other has a strong link or no link at all, we want to penalize the disagreement less than if one had a strong link and the other had no link. To accommodate the 2-level scheme, we therefore average F ↑ 1 , in which all weak links have been converted to strong links, and F ↓ 1 , in which they have been removed: . 7 If neither annotation contains any weak links, this equals the MUC 5 As a criterion for coreference resolution, the MUC measure has perceived shortcomings which have prompted several other measures (see Recasens and Hovy, 2011 for a review). It is not clear, however, whether any of these criticisms are relevant to MWE identification. 6 A link between a and b is redundant if the other links already imply that a and b belong to the same set. A set of N elements is expressed non-redundantly with exactly N − 1 links. 7 Overall precision and recall are likewise computed by averaging "strengthened" and "weakened" measurements. no gaps, 1-level This method applies to both the link-based and exact match evaluation criteria.

Tagging Schemes
Following (Ramshaw and Marcus, 1995), shallow analysis is often modeled as a sequence-chunking task, with tags containing chunk-positional information. The BIO scheme and variants (e.g., BILOU; Ratinov and Roth, 2009) are standard for tasks like named entity recognition, supersense tagging, and shallow parsing.
The language of derivations licensed by the grammars in §3 allows for a tag-based encoding of MWE analyses with only bigram constraints. We describe 4 tagging schemes for MWE identification, starting with BIO and working up to more expressive variants. They are depicted in figure 2. No gaps, 1-level (3 tags). This is the standard contiguous chunking representation from Ramshaw and Marcus (1995) using the tags {O B I}. O is for tokens outside any chunk; B marks tokens beginning a chunk; and I marks other tokens inside a chunk. Multiword chunks will thus start with B and then I. B must always be followed by I; I is not allowed at the beginning of the sentence or following O. No gaps, 2-level (4 tags). We can distinguish strength levels by splitting I into two tags:Ī for strong expressions andĨ for weak expressions. To express strong and weak contiguous chunks requires 4 tags: {O BĪĨ}. (Marking B with a strength as well would be redundant because MWEs are never lengthone chunks.) The constraints onĪ andĨ are the same as the constraints on I in previous schemes. IfĪ andĨ occur next to each other, the strong attachment will receive higher precedence, resulting in analysis of strong MWEs as nested within weak MWEs.
Gappy, 1-level (6 tags). Because gaps cannot themselves contain gappy expressions (we do not support full recursivity), a finite number of additional tags are sufficient to encode gappy chunks. We therefore add lowercase tag variants representing tokens within a gap: In addition to the constraints stated above, no within-gap tag may occur at the beginning or end of the sentence or immediately following or preceding O. Within a gap, b, i, and o behave like their out-of-gap counterparts.
Gappy, 2-level (8 tags). 8 tags are required to encode the 2-level scheme with gaps: {O o B bĪīĨĩ}. Variants of the inside tag are marked for strength of the incoming link-this applies gap-externally (capitalized tags) and gap-internally (lowercase tags). IfĪ orĨ immediately follows a gap, its diacritic reflects the strength of the gappy expression, not the gap's contents.

Model
With the above representations we model MWE identification as sequence tagging, one of the paradigms that has been used previously for identifying contiguous MWEs (Constant and Sigogne, 2011, see §7). 8 Constraints on legal tag bigrams are sufficient to ensure the full tagging is well-formed subject to the regular expressions in figure 2; we enforce these constraints in our experiments. 9 In NLP, conditional random fields (Lafferty et al., 2001) and the structured perceptron (Collins, 2002) are popular techniques for discriminative sequence modeling with a convex loss function. We choose the second approach for its speed: learning and inference depend mainly on the runtime of the Viterbi algorithm, whose asymptotic complexity is linear in the length of the input and (with a first-order Markov assumption) quadratic in the number of tags. Below, we review the structured perceptron and discuss our cost function, features, and experimental setup.

Cost-Augmented Structured Perceptron
The structured perceptron's (Collins, 2002) learning procedure, algorithm 1, generalizes the classic perceptron algorithm (Freund and Schapire, 1999) to incorporate a structured decoding step (for sequences, the Viterbi algorithm) in the inner loop. Thus, training requires only max inference, which is fast with a first-order Markov assumption. In training, features are adjusted where a tagging error is made; the procedure can be viewed as optimizing the structured hinge loss. The output of learning is a weight vector that parametrizes a feature-rich scoring function over candidate labelings of a sequence.
To better align the learning algorithm with our F-score-based MWE evaluation ( §3.2), we use a cost-augmented version of the structured perceptron that is sensitive to different kinds of errors during training. When recall is the bigger obstacle, we can adopt the following cost function: given a sentence x, its gold labeling y * , and a candidate labeling y ′ , A single nonnegative hyperparameter, ρ, controls the tradeoff between recall and accuracy; higher ρ biases the model in favor of recall (possibly hurting accuracy and precision). This is a slight variant of the recall-oriented cost function of Mohit et al. (2012). The difference is that we only penalize beginning-of-expression recall errors. Preliminary 9 The 8-tag scheme licenses 42 tag bigrams: sequences such as B O and oī are prohibited. There are also constraints on the allowed tags at the beginning and end of the sequence.
Algorithm 1: Training with the averaged perceptron. (Adapted from Daumé, 2006, p. 19.) experiments showed that a cost function penalizing all recall errors-i.e., with ρ⟦y * ≠ O ∧ y ′ = O⟧ as the second term, as in Mohit et al.-tended to append additional tokens to high-confidence MWEs (such as proper names) rather than encourage new MWEs, which would require positing at least two new nonoutside tags.

Features
Basic features. These are largely based on those of Constant et al. (2012): they look at word unigrams and bigrams, character prefixes and suffixes, and POS tags, as well as lexicon entries that match lemmas 10 of multiple words in the sentence. Appendix A lists the basic features in detail.
Some of the basic features make use of lexicons. We use or construct 10 lists of English MWEs: all multiword entries in WordNet (Fellbaum, 1998); all multiword chunks in SemCor (Miller et al., 1993); all multiword entries in English Wiktionary; 11 the WikiMwe dataset mined from English Wikipedia (Hartmann et al., 2012)  Top: Comparison of preexisting lexicons. "6 lexicons" refers to WordNet and SemCor plus SAID, WikiMwe, Phrases.net, and English Wiktionary; "10 lexicons" adds MWEs from CEDT, VNC, LVC, and Oyz. (In these lookup-based configurations, allowing gappy MWEs never helps performance.) Bottom: Combining preexisting lexicons with a lexicon derived from MWEs annotated in the training portion of each cross-validation fold at least once (lookup) or twice (model).
All precision, recall, and F 1 percentages are averaged across 8 folds of cross-validation on train; standard deviations are shown for the F 1 score. In each column, the highest value using only preexisting lexicons is underlined, and the highest overall value is bolded. The boxed row indicates the configuration used as the basis for subsequent experiments.
the verb-particle constructions (VPCs) dataset of (Baldwin, 2008); a list of light verb constructions (LVCs) provided by Claire Bonial; and two idioms websites. 12 After preprocessing, each lexical entry consists of an ordered sequence of word lemmas, some of which may be variables like <something>. Given a sentence and one or more of the lexicons, lookup proceeds as follows: we enumerate entries whose lemma sequences match a sequence of lemmatized tokens, and build a lattice of possible analyses over the sentence. We find the shortest path (i.e., using as few expressions as possible) with dynamic programming, allowing gaps of up to length 2. 13 Unsupervised word clusters. Distributional clustering on large (unlabeled) corpora can produce lexical generalizations that are useful for syntactic and semantic analysis tasks (e.g.: Miller et al., 2004;Koo et al., 2008;Turian et al., 2010;Owoputi et al., 2013;Grave et al., 2013). We were interested to see whether a similar pattern would hold for MWE identification, given that MWEs are concerned with what is lexically idiosyncratic-i.e., backing off from specific lexemes to word classes may lose the MWE-relevant information. Brown clustering 14 (Brown et al., 1992) 12 http://www.phrases.net/ and http://home. postech.ac.kr/~oyz/doc/idiom.html 13 Each top-level lexical expression (single-or multiword) incurs a cost of 1; each expression within a gap has cost 1.25. 14 With Liang's (2005) implementation: https://github. com/percyliang/brown-cluster. We obtain 1,000 clusters on the 21-million-word Yelp Academic Dataset 15 (which is similar in genre to the annotated web reviews data) gives us a hard clustering of word types. To our tagger, we add features mapping the previous, current, and next token to Brown cluster IDs.
The feature for the current token conjoins the word lemma with the cluster ID.

Experimental Setup
The corpus of web reviews described in §2 is used for training and evaluation. 101 arbitrarily chosen documents (500 sentences, 7,171 words) were held from words appearing at least 25 times.  In learning with the structured perceptron (algorithm 1), we employ two well-known techniques that can both be viewed as regularization. First, we use the average of parameters over all timesteps of learning. Second, within each cross-validation fold, we determine the number of training iterations (epochs) M by early stopping-that is, after each iteration, we use the model to decode the held-out data, and when that accuracy ceases to improve, use the previous model. The two hyperparameters are the number of iterations and the value of the recall cost hyperparameter (ρ). Both are tuned via cross-validation on train; we use the multiple of 50 that maximizes average link-based F 1 . The chosen values are shown in table 3. Experiments were managed with the ducttape tool. 18

Results
We experimentally address the following questions to probe and justify our modeling approach.

Is supervised learning necessary?
Previous MWE identification studies have found benefit to statistical learning over heuristic lexicon lookup (Constant and Sigogne, 2011;Green et al., 2012). Our first experiment tests whether this holds for comprehensive MWE identification: it compares our supervised tagging approach with baselines of heuristic lookup on preexisting lexicons. The baselines construct a lattice for each sentence using the same method as lexicon-based model features ( §5.2). If multiple lexicons are used, the union of their en-18 https://github.com/jhclark/ducttape/ tries is used to construct the lattice. The resulting segmentation-which does not encode a strength distinction-is evaluated against the gold standard. Table 2 shows the results. Even with just the labeled training set as input, the supervised approach beats the strongest heuristic baseline (that incorporates in-domain lexicon entries extracted from the training data) by 30 precision points, while achieving comparable recall. For example, the baseline (but not the statistical model) incorrectly predicts an MWE in places to eat in Baltimore (because eat in, meaning 'eat at home,' is listed in WordNet). The supervised approach has learned not to trust WordNet too much due to this sort of ambiguity. Downstream applications that currently use lexicon matching for MWE identification (e.g., Ghoneim and Diab, 2013) likely stand to benefit from our statistical approach.

How best to exploit MWE lexicons (type-level information)?
For statistical tagging (right portion of table 2), using more preexisting (out-of-domain) lexicons generally improves recall; precision also improves a bit. A lexicon of MWEs occurring in the non-held-out training data at least twice 19 (table 2, bottom right) is marginally worse (better precision/worse recall) than the best result using only preexisting lexicons.

Variations on the base model
We experiment with some of the modeling alternatives discussed in §5. Results appear in table 3 under both the link-based and exact match evaluation criteria. We note that the exact match scores are (as expected) several points lower.
Recall-oriented cost. The recall-oriented cost adds about 1 link-based F 1 point, sacrificing precision in favor of recall. Unsupervised word clusters. When combined with the recall-oriented cost, these produce a slight improvement to precision/degradation to recall, improving exact match F 1 but not affecting link-based F 1 . Only a few clusters receive high positive weight; one of these consists of matter, joke, biggie, pun, avail, clue, corkage, frills, worries, etc. These words are diverse semantically, but all occur in collocations with no, which is what makes the cluster coherent and useful to the MWE model. Oracle part-of-speech tags. Using humanannotated rather than automatic POS tags improves MWE identification by about 3 F 1 points on test (similar differences were observed in development).

What are the highest-weighted features?
An advantage of the linear modeling framework is that we can examine learned feature weights to gain some insight into the model's behavior.
In general, the highest-weighted features are the lexicon matching features and features indicative of proper names (POS tag of proper noun, capitalized word not at the beginning of the sentence, etc.).
Despite the occasional cluster capturing collocational or idiomatic groupings, as described in the previous section, the clusters appear to be mostly useful for identifying words that tend to belong (or not) to proper names. For example, the cluster with street, road, freeway, highway, airport, etc., as well as words outside of the cluster vocabulary, weigh in favor of an MWE. A cluster with everyday destinations (neighborhood, doctor, hotel, bank, dentist) prefers non-MWEs, presumably because these words are not typically part of proper names in this corpus. This was from the best model using non-oracle POS tags, so the clusters are perhaps useful in correcting for proper nouns that were mistakenly tagged as common nouns. One caveat, though, is that it is hard to discern the impact of these specific features where others may be capturing essentially the same information.  weak). There are 298 unique MWE types. Organizing the predicted MWEs by their coarse POS sequence reveals that the model is not too prejudiced in the kinds of expressions it recognizes: the 298 types fall under 89 unique POS+strength patterns. Table 4 shows the 14 POS sequences predicted 5 or more times as strong MWEs. Some of the examples (major award, a deal, tip on) are false positives, but most are correct. Singleton patterns include PROPN VERB (god forbid), PREP DET (at that), ADJ PRON (worth it), and PREP VERB PREP (to die for).
True positive MWEs mostly consist of (a) named entities, and (b) lexical idioms seen in training and/or listed in one of the lexicons. Occasionally the system correctly guesses an unseen and OOV idiom based on features such as hyphenation (walk -in) and capitalization/OOV words (Chili Relleno, BIG MIS-TAKE). On test, 244 gold MWE types were unseen in training; the system found 93 true positives (where the type was predicted at least once), 109 false positives, and 151 false negatives-an unseen type recall rate of 38%. Removing types that occurred in lexicons leaves 35 true positives, 61 false positives, and 111 false negatives-a unseen and OOV type recall rate of 24%.

What kinds of mismatches occur?
Inspection of the output turns up false positives due to ambiguity (e.g., Spongy and sweet bread); false negatives (top to bottom); and overlap (get high quality service, gold get high quality service; live up to, gold live up to). A number of the mismatches turn  out to be problems with the gold standard, like having our water shut off (gold having our water shut off ). This suggests that even noisy automatic taggers might help identify annotation inconsistencies and errors for manual correction.
6.7 Are gappiness and the strength distinction learned in practice?
Three quarters of MWEs are strong and contain no gaps. To see whether our model is actually sensitive to the phenomena of gappiness and strength, we train on data simplified to remove one or both distinctions-as in the first 3 labelings in figure 2and evaluate against the full 8-tag scheme. For the model with the recall cost, clusters, and oracle POS tags, we evaluate each of these simplifications of the training data in table 5. The gold standard for evaluation remains the same across all conditions. If the model was unable to recover gappy expressions or the strong/weak distinction, we would expect it to do no better when trained with the full tagset than with the simplified tagset. However, there is some loss in performance as the tagset for learning is simplified, which suggests that gappiness and strength are being learned to an extent.
In terms of modeling, the use of machine learning classification (Hashimoto and Kawahara, 2008;Shigeto et al., 2013) and specifically BIO sequence tagging (Diab and Bhutada, 2009;Constant and Sigogne, 2011;Constant et al., 2012;Vincze et al., 2013) for contextual recognition of MWEs is not new. Lexical semantic classification tasks like named entity recognition (e.g., Ratinov and Roth, 2009), supersense tagging (Ciaramita and Altun, 2006;Paaß and Reichartz, 2009), and index term identification (Newman et al., 2012) also involve chunking of certain MWEs. But our discriminative models, facilitated by the new corpus, broaden the scope of the MWE identification task to include many varieties of MWEs at once, including explicit marking of gaps and a strength distinction. By contrast, the aforementioned identification systems, as well as some MWE-enhanced syntactic parsers (e.g., Green et al., 2012), have been restricted to contiguous MWEs. However, Green et al. (2011) allow gaps to be described as constituents in a syntax tree. Gimpel and Smith's (2011) shallow, gappy language model allows arbitrary token groupings within a sentence, whereas our model imposes projectivity and nesting constraints ( §3). Blunsom and Baldwin (2006) present a sequence model for HPSG supertagging, and evaluate performance on discontinuous MWEs, though the sequence model treats the non-adjacent component supertags like other labels-it cannot enforce that they mutually require one another, as we do via the gappy tagging scheme ( §3.1). The lexicon lookup procedures of Bejček et al. (2013) can match gappy MWEs, but are nonstatistical and extremely error-prone when tuned for high oracle recall.
Another major thread of research has pursued unsupervised discovery of multiword types from raw corpora, such as with statistical association measures (Church et al., 1991;Pecina, 2010;Ramisch et al., 2012, inter alia), parallel corpora (Melamed, 1997;Moirón and Tiedemann, 2006;Tsvetkov and Wintner, 2010), or a combination thereof (Tsvetkov and Wintner, 2011); this may be followed by a lookupand-classify approach to contextual identification (Ramisch et al., 2010). Though preliminary experiments with our models did not show benefit to incorporating such automatically constructed lexicons, we hope these two perspectives can be brought together in future work.

Conclusion
This article has presented the first supervised model for identifying heterogeneous multiword expressions in English text. Our feature-rich discriminative sequence tagger performs shallow chunking with a novel scheme that allows for MWEs containing gaps, and includes a strength distinction to separate highly idiomatic expressions from collocations. It is trained and evaluated on a corpus of English web reviews that are comprehensively annotated for multiword expressions. Beyond the training data, its features incorporate evidence from external resources-several lexicons as well as unsupervised word clusters; we show experimentally that this statistical approach is far superior to identifying MWEs by heuristic lexicon lookup alone. Future extensions might integrate additional features (e.g., exploiting statistical association measures computed over large corpora), enhance the lexical representation (e.g., by adding semantic tags), improve the expressiveness of the model (e.g., with higher-order features and inference), or integrate the model with other tasks (such as parsing and translation).
Our data and open source software are released at http://www.ark.cs.cmu.edu/LexSem/.

Acknowledgments
This research was supported in part by NSF CA-REER grant IIS-1054319, Google through the Reading is Believing project at CMU, and DARPA grant FA8750-12-2-0342 funded under the DEFT program.
We are grateful to Kevin Knight, Martha Palmer, Claire Bonial, Lori Levin, Ed Hovy, Tim Baldwin, Omri Abend, members of JHU CLSP, the NLP group at Berkeley, and the Noah's ARK group at CMU, and anonymous reviewers for valuable feedback.

A Basic Features
All are conjoined with the current label, y i .

Lexicon Features (unlexicalized)
WordNet only 17. OOV: λ i is not in WordNet as a unigram lemma ∧ pos i 18. compound: non-punctuation lemma λ i and the {previous, next} lemma in the sentence (if it is non-punctuation; an intervening hyphen is allowed) form an entry in WordNet, possibly separated by a hyphen or space 19. compound-hyphen: pos i = HYPH ∧ previous and next tokens form an entry in WordNet, possibly separated by a hyphen or space 20. ambiguity class: if content word unigram λ i is in WordNet, the set of POS categories it can belong to; else pos i if not a content POS ∧ the POS of the longest MW match to which λ i belongs (if any) ∧ the position in that match (B or I) For each multiword lexicon 21. lexicon name ∧ status of token i in the shortest path segmentation (O, B, or I) ∧ subcategory of lexical entry whose match includes token i, if matched ∧ whether the match is gappy 22. the above ∧ POS tags of the first and last matched tokens in the expression Over all multiword lexicons 23. at least k lexicons contain a match that includes this token (if n ≥ 1 matches, n active features) 24. at least k lexicons contain a match that includes this token, starts with a given POS, and ends with a given POS