Unsupervised Acquisition of Comprehensive Multiword Lexicons using Competition in an n-gram Lattice

We present a new model for acquiring comprehensive multiword lexicons from large corpora based on competition among n-gram candidates. In contrast to the standard approach of simple ranking by association measure, in our model n-grams are arranged in a lattice structure based on subsumption and overlap relationships, with nodes inhibiting other nodes in their vicinity when they are selected as a lexical item. We show how the configuration of such a lattice can be optimized tractably, and demonstrate using annotations of sampled n-grams that our method consistently outperforms alternatives by at least 0.05 F-score across several corpora and languages.


Introduction
Despite over 25 years of research in computational linguistics aimed at acquiring multiword lexicons using corpora statistics, and growing evidence that speakers process language primarily in terms of memorized sequences (Wray, 2008), the individual word nonetheless stubbornly remains the de facto standard processing unit for most research in modern NLP. The potential of multiword knowledge to improve both the automatic processing of language as well as offer new understanding of human acquisition and usage of language is the primary motivator of this work. Here, we present an effective, expandable, and tractable new approach to comprehensive multiword lexicon acquisition. Our aim is to find a middle ground between standard MWE acquisition approaches based on association measures (Ramisch, 2014) and more sophisticated statistical models (Newman et al., 2012) that do not scale to large corpora, the main source of the distributional information in modern NLP systems.
A central challenge in building comprehensive multiword lexicons is paring down the huge space of possibilities without imposing restrictions which disregard a major portion of the multiword vocabulary of a language: allowing for diversity creates significant redundancy among statistically promising candidates. The lattice model proposed here addresses this primarily by having the candidatescontiguous and non-contiguous n-gram typescompete with each other based on subsumption and overlap relations to be selected as the best (i.e., most parsimonious) explanation for statistical irregularities. We test this approach across four large corpora in three languages, including two relatively freeword-order languages (Croatian and Japanese), and find that this approach consistency outperforms alternatives, offering scalability and many avenues for future enhancement.

Background and Related Work
In this paper we will refer to the targets of our lexicon creation efforts as formulaic sequences, following the terminology of Wray (2002;, wherein a formulaic sequence (FS) is defined as "a sequence, continuous or discontinuous, of words or other elements, which is, or appears to be, prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar." That is, an FS shows signs of being part of a mental lexicon. 1 As noted by Wray (2008), formulaic sequence theory is compatible with other highly multiword, lexicalized approaches to language structure, in particular Pattern Grammar (Hunston and Francis, ) and Construction Grammar (Goldberg, 1995); an important distinction, though, is that these sorts of theories often posit entirely abstract grammatical constructions/patterns/frames which do not fit well into the FS framework. Nevertheless, since many such constructions are composed of sequences of specific words, the FS inventory of a language includes many flexible constructions (e.g., ask * for) along with entirely fixed combinations (e.g., rhetorical question) not typically of interest to grammarians. Note that the FS framework allows for individual morphemes to be part of a formulaic sequence, but for practical reasons we focus primarily on lemmatized words as the unit out of which FS are built.
In computational linguistics, the most common term used to describe multiword lexical units is multiword expression ("MWE": Sag et al. (2002), Baldwin and Kim (2010)), but here we wish to make a principled distinction between at least somewhat non-compositional, strongly lexicalized MWEs and FS, a near superset which includes many MWEs but also compositional linguistic formulas. This distinction is not a new one; it exists, for example, in the original paper of Sag et al. (2002) in the distinction between lexicalized and institutionalized phrases, and also to some extent in the MWE annotation of Schneider et al. (2014b), who distinguish between weak (collocational) 2 and strong (non-compositional) MWEs. It is our contention, however, that separate, precise terminology is useful for research targeted at either class: we need not strain the concept of MWE to include items which do not require special semantics, nor are we inclined to disregard the larger formulaticity of language simply because it is not the dominant focus of MWE 1 Though by this definition individuals or small groups may have their own FS, here we are only interested in FS that are shared by a recognizable language community. 2 Here we avoid the term collocation entirely due to confusion with respect to its interpretation. Though some define it similarly to our definition of FS, it can be applied to any words that show a statistical tendency to appear in the vicinity of one another for any reason: for instance, the pair of words doctor/nurse might be considered a collocation (Ramisch, 2014). research. Many MWE researchers might defensibly balk at including in their MWE lexicons and corpus annotations (English) FS such as there is something going on, it is more important than ever to ..., ... do not know what it is like to ..., there is no shortage of ..., the rise and fall of ..., now is not the time to ..., etc. as well as tens of thousands of other such phrases which, along with less compositional MWEs like be worth ...'s weight in gold, fall under the FS umbrella. Another reason to introduce a different terminology is that there are classes of phrases which are typically considered MWEs that do not fit well into an FS framework, for instance novel compound nouns whose semantics are accessible by analogy (e.g., glass limb, analogous to wooden leg). We also exclude from the definition of both FS and MWE those named entities which refer to people or places which are little-known and/or whose surface form appears derived (e.g., Mrs. Barbara W. Smith or Smith Garden Supplies Ltd). Figure 1 shows the conception of the relationship between FS, (multiword) constructions, MWE, and (multiword) named entities that we assume for this paper.
From a practical perspective, the starting point for multiword lexicon creation has typically been lexical association measures (Church and Hanks, 1990;Dunning, 1993;Schone and Jurafsky, 2001;Evert, 2004;Pecina, 2010;Araujo et al., 2011;Kulkarni and Finlayson, 2011;Ramisch, 2014). When these methods are used to build a lexicon, particular binary syntactic patterns are typically chosen. Only some of these measures generalize tractably beyond two words, for example PMI (Church and Hanks, 1990), i.e., the log ratio of the joint probability to the product of the marginal probabilities of the individual words. Another measure which addresses sequences of longer than two words is the C-value 456 (Frantzi et al., 2000) which weights term frequency by the log length of the n-gram while penalizing ngrams that appear in frequent larger ones. Mutual expectation (Dias et al., 1999) involves deriving a normalized statistic that reflects the extent to which a phrase resists the omission of any constituent word. Similarly, the lexical predictability ratio (LPR) of Brooke et al. (2015) is an association measure applicable to any possible syntactic pattern, which is calculated by discounting syntactic predictability from the overall conditional probability for each word given the other words in the phrase. Though most association measures involve only usage statistics of the phrase and its subparts, the DRUID measure (Riedl and Biemann, 2015) is an exception which uses distributional semantics around the phrase to identify how easily an n-gram could be replaced by a single word.
Typically multiword lexicons are created by ranking n-grams according to an association measure and applying a threshold. The algorithm of da Silva and Lopes (1999) is somewhat more sophisticated, in that it identifies the local maxima of association measures across subsuming n-grams within a sentence to identify MWEs of unrestricted length and syntactic composition; its effectiveness beyond noun phrases, however, seems relatively limited (Ramisch et al., 2012). Brooke et al. (2014; developed a heuristic method intended for general FS extraction in larger corpora, first using conditional probabilities to do an initial (single pass) coarse-grained segmentation of the corpus, followed by a pass through the resulting vocabulary, breaking larger units into smaller ones based on a tradeoff between marginal and conditional statistics. The work of Newman et al. (2012) is an example of an unsupervised approach which does not use association measures: it extends the Bayesian word segmentation approach of Goldwater et al. (2009) to multiword tokenization, applying a generative Dirichlet Process model which jointly constructs a segmentation of the corpus and a corresponding multiword vocabulary.
Other research in MWEs has tended to be rather focused on particular syntactic patterns such as verbnoun combinations (Fazly et al., 2009). The system of Schneider et al. (2014a) distinguishes a full range of MWE sequences in the English Web Treebank, including gapped expressions, using a supervised se-quence tagging model. Though, in theory, automatic lexical resources could be a useful addition to the Schneider et al. model, which uses only manual lexical resources, attempts to do so have achieved mixed success (Riedl and Biemann, 2016).
The motivations for building lexicons of FS naturally overlap with those for MWE: models of distributional semantics, in particular, can benefit from sensitivity to multiword units (Cohen and Widdows, 2009), as can parsing (Constant and Nivre, 2016) and topic models (Lau et al., 2013). One major motivation for looking beyond MWEs is the ability to carry out broader linguistic analyses. Within corpus linguistics, multiword sequences have been studied in the form of lexical bundles (Biber et al., 2004), which are simply n-grams that occur above a certain frequency threshold. Like FS, lexical bundles generally involve larger phrasal chunks that would be missed by traditional MWE extraction, and so research in this area has tended to focus on how particular formulaic phrases (e.g., if you look at) are indicative of particular genres (e.g., university lectures). Lexical bundles have been applied, in particular, to learner language: for example, Chen and Baker (2010) show that non-native student writers use a severely restricted range of lexical bundle types, and tend to overuse those types, while Granger and Bestgen (2014) investigate the role of proficiency, demonstrating that intermediate learners underuse lower-frequency bigrams and overuse high-frequency bigrams relative to advanced learners. Sakaguchi et al. (2016) demonstrate that improving fluency (closely linked to the use of linguistic formulas) is more important than improving strict grammaticality with respect to native speaker judgments of non-native productions; Brooke et al. (2015) explicitly argue for FS lexicons as a way to identify, track, and improve learner proficiency.

Method
Our approach to FS identification involves optimization of the total explanatory power of a lattice, where each node corresponds to an n-gram type. The explanatory power of the whole lattice is defined simply as a product of the explainedness of the individual nodes. Each node can be considered either "on" (is an FS) or "off" (is not an FS). The basis of the calculation of explainedness is the syntax-sensitive LPR association measure of Brooke et al. (2015), but it is calculated differently depending on the on/off status of the node as well as the status of the nodes in its vicinity. Nodes are linked based on n-gram subsumption and corpus overlap relationships (see Figure 2), with "on" nodes typically explaining other nodes. Given these relationships, we iterate over the nodes and greedily optimize the on/off choice relative to explainedness in the local neighborhood of each node, until convergence.

Collecting statistics
The first step in the process is to derive a set of ngrams and related statistics from a large, unlabeled corpus of text. Since our primary association measure is an adaption of LPR, our approach in this section mostly follows Brooke et al. (2015) up until the last stage. An initial requirement of any such method is an n-gram frequency threshold, which we set to 1 instance per 10 million words, following Brooke et al. (2015). 3 We include gapped or non-contiguous n-grams in our analysis, in acknowledgment of the fact that many languages have MWEs where the components can be "separated", including verb particle constructions in English (Dehé, 2002), and noun-verb idioms in Japanese . Having said this, there are generally strong syntactic and length restrictions on what can constitute a gap (Wasow, 2002), which we capture in the form of a language-specific POS-based regular expression (see Section 4 for details). This greatly lowers the number of potentially gapped n-gram types, increasing precision and efficiency for negligible loss of recall. We also exclude punctuation and lemmatize the corpus, and enforce an n-gram count threshold. As long as the count threshold is substantially above 1, efficient extraction of all n-grams can be done iteratively: in iteration i, i-grams are filtered by the frequency threshold, and then pairs of instances of these i-grams with (i − 1) words of overlap are found, which derives a set of (i + 1)-grams which necessarily includes all those over the frequency threshold.
Once a set of relevant n-grams is identified and counted, other statistics required to calculate the Lexical Predictability Ratio ("LPR") for each word in the n-gram are collected. LPR is a measure of how predictable a word is in a lexical context, as compared to how predictable it is given only syntactic context (over the same span of words). Formally, the LPR for word w i in the context of a word sequence w 1 , ..., w i , ..., w n with POS tag sequence t 1 , ..., t n is given by: where w j,k denotes the word sequence w j , ..., w i−1 , w i+1 , ..., w k excluding w i (similarly for t j,k ). Note that the lower bound of LPR is 1, since the ratio for a word with no context is trivially 1. We use the same equation for gapped n-grams, with the caveat that quantities involving sequences which include the location where the gap occurs are derived from special gapped n-gram statistics. Note that the identification of the best ratio across all possible choices of context, not just the largest, is important for longer FS, where the entire POS context alone might uniquely identify the phrase, resulting in the minimum LPR of 1 even for entirely formulaic sequences-an undesirable result.
In the segmentation approach of Brooke et al. (2015), LPR for an entire span is calculated as a product of the individual LPRs, but here we will use the minimum LPR across the words in the sequence: Here, minLPR for a particular n-gram does not reflect the overall degree to which it holds together, but rather focuses on the word which is its weakest link. For example, in the case of be keep * under wraps ( Figure 2), a general statistical metric might assign it a high score due to the strong association between keep and under or under and wraps, but minLPR is focused on the weaker relationship between be and the rest of the phrase. This makes it particularly suited to use in a lattice model of competing n-grams, where the choice of be keep * under  Figure 2: A portion of an n-gram lattice. Solid lines indicate subsumption, dotted lines overlaps wraps versus keep * under wraps should be based exactly on the extent to which be is an essential part of the phrase; the other affinities are, in effect, irrelevant, because they occur in the smaller n-gram as well.

Node interactions
The n-gram nodes in the lattice are directionally connected to nodes consisting of (n + 1)-grams which subsume them and (n − 1)-grams which they subsume. For example, as detailed in Figure 2, the (gapped) n-gram keep * under wraps would be connected "upwards" to the node keep everything under wraps and connected "downwards" to under wraps. These directional relationships allow for two basic interactions between nodes in the lattice when a node is turned on: covering, which inhibits nodes below (subsumed by) a turned-on node (e.g., if keep * under wraps is on, the model will tend not to choose under wraps as an FS); and clearing, which inhibits nodes above a turned-on node (e.g., if keep * under wraps is on, the model would avoid selecting keep everything under wraps as an FS). A third, undirected mechanism is overlapping, where nodes inhibit each other due to overlaps in the corpus (e.g., having both keep * under wraps and be keep * under as FS will be avoided).

Covering
The most important node interaction is covering, which corresponds to discounting or entirely excluding a node due to a node higher in the lattice. Our model includes two types of covering: hard and soft.
Hard covering is based on the idea that, due to very similar counts, we can reasonably conclude that the presence of an n-gram in our statistics is a direct result of a subsuming (n+i)-gram. In Figure 2, e.g., if we have 143 counts of keep * under wraps and 152 counts of under wraps, the presence of keep * under wraps almost completely explains under wraps, and we should consider these two n-grams as one. We do this by permanently disabling any hard covered node, and setting the minLPR of the covering node to the maximum minLPR among all the nodes it covers (including itself); this means that longer ngrams with function words (which often have lower minLPR) can benefit from the strong statistical relationships between open-class lexical features in ngrams that they cover. This is done as a preprocessing step, and greatly improves the tractability of the iterative optimization of the lattice. Of course, a threshold for hard covering must be chosen: during development we found that a ratio of 2/3 (corresponding to a significant majority of the counts of a lower node corresponding to the higher node) worked well. We also use the concept of hard covering to address the issue of pronouns, based on the observation that specific pronouns often have high LPR values due to pragmatic biases (Brooke et al., 2015); for instance, private state verbs like feel tend to have first person singular subjects. In the lattice, n-grams with pronouns are considered covered (inactive) unless they cover at least one other node which does not have a pronoun, which allows us to limit FS with pronouns without excluding them entirely: they are included only in cases where they are definitively formulaic.
Soft covering is used in cases when a single ngram does not entirely account for another, but a turned-on n-gram to some extent may explain some of the statistical irregularity of one lower in the lattice. For instance, in Figure 2 keep * under is not hard-covered by keep * under wraps (since there are FS such as keep * under surveillance and keep it under your hat), but if keep * under wraps is tagged as an FS, we nevertheless want to discount the portion of the keep * under counts that correspond to occurrences of keep * under wraps, with the idea that these occurrences have already been explained by the longer n-gram. If enough subsuming n-grams are on, then the shorter n-gram will be discounted to the extent that it will be turned off, preventing redundancy. This effect is accomplished by increasing the turned-off explainedness of keep * under (and thus making turning on less desirable) in the follow-459 ing manner: let c(·) be the count function, y i the current FS status for node x i (0 if off, 1 if on) and ab(x) a function which produces the set of indicies of all nodes above node x in the lattice. Then, the cover(x t ) score for a covered node t is: When applied as an exponent to a minLPR score, it serves as simple, quick-to-calculate approximation of a new minLPR with the counts corresponding to the covering nodes removed from the calculation. The cover score takes on values in the range 0 to 1, with 1 being the default when no covering occurs.

Clearing
In general, covering prefers turning on longer, covering n-grams since doing so explains nodes lower in the lattice. Not surprisingly, it is generally desirable to have a mechanism working in opposition, i.e., one which views shorter FS as helping to explain the presence of longer n-grams which contain them, beyond the FS-neutral syntactic explanation provided by minLPR. Clearing does this by increasing the explainedness of nodes higher in the lattice when a lower node is turned-on. The basic mechanism is similar to covering, except that counts cannot be made use of in the same way-whereas it makes sense to explain covered nodes in proportion to the counts of their covering nodes (since the counts of the covered n-grams can be directly attributed to the covering n-gram), in the reverse direction this logic fails.
A simple but effective solution which avoids extra hyperparameters is to make use of the minLPR values of the relevant nodes. In the most common two-node situation, we increase the explainedness of the cleared node based on the ratio of the minLPR of two nodes, though only if the minLPR of the lower node is higher. Generalized to the (rare) case of multiple clearing nodes, we define clear(x t ) as: where bl(x t ) produces a set of indicies of nodes below x t in the lattice. We refer to this mechanism as "clearing" because it tends to clear away a variety of trivial uses of common FS which may have higher LPR due to the lexical and syntactic specificity of the FS. For instance, in Figure 2 if the node keep * under wraps is turned on and has a minLPR of 8, then, if the minLPR of a node such as keep * under wraps for is 4, clear(x t ) will be 0.5. Like cover, clear takes on values in the range 0 to 1, with 1 being the default when no clearing occurs. Note that one major advantage with this particular formulation of clearing is that low-LPR nodes will be unable to clear higher LPR nodes above them in the lattice; otherwise, bad FS like of the might be selected as FS based purely to increase the explainedness of the many n-grams they appear in.

Overlap
The third mechanism of node interaction involves n-grams which overlap in the corpus. In general, independent FS do not consistently overlap. For example, given that be keep * under and keep * under wraps often appear together (overlapping on the tokens keep * under), we do not want both being selected as an FS, even in the case that both have high minLPR. To address this problem, rather than increasing the explainedness of turned-off nodes, we decrease the explainedness of the overlapping turned-on nodes-a penalty rather than an incentive which expresses the model's confusion at having overlapping FS. For non-subsuming nodes x i and x j , let oc(x i , x j ) be the count of instances of x i which contain at least one non-gap token of a corresponding instance of x j . For subsuming nodes, though, overlap is treated asymmetrically, with oc(x i , x j ) equal to c(x j ) (the lower count) if j ∈ ab(x i ), but zero if j ∈ bl(x i ). Given this definition of oc, we define overlap(x t ) as: x i ) Overlap takes on values in the range 1 to +∞, also defaulting to 1 when no overlaps exist. The effect of overlap is hyperbolic: small amounts of overlap have little effect, but nodes with significant overlap will effectively be forced to turn off.

Explainedness
The objective function maximized by the model is then the explainedness (expl) across all the nodes 460 of the lattice X, x i , . . . , x N , which can be defined in terms of minLPR, the node interaction functions, and the FS status y i of each node in the lattice: When a node is off, its explainedness is the inverse of its minLPR, except if there are covering or clearing nodes which explain it by pushing the exponent of minLPR towards zero. When the node is on, its explainedness is the inverse of a fixed cost hyperparameter C, though this cost is increased if it overlaps with other active nodes. All else being equal, when minLPR(t) > C, a node will be selected as an FS, and so, independent of the node interactions, C can be viewed as the threshold for the minLPR association measure under a traditional approach to MWE identification.

Optimization
The dependence of the explainedness of nodes on their neighbors effectively prohibits a global optimization of the lattice. Fortunately, though most of the nodes in the lattice are part of a single connected graph, most of the effects of nodes on each other are relatively local, and effective local optimizations can be made tractable by applying some simple restrictions. The main optimization loop consists of iterations over the lattice until complete convergence (no changes in the final iteration). For each iteration over the main loop, each potentially active node is examined in order to evaluate whether its current status is optimal given the current state of the lattice. The order that we perform this has an effect on the result: among the obvious options (LPR, ngram length), in development good results were obtained through ordering nodes by frequency, which gives an implicit advantage to relatively common ngrams.
Given the relationships between nodes, it is obviously not sufficient to consider switching only the present node. If, for instance, one or more of be keep * under wraps, under wraps, or be keep * under has been turned on, the covering, clearing, or overlapping effects of these other nodes will likely prevent Algorithm 1 Optimization algorithm. X is an ordered list of the nodes in the lattice. Nodes (designated by x) contain pointers to the nodes immediately linked to them in the lattice. States (designated by Y ) indicate whether each node is ON or OFF. Explainedness values are indicated by e. rev = relevant, aff = affected, curr = current function LOCALOPT(Y start , x, X rev , X aff ) a competing node like keep * under wraps from being correctly activated. Instead, the algorithm identifies a small set of "relevant" nodes which are the most important to the status of the node under consideration. Since turned-off nodes have no direct effect on each other, only turned-on nodes above, below, or overlapping with the current node in the lattice need be considered. Once the relevant nodes have been identified, all nodes (including turned-off nodes) whose explainedness is affected by one or more of the relevant nodes are identified. Next, a search is carried out for the optimal configuration of the relevant nodes, starting from an 'all-on' state and iteratively considering new states with one relevant node turned off; the search continues as long as there is an improvement in explainedness. Since the node interactions are roughly cumulative in their effects, this approach will generally identify the optimal state without the need for an exhaustive search. See Algorithm 1 for details.
Omitted from Algorithm 1 for clarity are various low-level efficiencies which prevent the algorithm from reconsidering states already checked or from recalculating the explainedness of nodes when unnecessary. We also apply the following efficiency restrictions, which significantly reduce the runtime of the algorithm. In each case, more extreme (less efficient) values were individually tested using a development set and found to provide no benefit in terms of the quality of the output lexicon: • We limit the total number of relevant nodes to 5. When there are more than 5 nodes turned on in the vicinity of the target node, the most relevant nodes are selected by ranking candidates by the absolute difference in explainedness across possible configurations of the target and candidate node considered in isolation; • To avoid having to deal with storing and processing trivial overlaps, we exclude overlaps with a count of less than 5 from our lattice; • Many nodes have a minLPR which is slightly larger than 1 (the lowest possible value). There is very little chance these nodes will be activated by the algorithm, and so after applying hard covering, we do not consider activating nodes with minLPR < 2.

Evaluation
We evaluate our approach across three different languages, including evaluation sets derived from four different corpora selected for their size and linguistic diversity. In English, we follow Brooke et al. (2015) in using a 890M token filtered portion of the ICWSM blog corpus (Burton et al., 2009) tagged with the Tree Tagger (Schmid, 1995). To facilitate a comparison with Newman et al. (2012), which does not scale up to a corpus as large as the ICWSM, we also build a lexicon using the 100M token British National Corpus (Burnard, 2000), using the standard CLAWS-derived POS tags for the corpus. Lemmatization included removing all inflectional marking from both words and POS tags. For English, gaps are identified using the same POS regex used in Brooke et al. (2015), which includes simple nouns and portions thereof, up to a maximum of 4 words. The other two languages we include in our evaluation are Croatian and Japanese. Relative to English, both languages have freer word order: we were interested in probing the challenges associated with using an n-gram approach to FS identification in such languages. For Croatian, we used the 1.2-billion-token fhrWaC corpus (Šnajder et al., 2013), a filtered version of the Croatian web corpus hrWaC (Ljubešić and Klubička, 2014), which is POS-tagged and lemmatized using the tools of . Similar to English, the POS regex for Croatian includes simple nouns, adjectives and pronouns, but also other elements that regularly appear inside FS, including both adverbs and copulas. For Japanese, we used a subset of the 100M-page web corpus of Shinzato et al. (2008), which was roughly the same token length as the English corpus. We segmented and POS-tagged the corpus with MeCab (Kudo, 2008) using the UNIDIC morphological dictionary (Den et al., 2007). The POS regex for Japanese covers the same basic nominal structures as English, but also includes case markers and adverbials. Though our processing of Japanese includes basic lemmatization related to superficial elements like the choice of writing script and politeness markers, many elements (such as case marking) which are removed by lemmatization in Croatian are segmented into independent morphological units in the MeCab output, making the task somewhat different for the two languages. Brooke et al. (2015) introduced a method for evaluating FS extraction without a reference lexicon or direct annotation of the output of a model. Instead, n-grams are sampled after applying the frequency threshold and then annotated as being either an FS or not. Benefits of this style of evaluation include replicability, the diversity of FS, and the ability to calculate a true F-score. We use the annotation of 2000 n-grams in the ICWSM corpus from that earlier work, and applied the same annotation methodology to the other three corpora: after training and based on written guidelines derived from the definitions of Wray (2008), three native-speaker, educated annotators judged 500 contiguous n-grams and another 500 gapped n-grams for each corpus.
Other than the inclusion of new languages, our test sets differ from Brooke et al. (2015) in two ways. One advantage of a type-based annotation approach, particularly with regards to annotation with a known subjective component, is that it is quite sensible to simply discard borderline cases, improving reliability at the cost of some representativeness. As such we entirely excluded from our test set n-grams which just one annotator marked as FS. Table 1 contains the counts for the four test sets after this filtering step and Fleiss' Kappa scores before ("Pre") and after ("Post"). The second change is that for the main evaluation we collapsed gapped and contiguous n-grams into a single test set. The rationale is that the number of positive gapped examples is too low to provide a reliable independent F-score.
Our primary comparison is with the heuristic LPR model of Brooke et al. (2015), which is scalable to large corpora and includes gapped n-grams. For the BNC, we also benchmark against the DP-seg model of Newman et al. (2012) with recommended settings, and the LocalMaxs algorithm of da Silva and Lopes (1999) using SCP; neither of these methods scale to the larger corpora. 4 Because these other approaches only generate sequential multiword units, we use only the sequential part of the BNC test set for this evaluation. All comparison approaches have themselves been previously compared against a wide range of association measures. As such, we do not repeat all these comparisons here, but we do consider a lexicon built from ranking n-grams according to the measure used in our lattice (minLPR) as well as PMI and raw frequency. For each of these association measures we rank all n-grams above the frequency threshold and build a lexicon equal to the size of the lexicon produced by our model.
We created small development sets for each corpus and used them to do a thorough testing of parameter settings. Although it is generally possible to increase precision by increasing C, we found that across corpora we always obtained near-optimal results with C = 4, so to demonstrate the usefulness of the lattice technique as an entirely off-the-shelf tool, we present the results using identical settings for all four corpora. We treat covering as a fundamental part of the Lattice model, but to investigate the efficacy of other node interactions within the model we present results with overlap and clearing node interactions turned off.

Results
The main results for FS acquisition across the four corpora are shown in Table 2. As noted in Section 2, simple statistical association measures like PMI do poorly when faced with syntactically-unrestricted n-grams of variable length: minLPR is clearly a much better statistic for this purpose. The LPRseg method of Brooke et al. (2015) consistently outperforms simple ranking, and the lattice method proposed here does better still, with a margin that is fairly consistent across the languages. Generally, clearing and overlap node interactions provide a relatively large increase in precision at the cost of a smaller drop in recall, though the change is fairly symmetrical in Croatian. When only covering is used, the results are fairly similar to Brooke et al. (2015), which is unsurprising given the extent to which decomposition and covering are related. The Japanese and ICWSM corpora have relatively high precision and low recall, whereas both the BNC and Croatian corpora have low precision and high recall.
In the contiguous FS test set for the BNC (Ta-English ICWSM BNC Croatian Japanese  Table 2: Results of FS identification in various test sets: Countrank = ranking with frequency; PMIrank = PMI-based ranking; minLPRrank = ranking with minLPR; LPRseg = the method of Brooke et al. (2015); "−cl" = no clearing; "−ovr" = no penalization of overlaps; "P" = Precision; "R" = Recall; and "F" = F-score. Bold is best in a given column. The performance difference of the Lattice model relative to the best baseline for all test sets considered together is significant at p < 0.01 (based on the permutation test: Yeh (2000)).   (1999); DP-seg = method of Newman et al. (2012) ble 3), we found that both the LocalMaxs algorithm and the DP-seg method of Newman et al. (2012) were able to beat our other baseline methods with roughly similar F-scores, though both are well below our Lattice method. Some of the difference seems attributable to fairly severe precision/recall imbalance, though we were unable to improve the F-score by changing the parameters from recommended settings for either model.

Discussion
Though the results across the four corpora are reasonably similar with respect to overall F-score, there are some discrepancies. By using the standard UNI-DIC morpheme representation as the base unit for Japanese, the model ends up doing an extra layer of FS identification, one which is provided by word boundaries in the other languages. The result is that there are relatively more FS for Japanese: precision is high, and recall is comparably low. Importantly, the initial n-gram statistics actually reflect that Japanese is different: the number of n-gram types over length 4 is almost twice the number in the ICWSM corpus. One idea for future work is to automatically adapt to the input language/corpus in order to ensure a good balance between precision and recall. At the opposite extreme, the low precision of the BNC is almost certainly due to its relatively small size: whereas the n-gram threshold we used here results in minimum counts of roughly 100 for the other three corpora, the BNC statistics include n-grams with counts of less than 10. At such low counts, LPR is less reliable and more noise gets into the lexicon: the first column of Table 4 shows that the BNC is noticeably larger then the other lexicons, and the higher numbers in columns 2 and 3 (number of POS types and percentage of gapped expressions, resp.) are also indicative of increased noise. This could be resolved by increasing the n-gram threshold. It might also make sense to simply avoid smaller corpora, though for some applications a smaller corpus  Table 4: Statistics for the lexicons created by our lattice method may be unavoidable. One idea we are pursing is modifying the calculation of the LPR metric to use a more conservative probability estimate than maximum likelihood in the case of low counts.
We were interested in Croatian and Japanese in part because of their relatively free word order, and whether the handling of gaps would help with identifying FS in these languages. We discovered, however, that free word order actually results in more of a tendency towards contiguous FS, not less, a fact that is reflected in our test sets (Table 1) as well as the lexicons themselves (Table 4). Strikingly rare in Croatian, in particular, are expressions where the content of a gap is an argument which must be filled to syntactically complete an expression: it is English whose fixed-word-order constraints often keep elements of an FS distant from each other. The gaps that do happen in Croatian are mostly prosodydriven insertions of other elements into already complete FS. This phenomena highlights a problem with the current model, in that gapped and contiguous versions of the same n-gram sequence (e.g., take away and take * away) are, at present, considered entirely independently. Alternatives for dealing with this include collapsing statistics to create a single node in the lattice, creating a promoting link between contiguous and gapped versions of the same n-grams sequence in the lattice model, or switching to a dependency representation (which, we note, requires very little change to the basic model presented here, but would narrow its applicability). Table 4 otherwise reflect the quantity and diversity of FS across the corpora, particularly in terms of the number of POS patterns represented in the lexicon. Looking at the most common POS patterns across languages, only noun-noun and adjective-noun combinations ever account for more than 5% of all word types in any of the lexicons. Though some of the diversity can of course be attributed to noise, it is safe to say that most FS do not fall into the standard two-word syntactic categories used in MWE work, and therefore identifying them requires a much more general approach like the one presented here. Table 5 contains 10 randomly selected examples from each of the lexicons produced by our method. Among the English examples, most of the clear errors are bigrams that reflect particular biases of their respective corpora: The phrase via slashdot comes from boilerplate text identifying the source of an article, whereas Maureen (from Maureen says) is a character in one of the novels included in the BNC. The longer FS mostly seem sensible, in that they are plausible lexicalized constructions, though be open to all * in the from the BNC seems too long and is likely the result of noise due to insufficient examples. Some FS are dialectal variants, for instance license endorsed refers to British traffic violations. More generally, the FS lexicons created by these two corpora are quite distinct, sharing less than 50% of their entries.

The statistics in
One striking thing about the non-English FS is how poorly they translate: many good FS in these languages become extremely awkward when translated into English. This is expected, of course, for idioms like biti general poslije bitke "be the general after the battle" (i.e., "hindsight is 20/20"), but it extends to other relatively compositional constructions like こう 言う * が 続く "repeat occurrences of * like this" and 前期 比 "first half comparison". This highlights the potential importance of focusing on FS when learning a language. Though some of the errors seem to be the result of extra material added to a good FS, for instance promet teretnih vozila "good 465 English (ICWSM) heart ache, so * have some time, part of the blame, via slashdot, any more questions, protein expression, work in * bank, al-qaeda terrorist, continue discussions, speak about * issue English (BNC) go into decline, Maureen says, be open to all * in the, Peggy Sue, square * shoulders, delivery system for, this * also includes, license endorsed, point * finger, highly * asset Croatian negativno utjecati na "negatively affects on", jedan od dobrih poznavatelja "one of the best connoisseurs of", jasno * je da "it is clear to * that", promet teretnih vozila "good vehicle traffic", odvratiti pozornost "divert attention", biti general poslije bitke "be the general after the battle", popularni internetski "popular internet", izazvati kaos "cause chaos", austrijski investitor "Austrian investor", ideja o gradnji "the idea of building" Japanese 高速 道路 整備 "highway construction", 年次 後期 "the second half of the fiscal year", 労 働 者 派遣 事業 "temporary labor agency", こう 言う * が 続く "repeat occurrences of * like this", 風邪 っ 匹 "cold sufferer", ＤＨＣＰ サーバー "DHCP server", 前期 比 "first half comparison", 経営 事項 審査 "examination of administrative affairs", 自分 の 文章 "own writing", 深い 味わい "deep flavor" vehicle traffic", most, again, are somewhat inexplicable artifacts of the corpus they were built from, like austrijski investitor "Austrian investor". Since Zipfian frequency curves naturally extend to multiword vocabulary, our lexicons (and typebased evaluation of them) are of course dominated by rarer terms. This is not, we would argue, a serious drawback, since in practical terms there is very little value in focusing on common FS like of course which manually-built lexicons already contain; most of the potential in automatic extraction comes from the long tail. However, we did investigate the other end of the Zipfian curve by extracting the 20 most common MWEs (including both strong and weak) from the Schneider et al. (2014b) corpus. In the ICWSM lexicon, our recall for these common terms was fairly high (0.7), with errors mostly resulting from longer phrases containing these terms "winning out" (in the lattice) over shorter phrases, which have relatively low LPR due to extremely common constituent words; for example, we missed on time, but had 19 FS which contain it (e.g. right on time, show up on time, and start on time). In one case which showed this same problem, waste * time, the lexicon did have its ungapped version, highlighting the potential for improved handling of this issue.
In Section 2, we noted that FS is generally a much broader category than MWE, which we take as referring to terms which carry significant noncompositional meaning. We decided to investigate the distinction at a practical level by annotating the positive examples in the ICWSM test set for being MWE or non-MWE FS. 5 First, we note that only 28% of our FS types were labeled MWE; this is in contrast to, for instance, the annotation of Schneider et al. (2014b) where "weak" MWE make up a small fraction of MWE types. Even without any explicit representation of compositionality, our model did much better at identifying MWE FS than non-MWE FS: 0.7 versus 0.32 recall. This may simply reflect, however, the fact that a disproportionate number of MWEs were noun-noun compounds, which are fairly easy for the model to identify.
Due to the lack of spaces between words and an agglutinative morphology, the standard approach to tokenization and lemmatization in Japanese involves morphological rather than word segmentation. In terms of the content of the resulting lexicon we believe the effect of this difference on FS extraction is modest, since much of the extra FS in Japanese would simply be single words in other languages (and considered trivially part of the FS lexicon). However, from a theoretical perspective we might very much prefer to build FS for all languages starting from morphemes rather than words. Such a framework could, for instance, capture inflectional flexibility versus fixedness directly in the model, with fixed inflectional morphemes included as a distinct element of the FS and flexible morphemes becoming gaps. However, for many languages this would result in a huge blow up in complexity with only modest increases in the scope of FS identification. Though it is indisputable that inflectional fixedness is part of the lexical information contained in an FS, in practice this sort of information can be efficiently derived post hoc from the corpus statistics.
Though we have demonstrated that competition within a lattice is a powerful method for the production of multiword lexicons, its usefulness derives less from the specific choices we have made in this instantiation of the model, and more from the flexiblity that such a model provides for future research. Not only do alternatives like DP-seg and LocalMaxs fail to scale up to large corpora, there are few obvious ways to improve on their simple underlying algorithms without compromising their elegance and worsening tractability. Fast and functional, the LPR decomp approach is nevertheless algorithmically ungainly, involving multiple layers of heuristic-driven filtering with no possibility of correcting errors. Our lattice method is aimed at something between these extremes: a practical, optimizable model, but with various component heuristics that can be improved upon. For instance, though the current version of clearing is effective and has practical advantages relative to simpler options that we tested, it could be enhanced by more careful investigation of the statistical properties of n-grams which contain FS.
We can also consider adding new terms to the exponents of the two parts of our objective function, analagous to the cover, clear, and overlap functions, based on other relationships between nodes in the lattice. One which we have considered is creating new connections between identical or similar syntactic patterns, which could serve to encourage the model to generalize. In English, for instance, it might learn that verb-particle combinations are generally likely to be FS, whereas verb-determiner combinations are not. Our initial investigations sug-gest, however, it may be difficult to apply this idea without merely amplifying existing undesirable biases in the LPR measure. Bringing in other information such as simple distributional statistics might help the model identify non-compositional semantics, and could, in combination with the existing lattice competition, focus the model on MWEs which could provide a reliable basis for generalization.
For all four corpora, the lattice optimization algorithm converged within 10 iterations. Although the optimization of the lattice is several orders of magnitude more complex than the decomposition heuristics of Brooke et al. (2015), the time needed to build and optimize the lattice is a fraction of the time required to collect the statistics for LPR calculation, and so the end-to-end runtimes of the two methods are comparable. In the BNC, the full lattice method was much faster than LocalMaxs and DP-Seg, though direct runtime comparisons to these methods are of modest value due to differences in both scope and implementation.
Finally, though the model was designed specifically for FS extraction, we note that it could be useful for related tasks such as unsupervised learning of morphological lexicons, particularly for agglutinative languages. Character or phoneme n-grams could compete in an identically structured lattice to be chosen as the best morphemes for the language, with LPR adapted to use phonological predictability (i.e., based on vowel/consonant "tags") instead of syntactic predictability. It is likely, though, that further algorithmic modifications would be necessary to target morphological phenomena well, which we leave for future work.

Conclusion
We have presented here a new methodology for acquiring comprehensive multiword lexicons from large corpora, using competition in an n-gram lattice. Our evaluation using annotations of sampled n-grams shows that it consistently outperforms alternatives across several corpora and languages. A tool which implements the method, as well as the acquired lexicons, annotation guidelines, and test sets have been made available. 6