Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Efficient methods for storing and querying are critical for scaling high-order m-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500×, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).


Introduction
Language models (LMs) are fundamental to many NLP tasks, including machine translation and speech recognition.Statistical LMs are probabilistic models that assign a probability to a sequence of words w N 1 , indicating how likely the sequence is in the language.m-gram LMs are popular, and prove to be accurate when estimated using large corpora.In these LMs, the probability of m-grams are often precomputed and stored explicitly.
Although widely successful, current m-gram LM approaches are impractical for learning high-order LMs on large corpora, due to their poor scaling properties in both training and query phases.Prevailing methods (Heafield, 2011;Stolcke et al., 2011) precompute all m-gram probabilities, and consequently need to store and access as many as a hundred of billions of m-grams for a typical moderate-order LM.
Recent research has attempted to tackle scalability issues through the use of efficient data structures such as tries and hash-tables (Heafield, 2011;Stolcke et al., 2011), lossy compression (Talbot and Osborne, 2007;Levenberg and Osborne, 2009;Guthrie and Hepple, 2010;Pauls and Klein, 2011;Church et al., 2007), compact data structures (Germann et al., 2009;Watanabe et al., 2009;Sorensen and Allauzen, 2011), and distributed computation (Heafield et al., 2013;Brants et al., 2007).Fundamental to all the widely used methods is the precomputation of all probabilities, hence they do not provide an adequate trade-off between space and time for high m, both during training and querying.Exceptions are Kennington et al. (2012) and Zhang and Vogel (2006), who use a suffix-tree or suffix-array over the text for computing the sufficient statistics on-the-fly.
In our previous work (Shareghi et al., 2015), we extended this line of research using a Compressed Suffix Tree (CST) (Ohlebusch et al., 2010), which provides a considerably more compact searchable means of storing the corpus than an uncompressed suffix array or suffix tree.This approach showed favourable scaling properties with m and had only a modest memory requirement.However, the method only supported Kneser-Ney smoothing, not its modified variant (Chen and Goodman, 1999) which overall performs better and has become the de-facto standard.Additionally, querying was significantly slower than for leading LM toolkits, making the method impractical for widespread use.
In this paper we extend Shareghi et al. (2015) to support modified Kneser-Ney smoothing, and arXiv:1608.04465v1[cs.CL] 16 Aug 2016 present new optimisation methods for fast construction and querying. 1Critical to our approach are: • Precomputation of several modified counts, which would be very expensive to compute at the query time.To orchestrate this, a subset of the CST nodes is selected based on the cost of computing their modified counts (which relates with the branching factor of a node).
The precomputed counts are then stored in a compressed data structure supporting efficient memory usage and lookup.
• Re-use of CST nodes within m-gram probability computation as a sentence gets scored leftto-right, thus saving many expensive lookups.
Empirical comparison against our earlier work (Shareghi et al., 2015) shows the significance of each of these optimisations.The strengths of our method are apparent when applied to very large training datasets (≥ 16 GiB) and for high order models, m ≥ 5.In this setting, while our approach is more memory efficient than the leading KenLM model, both in the construction (training) and querying phases (testing), we are highly competitive in terms of runtimes of both phases.When memory is a limiting factor at query time, our approach is orders of magnitude faster than the state of the art.Moreover, our method allows for efficient querying with an unlimited Markov order, m → ∞, without resorting to approximations or heuristics.

Modified Kneser-Ney Language Model
In an m-gram language model, the probability of a sentence is decomposed into ) is the conditional probability of the next word given its finite history.Smoothing techniques are employed to deal with sparsity when estimating the parameters of P (w i |w i−1 i−m+1 ).A comprehensive comparison of different smoothing techniques is provided in (Chen and Goodman, 1999).We focus on interpolated Modified Kneser-Ney (MKN) smoothing, which is widely regarded as a state-of-the-art technique and is supported by popular language modelling toolkits, e.g.SRILM (Stolcke, 2002) and KenLM (Heafield, 2011).
1 https://github.com/eehsan/cstlmk) .k) .k) .MKN is a recursive smoothing technique which uses lower order k-gram language models to smooth higher order models.Figure 1 describes the recursive smoothing formula employed in MKN.It is distinguished from Kneser-Ney (KN) smoothing in its use of adaptive discount parameters (denoted as D k (j) in Figure 1) based on the k-gram counts.Importantly, MKN is based on not just m-gram frequencies, c(x), but also several modified counts based on numbers of unique contexts, namely and N i+ (α • ) are the number of words with frequency at least i that come before and after a pattern α, respectively.N i+ ( • α • ) is the number of word-pairs with frequency at least i which surround x.N i+ (α • ) is the number of words coming after α to form a pattern αw for which the number of unique left contexts is at least i; it is specific to MKN and not needed in KN.Table 1  different types of quantities required for computing an example 4-gram MKN probability.Efficient computation of these quantities is challenging with limited memory and time resources, particularly when the order of the language model m is high and/or the training corpus is large.In this paper, we make use of advanced data structures to efficiently obtain the required quantities to answer probabilistic queries as they arrive.Our solution involves precomputing and caching expensive quan- and N {1,2,3+} (α • ), which we will explain in §4.We start in §3 by providing a review of the approach in Shareghi et al. (2015) upon which we base our work.

KN with Compressed Suffix Trees
3.1 Compressed Data Structures Shareghi et al. (2015) proposed a method for Kneser-Ney (KN) language modelling based on onthe-fly probability computation from a compressed suffix tree (CST) (Ohlebusch et al., 2010).The CST emulates the functionality of the Suffix Tree (ST) (Weiner, 1973) using substantially less space.The suffix tree is a classical search index consisting of a rooted labelled search tree constructed from a text T of length n drawn from an alphabet of size σ.Each root to leaf path in the suffix tree corresponds to a suffix of T .The leaves, considered in left-toright order define the suffix array (SA) (Manber and Myers, 1993) Searching for a pattern α of length m in T can be achieved by finding the "highest" node v in the ST such that the path from the root to v is prefixed by α.All leaf nodes in the subtree starting at v correspond to the locations of α in T .This is translated to finding the specific range SA[lb, rb] such that as illustrated in the ST and SA of Figure 2 (left).
While searching using the ST or the SA is efficient in theory, it requires large amounts of main memory.A CST reduces the space requirements of ST by utilizing the compressibility of the Burrows-Wheeler transform (BWT) (Burrows and Wheeler, 1994).The BWT corresponds to a reversible permutation of the text used in data compression tools such as BZIP2 to increase the compressibility of the input.The transform is defined as (1) and is the core component of the FM-Index (Ferragina and Manzini, 2000) which is a subcomponent of a CST to provide efficient search for locating arbitrary length patterns (m-grams), determining occurrence frequencies etc.The key functionality provided by the FM-Index is the ability to efficiently determine the range SA[lb, rb] matching a given pattern α described above without the need to store the ST or SA explicitly.This is achieved by iteratively processing α in reverse order using the BWT, which is usually referred to as backward-search.
The backward-search procedure utilizes the duality between the BWT and SA to iteratively determine SA [lb, rb] for suffixes of α.Suppose SA[sp j , ep j ] corresponds to all suffixes in T matching α where C[c] refers to the starting position of all suffixes prefixed by c in SA and RANK(BWT, sp j , c) determines the number of occurrences of symbol c in BWT[0, sp j ].
Operation RANK(BWT, i, c) (and its inverse operation SELECT(BWT,i,c)2 ) can be performed efficiently using a wavelet tree (Grossi et al., 2003) representation of the BWT.A wavelet tree is a versatile, space-efficient representation of a sequence which can efficiently support a variety of operations (Navarro, 2014).The structure of the wavelet tree is derived by recursively decomposing the alphabet into subsets.At each level the alphabet is split into two subsets based on which symbols in the sequence are assigned to the left and right child nodes respectively.Using compressed bitvector representations and Huffman codes to define the alphabet partitioning, the space usage of the wavelet tree and associated RANK structures of the BWT is bound by H k (T)n + o(n log σ) bits (Grossi et al., 2003).Thus the space usage is proportional to the order k entropy (H k (T)) of the text.Figure 2 (right) shows a sample wavelet tree representation.Using the wavelet tree structure, RANK over a sequence drawn from an alphabet of size σ can be reduced to log σ binary RANK operations which can be answered efficiently in constant time (Jacobson, 1989).The range SA[lb, rb] corresponding to a pattern α, can be determined in O(m log σ) time using a wavelet tree of the BWT.
In addition to the FM-index, a CST efficiently stores the tree topology of the ST to emulate tree operations such efficiently (Ohlebusch et al., 2010).et al. (2015) showed how the requisite counts for a KN-LM, namely c(α),

Shareghi
can be computed directly from CST.Consider the example in Figure 2, the number of occurrences of b corresponds to counting the number of leaves, size(v), in the subtree rooted at v.This can be computed in O(1) time by computing the size of the range [lb, rb] implicitly associated with each node.The number of unique right contexts of b can be determined using degree(v) (again O(1) but requires bit operations on the succinct tree representation of the ST).That Determining the number of left-contexts and surrounding contexts are more involved.Computing N 1+ ( • α) relies on the BWT.Recall that BWT [i]   corresponds to the symbol preceding the suffix start- first requires finding the interval of suffixes starting with b in SA, namely lb = 6 and rb = 10, and then counting the number of unique symbols in BWT[6, 10] = {d, b, a, a, a}, i.e., 3. Determining all unique symbols in BWT[i, j] can be performed efficiently (independently of the size of the range) using the wavelet tree encoding of the BWT.The set of symbols preceding pattern α, denoted by P (α) can be computed in O(|P (α)| log σ) by visiting all unique leafs in the wavelet tree which correspond to symbols in BWT[i, j].This is usually referred to as the interval-symbols (Schnattinger et al., 2010) procedure and uses RANK operations to find the set of symbols s ∈ P (α) and corresponding ranges for sα in SA.In the above example, identifying the SA range of ab requires to find the lb, rb in the SA for suffixes starting with a (SA [3,5]) and then adjusting the bounds to cover only the suffixes starting with ab.This last step is done via computing the rank of three a symbols in BWT[8,10] using the wavelet tree, see Figure 2 (right) for RANK (BWT,a,8).As illustrated, answering RANK(BWT, 8, a) corresponds to processing the first digit of the code word at the root level, which translates into RANK(WT root , 8, 0) = 4, followed by a RANK(WT 1 , 4, 1) = 1 on the left branch.Once the ranks are computed lb, rb are refined accordingly to SA [3+ (1 -1), 3+ (3 -1)].Finally, for N 1+ ( • α • ) all patterns which can follow α are enumerated, and for each of these extended patterns, the number of preceding symbols is computed using interval-symbols.This proved to be the most expensive operation in their approach.
Given these quantities, Shareghi et al. (2015) show how m-gram probabilities can be computed on demand using an iterative algorithm to search for matching nodes in the suffix tree for the required k-gram (k ≤ m) patterns in the numerator and denominator of KN recursive equations, which are then used to compute the probabilities.We refer the reader to Shareghi et al. (2015) for further details.Overall their approach showed promise, in that it allowed for unlimited order KN-LMs to be evaluated with a modest memory footprint, however it was many orders of magnitude slower for smaller m than leading LM toolkits.
To illustrate the cost of querying, see Figure 3 (top) which shows per-sentence query time for KN, based on the approach of Shareghi et al. (2015) (also shown is MKN, through an extension of their method as described in §4).It is clear that the runtimes for KN is much too slow for practical useabout 5 seconds per sentence, with the majority of this time spent computing N 1+ ( • α • ).Clearly op- timisation is warranted, and the gains from doing so are spectacular (see Figure 3 bottom, which uses the precomputation method as described in §4.2).
4 Extending to MKN

Computing MKN modified counts
A central requirement for extending Shareghi et al. (2015) to support MKN are algorithms for computing N {1,2,3+} (α • ) and N {1,2,3+} (α • ), which we now expound upon.Algorithm 1 computes both of these quantities, taking as input a CST t, a node v matching the pattern α, the pattern and a flag is-prime, denoting which of the N and N variants is required.This method enumerates the children of the node (line 3) and calculates either the frequency of each child (line 7) or the modified count N 1+ ( • α x), for each child u where x is the first symbol on the edge vu (line 5).Lines 8 and 9 accumulate the number of these values equal to one or two, and finally in line 10, N 3+ is computed by the difference between N 1+ (α • ) = degree(v) and the already counted events N 1 + N 2 .
For example, computing While roughly similar in approach, computing N {1,2,3+} (α • ) is in practice slower than N {1,2,3+} (α • ) since it requires calling interval- symbols (line 7) instead of calling the constant time size operation (line 5).This gives rise to a time where d is the number of children of v.
As illustrated in Figure 3 (top), the modified counts ( §2) combined are responsible for 99% of the query time.Moreover the already expensive runtime of KN is considerably worsened in MKN due to the additional counts required.These facts motivate optimisation, which we achieve by precomputing values, resulting in a 2500× speed up in query time as shown in Figure 3 (bottom).

Efficient Precomputation
Language modelling toolkits such as KenLM and SRILM precompute real valued probabilities and backoff-weights at training time, such that querying becomes largely a problem of retrieval.We might consider taking a similar route in optimising our language model, however we would face the problem that floating point numbers cannot be compressed very effectively.Even with quantisation, which can have a detrimental effect on modelling perplexity, we would not expect good compression and thus this technique would limit the scaling potential of our approach.
For these reasons, instead we store the most expensive count data, targeting those counts which have the greatest effect on runtime (see Figure 3 top).We expect these integer values to compress well: as highlighted by Figure 4 most counts will have low values, and accordingly a variable byte compression scheme will be able to realise high compression rates.Removing the need for computing these values at query time leaves only pattern search and a few floating point operations in order to compute language model probabilities (see §4.3) which can be done cheaply. 3That is N1+( Storage Threshold % of total space usage Our first consideration is how to structure the cache.Given that each precomputed value is computed using a CST node, v, (with the pattern as its path-label), we structure the cache as a mapping between unique node identifiers id(v) and the precomputed values. 4Next we consider which values to cache: while it is possible to precompute values for every node in the CST, many nodes are unlikely to be accessed at query time.Moreover, these rare patterns are likely to be cheap to process using the onthe-fly methods, given they occur in few contexts.Consequently precomputing these values will bring minimal speed benefits, while still incurring a memory cost.For this reason we precompute the values only for nodes corresponding to k-grams up to length m (for our word-level experiments m = 10), which are most likely to be accessed at runtime. 5  The precomputation method is outlined in Algorithm 2, showing how a compressed cache is created for the quantities x The algorithm visits the suffix tree nodes in depthfirst-search (DFS) order, and selects a subset of nodes for precomputation (line 7), such that the remaining nodes are either rare or trivial to handle 4 Each node can uniquely be identified by the order which it is visited in a DFS traversal of the suffix tree.This corresponds to the RANK of the opening parenthesis of the node in the balanced parenthesis representation of the tree topology of the CST which can be determined in O(1) time (Ohlebusch et al., 2010). 5We did not test other selection criteria.Other methods may be more effective, such as selecting nodes for precomputation based on the frequency of their corresponding patterns in the training set.
Algorithm 2 Precomputing expensive counts ← counts from above, for each output x 14: j ← j + 1 15: bvrrr ← compress-rrr(bv) 16: i ← compress-dac({i (x) ∀x}) 17: write-to-disk(bvrrr,i) on-the-fly (i.e., leaf nodes).A node included in the cache is marked by storing a 1 in the bit vector bv (lines 8-9) at index l, where l is the node identifier.For each selected node we precompute the expensive counts in lines 10-12, which are stored into integer vectors i (x) for each count type x (line 13).The integer vectors are streamed to disk and then compressed (lines 15-17) in order to limit memory usage.The final steps in lines 15 and 16 compress the integer and bit-vectors.The integer vectors i (x) are compressed using a variable length encoding, namely Directly Addressable Variable-Length Codes (DAC; Brisaboa et al. ( 2009)) which allows for efficient storage of integers while providing efficient random access.As the overwhelming majority of our precomputed values are small (see Figure 4 left), this gives rise to a dramatic compression rate of only ≈ 5.2 bits per integer.The bit vector bv of size O(n) where n is the number of nodes in the suffix tree, is compressed using the scheme of Raman et al. (2002) which supports constant time rank operation over very large bit vectors.This encoding allows for efficient retrieval of the precomputed counts at query time.The compressed vectors are loaded into memory and when an expensive count is required for node v, the precomputed quantities can be fetched in constant time via LOOKUP(v, bv, i (x) ) = i (x) RANK(bv,id(v),1) .We use RANK to determine the number of 1s preceding v's position in the bit vector bv.This corresponds to v's index in the compressed integer vectors i (x) , from which its precomputed count can be fetched.This strategy only applies for precomputed nodes; for other nodes, the values are computed on-the-fly.
Figure 3 compares the query time breakdown for on-the-fly count computation (top) versus precomputation (bottom), for both KN and MKN and with different Markov orders, m.Note that query speed improves dramatically, by a factor of about 2500×, for precomputed cases.This improvement comes at a modest cost in construction space.Precomputing for CST nodes with m ≤ 10 resulted in 20% of the nodes being selected for precomputation.The space used by the precomputed values accounts for 20% of the total space usage (see Figure 4 right).Index construction time increased by 70%.

Computing MKN Probability
Having established a means of computing the requisite counts for MKN and an efficient precomputation strategy, we now turn to the algorithm for computing the language model probability.This is presented in Algorithm 3, which is based on Shareghi et al. (2015)'s single CST approach for computing the KN probability (reported in their paper as Algorithm 4.) Similar to their method, our approach implements the recursive m-gram probability formulation as an iterative loop (here using MKN).The core of the algorithm are the two nodes v full and v which correspond to nodes matching the full k-gram and its (k − 1)-gram context, respectively.
Although similar to Shareghi et al. (2015)'s method, which also features a similar right-to-left pattern lookup, in addition we optimise the computation of a full sentence probability by sliding a window of width m over the sequence from left-to-right, adding one new word at a time. 7This allows for the for k ← 1 to m do 6: if v k−1 does not match then 7: break out of loop 8: re-use of nodes in one window matching the full kgrams, v full , as the nodes matching the context in the subsequent window, denoted v.For example, in the sentence "The Force is strong with this one.",computing the 4-gram probability of "The Force is strong" requires matches into the CST for "strong", "is strong", etc.As illustrated in Table 1, for the next 4-gram resulting from sliding the window to include "with", the denominator terms require exactly these nodes, see Figure 5. Practically, this is achieved by storing the matching v full nodes computed in line 8, and passing this vector as the input argument [v k ] m−1 k=0 to the next call to PROBMKN (line 1).This saves half the calls to backward-search, which, as shown in Figure 3, represent a significant fraction of the querying cost, resulting in a 30% improvement in query runtime.
The algorithm starts by considering the unigram probability, and grows the context to its left by one word at a time until the m-gram is fully covered (line 5).This best suits the use of backward-search in a CST, which proceeds from right-to-left over the search pattern.At each stage the search for Figure 5: Example MKN probability computation for a 4-gram LM applied to "The Force is strong with" (each word abbreviated to its first character), showing in the two left columns the suffix matches required for the 4gram FISW and elements which can be reused from previous 4-gram computation (gray shading), TFIS.Elements on the right denote the count and occurrence statistics derived from the suffix matches, as linked by blue lines.
v full k uses the span from the previous match, v full k−1 , along with the BWT to efficiently locate the matching node.Once the nodes matching the full sequence and its context are retrieved, the procedure is fairly straightforward: the discounts are loaded on line 9 and applied in lines 18-21, while the numerator, denominator and smoothing quantities as required for computing P and P are calculated in lines 10-13 and 15-17, respectively.8Note that the calls for functions N123PFRONT, N1PBACK1, N1PFRONTBACK1 are avoided if the corresponding node is amongst the selected nodes in the precomputation step; instead the LOOKUP function is called.Finally, the smoothing weight γ is computed in line 22 and the conditional probability computed on line 23.The loop terminates when we reach the length limit k = m or we cannot match the context, i.e., w i−1 i−k is not in the training corpus, in which case the probability value p for the longest match is returned.
We now turn to the discount parameters, D k (j) , k ≤ m, j ∈ 1, 2, 3+, which are function of the corpus statistics as outlined in Figure 1.While these could be computed based on raw m-gram statistics, this approach is very inefficient for large m ≥ 5; instead these values can be computed efficiently from the compressed data structures.Algorithm 4 outlines how the D k (i) values can be com- dS ← depth-next-sentinel(SA, bv , lb(v)) 8: i ← size(v) frequency 9: c ← interval-symbols (t, [lb(v), rb(v)]) left occ.10: for k ← dP + 1 to min (d, m, dS − 1) do 11: if k = 2 then 12: puted directly from the CST.This method iterates over the nodes in the suffix tree, and for each node considers the k-grams encoded in the edge label, where each k-gram is taken to start at the root node (to avoid duplicate counting, we consider k-grams only contained on the given edge but not in the parent edges, i.e., by bounding k based on the string depth of the parent and current nodes, For each k-gram we record its count, i (line 8), and the number of unique symbols to the left, c (line 9), which are accumulated in an array for each kgram size for values between 1 and 4 (lines 13-14 and 15-16, respectively).We also record the number of unique bigrams by incrementing a counter during the traversal (lines 11-12).Special care is required to exclude edge labels that span sentence boundaries, by detecting special sentinel (line 8) that separate each sentence or conclude the corpus.This check could be done by repeatedly calling edge(v, k) to find the k th symbol on the given edge to check for sentinels, however this is a slow operation as it requires multiple backward search calls.Instead we precalculate a bit vector, bv , of size equal to the number of tokens in the corpus, n, in which sentinel locations in the text are marked by 1 bits.Coupled with this, we use the suffix array SA, such that depth-next-sentinel(SA, bv , ) = SELECT(bv , RANK(bv , SA , 1) + 1, 1) − SA , where SA returns the offset into the text for index , and the SA is stored uncompressed to avoid the expensive cost of recovering these values.9This function can be understood as finding the first occurrence of the pattern in the text (using SA ) then finding the location of the next 1 in the bit vector, using constant time RANK and SELECT operations.This locates the next sentinel in the text, after which it computes the distance to the start of the pattern.Using this method in place of explicit edge calls improved the training runtime substantially up to 41×.
We precompute the discount values for k ≤ mgrams.For querying with m > m (including ∞) we reuse the discounts for the largest m-grams.10

Experiments
To evaluate our approach we measure memory and time usage, along with the predictive perplexity score of word-level LMs on a number of different corpora varying in size and domain.For all of our word-level LMs, we use m, m ≤ 10.We also demonstrate the positive impact of increasing the set limit on m, m from 10 to 50 on improving characterlevel LM perplexity.The SDSL library (Gog et al., 2014) is used to implement our data structures.The benchmarking experiments were run on a single core of a Intel Xeon E5-2687 v3 3.10GHz server with 500GiB of RAM.
In our word-level experiments, we use the German subset of the Europarl (Koehn, 2005) as a small corpus, which is 382 MiB in size measuring the raw uncompressed text.We also evaluate on much larger corpora, training on 32GiB subsets of the deduplicated English, Spanish, German, and French Common Crawl corpus (Buck et al., 2014).As test sets, we used newstest-2014 for all languages except Spanish, for which we used newstest-2013.our benchmarking experiments we used the bottom 1M sentences (not used in training) of German Comman Crawl corpus.We used the preprocessing script of Buck et al. (2014), then removed sentences with ≤ 2 words, and replaced rare words 12 c ≤ 9 in the training data with a special token.In our characterlevel experiments, we used the training and test data of the benchmark 1-billion-words corpus (Chelba et al., 2013).
Small data: German Europarl First, we compare the time and memory consumption of both the SRILM and KenLM toolkits, and the CST on the small German corpus.Figure 6 shows the memory usage for construction and querying for CST-based methods w/o precomputation is independent of m, but they grow substantially with m for the SRILM and KenLM benchmarks.To make our results comparable to those reported in (Shareghi et al., 2015) for query time measurements we reported the loading and query time combined.The construction cost is modest, requiring less memory than the benchmark systems for m ≥ 3, and running in a sim- 12 Running with the full vocabulary increased the memory requirement by 40% for construction and 5% for querying with our model, and 10% and 30%, resp.for KenLM.Construction times for both approaches were 15% slower, but query runtime was 20% slower for our model versus 80% for KenLM.EN 6470 321.6 183.8 154.3 152.7 152.5 152.3 ES 6276 231.3 133.2 111.7  ilar time 13 (despite our method supporting queries of unlimited size).Precomputation adds to the construction time, which rose from 173 to 299 seconds, but yielded speed improvements of several orders of magnitude for querying (218k to 98 seconds for 10gram).In querying, the CST-precompute method is 2-4× slower than both SRILM and KenLM for large m ≥ 5, with the exception of m = 10 where it outperforms SRILM.A substantial fraction of the query time is loading the structures from disk; when this cost is excluded, our approach is between 8-13× slower than the benchmark toolkits.Note that perplexity computed by the CST closely matched KenLM (differences ≤ 0.1).
Big Data: Common Crawl Table 2 reports the perplexity results for training on 32GiB subsets of the English, Spanish, French, and German Common Crawl corpus.Note that with such large datasets, perplexity improves with increasing m, with substantial gains available moving above the widely used m = 5.This highlights the importance of our approach being independent from m, in that we can evaluate for any m, including ∞, at low cost.
Heterogeneous Data To illustrate the effects of domain shift, corpus size and language model capacity on modelling accuracy, we now evaluate the system using a variety of different training corpora.Table 3 reports the perplexity for German when training over datasets ranging from the small Europarl up 13 For all timings reported in the paper we manually flushed the system cache between each operation (both for construction and querying) to remove the effect of caching on runtime.To query KenLM, we used the speed optimised populate method.(We also compare the memory optimised lazy method in Figure 7.) To train and query SRILM we used the default method which is optimised for speed, but had slightly worse memory usage than the compact method.to 32GiB of the Common Crawl corpus.Note that the test set is from the same domain as the News Crawl, which explains the vast difference in perplexities.The domain effect is strong enough to eliminate the impact of using much larger corpora, compare 10-gram perplexities for training on the smaller News Crawl 2007 corpus versus Europarl.However 'big data' is still useful: in all cases the perplexity improves as we provide more data from the same source.Moreover, the magnitude of the gain in perplexity when increasing m is influenced by the data size: with more training data higher order m-grams provide richer models; therefore, the scalability of our method to large datasets is crucially important.
Benchmarking against KenLM Next we compare our model against the state-of-the-art method, KenLM trie.The perplexity difference between CST and KenLM was less than 0.003 in all experiments.
Construction Cost. Figure 7a) compares the peak memory usage of our CST models and KenLM.KenLM is given a target memory usage of the peak usage of our CST models. 14The construction phase 14 Using the memory budget option, -S.Note that KenLM often used more memory than specified.Allowing KenLM use of 80% of the available RAM reduced training time by a factor of between 2 and 4. for the CST required more time for lower order models (see Figure 7c) but was comparable for larger m, roughly matching KenLM for m = 10. 15For the 32GiB dataset, the CST model took 14 hours to build, compared to KenLM's 13.5 and 4 hours for the 10-gram and 5-gram models, respectively.
Query Cost.As shown in Figure 7b, the memory requirements for querying with the CST method were consistently lower than KenLM for m ≥ 4: for m = 10 the memory consumption of KenLM was 277GiB compared to our 27GiB, a 10× improvement.This closely matches the file sizes of the stored models on disk.Figure 7d reports the query runtimes, showing that KenLM grows substantially slower with increasing dataset size and increasing language model order.In contrast, the runtime of our CST approach is much less affected by data size or model order.Our approach is faster than KenLM with the memory optimised lazy option for m ≥ 3, often by several orders of magnitude.For the faster KenLM populate, our model is still highly competitive, growing to 4× faster for the largest data size. 16he loading time is still a significant part of the runtime; without this cost, our model is 5× slower than KenLM populate for m = 10 on the largest dataset.Running our model with m = ∞ on the largest data size did not change the memory usage and only had a minor effect on runtime, taking 645s.
Character-level modelling To demonstrate the full potential of our approach, we now consider character based language modelling, evaluated on the large benchmark 1-billion-words language modelling corpus, a 3.9GiB (training) dataset with 768M words and 4 billion characters.17Table 4 shows the test perplexity results for our models, using the full training vocabulary.Note that perplexity improves with m for the character based model, but plateaus at m = 10 for the word based model; one reason for this is the limited discount computation, m ≤ 10,  for the word model, which may not be a good parameterisation for m > m.
Despite the character based model (implicitly) having a massive parameter space, estimating this model was tractable with our approach: the construction time was a modest 5 hours (and 2.3 hours for the word based model.)For the same dataset, Chelba et al. (2013) report that training a MKN 5gram model took 3 hours using a cluster of 100 CPUs; our algorithm is faster than this, despite only using a single CPU core. 18Queries were also fast: 0.72-0.87msand 15ms per sentence for word and character based models, respectively.

Conclusions
We proposed a language model based on compressed suffix trees, a representation that is highly 18 Chelba et al. (2013) report a better perplexity of 67.6, but they pruned the training vocabulary, whereas we did not.Also we use a stringent treatment of OOV, following Heafield (2013).compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on the fly.We presented several optimisations to accelerate this process, with only a modest increase in construction time and memory usage, yet improving query runtimes up to 2500×.In benchmarking against the state-of-the-art KenLM package on large corpora, our method has superior memory usage and highly competitive runtimes for both querying and training.Our approach allows easy experimentation with high order language models, and our results provide evidence that such high orders most useful when using large training sets.
We posit that further perplexity gains can be realised using richer smoothing techniques, such as a non-parametric Bayesian prior (Teh, 2006;Wood et al., 2011).Our ongoing work will explore this avenue, as well as integrating our language model into the Moses machine translation system, and improving the querying time by caching the lower order probabilities (e.g., m < 4) which we believe can improve query time substantially while maintaining a modest memory footprint.

Figure 1 :
Figure 1: The quantities and formula needed for modified Kneser-Ney smoothing, where x is a k-gram, u and w are words, and [a] + def= max{0, a}.We use m to refer to the order of the language model, and k ∈ [1, m] to the level of smoothing.The recursion stops at the unigram level P0 (w| ) where the probability is smoothed by the uniform distribution over the vocabulary 1 σ .

Figure 3 :
Figure 3: Time breakdown for querying average persentence, shown without runtime precomputation of expensive contextual counts (above) vs. with precomputation (below).The left and right bar in each group denote KN and MKN, respectively.Trained on the German portion of the Europarl corpus and tested over the first 10K sentences from the News Commentary corpus.
Figure 2, which again enumerates over child nodes (whose path labels start with symbols b, c and d) and computes the number of preceding symbols for the extended patterns. 3Accordingly N 1 (b • ) = 2, N 2 (b • ) = 1 and N 3+ (b • ) = 0.

Figure 4 :
Figure 4: Left: Distribution of values prestored for Europarl German; Right: Space usage of prestored values relative to total index size for Europarl German for different storage thresholds ( m).

Figure 6 :
Figure 6: Memory consumption and total runtime for the CST with and without precomputation, KenLM (trie), and SRILM (default) with m ∈ [2, 10], while we also include m = ∞ for CST methods.Trained on the Europarl German corpus and tested over the bottom 1M sentences from German Common Crawl corpus.

Figure 7 :
Figure 7: Memory and runtime statistics for CST and KenLM for construction and querying with different amounts of German Common Crawl training data and different Markov orders, m.We compare the query runtimes against the optimised version of KenLM for memory (lazy) and speed (populate).For clarity, in the figure we only show CST numbers for m = 10; the results for other settings of m are very similar.KenLM was trained to match the construction memory requirements of the CST-precompute method.

Table 1 :
The main quantities required for computing P (with|Force, is, strong) under MKN.

Table 2 :
Perplexities on English, French, German newstests 2014, and Spanish newstest 2013 when trained on 32GiB chunks of English, Spanish, French, and German Common Crawl corpus.

Table 4 :
Perplexity results for the 1 billion word benchmark corpus, showing word based and character based MKN models, for different m.Timings and peak memory usage are reported for construction.The word model computed discounts and precomputed counts up to m, m = 10, while the character model used thresholds m, m = 50.Timings measured on a single core.