Universal Word Segmentation: Implementation and Interpretation

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and Hebrew, when compared to previous work.


Introduction
Word segmentation is the initial step for most higher level natural language processing tasks, such as part-of-speech tagging (POS), parsing and machine translation. It can be regarded as the problem of correctly identifying word forms from a character string.
Word segmentation can be very challenging, especially for languages without explicit word boundary delimiters, such as Chinese, Japanese and Vietnamese. Even for space-delimited languages like English or Russian, relying on white space alone generally does not result in adequate segmentation as at least punctuation should usually be separated from the attached words. For some languages, the space-delimited units in the surface form are too coarse-grained and therefore often further analysed, as in the cases of Arabic and Hebrew. Even though language-specific word segmentation systems are near-perfect for some languages, it is still useful to have a single system that performs reasonably with no or minimum language-specific adaptations.
Word segmentation standards vary substantially with different definitions of the concept of a word. In this paper, we will follow the teminologies of Universal Dependencies (UD), where words are defined as basic syntactic units that do not always coincide with phonological or orthographic words. Some orthographic tokens, known in UD as multiword tokens, therefore need to be broken into smaller units that cannot always be obtained by splitting the input character sequence. 1 To perform word segmentation in the UD framework, neither rule-based tokenisers that rely on white space nor the naive character-level sequence tagging model proposed previously (Xue, 2003) are ideal. In this paper, we present an enriched sequence labelling model for universal word segmentation. It is capable of segmenting languages in very diverse written forms. Furthermore, it simultaneously identifies the multiword tokens defined by the UD framework that cannot be resolved sim-ply by splitting the input character sequence. We adapt a regular sequence tagging model, namely the bidirectional recurrent neural networks with conditional random fields (CRF) (Lafferty et al., 2001) interface as the fundamental framework (BiRNN-CRF)  for word segmentation.
The main contributions of this work include: 1. We propose a sequence tagging model for word segmentation, both for general purposes (mere splitting) and full UD processing (splitting plus occasional transduction). 2. We investigate the correlation between segmentation accuracy and properties of languages and writing systems, which is helpful in interpreting the gaps between segmentation accuracies across different languages as well as selecting language-specific settings for the model. 3. Our segmentation system achieves state-of-theart accuracy on the UD datasets and improves on previous work (Straka and Straková, 2017) especially for the most challenging languages. 4. We provide an open source implementation. 2

Word Segmentation in UD
The UD scheme for cross-linguistically consistent morphosyntactic annotation defines words as syntactic units that have a unique part-of-speech tag and enter into syntactic relations with other words (Nivre et al., 2016). For languages that use whitespace as boundary markers, there is often a mismatch between orthographic words, called tokens in the UD terminology, and syntactic words. Typical examples are clitics, like Spanish dámelo = da me lo (1 token, 3 words), and contractions, like French du = de le (1 token, 2 words). Tokens that need to split into multiple words are called multiword tokens and can be further subdivided into those that can be handled by simple segmentation, like English cannot = can not, and those that require a more complex transduction, like French du = de le. We call the latter non-segmental multiword tokens. In addition to multiword tokens, the UD scheme also allows multitoken words, that is, words consisting of multiple tokens, such as numerical expressions like 20 000. 2 https://github.com/yanshao9798/segmenter

Word Segmentation and Typological Factors
We begin with the analysis of the difficulty of word segmentation. Word segmentation is fundamentally more difficult for languages like Chinese and Japanese because there are no explicit word boundary markers in the surface form (Xue, 2003). For Vietnamese, the space-segmented units are syllables that roughly correspond to Chinese characters rather than words. To characterise the challenges of word segmentation posed by different languages, we will examine several factors that vary depending on language and writing system. We will refer to these as typological factors although most of them are only indirectly related to the traditional notion of linguistic typology and depend more on writing system. • Character Set Size (CS) is the number of unique characters, which is related to how informative the characters are to word segmentation. Each character contains relatively more information if the character set size is larger. • Lexicon Size (LS) is the number of unique word forms in a dataset, which indicates how many unique word forms have to be identified by the segmentation system. Lexicon size increases as the dataset grows in size. • Average Word Length (AL) is calculated by dividing the total character count by the word count. It is negatively correlated with the density of word boundaries. If the average word length is smaller, there are more word boundaries to be predicted. • Segmentation Frequency (SF) denotes how likely it is that space-delimited units are further segmented. It is calculated by dividing the word count by the space-segment count. Languages like Chinese and Japanese have much higher segmentation frequencies than spacedelimited languages. • Multiword Token Portion (MP) is the percentage of multiword tokens that are nonsegmental. • Multiword Token Set Size (MS) is the number of unique non-segmental multiword tokens.
The last two factors are specific to the UD scheme but can have a significant impact on word segmentation accuracy.  Table 1: Pearson product-moment correlation coefficients between dataset size and the statistical factors.
All the languages in the UD dataset are characterised and grouped by the typological factors in Figure 1. We standardise the statistics x of the proposed factors on the UD datasets with the arithmetic mean µ and the standard deviation σ as x−µ σ . We use them as features and apply K-Means clustering (K = 6) to group the languages. Principal component analysis (PCA) (Abdi and Williams, 2010) is used for dimensionality reduction and visualisation.
The majority of the languages in UD are spacedelimited with few or no multiword tokens and they are grouped at the bottom left of Figure 1. They are statistically similar from the perspective of word segmentation. The Semitic languages Arabic and Hebrew with rich non-segmental multiword tokens are positioned at the top. In addition, languages with large character sets and high segmentation frequencies, such as Chinese, Japanese and Vietnamese are clustered together. Korean is distanced from the other space-delimited languages as it contains white-space delimiters but has a comparatively large character set. Overall, the x-axis of Figure 1 is primarily related to character set size and segmentation  frequency, while the y-axis is mostly associated with multiword tokens. Dataset sizes for different languages in UD vary substantially. Table 1 shows the correlation coefficients between the dataset size in sentence number and the six typological factors. Apart from the lexicon size, all the other factors, including multiword token set size, have no strong correlations with dataset size. From Table 2, we can see that the Char. On considère qu'environ 50 000 Allemands du Wartheland ont péri pendant la période. Tags BEXBIIIIIIIEXBIEBIIIIIEXBIIIIEXBIIIIIIIEXBEXBIIIIIIIIEXBIEXBIIEXBIIIIIEXBEXBIIIIIES Figure 2: Tags employed for word segmentation. 50 000 is a multitoken word, while qu'environ and du are multiword tokens that should be processed differently.
factors, except for lexicon size, are relatively stable across different UD treebanks for the same language, which indicates that they do capture properties of these languages, although some variation inevitably occurs due to corpus properties like genre.
In this paper, we thoroughly investigate the correlations between the proposed statistical factors and segmentation accuracy. Moreover, we aim to find specific settings that can be applied to improve segmentation accuracy for each language group.

Sequence Tagging Model
Word segmentation can be modelled as a character-level sequence labelling task (Xue, 2003;Chen et al., 2015). Characters as basic input units are passed into a sequence labelling model and a sequence of tags that are associated with word boundaries are predicted. In this section, we introduce the boundary tags adopted in this paper.
Theoretically, binary classification is sufficient to indicate whether a character is the end of a word for segmentation. In practice, more finegrained tagsets result in higher segmentation accuracy (Zhao et al., 2006). Following the work of , we employ a baseline tagset consisting of four tags: B, I, E, and S, to indicate a character positioned at the beginning (B), inside (I), or at the end (E) of a word, or occurring as a singlecharacter word (S).
The baseline tagset can be applied to word segmentation of Chinese and Japanese without further modification. For languages with space-delimiters, we add an extra tag X to mark the characters, mostly spaces, that do not belong to any words/tokens. As illustrated in Figure 2, the regular spaces are marked with X while the space in a multitoken word like 50 000 is disambiguated with I.
To enable the model to simultaneously identify non-segmental multiword tokens for languages like Spanish and Arabic in the UD framework, we extend the tagset by adding four tags B, I, E, S that  correspond to B, I, E, S to mark corresponding positions in non-segmental multiword tokens and to indicate their occurrences. As shown in Figure  2, the multiword token qu'environ is split into qu' and environ and therefore the corresponding tags are BIEBIIIIIE. This contrasts with du, which should be transduced into de and le. Moreover, the extra tags disambiguate whether the multiword tokens should be split or transduced according to the context. For instance, (wamimma) in Arabic is occasionally split into (wa) and (mimma) but more frequently transduced into (wa), (min) and (ma) . The corresponding tags are SBIE and BIIE, respectively. The transduction of the identified multiword tokens will be described in detail in the following section.
The complete tagset is summarised in Table 3. The proposed sequence model can easily be extended to perform joint sentence segmentation by adding two more tags to mark the last character of a sentence (de Lhoneux et al., 2017). T is used if the character is a single-character word and U otherwise. T and U can be used together with B, I, E, S, X for general segmentation, or with B, I, E, S additionally for full UD processing. Joint sentence segmentation is not addressed any further in this paper.

Main network
The main network for regular segmentation as well as non-segmental multiword token identification is an adaptation of the BiRNN-CRF model ) (see Figure 3). The input characters can be represented as conventional character embeddings. Alternatively, we employ the concatenated 3-gram model introduced by . In this representation (Figure 4), the pivot character in a given context is represented as the concatenation of the character vector representation along with the local bigram and trigram vectors. The concatenated n-grams encode rich local information as the same character has different yet closely related vector representations in different contexts. For each n-gram order, we use a single vector to represent the terms that appear only once in the training set while training. These vectors are later used as the representations for unknown characters and n-grams in the development and test sets. All the embedding vectors are initialised randomly.
The character vectors are passed to the forward and backward recurrent layers. Gated recurrent units (GRU) (Cho et al., 2014) are employed as the basic recurrent cell to capture long term dependencies and sentence-level information. Dropout (Srivastava et al., 2014) is applied to both the inputs and the outputs of the bidirectional recurrent layers.
n-gram character representation V 3 Figure 4: Concatenated 3-gram model. The third character is the pivot character in the given context.
A first-order chain CRF layer is added on top of the recurrent layers to incorporate transition information between consecutive tags, which ensures that the optimal sequence of tags over the entire sentence is obtained. The optimal sequence can be computed efficiently via the Viterbi algorithm.

Transduction
The non-segmental multiword tokens identified by the main network are transduced into corresponding components in an additional step. Based on the statistics of the multiword tokens to be transduced on the entire UD training sets, 98.3% only have one possible transduction, which indicates that the main ambiguity of non-segmental multiword tokens comes with identification, not transduction. We therefore transduce the identified non-segmental multiword tokens in a context-free fashion. For multiword tokens with two or more valid transductions, we only adopt the most frequent one.
In most languages that have multiword tokens, the number of unique non-segmental multiword tokens is rather limited, such as in Spanish, French and Italian. For these languages, we build dictionaries from the training data to look up the multiword tokens. However, in some languages like Arabic and Hebrew, multiword tokens are very productive and therefore cannot be well covered by dictionaries generated from training data. Some of the available external dictionary resources with larger coverage, for instance the MILA lexicon (Itai and Wintner, 2008), do not follow the UD standards.
In this paper, we propose a generalising approach to processing non-segmental multiword tokens. If  there are more than 200 unique multiword tokens in the training set for a language, we train an attentionbased encoder-decoder (Bahdanau et al., 2015) equipped with shared long-short term memory cells (LSTM) (Hochreiter and Schmidhuber, 1997). At test time, identified non-segmental multiword tokens are first queried in the dictionary. If not found, the segmented components are generated with the encoder-decoder as character-level transduction. Overall, we utilise rich context to identify non-segmental multiword tokens, and then apply a combination of dictionary and sequence-tosequence encoder-decoder to transduce them.

Implementation
Our universal word segmenter is implemented using the TensorFlow library (Abadi et al., 2016). Sentences with similar lengths are grouped into the same bucket and padded to the same length. We construct sub-computational graphs for each bucket so that sentences of different lengths are processed more efficiently. Table 4 shows the hyper-parameters adopted for the neural networks. We use one set of parameters for all the experiments as we aim for a simple universal model, although fine-tuning the hyper-parameters on individual languages might result in additional improvements. The encoderdecoder is trained prior to the main network. The weights of the neural networks, including the embeddings, are initialised using the scheme introduced in Glorot and Bengio (2010). The network is trained using back-propagation. All the random embeddings are fine-tuned during training by backpropagating gradients. Adagrad (Duchi et al., 2011) with mini-batches is employed for optimization. The initial learning rate η 0 is updated with a decay rate ρ.
The encoder-decoder is trained with the unique non-segmental multiword tokens extracted from the training set. 5% of the total instances are subtracted for validation. The model is trained for 50 epochs and the score of how many outputs exactly match the references is used for selecting the weights.
For the main network, word-level F1-score is used to measure the performance of the model after each epoch on the development set. The network is trained for 30 epochs and the weight of the best epoch is selected.
To increase efficiency and reduce memory demand both for training and decoding, we truncate sentences longer than 300 characters. At decoding time, the truncated sentences are reassembled at the recorded cut-off points in a post-processing step.

Datasets and Evaluation
Datasets from Universal Dependencies 2.0 (Nivre et al., 2016) are used for all the word segmentation experiments. 3 In total, there are 81 datasets in 49 languages that vary substantially in size. The training sets are available in 45 languages. We follow the standard splits of the datasets. If no development set is available, 10% of the training set is subtracted.
We adopt word-level precision, recall and F1score as the evaluation metrics. The candidate and the reference word sequences in our experiments may not share the same underlying characters due to the transduction of non-segmental multiword tokens. The alignment between the candidate words and the references becomes unclear and therefore it is difficult to compute the associated scores. To resolve this issue, we use the longest common subsequence algorithm to align the candidate and the reference words. The matched words are compared and the evaluation scores are computed accordingly: where c and r denote the sequences of candidate words and reference words, and |c|, |r| are their lengths. |c ∩ r| is the number of candidate words that are aligned to reference words by the longest common subsequence algorithm. The word-level evaluation metrics adopted in this paper are different from the boundary-based alternatives (Palmer and Burger, 1997). We adapt the evaluation script from the CoNLL 2017 shared task (Zeman et al., 2017) to calculate the scores. In the following experiments, we only report the F1-score.
In the following sections, we thoroughly investigate correlations between several language-specific characteristics and segmentation accuracy. All the experimental results in Section 6.2 are obtained on the development sets. The test sets are reserved for final evaluation, reported in Section 6.3.

Word-Internal Spaces
For Vietnamese and other languages with similar historical backgrounds, such as Zhuang and Hmongic languages (Zhou, 1991), the spacedelimited syllables containing no punctuation are never segmented but joined into words with wordinternal spaces instead. The space-delimited units can therefore be applied as the basic elements for tag prediction if we pre-split punctuation. Word segmentation for these languages thus becomes practically the same as for Chinese and Japanese. Table 5 shows that a substantial improvement can be achieved if we use space-delimited syllables as the basic elements for word segmentation for Vietnamese. It also drastically increases both training and decoding speed as the sequence of tags to be predicted becomes much shorter.

Character Representation
We apply regular character embeddings and concatenated 3-gram vectors introduced in Section 5.1 to the input characters and test their performances respectively. First, the experiments are extensively conducted on all the languages with the full training sets. The results show that the concatenated 3-gram model is substantially better than the regular character embeddings on Chinese, Japanese and Vietnamese, but notably worse on Spanish and Catalan. For all the other languages, the differences are marginal.
To gain more insights, we select six languages, namely Arabic, Catalan, Chinese, Japanese, English and Spanish for more detailed analysis via learning curve experiments. The training sets are gradually extended by 300 sentences at a time. The results are shown in Figure 5. Regardless of the amounts of training data and the other typological factors, concatenated 3-grams are better on Chinese and Japanese and worse on Spanish and Catalan. We expect the concatenated 3-gram representation to outperform simple character embeddings on all languages with a large character set but no space delimiters.
Since adopting the concatenated 3-gram model drastically enlarges the embedding space, in the following experiments, including the final testing phase, concatenated 3-grams are only applied to Chinese, Japanese and Vietnamese.

Space Delimiters
Chinese and Japanese are not delimited by spaces. Additionally, continuous writing without spaces (scriptio continua) is evidenced in most Classical Greek and Latin manuscripts. We perform two sets of learning curve experiments to investigate the impact of white space on word segmentation. In the first set, we keep the datasets in their original forms. In the second set, we omit all white space. The experimental results are presented in Figure 6.
In general, there are huge discrepancies between the accuracies with and without spaces, showing that white space acts crucially as a word boundary indicator. Retaining the original forms of the spacedelimited languages, very high accuracies can be achieved even with small amounts of training data as the model quickly learns that space is a reliable word boundary indicator. Moreover, we obtain rel-   atively lower scores on space-delimited languages when space is ignored than Chinese using comparable amounts of training data, which shows that Chinese characters are more informative to word boundary prediction, due to the large character set size.

Non-Segmental Multiword Tokens
The concept of multiword tokens is specific to UD. To explore how the non-segmental multiword tokens, as opposed to pure segmentation, influence segmentation accuracy, we conduct relevant experiments on selected languages. Similarly to the previous section, two sets of learning curve experiments are performed. In the second set, all the multiword tokens that require transduction are regarded as single words without being processed. The results are presented in Figure 7.
Word segmentation with full UD processing is notably more challenging for Arabic and Hebrew. Table 6 shows the evaluation of the encoder-decoder as the transducer for non-segmental multiword to-  kens on Arabic and Hebrew. The evaluation metrics ACC and MF-score (MFS) are adapted from the metrics used for machine transliteration evaluation (Li et al., 2009). ACC is exact match and MFS is based on edit distance. The transducer yields relatively higher scores on Hebrew while it is more challenging to process Arabic. In addition, different approaches to transducing the non-segmental multiword tokens are evaluated in Table 7. In the condition None, the identified nonsegmental multiword tokens remain unprocessed. In Dictionary, they are mapped via the dictionary derived from training data if found in the dictionary. In Transducer, they are all transduced by the attentionbased encoder-decoder. In Mix, in addition to utilising the mapping dictionary, the non-segmental terms not found in the dictionary are transduced with the encoder-decoder. The results show that when the encoder-decoder is applied alone, it is worse than only using the dictionaries, but additional improvements can be obtained by combining both of them.
The accuracy differences associated with nonsegmental multiword tokens are nonetheless marginal on the other languages as shown in Figure  7. Regardless of their frequent occurrences, multiword tokens are easy to process in general when the set of unique non-segmental multiword tokens is small.

Correlations with Accuracy
We investigate the correlations between the proposed typological factors in Section 3 and segmentation accuracy using linear regression with Huber loss (Huber, 1964). The factors are used in addition to training set size as the features to predict the segmentation accuracies in F1-score. To collect more data samples, apart from experimenting with the full training data for each set, we also use smaller sets of 500, 1,000 and 2,000 training instances to train the models respectively if the training set is large enough. The features are standardised with the arithmetic mean and the standard deviation before fitting the linear regression model.
The correlation coefficients of the linear regression model are presented in Figure 8. We can see that segmentation frequency and multiword token set size are negatively correlated with segmentation accuracy. Overall, the UD datasets are strongly biased towards space-delimited languages. Training set size is therefore not a strong factor as high accuracies can be obtained with small amounts of training data, which is consistent with the results of all the learning curve experiments. The other typological factors such as average word length and lexicon size are less relevant to segmentation accuracy. Referring back to Figure 1, segmentation frequency and multiword token set size as the most influential factors, are also the primary principal components that categorise the UD languages into different groups.

Language-Specific Settings
Our model obtains competitive results with only a minimal number of straightforward languagespecific settings. Based on the previous analysis of  segmentation accuracy and typological factors, referring back to Figure 1, we apply the following settings, targeting on specific language groups, to the segmentation system on the final test sets. The language-specific settings can be applied to new languages beyond the UD datasets based on an analysis of the typological factors.

For languages with word-internal spaces like
Vietnamese, we first separate punctuation and then use space-delimited syllables for boundary prediction. 2. For languages with large character sets and no space delimiters, like Chinese and Japanese, we use concatenated 3-gram representations. 3. For languages with more than 200 unique nonsegmental multiword tokens, like Arabic and Hebrew, we use the encoder-decoder model for transduction. 4. For other languages, the universal model is sufficient without any specific adaptation.

Final Results
We compare our segmentation model to UDPipe (Straka and Straková, 2017) on the test sets. UD-Pipe contains word segmentation, POS tagging, morphological analysis and dependency parsing models in a pipeline. The word segmentation model in UDPipe is also based on RNN with GRU. For efficiency, UDPipe has a smaller character embedding size and no CRF interface. It also relies heavily on white-space and uses specific configurations for languages in which word-internal spaces are allowed. Automatically generated suffix rules are applied jointly with a dictionary query to handle multiword tokens. Moreover, UDPipe uses languagespecific hyper-parameters for Chinese and Japanese. We employ UDPipe 1.2 with the publicly available UD 2.0 models. 4 The presegmented option is enabled as we assume the input text to be preseg-mented into sentences so that only word segmentation is evaluated. In addition, the CoNLL shared task involved some test sets for which no specific training data were available. This included a number of parallel test sets of known languages, for which we apply the models trained on the standard treebanks, as well as four surprise languages, namely Buryat, Kurmanji, North Sami and Upper Sorbian, for which we use the small annotated data samples provided in addition to the test sets by the shared task to build models and evaluation on those languages.
The main evaluation results are shown in Table 9. We also report the Macro Average F1-scores. The scores of the surprise languages are excluded and presented separately as no corresponding UDPipe models are available.
Our system obtains higher segmentation accuracy overall. It achieves substantially better accuracies on languages that are challenging to segment, namely Chinese, Japanese, Vietnamese, Arabic and Hebrew. The two systems yield very similar scores, when these languages are excluded as shown in Table 8, in which the two systems are also compared with two rule-based baselines, a simple space-based tokeniser and the tokenisation model for English in NLTK (Loper and Bird, 2002). The NLTK model obtains relatively high accuracy while the spacebased baseline substantially underperforms, which indicates that relying on white space alone is insufficient for word segmentation in general. On the majority of the space-delimited languages without productive non-segmental multiword tokens, both UD-Pipe and our segmentation system yield near-perfect scores in Table 9. In general, referring back to Figure 1, languages that are clustered at the bottom-left corner are relatively trivial to segment.
The evaluation scores are notably lower on Semitic languages as well as languages without word delimiters. Nonetheless, our system obtains substantially higher scores on the languages that are more challenging to process.
For Chinese, Japanese and Vietnamese, our system benefits substantially from the concatenated 3-gram character representation, which has been demonstrated in Section 6.2.2. Besides, we employ a more fine-grained tagset with CRF loss instead of the binary tags adopted in UDPipe. As   presented in Zhao et al. (2006), more fine-grained tagging schemes outperform binary tags, which is supported by the experimental results on morpheme segmentation reported in Ruokolainen et al. (2013).
We further investigate the merits of the finegrained tags over the binary tags as well as the effectiveness of the CRF interface by the experiments presented in Table 10 with the variances of our segmentation system. The fine-grained tags denote the boundary tags introduced in    fiting from the fine-grained tagset and the CRF interface, our model is better at handling non-segmental multiword tokens (Table 11). The attention-based encoder-decoder as the transducer is much more powerful in processing the non-segmental multiword tokens that are not covered by the dictionary than the suffix rules for analysing multiword tokens in UDPipe. UDPipe obtains higher scores on a few datasets. Our model overfits the small training data of Uyghur as it yields 100.0 F1-score on the development set. For a few parallel test sets, there are punctuation marks not found in the training data that cannot be correctly analysed by our system as it is fully datadriven without any heuristic rules for unknown characters.
The evaluation results on the surprise languages are presented in Table 13. In addition to the segmentation models proposed in this paper, we present the evaluation scores of a space-based tokeniser as well as the NLTK model for English. As shown by the previous learning curve experiments in Section 6.2, very high accuracies can be obtained on the space-delimited languages with only small amounts of training data. However, in case of extreme data sparseness (less than 20 training sentences), such as for the four surprise languages in Table 13 and Kazakh in Table 9, the segmentation results are drastically lower even though the surprise languages are all space-delimited.
For the surprise languages, we find that applying segmentation models trained on a different language with more training data yields better results than relying on the small annotated samples of the target language. Considering that the segmentation model is fully character-based, we simply select the model of the language that shares the most characters with the surprise language as its segmentation model. No annotated data of the surprise language are used for model selection. As shown in Table 13, the transfer approach achieves comparable segmentation accuracies to NLTK. For space-delimited languages with insufficient training data, it may be beneficial to employ a well-designed rule-based word segmenter as NLTK occasionally outperforms the data-driven approach.
As a form of extrinsic evaluation, we test the segmenter in a dependency parsing setup on the datasets where we obtained substantial improvements over UDPipe. We present results for the transition-based parsing model in UDPipe 1.2 and for the graphbased parser by Dozat et al. (2017). The experimental results are shown in Table 12. We can see that word segmentation accuracy has a great impact on parsing accuracy as the segmentation errors propagate. Having a more accurate word segmentation model is very beneficial for achieving higher parsing accuracy.

Related Work
The BiRNN-CRF model is proposed by  and has been applied to a number of sequence labelling tasks, such as partof-speech tagging, chunking and named entity recognition.
Our universal word segmenter is a major extension of the joint word segmentation and POS tagging  Table 14: Comparison between the universal model and the language-specific models.
system described by . The original model is specifically developed for Chinese and only applicable to Chinese and Japanese. Apart from being language-independent, the proposed model in this paper employs an extended tagset and a complementary sequence transduction component to fully process non-segmental multiword tokens that are present in a substantial amount of languages, such as Arabic and Hebrew in particular. It is a generalised segmentation and transduction framework. Our universal model is compared with the language-specific model of  in Table 14. With pretrained character embeddings, ensemble decoding and joint POS tags prediction as introduced in Shao et al. (2017), considerable improvements over the universal model presented in this paper can be obtained. However, the joint POS tagging system is difficult to generalise as single characters in space-delimited languages are usually not informative for POS tagging. Additionally, compared to Chinese, sentences in space-delimited languages have a much greater number of characters on average. Combining the POS tags with segmentation tags drastically enlarges the search space and therefore the model becomes extremely inefficient both for training and tagging. The joint POS tagging model is nonetheless applicable to Japanese and Vietnamese. Monroe et al. (2014) present a data-driven word segmentation system for Arabic based on a sequence labelling framework. An extended tagset is designed for Arabic-specific orthographic rules and applied together with hand-crafted features in a CRF framework. It obtains 98.23 F1-score on newswire Arabic Treebank, 5 97.61 on Broadcast News Treebank, 6 and 92.10 on the Egyptian Arabic dataset. 7 For He-5 LDC2010T13, LDC2011T09, LDC2010T08 6 LDC2012T07 7 LDC2012E93,98,89,99,107,125, LDC2013E12,21 brew, Goldberg and Elhadad (2013) perform word segmentation jointly with syntactic disambiguation using lattice parsing. Each lattice arc corresponds to a word and its corresponding POS tag, and a path through the lattice corresponds to a specific word segmentation and POS tagging of the sentence. The proposed model is evaluated on the Hebrew Treebank (Guthmann et al., 2009). The joint word segmentation and parsing F1-score (76.95) is reported and compared against the parsing score (85.70) with gold word segmentation. The evaluation scores reported in both Monroe et al. (2014) and Goldberg and Elhadad (2013) are not directly comparable to the evaluation scores on Arabic and Hebrew in this paper, as they are obtained on different datasets.
For universal word segmentation, apart from UDPipe described in Section 6.3, there are several systems that are developed for specific language groups. Che et al. (2017) build a similar Bi-LSTM word segmentation model targeting languages without space delimiters like Chinese and Japanese. The proposed model incorporates rich statistics-based features gathered from large-scale unlabelled data, such as character unigram embeddings, character bigram embeddings and the pointwise mutual information of adjacent characters. Björkelund et al. (2017) use a CRF-based tagger for multiword token rich languages like Arabic and Hebrew. A predicted Levenshtein edit script is employed to transform the multiword tokens into their components. The evaluation scores on a selected set of languages reported in Che et al. (2017) and Björkelund et al. (2017) are included in Table 14 as well. More et al. (2018) adapt existing morphological analysers for Arabic, Hebrew and Turkish and present ambiguous word segmentation possibilities for these languages in a lattice format (CoNLL-UL) that is compatible with UD. The CoNLL-UL datasets can be applied as external resources for processing non-segmental multiword tokens. 8

Conclusion
We propose a sequence tagging model and apply it to universal word segmentation. BiRNN-CRF is adopted as the fundamental segmentation framework that is complemented by an attention-based sequence-to-sequence transducer for non-segmental multiword tokens. We propose six typological factors to characterise the difficulty of word segmentation cross different languages. The experimental results show that segmentation accuracy is primarily correlated with segmentation frequency as well as the set of non-segmental multiword tokens. Using whitespace as delimiters is crucial to word segmentation, even if the correlation between orthographic tokens and words is not perfect. For space-delimited languages, very high accuracy can be obtained even with relatively small training sets, while more training data is required for high segmentation accuracy for languages without spaces. Based on the analysis, we apply a minimal number of language-specific settings to substantially improve the segmentation accuracy for languages that are fundamentally more difficult to process.
The segmenter is extensively evaluated on the UD datasets in various languages and compared with UDPipe. Apart from obtaining nearly perfect segmentation on most of the space-delimited languages, our system achieves high accuracies on languages without space delimiters such as Chinese and Japanese as well as Semitic languages with abundant multiword tokens like Arabic and Hebrew.