Nonparametric Bayesian Semi-supervised Word Segmentation

This paper presents a novel hybrid generative/discriminative model of word segmentation based on nonparametric Bayesian methods. Unlike ordinary discriminative word segmentation which relies only on labeled data, our semi-supervised model also leverages a huge amounts of unlabeled text to automatically learn new “words”, and further constrains them by using a labeled data to segment non-standard texts such as those found in social networking services. Specifically, our hybrid model combines a discriminative classifier (CRF; Lafferty et al. (2001) and unsupervised word segmentation (NPYLM; Mochihashi et al. (2009)), with a transparent exchange of information between these two model structures within the semi-supervised framework (JESS-CM; Suzuki and Isozaki (2008)). We confirmed that it can appropriately segment non-standard texts like those in Twitter and Weibo and has nearly state-of-the-art accuracy on standard datasets in Japanese, Chinese, and Thai.


Introduction
For any unsegmented language, especially East Asian languages such as Chinese, Japanese and Thai, word segmentation is almost an inevitable first step in natural language processing. In fact, it is becoming increasingly important lately because of the growing interest in processing user-generated media, such as Twitter and blogs. Texts in such media are often written in a colloquial style that contains many new words and expressions that are not present in any existing dictionaries. Since such words are theoretically infinite in number, we need to leverage unsupervised learning to automatically identify them in corpora.
For this purpose, ordinary supervised learning is clearly unsatisfactory; even hand-crafted dictionar-ies will not suffice because functional expressions more complex than simple nouns need to be recognized through their relationship with other words in text, which also might be unknown in advance. Previous studies of this issue used character and word information in the framework of supervised learning (Kruengkrai et al., 2009;Sun et al., 2009;Sun and Xu, 2011). However, they (1) did not explicitly model new words, or (2) did not give a seamless combination with discriminative classifiers (e.g., they just used a threshold to discriminate between known and unknown words).
In contrast, unsupervised word segmentation methods (Goldwater et al., 2006;Mochihashi et al., 2009) use nonparametric Bayesian generative models for word generation to infer the "words" only from observations of raw input strings. These methods work quite well and have been used not only for tokenization but also for machine translation (Nguyen et al., 2010), speech recognition (Lee and Glass, 2012;Heymann et al., 2014), and even robotics (Nakamura et al., 2014).
However, from a practical point of view, such purely unsupervised approaches do not suffice. Since they only aim to maximize the probability of the language model on the observed set of strings, they sometimes yield word segmentations that are ãôO¦·¢{{ã:YÅèQBSUZØ$^|º{{{ VV¤$•GV Þ-×&NJOFN!á\$7BñÂ|×b·ö •i•i•°bi| KHËbº·i|äØ®{{{ ¹¹péýCPCPü{Öóyy${D1® A·p%'D.LýÕ<ÝúNzìí +ˆ$7 Figure 1: Excerpt of Weibo tweets. It contains many "unknown" words such as novel proper nouns, terms from local dialects, etc., that cannot be covered by ordinary labeled data or dictionaries. To solve this problem, this paper describes a novel combination of a nonparametric Bayesian generative model (NPYLM; Mochihashi et al. (2009)) and a discriminative classifier (CRF; Lafferty et al. (2001)). This combination is based on a semisupervised framework called JESS-CM (Suzuki and Isozaki, 2008), and it requires a nontrivial exchange of information between these two models. In this approach, the generative and discriminative models will "teach each other" and yield a novel log-linear model for word segmentation.
Experiments on standard datasets of Chinese, Japanese, and Thai indicate that this hybrid model achieves nearly state-of-the-art accuracy on standard corpora, and, thanks to our nonparametric Bayesian model of infinite vocabulary it can accurately segment non-standard texts like those in Twitter and Weibo (the Chinese equivalent of Twitter) without any human intervention.
This paper is organized as follows. Section 2 introduces NPYLM which will be leveraged in the framework of JESS-CM, described in Section 3. Section 4 introduces our model, NPYCRF, and the necessary exchange of information, while Section 5 is devoted to experiments on datasets in Chinese, Japanese, and Thai. We analyze the results and discuss future directions of research on semisupervised learning in Section 6 and conclude in Section 7.

Unsupervised Word Segmentation
To acquire new words from an observation consisting of raw strings, a generative model of words can be extremely useful for word segmentation. Goldwater et al. (2006) showed that a bigram hierarchical Dirichlet process (HDP) model based on Gibbs sampling can effectively find "words" in small corpora. In extending this work, Mochihashi et al. (2009) proposed a nested Pitman-Yor language model (NPYLM), a hierarchical Bayesian language model, where character n-grams (actually, ∞-grams (Mochihashi and Sumita, 2008)) are embedded in word n-grams, and an efficient dynamic programming algorithm for inference exists. Conceptually, NPYLM posits that an infinite number of spellings, Character HPYLM Word HPYLM Figure 2: The structure of NPYLM by a Chinese Restaurant Process representation (replicated from Mochihashi et al. (2009)). The word and character HPYLM are drawn as suffix trees; the character HPYLM is a base measure for the word HPYLM, and the two are learned as a single model. Each black customer is a count in HPYLM, and a white customer is a latent proxy customer initiated from each black customer: see Teh (2006) for details.
i.e., "words", are probabilistically generated from character n-grams, and a word unigram is drawn using the character n-grams as the base measure. Then bigram and trigram distributions are hierarchically generated and the final string is yielded from the "word" n-grams, as shown in Figure 2.
Practically, NPYLM can be considered as a hierarchical smoothing of the Bayesian n-gram language model, HPYLM (Teh, 2006). In HPYLM, the predictive distribution of a word w = w t given a history h = w t−(n−1) · · · w t−1 is expressed as where c(w|h) denotes the observed counts, θ and d are model parameters, and t hw and t h· = w t hw are latent variables estimated in the model. The probability of w given h is recursively interpolated using a shorter history h ′ = w t−(n−2) · · · w t−1 . If h is already empty at the unigram level, NPYLM employs a back-off distribution using character n-grams for p(w|h ′ ): In this way, NPYLM can assign appropriate probabilities to every possible sequence of segmentation and learn the word and character n-grams at the same time by using a single generative model (Mochihashi et al., 2009). Semi-Markov view of NPYLM NPYLM formulates unsupervised word segmentation as learning with a semi-Markov model ( Figure 3). Here, each node corresponds to an inside probability α[t][k] 1 that equals the probability of a substring c t 1 = c 1 · · · c t with the last k characters c t t−k+1 being a word. This inside probability can be computed recursively as follows: Here, 1 ≤ L ≤ t−k is the maximum allowed length of a word. With these inside probabilities, we can make use of Markov Chain Monte Carlo (MCMC) method with an efficient forward filtering-backward sampling algorithm (Scott, 2002), namely a "stochastic Viterbi" algorithm to iteratively sample "words" from raw strings in a completely unsupervised fashion, while avoiding local minima.
Problems and Beyond Unsupervised word segmentation with NPYLM works surprisingly well for many languages (Mochihashi et al., 2009); however, it has certain issues. First, since it optimizes the performance of the language model, its segmentation does not always conform to human standards and depends on subtle modeling decisions. For example, NPYLM often separates inflectional suffixes in Japanese like " " in " -" from the rest of the verb, when it is actually a part of the verb itself. Second, it can produce deficient segmentations for low-frequency words and the beginning or ending of a string where the available information comes from only one direction. These issues can be alleviated by using naïve semi-supervised learning method (Mochihashi et al., 2009) that simply  adds n-gram counts from supervised segmentations in advance. However, this solution is not perfect because these supervised counts will eventually be overwhelmed by the unsupervised counts, because the overall objective function remains unsupervised.
To resolve this issue, we must resort to an explicit semi-supervised learning framework that combines both discriminative and generative models. We used JESS-CM (Suzuki and Isozaki, 2008), currently the best such framework for this purpose, which we will briefly introduce below.

JESS-CM (Joint probability model Embedding style
Semi-Supervised Conditional Model) is a semisupervised learning framework that outperforms other generative and log-linear models (Druck and McCallum, 2010). In JESS-CM, the probability of a label sequence y given an input sequence x is written as follows: where p DISC and p GEN are respectively the discriminative and generative models, and Λ and Θ are their corresponding parameters. Equation (5) is the product of the experts, where each expert works as a "constraint" to the other with a relative geometrical interpolation weight 1 : λ 0 . If we take p DISC to be a log-linear model like CRF (Lafferty et al., 2001): (5) can be also expressed as a loglinear model with a new "feature function" log p GEN (y, x): 181 Here, the parameter Λ = (λ 0 , λ 1 , · · · , λ K ) includes the interpolation weight λ 0 and F (y, x) = (log p GEN (y, x), f 1 (y, x), · · · , f K (y, x)). JESS-CM interleaves the optimization of Λ and Θ to maximize the objective function where X l , Y l is the labeled dataset and X u is the unlabeled dataset. Suzuki and Isozaki (2008) conducted semisupervised learning on a combination of a CRF and an HMM, as shown in Figure 4. Since CRF and HMM have the same Markov model structure, they interpolate two weights on the corresponding path, altenately • fixing Θ and optimizing Λ of CRF on X l , Y l , and • fixing Λ and optimizing Θ of HMM on X u until convergence, and thereby iteratively maximizing the two terms in (8).
Through this optimization, p DISC and p GEN will "teach each other" to make the feature log p GEN more accurate, and further rectified by p DISC with respect to the labeled data. Note that the interpolation weight λ 0 is automatically computed through this process.

Connecting Two Worlds: NPYCRF
We wish to integrate NPYLM and CRF, applying semi-supervised learning via JESS-CM. Note that Suzuki and Isozaki (2008) implicitly assumed that the discriminative and generative models have the same structure as shown in Figure 4. Since NPYLM is a semi-Markov model as described in Section 2, a naïve approach would be to combine it with a semi-Markov CRF (Sarawagi and Cohen, 2005) as the discriminative model.
However, this strategy does not work well for two reasons: First, since a semi-Markov CRF is a model for transitions between segments, it cannot deal with character-level transitions and thus performs suboptimally on its own. In fact, our preliminary supervised word segmentation experiments showed a F 1 measure of around 95%, whereas a character-wise Markov CRF achieves >99%. Second, the semi-Markov CRF was originally designed to chunk at most a few words (Sarawagi and Cohen, 2005). However, in word segmentation of Japanese, for example, we often encounter long proper nouns or Katakana sequences that are more than ten characters, requiring a huge amount of memory even for a small dataset.
In this paper we instead transparently exchange information between the Markov model (CRF) on characters and the semi-Markov model (NPYLM) on words to perform a semi-supervised learning on different model structures. Called NPYCRF, this unified statistical model makes good use of the discriminative model (CRF) from the labeled data and the generative model (NPYLM) from the unlabeled data.

CRF→NPYLM
To convert from a CRF to NPYLM, we can easily translate Markov potentials into semi-Markov potentials as shown in Andrew (2006) for the supervised learning case.
Consider the situation depicted in Figure 5. Here we can see that the potential of the substring " " (Tokyo prefecture) in the semi-Markov model (left) corresponds to the sum of the potentials in the Markov model (right) along the path shown in bold. Here, we introduce binary hidden states in the Markov model for each character, similarly to the BI tags used in supervised learning, where state 1 represents the beginning of a word and state 0 represents a continuation of the word.
Mathematically, we define γ[a, b) as the sum of the potentials along a U-shaped path over an interval [a, b) (a < b) as shown in Figure 5, which begins with state 1 and ends with (but does not include) 1. Figure 6: Substring transitions for marginalization.
Using this notation, the potential that corresponds to and thus the forward recursion of the inside probability α[t][k] that incorporates the information from the CRF can be written as follows, instead of (4): Backward sampling can be performed in a similar fashion. In this way, we can incorporate information from the character-wise discriminative model (CRF) into the language model segmentation of NPYLM.

NPYLM→CRF
On the other hand, translating the information from the semi-Markov to Markov model, i.e., translating a potential from the word-based language model into the character-wise discriminative classifier, is not trivial. However, as we describe below, it is actually possible to do so by extending the technique proposed in Andrew (2006).
Note that for the inference of CRF, from the standard theory of log-linear models we only have to compute its gradient with respect to the expectation of each feature in the current model. This reduces the problem to a computation of the marginal probability of each path, which can be derived within the framework of semi-Markov models as follows: Semi-Markov feature λ 0 . Following the line of argument presented in the Section 4.1, the potential with respect to the semi-Markov feature weight λ 0 that is associated with the word transition c t−k t−k−j+1 → c t t−k+1 , shown in Figure 6, can be expressed as an expectation using the standard forward-backward formula: Here, Z(s) is a normalizing constant associated with each input string s, and β[t][k] is a backward proba-bility similar to (11) computed by Markov features λ 1 , · · · , λ K . Note that the features associated with label bigrams in our binary CRF can be divided into four types: 1-1,1-0,0-1, and 0-0, as shown in Figure 7.
Case 1-1: As shown in Figure 8(a), this case means that a word of length 1 begins at time t, which is equivalent to the probability of substring c t t being a word: Here, p(c k ℓ |s) is the marginal probability of a substring c ℓ · · · c k being a word, which can be derived from equation (12): Case 1-0: As shown in Figure 8(b), this case means that a word begins at time t and has length at least 2. Since we do not know the endpoint of this word, we can obtain the probability p(z t = 1, z t+1 = 0) by marginalizing over the endpoint j (· · · means values all 0): where p(c t+j−1 t |s) is obtained from (15).
Case 0-1: Similarly, as shown in Figure 8(c) this case means that a word of length at least 2 begins before time t and ends at time t. Therefore, we can marginalize over the start point of a possible word to obtain the marginal probability: Case 0-0: In principle, this means that a word begins before time t and ends later than (and including) time t + 1. Therefore, we can marginalize over both the start and end time of a possible word spanning [t, t+1] to obtain: However, in fact we can avoid this nested computation because the probability of p(z t , z t+1 ) over the possible values of z t and z t+1 must sum to 1. We can therefore simply calculate it as follows (Andrew, 2006): p(z t = 0, z t+1 = 0|s) = 1−p(1, 1)−p(1, 0)−p(0, 1) (20) where p(x, y) means p(z t = x, z t+1 = y|s).

Inference
Finally, we obtain the inference algorithm for NPY-CRF as a variant of the MCMC-EM algorithm (Wei and Tanner, 1990) shown in Figure 9. 2 In learning of a NPYLM, we add the CRF potentials as described in Section 4.1, and sample a possible segmentation from the posterior through Forward filtering-Backward sampling to update the model parameters. On the basis of this improved language model, the CRF weights are then optimized by incorporating language model features as explained in Section 4.2. We iterate this process until convergence.
Note that we first have to learn an unsupervised segmentation in Step 2 before training the CRF. Since our inference algorithm includes an optimization of CRF and thus is not a true MCMC, the learning of word segmentation after the supervised information will be severely constrained and likely to get stuck in local minima.
In practice, we found that the EM-style batch learning of CRF described above often fails because our objective function is non-convex. Therefore, we switched to ADF below (Sun et al., 2014), an adaptive stochastic gradient descent that yields state-ofthe-art accuracies for natural language processing problems including word segmentation. In this case, Λ in Figure 9 was optimized with each minibatch through the labeled data X l , Y l , while incorporating information from the unlabeled data X u by the language model. Because of its heavy computational demands, 1: Add Y l , X l to NPYLM. 2: Optimize Λ on Y l , X l . (pure CRF) 3: for j = 1 · · · M do 4: for i = randperm(1 · · · N ) do 5: if j > 1 then 6: Remove customers of X Optimize Λ of NPYCRF on Y l , X l . 12: end for Figure 9: Basic learning algorithm for NPYCRF. X (i) u denotes the i-th sentence in the unlabeled data X u . We can also iterate steps 4 to 10 several times until Θ approximately converges, before updating Λ. Test   Chinese  MSR  86,924  865,679 3,985  Weibo 10K-40K 880,920 3 30,000  Japanese  Twitter  59,931  600,000  444  Thai InterBEST 10,000 30,133 10,000 we parallelized the NPYLM sampling over several processors and because of the possible correlation of segmentations within the samples, used the Metropolis-Hastings algorithm to correct them. The acceptance rate in our experiments was over 99%. For decoding, we can simply find a Viterbi path in the integrated semi-Markov model while fixing all the sampled segmentations on the unlabeled data.

Experiments
We conducted experiments on several corpora of unsegmented languages: Japanese, Chinese, and Thai. The corpora included standard corpora as well as text from Twitter and its equivalent, Weibo, in Chinese.

Data
Chinese For Chinese, we first used a standard dataset from the SIGHAN Bakeoff 2005(Emerson, 2005 for the labeled and test data, and Chinese gigaword version 2 (LDC2009T14) for the unlabeled data. We chose the MSR subset of SIGHAN Bakeoff written in simplified Chinese together with the provided training and test splits, which contain about 87K/40K sentences, respectively. For the unlabeled data, i.e., a collection of raw strings, we used a random subset of 880K sentences from Chinese gigaword with all spaces removed. We chose this size to be about 10 times larger than the labeled data, considering current computational requirements. We used the part from the Xinhua news agency 2004 and split the data into sentences at the end-of-sentence character " ".
Because the MSR and Xinhua datasets were compiled from newspapers, to meet our objective on informal text we conducted further experiments using   Table 3: Accuracies on Leiden Weibo corpus in Chinese. 'Label' and 'Unlabel' are the amounts of labeled and unlabeled data, respectively. "Topline" is an ideal situation of complete supervision, and K= 10 3 sentences.
the Leiden Weibo corpus 4 from Weibo, a Twitter equivalent in China. From this dataset, we used the sentences that have exact correspondence between the provided segmented-unsegmented pair, yielding about 880K sentences. Since we did not know how much supervision would be necessary for a decent performance, we conducted experiments with different amounts of labeled data: 10K, 20K, 40K and 880K(all). Note that the final case amounts to complete supervision, an ideal situation that is not likely in practice.
Japanese Word segmentation accuracies around 99% have already been reported for newspaper domains in Japanese (Kudo et al., 2004). Therefore, we only conducted experiments on segmenting Twitter text. In addition to our random Twitter crawl in April 2014, we used a corpus of Japanese Twitter text compiled by the Tokyo Metropolitan University 5 . This corpus is actually very small, 944 sentences. It mainly targets transfer learning and is segmented according to BCCWJ (Basic Corpus of Contemporary Written Japanese) standards from the National Institute of Japanese Language (Maekawa, 2007). Therefore, for the labeled data we used the "core" subset of BCCWJ consisting of about 59K sentences plus 500 random sentences from the Twitter dataset. We used the remaining 444 sentences for testing. For the unlabeled data, we used a random crawl of 600K Japanese sentences collected from Twitter in March-April, 2014.
Thai Unsegmented languages, such as Thai, Lao, Myanmar, and Kumer, are also prevalent in South East Asia and are becoming increasingly important targets of natural language processing. Thus we also conducted an experiment on Thai, using the standard InterBEST 2009 dataset (Kosawat, 2009). Since it is reported that the "novel" subset of InterBEST has relatively low precision, we used this part with a random split of 10K sentences for supervised learning, 30K sentences for unsupervised learning, and a further 10K sentences for testing.

Training Settings
Because Sun et al. (2012) report increased accuracy with three tags, {B,I,E} 6 , we also tried these tags in place of the binary tags described in Section 4.2. This modification resulted in 6 possible transitions out of 3 2 = 9 transitions, whose computation follows from the binary case in Section 4.2. We used normal priors of truncated N (1, σ 2 ) and N (0, σ 2 ) for λ 0 and λ 1 · · · λ K , respectively, and fixed the CRF regularization parameter C to 1.0, and σ to 1.0 by preliminary experiments on the same data.
For the feature templates, we followed Sun et al. (2012). In addition to those templates, we used character type bigrams, where the 'character type' was defined by Unicode blocks (like Hiragana or CJK Unified Ideographs for Chinese and Japanese) or Unicode character categories (Thai).
To reduce computations by restricting the search space appropriately, we employed a Negative Binomial generalized linear model on string features (Uchiumi et al., 2015) to predict the maximum length of a possible word for each character position in the training data. Therefore, the upper limit of L in (11) and (13) was L t for each position t, obtained 6 The B, I, and E tags mean the beginning, internal part, and end of a word, respectively. (a) MSR (Simplified Chinese) (b) Twitter (Japanese) Figure 10: New words acquired by NPYCRF. For each figure, the left column is the words that did not appear in the provided labeled data, and the right column is the frequencies NPYCRF recognized in the test data. In Chinese, we found many proper names including company and person name, and in Japanese, we found many novel slang words and proper names. from this statistical model trained on labeled segmentations. We observed that this prediction made the computation several times faster than, for example, using a fixed threshold in Japanese where quite long words are occasionally encountered.

Experimental results
Chinese Tables 2 and 3 show IV (in-vocabulary) and OOV (out-of-vocabulary) precision and Fmeasure, computed against segmented tokens. The results for standard newspaper text indicate that NPYCRF is basically comparable in performance to state-of-the-art supervised neural networks (Chen et al., 2015;Zhang et al., 2016) that require hand tuning of hyperparameters or model architectures. Figure 10 shows some of the learned words in the testset of the Bakeoff MSR corpus. As shown in Table 3, NPYCRF also yields higher precision than supervised learning on non-standard text like Weibo, which is the main objective for this study. Contrary to ordinary supervised learning, we can see that NPYCRF effectively learns many "new words" from the large amount of unlabeled data thanks to the generative model, while observing human standards of segmentation by the discriminative model. Note that in Weibo segmentation, complete supervision is not 186 CRF NPYLM NPYCRF Gold Figure 11: Example of segmentation of the SIGHAN Bakeoff MSR dataset made with supervised (CRF), unsupervised (NPYLM), and semi-supervised (NPYCRF) models in comparison with gold segmentations (Gold). " " is a proverb and " " is a full name of a person. available in practice. In fact, we realized that the Weibo segmentations were given automatically by an existing classifier, and contain many inappropriate segmentations, while NPYCRF finds much "better" segmentations. Figure 11 compares the results of CRF, NPYLM, and NPYCRF with the gold segmentation. While proverbs like " " (wide vision without action) are correctly captured from the unlabeled data by NPYLM, it is sometimes broken by CRF through integration. In another case, the name of a person is properly connected because of the information provided by the CRF. This comparison shows that there is still room for improvement in NPYCRF. Section 6 discusses future research directions for improvements. Japanese and Thai Figure 12 shows an example of the analysis of Japanese Twitter text. Shaded words are those that are not contained in labeled data (BCCWJ core) but were found by NPYCRF. Many segmentations, including new words, are correct. We expect NPYCRF would perform better with more unlabeled data that are easily obtained.
Tables 4 and 5 show the segmentation accuracies of the Twitter data in Japanese and novel data in w w w ( ) Figure 12: Samples of NPYCRF segmentation of Twitter text in Japanese that are difficult to analyze by ordinary supervised segmentation. It contains a lot of novel words, emoticons, and colloquial expressions that are not contained in the BCCWJ core text (shaded).
Thai. While there are no publicly available results for these data (the InterBEST testset is closed during competition), NPYCRF achieved better accuracies than vanilla supervised segmentation based on CRF. Considering that many new words were found in Figure 12, for example, we believe NPYCRF is quite competitive thanks to its ability to learn the infinite vocabulary, which it inherits from NPYLM.

Analysis
As shown in Figure 11, NPYCRF makes good use of NPYLM but sometimes ignores its prediction by falling back to CRF, yielding suboptimal performance. This is mainly because the geometric interpolation weight λ 0 is always constant and does not vary according to the input. For example, even if the substring to segment is very rare in the labeled data, NPYCRF trusts the supervised classifier (CRF) with a constant rate of 1/(1+ λ 0 ) in the log probability domain. To alleviate this problem,   it is necessary to change λ 0 depending on the input string in a log-linear framework. 7 While this might be achieved through Density Ratio estimation framework (Sugiyama et al., 2012;Tsuboi et al., 2009), we believe it is a general problem of semisupervised learning and is beyond the scope of this paper.
This issue also affects the estimation of λ 0 as a scalar: that is, we found that λ 0 often fluctuates during training because Λ (which includes λ 0 ) is estimated using only limited X l , Y l . In practice, we terminated the EM algorithm in Figure 9 early after a few iterations. Therefore, with a more adaptive semi-supervised learning framework, we expect that NPYCRF will achieve higher accuracy than the current performance.

Conclusion
In this paper, we presented a hybrid generative/discriminative model of word segmentation, leveraging a nonparametric Bayesian model for unsupervised segmentation. By combining CRF and NPYLM within the semi-supervised framework of JESS-CM, our NPYCRF not only works as well as the state-of-the-art neural segmentation without hand tuning of hyperparameters on standard corpora, but also appropriately segments non-standard texts found in Twitter and Weibo, for example, by automatically finding "new words" thanks to a nonparametric model of infinite vocabulary.
We believe that our model lays the foundation for developing a methodology of combining nonparametric Bayesian models and discriminative classifiers, as well as providing an example of semisupervised learning on different model structures, i.e. Markov and semi-Markov models for word segmentation.