Unsupervised Learning of Morphological Forests

This paper focuses on unsupervised modeling of morphological families, collectively comprising a forest over the language vocabulary. This formulation enables us to capture edge-wise properties reflecting single-step morphological derivations, along with global distributional properties of the entire forest. These global properties constrain the size of the affix set and encourage formation of tight morphological families. The resulting objective is solved using Integer Linear Programming (ILP) paired with contrastive estimation. We train the model by alternating between optimizing the local log-linear model and the global ILP objective. We evaluate our system on three tasks: root detection, clustering of morphological families, and segmentation. Our experiments demonstrate that our model yields consistent gains in all three tasks compared with the best published results.


Introduction
The morphological study of a language inherently draws upon the existence of families of related words. All words within a family can be derived from a common root via a series of transformations, whether inflectional or derivational. Figure 1 depicts one such family, originating from the word faith. This representation can benefit a range of applications, including segmentation, root detection and clustering of morphological families. 1 Code is available at https://github.com/ j-luo93/MorphForest. Figure 1: An illustration of a single tree in a morphological forest. pre and suf represent prefixation and suffixation. Each edge has an associated probability for the morphological change.
Using graph terminology, a full morphological assignment of the words in a language can be represented as a forest. 2 Valid forests of morphological families exhibit a number of well-known regularities. At the global level, the number of roots is limited, and only constitute a small fraction of the vocabulary. A similar constraint applies to the number of possible affixes, shared across families. At the local edge level, we prefer derivations that follow regular orthographic patterns and preserve semantic relatedness. We hypothesize that enforcing these constraints as part of the forest induction pro-cess will allow us to accurately learn morphological structures in an unsupervised fashion.
To test this hypothesis, we define an objective over the entire forest representation. The proposed objective is designed to maximize the likelihood of local derivations, while constraining the overall number of affixes and encouraging tighter morphological families. We optimize this objective using integer linear programming (ILP), which is commonly employed to handle global constraints. While in prior work, ILP has often been employed in supervised settings, we explore its effectiveness in unsupervised learning. We induce a forest by alternating between learning local edge probabilities using a log-linear model, and enforcing global constraints with the ILP-based decoder. With each iteration, the model progresses towards more consistent forests. We evaluate our model on three tasks: root detection, clustering of morphologically related families and segmentation. The last task has been extensively studied in the recent literature, providing us with the opportunity to compare the model with multiple unsupervised techniques. On benchmark datasets representing four languages, our model outperforms the baselines, yielding new state-of-the-art results. For instance, we improve segmentation performance on Turkish by 4.4% and on English by 3.7%, relative to the best published results (Narasimhan et al., 2015). Similarly, our model exhibits superior performance on the other two tasks. We also provide analysis of the model behavior which reveals that most of the gain comes from enforcing global constraints on the number of unique affixes.

Related Work
Unsupervised morphological segmentation Most top performing algorithms for unsupervised segmentation today center around modeling singlestep derivations (Poon et al., 2009;Naradowsky and Toutanova, 2011;Narasimhan et al., 2015). A commonly used log-linear formulation enables these models to consider a rich set of features ranging from orthographic patterns to semantic relatedness.
However, these models generally bypass global constraints (Narasimhan et al., 2015) or require performing inference over very large spaces (Poon et al., 2009). As we show in our analysis (Section 5), this omission negatively affects model performance.
In contrast, earlier work focuses on modeling global morphological assignment, using generative probabilistic models (Creutz and Lagus, 2007;Snyder and Barzilay, 2008;Goldwater et al., 2009;Sirts and Goldwater, 2013). These models are inherently limited in their ability to incorporate diverse features that are effectively utilized by local discriminative models.
Our proposed approach attempts to combine the advantages of both approaches, by defining an objective that incorporates both levels of linguistic properties over the entire forest representation, and adopting an alternating training regime for optimization.
Graph-based representations in computational morphology Variants of a graph-based representation have been used to model various morphological phenomena (Dreyer and Eisner, 2009;Peng et al., 2015;Soricut and Och, 2015;Faruqui et al., 2016). The graph induction methods vary widely depending on the task and the available supervision. The distinctive feature of our work is the use of global constraints to guide the learning of local, edge-level derivations.
ILP for capturing global properties Integer Linear Programming has been successfully employed to capture global constraints across multiple applications such as information extraction (Roth and Yih, 2001), sentence compression (Clarke and Lapata, 2008), and textual entailment (Berant et al., 2011). In all of these applications, the ILP formulation is used with a supervised classifier. Our work demonstrates that this framework continues to be effective in an unsupervised setting, providing strong guidance for a local, unsupervised classifier.

Model
Our model considers a full morphological assignment for all the words in a language, representing it as a forest. Let F = (V, E) be a directed graph where each word corresponds to a node v ∈ V . A directed edge e = (v c , v p ) ∈ E encodes a single morphological derivation from a parent word v p to a child word v c . Edges also reflect the type of the underlying derivation (e.g., prefixation), and an associated probability Pr(e). Note that the root of a tree is always marked with a self-directed (i.e. v c = v p ) edge associated with the label stop. Figure 1 illustrates a single tree in the forest.

Inducing morphological forests
We postulate that a valid assignment yields forests with the following properties: 1. Increased edge weights Edge weights reflect probabilities of single-step derivations based on the local features including orthographic patterns and semantic relatedness. This local information helps identify that the edge (painter, paint) should be preferred over (painter, pain), because −er is a valid suffix and paint is semantically closer to painter.
2. Minimized number of affixes Prior research has shown that local models tend to greatly overestimate the number of suffixes. For instance, the model of Narasimhan et al. (2015) produces 617 unique affixes when segmenting 10000 English words. Thus, we explicitly encourage the model towards assignments with the least number of affixes.
3. Minimized number of roots relatively to vocabulary size Similarly, the number of roots, and consequently the number of morphological families is markedly smaller than the size of the vocabulary.
The first property is local in nature, while the last two are global and embody the principle of Minimum Description Length (MDL). Based on these properties, we formulate an objective function S(F ) over a forest F : where |·| denotes set cardinality, Affix = {a k } K k=1 is the set of all affixes, and |F | is the number of trees in F . |E| and |V | are the size of the edge set and vocabulary, respectively. The hyperparameters α and β capture the relative importance of the three terms.
By minimizing this objective, we encourage assignments with high edge probabilities (first term), Figure 2: Illustration of two chosen forest representations. The top forest has only one affix -s, but two roots {pain, paint}. Shown in the bottom forest, choosing the edge (paint, pain) instead of (paint, paint) will introduce another affix -t, while reducing the set of roots to just {pain}.
while limiting the number of affixes and morphological families (second and third terms, respectively). This objective can also be viewed as a simple log-likelihood objective regularized by the last two terms in Equation (1).
To illustrate the interaction between local and global constraints in this objective, consider an example in Figure 2. If the model selects a different edge -e.g. (paint, pain) instead, all the terms in Equation (1) will be affected.

Computing local probabilities
We now describe how to parameterize Pr(e), which captures the likelihood of a single-step morphological derivation between two words. Following prior work (Narasimhan et al., 2015), we model this probability using a log-linear model: where θ is the set of parameters to be learned, and φ(w, z) is the feature vector extracted from w and z. Each candidate z is a tuple (string, label), where label refers to the label of the potential edge. As a result, the marginal probability is where Σ * is the set of all possible strings. Computing the sum in the denominator is infeasible. Instead, we make use of contrastive estimation (Smith and Eisner, 2005), substituting Σ * with a (limited) set of neighbor strings N (w) that are orthographically close to w. This technique distributes the probability mass among neighboring words and forces the model to identify meaningful discriminative features. We obtain N (w) by transposing characters in w, following the method described in Narasimhan et al. (2015). Now for the forest over the set of nodes V , the log-likelihood loss function is defined as: This objective can be minimized by gradient descent.

Space of Possible Candidates
We only consider assignments where the parent word is strictly shorter than the child to prevent cycles of length two or more. In addition to suffixation and prefixation, we also consider three types of transformations introduced in Goldwater and Johnson (2004): repetition, deletion, and modification. We also handle compounding, where two stems are combined to form a new word (e.g., football). One of these stems carries the main semantic meaning of the compound and is considered to be the parent of the word. Note that stems are not considered affixes, so this does not affect the affix list.
We allow parents to be words outside V , since many legitimate word forms might never appear in the corpus. For instance, if we have V = {painter, paints}, the optimal solution would add an unseen word paint to the forest, and choose edges (painter, paint) and (paints, paint).
Features We use the same set of features shown to be effective in prior work (Narasimhan et al., 2015), including word vector similarity, beginning and ending character bigrams, word frequencies and affixes. Affix features are automatically extracted from the corpus based on string difference and are thresholded based on frequency. We also include an additional sibling feature that counts how many words are siblings of word w in its tree. Siblings are words that are derived from the same parent, e.g., faithful and faithless, both from the word faith.

ILP formulation
Minimizing the objective in Equation (1) is challenging because the second and third terms capture discrete global properties of the forest, which prevents us from performing gradient descent directly. Instead, we formulate this optimization problem as Integer Linear Programming (ILP), where these two terms can be cast as constraints. 3 For each child word v i ∈ V , we have a bounded set of its candidate outgoing edges C( is the same set as defined in Section 3.2. Each edge is associated with p ij , which is computed as log Pr(z j i |v i ). Let x ij be a binary variable that has value 1 if and only if z j i is chosen to be in the forest. Without loss of generality, we assume the first candidate edge is always the self-edge (or stop case), i.e., z 1 i = (v i , stop). We also use a set of binary variables {y k } to indicate whether affix a k is used at least once in F (i.e. required to explain a morphological change). Now let us consider how to derive our ILP formulation using the notations above. Note that |F | is equal to the number of self-edges i x i1 , and also a valid forest will satisfy |V | = |E|. Combining these pieces, we can rewrite the objective in equation (1) and arrive at the following ILP formulation: Constraint 4 states that exactly one of the candidate edges should be chosen for each word. The last constraint implies that we can only consider this candidate (and construct the corresponding edge) when the involved affix 4 is used at least once in the forest representation.

Alternating training
The objective function contains two sets of parameters: a continuous weight vector θ that parameterizes edge probabilities, and binary variables {x ij } and {y k } in ILP. Due to the discordance between continuous and discrete variables, we need to optimize the objective in an alternating manner. Algorithm 1 details the training procedure. After automatically extracting affixes from the corpus, we alternate between learning the local edge probabilities (line 3) and solving ILP (line 4).
The feedback from solving ILP with the global constraints can help us refine the learning of local probabilities by removing incorrect affixes (line 5). For instance, automatic extraction based on frequencies can include -ers as an English suffix. This is likely to be eliminated by ILP, since all occurrences of -ers can be explained away without adding a new affix by concatenating -er and -s, two very common suffixes. After refining the affix set, we remove all candidates that involve any affix discarded by ILP. This corresponds to reducing the size of C(w) in equation (3). We then train the log-linear model again using the newly-pruned candidate set. By doing so, we force the model to learn from better contrastive signals, and focus on affixes of higher quality, resulting in a new set of probabilities {p ij }. This procedure is repeated until no more affixes are rejected. 5

Experiments
We evaluate our model on three tasks: segmentation, morphological family clustering, and root detection. While the first task has been extensively studied in the prior literature, we consider two additional tasks to assess the flexibility of the derived representation.

Morphological segmentation
Data We choose four languages with distinct morphological properties: English, Turkish, Arabic, and German. Our training data consists of standard datasets used in prior work. Statistics for all datasets are summarized in Table 1. Note that for the Arabic test set, we filtered out duplicate words, and we reran the baselines to obtain comparable results.
Following Narasimhan et al. (2015), we reduce the noise by truncating the training word list to the top K frequent words. In addition, we train word vectors (Mikolov et al., 2013) to obtain cosine similarity features. Statistics for all datasets are summarized in Table 1.
Baselines We compare our approach against the state-of-the-art unsupervised method of Narasimhan et al. (2015) which outperforms a number of alternative approaches (Creutz and Lagus, 2005;Virpioja et al., 2013;Sirts and Goldwater, 2013;Lee et al., 2011;Stallard et al., 2012;Poon et al., 2009). For this baseline, we report the results of the publicly available implementation of the technique (NBJ'15), as well as our own improved reimplementation (NBJ-Imp). Specifically in NBJ-Imp, we expanded the original algorithm to handle compounding, along with sibling features as described in Section 3.2, making it essentially an ablation of our model without ILP and alternating training. We employ grid search to find the optimal hyperparameter setting. 6 Algorithm 1 Morphological Forest Induction Input: wordlist V Output: Forest representation of V Get indicators for affixes, and the forest, cf. Section 3.3 5: PruneAffixSet(Affix, y * t ) Prune affix set using the output from ILP, cf. Section 3.4 return F T  (Sak et al., 2008), Gigaword = Arabic Gigaword corpus (Parker et al., 2011), ATB = Arabic Treebank (Maamouri et al., 2003). Duplicates in Arabic test set are filtered. Dsolve is the dataset released by Würzner and Jurish (2015), and for training German vectors, we use the pre-processed Wikipedia dump from (Al-Rfou et al., 2013).
We also include a supervised counterpart, which uses the same set of features as NBJ-Imp but has access to gold segmentation during training (we perform 5-fold cross-validation using the same data). We obtain the gold standard parent-child pairs required for training from the segmented words in a straightforward fashion.
Evaluation metric Following prior work (Virpioja et al., 2011), we evaluate all models using the standard boundary precision and recall (BPR). This measure assesses the accuracy of individual segmentation points, producing IR-style Precision, Recall and F1 scores.  Training For unsupervised training, we use the gradient descent method ADAM (Kingma and Ba, 2014) and optimize over the whole batch of training words. We use a Gurobi 7 solver for the ILP.

Morphological family clustering
Morphological family clustering is the task of clustering morphologically related word forms. For instance, we want to group paint, paints and pain into two clusters: {paint, paints} and {pain}. To derive clusters from the forest representation, we assume that all the words in the same tree form a cluster.
Data To obtain gold information about morphological clusters, we use CELEX (Baayen et al., 1993). Data statistics are summarized in Table 2. We remove words without stems from CELEX. 8 Baseline We compare our model against NBJ-Imp described above. We select the best variant of our model and the base model based on their respective performance on the segmentation task.
Evaluation We use the metrics proposed by Schone and Jurafsky (2000). Specifically, let X w  and Y w be the clusters for word w in our predictions and gold standard respectively. We compute the number of correct (C), inserted (I) and deleted (D) words for the clusters as follows:

Root detection
In addition, we evaluate how accurately our model can predict the root of any given word.
Data We report the results on the Chipmunk dataset  which has been used for evaluating supervised models for root detection. Since our model is unsupervised, we report the performance both on the test set only, and on the entire dataset, combining the train/test split. Statistics for the dataset are shown in Table 3.

Results
In the following subsections, we report model performance on each one of the three evaluation tasks. 9 We used cosine similarity features in all experiments. But the root forms of German verbs are rarely used, except in imperative sentences. Consequently they barely have trained word vectors, contributing to the low recall value. We suspect better treatment with word vectors can further improve the results. 10 http://www.mathcracker.com/sign-test.  Table 4: Segmentation results for the supervised model and three unsupervised models: the state-ofthe-art system NBJ'15 (Narasimhan et al., 2015), our improved implementation of their system NBJ-Imp and our model. For our model, we also report results with different feature combinations. + Sibl and + Comp refer to addition of sibling and compounding features respectively. Best hyperparameter values for unsupervised baselines (NBJ'15, NBJ-Imp) are chosen via grid search, while for our model, we use 10K words and top 500 affixes throughout. * implies statistical significance with p < 0.05 against the NBJ-Imp model using the sign test 10 .

Segmentation
From Table 4, we observe that our model consistently outperforms the baselines on all four lan-guages. Compared to NBJ'15, our model has a higher F1 score by 3.7%, 4.4%, 2.9% and 27.7% on English, Turkish, Arabic and German, respectively. While the improved implementation NBJ-Imp benefits from the addition of compounding and sibling features, our model still delivers an absolute increase in F1 score, ranging from 1.8% to 7.7% over NBJ-Imp. Note that our model achieves higher scores even without tuning the threshold K or the number of affixes, whereas the baselines have optimal hyperparameter settings via grid search.
To understand the importance of global constraints (the last two terms of equation 1), we analyze our model's performance with different values of α and β (see Figure 3). The first constraint, which controls the size of the affix set, plays a more dominant role than the second. By setting α = 0.0, the model scores at best 75.7% on English and 63.2% on Turkish, lower than the baseline. While the value of β also affects the F1 score, its role is secondary in achieving optimal performance.
The results also demonstrate that language properties can greatly affect the feature set choice. For fusional languages such as English, computing of sibling features is unreliable. For example, two descendants of the same parent spotspotless and spotty -may not be necessarily identified as such by a simple sibling computation algorithm, since they undergo different changes. In contrast, Turkish is highly agglutinative, with minimal (if any) transformations, but each word can have up to hundreds of related forms. Consequently, sibling features have different effects on English and Turkish, leading to changes of −0.3% and +2.1% in F1 score respectively.
Understanding model behavior We find that much of the gain in model performance comes from the first two rounds of training. As Figure 4 shows, the improvement mainly stems from solving ILP in the first round, followed by training the log-linear model in the second round after removing affixes and pruning candidate sets. This is exactly what we expect from the ILP formulation -to globally adjust the forest by reducing the number of unique affixes. We find this to be quite effective -in English, out of 500 prefixes, only 6 remain: de, dis, im, in, re, and un. Similarly, only 72 out of 500 suffixes survive Robustness We also investigate how robust our model is to the choice of hyperparameters. Figure 3 illustrates that we can obtain a sizable boost over the baseline by choosing α and β within a fairly wide region. Note that α takes on a much smaller value than β, to maintain the two constraints (|Affix| and |F | |V | ) at comparable magnitudes. Narasimhan et al. (2015) observe that after including more than K = 10000 words, the performance of the unsupervised model drops noticeably. In contrast, our model handles training noise more robustly, resulting in a steady boost or not too big drop in performance with increasing training size ( Figure 5). In fact, it scores 83.0% with K = 40000 on English, a 6.0% increase in absolute value over the baseline.  Table 5 shows examples of English words that our model segments correctly, while NBJ'15 fails on them. We present them in three categories (top to bottom) based on the component of our model that contributes to the successful segmentation. The first category benefits from a refinement of affix set, by removing noisy ones, such as -nce, -ch, and k-. This leads to correct stopping as in the case of knuckle or induction of the right suffix, as in divergence. Further, a smaller affix set also leads to more concentrated weights for the remaining affixes. For example, the feature weight for -ive jumps from 0.06 to 0.25, so that the derivation negative → negate is favored, as shown in the second category. Finally, the last category lists some compound words that our model successfully segments.

Morphological family clustering
We show the results for morphological family clustering in Table 6. For both languages, our model increases precision by a wide margin, with a modest boost for recall as well. This corroborates our findings in the segmentation task, where our model can effectively remove incorrect affixes while still encouraging words to form tight, cohesive families. Figure 5: Performance using bigger training sets. +Comp for English and +Sibl for Turkish. Dashed lines represent the best results for NBJ-Imp (with smaller training sets). Table 7 summarizes the results for the root detection task. Our model shows consistent improvements over the baseline on all three languages. We also include the results on the test set of two supervised systems: Morfette (Chrupala et al., 2008) and Chipmunk . Morfette is a string transducer while Chipmunk is a segmenter. Both systems have access to morphologically annotated corpora.

Root detection
Our model is quite competitive against Morfette. In fact, it achieves higher accuracy for English and Turkish. Compared with Chipmunk, our model

NBJ-Imp
Our model diverge-nce diverg-ence lur-ch lurch k-nuckle knuckle negative negat-ive junks junk-s unreserved un-reserv-ed gaslight-s gas-light-s watercourse-s water-course-s expressway express-way  Table 6: Results for morphological family clustering. P = precision, R = recall. scores 0.65 versus 0.70 on English, bridging the gap significantly. However, the high accuracy for morphologically complex languages such as Turkish and German suggests that unsupervised root detection remains a hard task.

Conclusions
In this work, we focus on unsupervised modeling of morphological families, collectively defining a forest over the language vocabulary. This formulation enables us to incorporate both local and global properties of morphological assignment. The resulting objective is solved using Integer Linear Programming (ILP) paired with contrastive estimation. Our experiments demonstrate that our model yields consistent gains in three morphological tasks compared with the best published results.   .
feedback. We are also grateful to anonymous reviewers for their insightful comments.