Modeling Word Forms Using Latent Underlying Morphs and Phonology

The observed pronunciations or spellings of words are often explained as arising from the “underlying forms” of their morphemes. These forms are latent strings that linguists try to reconstruct by hand. We propose to reconstruct them automatically at scale, enabling generalization to new words. Given some surface word types of a concatenative language along with the abstract morpheme sequences that they express, we show how to recover consistent underlying forms for these morphemes, together with the (stochastic) phonology that maps each concatenation of underlying forms to a surface form. Our technique involves loopy belief propagation in a natural directed graphical model whose variables are unknown strings and whose conditional distributions are encoded as finite-state machines with trainable weights. We define training and evaluation paradigms for the task of surface word prediction, and report results on subsets of 7 languages.

Fortunately, languages are systematic. The realization of a morpheme may vary by context but is largely predictable from context, in a way that generalizes across morphemes. In fact, generative linguists traditionally posit that each morpheme of a language has a single representation shared across all contexts (Jakobson, 1948;Kenstowicz and Kisseberth, 1979, chapter 6). However, this string is a latent variable that is never observed. Variation appears when the phonology of the language maps these underlying representations (URs)-in context-to surface representations (SRs) that may be easier to pronounce. The phonology is usually described by a grammar that may consist of either rewrite rules (Chomsky and Halle, 1968) or ranked constraints (Prince and Smolensky, 2004).
We will review this framework in section 2. The upshot is that the observed words in a language are supposed to be explainable in terms of a smaller underlying lexicon of morphemes, plus a phonology. Our goal in this paper is to recover the lexicon and phonology (enabling generalization to new words). This is difficult even when we are told which morphemes are expressed by each word, because the unknown underlying forms of the morphemes must cooperate properly with one another and with the unknown phonological rules to produce the observed results. Because of these interactions, we must reconstruct everything jointly. We regard this as a problem of inference in a directed graphical model, as sketched in Figure 1. This is a natural problem for computational linguistics. Phonology students are trained to puzzle out solutions for small datasets by hand. Children apparently solve it at the scale of an entire language. Phonologists would like to have grammars for many languages, not just to study each language but also to understand universal principles and differences among related languages. Automatic procedures would recover such grammars. They would also allow comprehensive evaluation and comparison of different phonological theories (i.e., what inductive biases are useful?), and would suggest models of human language learning.
Solving this problem is also practically important for NLP. What we recover is a model that can generate and help analyze novel word forms, 1 which abound in morphologically complex languages. Our approach is designed to model surface pronunciations (as needed for text-to-speech and ASR). It might also be applied in practice

4) Word Observations
Figure 1: Our model as a Bayesian network, in which surface forms arise from applying phonology to a concatenation of underlying forms. Shaded nodes show the observed surface forms for four words: resignation, resigns, damns, and damnation.
The graphical model encodes their morphological relationships using latent forms. Each morpheme UR at layer 1 is generated by the lexicon model Mφ (a probabilistic finite-state automaton). These are concatenated into various word URs at layer 2. Each SR at layer 3 is generated using the phonology model S θ (a probabilistic finite-state transducer). Layer 4 derives observable phonetic forms from layer 3. This deletes unpronounced symbols such as syllable boundaries, and translates the phonemes into an observed phonetic, articulatory, or acoustic representation. However, our present paper simply merges layers 3 and 4: our layer 3 does not currently make use of any unpronounced symbols (e.g., syllable boundaries) and we observe it directly.
to model surface spellings (as needed for MT on text). Good morphological analysis has been used to improve NLP tasks such as machine translation, parsing, and NER (Fraser et al., 2012;Hohensee and Bender, 2012;Yeniterzi, 2011). Using loopy belief propagation, this paper attacks larger-scale learning problems than prior work on this task (section 8). We also develop a new evaluation paradigm that examines how well an inferred grammar predicts held-out SRs. Unlike previous algorithms, we do not pre-restrict the possible URs for each morpheme to a small or structured finite set, but use weighted finite-state machines to reason about the infinite space of all strings. Our graphical model captures the standard assumption that each morpheme has a single UR, unlike some probabilistic learners. However, we do not try to learn traditional ordered rules or constraint rankings like previous methods. We just search directly for a probabilistic finite-state transducer that captures likely UR-to-SR mappings.

Formal Framework
We urge the reader to begin by examining Figure 1, which summarizes our modeling approach through an example. The upcoming sections then give a formal treatment with details and discussion. Section 2 describes the random variables in Figure 1's Bayesian network, while section 3 describes its conditional probability distributions. Sections 4-5 give inference and learning methods.
A morpheme is a lexical entry that pairs form with content (Saussure, 1916). Its form is a morph-a string of phonemes. Its content is a bundle of syntactic and/or semantic properties. 2 Note that in this paper, we are nonstandardly using "morph" to denote an underlying form. We assume that all underlying and surface representations can be encoded as strings, over respective alphabets Σ u and Σ s . This would be possible even for autosegmental representations (Kornai, 1995).
A language's phonological system thus consists of the following components. We denote each important set by a calligraphic letter. We use the corresponding uppercase letter to denote a function to that set, the corresponding lowercase letter as a variable that ranges over the set's elements, and a distinguished typeface for specific elements.
• A is a set of abstract morphemes such as quiz and plur$al. These are atoms, not strings. • M = Σ * u is the space of possible morphs: concrete UR strings such as /kwIz/ or /z/.
• M : A → M is the lexicon that maps each morpheme a to an underlying morph m = M (a). We will find M (a) for each a.
representations for words, such as /kwIz#z/. • U : M * → U combines morphs. A word is specified by a sequence of morphemes a = a 1 , a 2 , . . ., with concrete forms m i = M (a i ). That word's underlying form is then u = U (m 1 , m 2 , . . .) ∈ U.
• S = Σ * s is the space of surface representations for words, such as [kwIzIz]. • S : U → S is the phonology. It maps an underlying form u to its surface form s. We will find this function S along with M .
We assume in this paper that U simply concatenates the sequence of morphs, separating them by the morph boundary symbol #: u = U (m 1 , m 2 , . . .) = m 1 #m 2 # · · · . However, see section 4.3 for generalizations.
The overall system serves to map an (abstract) morpheme sequence a ∈ A * to a surface word s ∈ S. Crucially, S acts on the underlying form u of the entire word, not one morph at a time. Hence its effect on a morph may depend on context, as we saw for English pluralization. For example, S(/kwIz#s/) = [kwIzIz]-or if we were to apply our model to orthography, S(/quiz#s/) = [quizzes]. S produces a single well-formed surface form, which is not arbitrarily segmented as [quiz-zes] or [quizz-es] or [quizze-s].

Probability Model
Our goal is to reconstruct the lexicon M and morphophonology S for a given language. We therefore define prior probability distributions over them. (We assume Σ u , Σ s , A, U are given.) For each morpheme a ∈ A, we model the morph M (a) ∈ M as an IID sample from a probability distribution M φ (m). 3 This model describes what sort of underlying forms appear in the language's lexicon.
The phonology is probabilistic in a similar way. For a word with underlying form u ∈ U, we presume that the surface form S(u) is a sample from a conditional distribution S θ (s | u). This single sample appears in the lexical entry of the word type and is reused for all tokens of that word.
The parameter vectors φ and θ are specific to the language being generated. Thus, under our generative story, a language is created as follows: 1. Sample φ and θ from priors (see section 3.4). 2. For each a ∈ A, sample M (a) ∼ M φ . 3. Whenever a new abstract word a = a 1 , a 2 · · · must be pronounced for the first time, construct u as described in section 2, and sample S(u) ∼ S θ (· | u). Reuse this S(u) in future.
Note that we have not specified a probability distribution over abstract words a, since in this 3 See section 3.3 for a generalization to M φ (m | a). paper, these sequences will always be observed. Such a distribution might be influenced by the semantic and syntactic content of the morphemes. We would need it to recover the abstract words if they were unobserved, e.g., when analyzing novel word forms or attempting unsupervised training.

Discussion: Why probability?
A language's lexicon M and morphophonology S are deterministic, in that each morpheme has a single underlying form and each word has a single surface form. The point of the language-specific distributions M φ and S θ is to aid recovery of these forms by capturing regularities in M and S.
In particular, S θ constitutes a theory of the regular phonology of the language. Its high-probability sound changes are the "regular" ones, while irregularities and exceptions can be explained as occasional lower-probability choices. We prefer a theory S θ that has high likelihood, i.e., it assigns high probability (≈ 1) to each observed form s given its underlying u. In linguistic terms, we prefer predictive theories that require few exceptions.
In the linguistic community, the primary motivation for probabilistic models of phonology (Pierrehumbert, 2003) has been to explain "soft" phenomena: synchronic variation (Sankoff, 1978;Boersma and Hayes, 2001) or graded acceptability judgments on novel surface forms (Hayes and Wilson, 2008). These applications are orthogonal to our motivation, as we do not observe any variation or gradience in our present experiments. Fundamentally, we use probabilities to measure irregularity-which simply means unpredictability and is a matter of degree. Our objective function will quantitatively favor explanations that show greater regularity (Eisner, 2002b).
A probabilistic treatment also allows relatively simple learning methods (e.g., Boersma and Hayes (2001)) since inference never has to backtrack from a contradiction. Our method searches a continuous space of phonologies S θ , all of which are consistent with every mapping S. That is, we always have S θ (s | u) > 0 for all u, s, so our current guess of S θ is always capable of explaining the observed words, albeit perhaps with low probability. Our EM learner tunes S θ (and M φ ) so as to raise the probability of the observed surface forms, marginalizing over the reconstructed lexicon M of underlying forms. We do warn that EM can get stuck at a local optimum; random restarts and simulated annealing are ways to es- cape such low-likelihood solutions, much as backtracking escapes zero-likelihood solutions.

Mapping URs to SRs: The phonology S θ
We currently model S θ (s | u) as the probability that a left-to-right stochastic contextual edit process ( Figure 2) would edit u into s. This probability is a sum over all edit sequences that produce s from u-that is, all s-to-u alignments. Stochastic contextual edit processes were described by Cotterell et al. (2014). Such a process writes surface string s ∈ Σ * s while reading the underlying string u ∈ Σ * u . If the process has so far consumed some prefix of the input and produced some prefix of the output, it will next make a stochastic choice among 2|Σ s | + 1 possible edits. Edits of the form SUBST(c) or INSERT(c) (for c ∈ Σ s ) append c to the output string. Edits of the form SUBST(c) or DELETE will (also) consume the next input phoneme; if no input phonemes remain, the only possible edits are INSERT(c) or HALT. The stochastic choice of edit, given context, is governed by a conditional log-linear distribution with feature weight vector θ. The feature functions may look at a bounded amount of left and right input context, as well as left output context. Our feature functions are described in section 6.
Our normalized probabilities S θ (s | u) can be computed by a weighted finite-state transducer, a crucial computational property that we will exploit in section 4.2. As Cotterell et al. (2014) explain, the price is that our model is left/rightasymmetric. The inability to condition directly on the right output context arises from local normalization, just like "label bias" in maximum entropy Markov models (McCallum et al., 2000). With certain fancier approaches to modeling S θ , which we leave to future work, this effect could be mitigated while preserving the transducer property.

Generating URs: The lexicon model M φ
In our present experiments, we use a very simple lexicon model M φ , so that the burden falls on the phonology S θ to account for any language-specific regularities in surface forms. This corresponds to the "Richness of the Base" principle advocated by some phonologists (Prince and Smolensky, 2004), and seems to yield good generalization for us. Specifically, we say all URs of the same length have the same probability, and the length is geometrically distributed with mean (1/φ)−1. This is a 0-gram model with a single parameter φ ∈ (0, 1], It would be straightforward to experiment with other divisions of labor between the lexicon model and phonology model. A 1-gram model for M φ would also model which underlying phonemes are common in the lexicon. A 2-gram model would model the "underlying phonotactics" of morphs, though phonological processes would still be needed at morph boundaries. Such models are the probabilistic analogue of morpheme structure constraints. We could further generalize from M φ (m) to M φ (m | a), to allow the shape of the morph m to be influenced by a's content. For example, M φ (m | a) for English might describe how nouns tend to have underlying stress on the first syllable; similarly, M φ (m | a) for Arabic might capture the fact that underlying stems tend to consist of 3 consonants; and across languages, M φ (m | a) would prefer affixes to be short.
Note that we will always learn a language's M φ jointly with its actual lexicon M . Loosely speaking, the parameter vector φ is found from easily reconstructed URs in M ; then M φ serves as a prior that can help us reconstruct more difficult URs.

Prior Over the Parameters
For φ, which is a scalar under our 0-gram model, our prior is uniform over (0, 1]. We place a spherical Gaussian prior on the vector θ, with mean 0 and a variance σ 2 tuned by coarse grid search on dev data (see captions of Figures 3-4).
The Gaussian favors phonologies that are simple in the sense that they have few strongly weighted features. A grammar that refers once to the natural class of voiced consonants (section 6), which captures a generalization, is preferred to an equally descriptive grammar that refers separately to several specific voiced consonants. If it is hard to tell whether a change applies to round or back vowels (because these properties are strongly correlated in the training data), then the prior resists grammars that make an arbitrary choice. It prefers to "spread the blame" by giving half the weight to each feature. The change is still probable for round back vowels, and moderately probable for other vowels that are either round or back.

Inference
We are given a training set of surface word forms s that realize known abstract words a. We aim to reconstruct the underlying morphs m and words u, and predict new surface word forms s.

A Bayesian network
For fixed θ and φ, this task can be regarded as marginal inference in a Bayesian network (Pearl, 1988). Figure 1 displays part of a network that encodes the modeling assumptions of section 3. The nodes at layers 1, 2, and 3 of this network represent string-valued random variables in M, U, and S respectively. Each variable's distribution is conditioned on the values of its parents, if any.
Layer 1 represents the unknown M (a) for various a. Notice that each M (a) is softly constrained by the prior M φ , and also by its need to help produce various observed surface words via S θ . Each underlying word u at level 2 is a concatenation of its underlying morphs M (a i ) at level 1. Thus, the topology at levels 1-2 is given by supervision. We would have to learn this topology if the word's morphemes a i were not known.
Our approach captures the unbounded generative capacity of language. In contrast to Dreyer and Eisner (2009) (see section 8), we have defined a directed graphical model. Hence new unobserved descendants can be added without changing the posterior distribution over the existing variables. So our finite network can be viewed as a subgraph of an infinite graph. That is, we make no closed-vocabulary assumption, but implicitly include (and predict the surface forms of) any unobserved words that could result from combining morphemes, even morphemes not in our dataset.
While the present paper focuses on word types, we could extend the model to consider tokens as well. In Figure 1, each phonological surface type at layer 3 could be observed to generate zero or more noisy phonetic tokens at layer 4, in contexts that call for the morphemes expressed by that type.

Loopy belief propagation
The top two layers of Figure 1 include a long undirected cycle (involving all 8 nodes and all 8 edges shown). On such "loopy" graphical models, exact inference is in general uncomputable when the random variables are string-valued. However, Dreyer and Eisner (2009) showed how to substitute a popular approximate joint inference method, loopy belief propagation (Murphy et al., 1999).
Qualitatively, what does this do on Figure 1? 4 Let u denote the leftmost layer-2 node. Midway through loopy BP, u is not yet sure of its value, but is receiving suggestions from its neighbors. The stem UR immediately above u would like u to start with something like /rizajgn#/. 5 Meanwhile, the word SR immediately below u encourages u to be any UR that would have a high probability (under S θ ) of surfacing as [rEzIgn#eIS@n]. So u tries to meet both requirements, guessing that its value might be something like /rizajgn#eIS@n/ (the product of this string's scores under the two messages to u is relatively high). Now, for U to have produced something like /rizajgn#eIS@n/ by stemsuffix concatenation, the suffix's UR must have been something like /eIS@n/. u sends a message saying so to the third node in layer 1. This induces that node (the suffix UR) to inform the rightmost layer-2 node that it probably ends in /#eIS@n/ as well-and so forth, iterating until convergence.
Formally, the loopy BP algorithm iteratively updates messages and beliefs. Each is a function that scores possible strings (or string tuples). Dreyer and Eisner (2009)'s key insight is that these messages and beliefs can be represented using weighted finite-state machines (WFSMs), and furthermore, loopy BP can compute all of its updates using standard polytime finite-state constructions.

Discussion: The finite-state requirement
The above results hold when the "factors" that define the graphical model are themselves expressed as WFSMs. This is true in our model. The factors of section 4.1 correspond to the conditional distributions M φ , U , and S θ that respectively select values for nodes at layers 1, 2, and 3 given the values at their parents. As section 3 models these, for any φ and θ, we can represent M φ as a 1-tape WFSM (acceptor), U as a multi-tape WFSM, and S θ as a 2-tape WFSM (transducer). 6 Any other WFSMs could be substituted. We are on rather firm ground in restricting to finite-state (regular) models of S θ . The apparent regularity of natural-language phonology was first observed by Johnson (1972), so computational phonology has generally preferred grammar formalisms that compile into (unweighted) finite-state machines, whether the formalism is based on rewrite rules (Kaplan and Kay, 1994) or constraints (Eisner, 2002a;Riggle, 2004).
Similarly, U could be any finite-state relation, 7 not just concatenation as in section 2. Thus our framework could handle templatic morphology (Hulden, 2009), infixation, or circumfixation.
Although only regular factors are allowed in our graphical model, a loopy graphical model with multiple such factors can actually capture nonregular phenomena, for example by using auxiliary variables (Dreyer and Eisner, 2009, §3.4). Approximate inference then proceeds by loopy BP on this model. In particular, reduplication is not regular if unbounded, but we can adopt morphological doubling theory (Inkelas and Zoll, 2005) and model it by having U concatenate two copies of the same morph. During inference of URs, this morph exchanges messages with two substrings of the underlying word.
Similarly, we can manipulate the graphical model structure to encode cyclic phonology-i.e., concatenating a word SR with a derivational affix 6 M φ has a single state, with halt probability φ and the remaining probability 1 − φ divided among self-loop arcs labeled with the phonemes in Σu. U must concatenate k morphs by copying all of tape 1, then tape 2, etc., to tape k + 1: this is easily done using k + 1 states, and arcs of probability 1. Sθ is constructed as in Cotterell et al. (2014). 7 In general, a U factor enforces u = U (m1, . . . , m k ), so it is a degree-(k + 1) factor, represented by a (k + 1)-tape WFSM connecting these variables (Dreyer and Eisner, 2009). If one's finite-state library is limited to 2-tape WFSMs, then one can simulate any such U factor using (1) an auxiliary string variable π encoding the path through U , (2) a unary factor weighting π according to U , (3) a set of k + 1 binary factors relating π to each of u, m1, . . . , m k . It is even easier to handle the particular U used in this paper, which enforces u = m1# . . . #m k . Given this factor U 's incoming messages µ·→U , each being a 1-tape WFSM, compute its loopy BP outgoing messages µU→u = µm 1 →U # · · · #µm k →U and (e.g.) µU→m 2 = range(µu→U • ((µm 1 →U # × ) Σ * u (#µm 3 →U # · · · #µm k →U × ))).
UR and passing the result through S θ once again. An alternative is to encode this hierarchical structure into the word UR u, by encoding level-1 and level-2 boundaries with different symbols. A single application of S θ can treat these boundaries differently: for example, by implementing cyclic phonology as a composition of two transductions.

Loopy BP implementation details
Each loopy BP message to or from a random variable is a 1-tape WFSM (acceptor) that scores all possible values of that variable (given by the set M, U, or S: see section 2). We initialized each message to the uniform distribution. 8 We then updated the messages serially, alternating between upward and downward sweeps through the Bayesian network. After 10 iterations we stopped and computed the final belief at each variable. A complication is that a popular affix such as plur$al (/z/ in layer 1) receives messages from hundreds of words that realize that affix. Loopy BP obtains that affix's belief and outgoing messages by intersecting these WFSMs-which can lead to astronomically large results and runtimes. We address this with a simple pruning approximation where at each variable m, we dynamically restrict to a finite support set of plausible values for m.
We take this to be the union of the 20-best lists of all messages sent to m. 9 We then prune those messages so that they give weight 0 to all strings outside m's support set. As a result, m's outgoing messages and belief are also confined to its support set. Note that the support set is not handspecified, but determined automatically by taking the top hypotheses under the probability model. Improved approaches with no pruning are possible. After submitting this paper, we developed a penalized expectation propagation method . It dynamically approximates messages using log-linear functions (based on variable-order n-gram features) whose support is the entire space Σ * . We also developed a dual decomposition method (Peng et al., 2015), which if it converges, exactly recovers the single most 8 This is standard-although the uniform distribution over the space of strings is actually an improper distribution. It is expressed by a single-state WFSM whose arcs have weight 1. It can be shown that the beliefs are proper distributions after one iteration, though the upward messages may not be. 9 In general, we should update this support set dynamically as inference and learning improve the messages. But in our present experiments, that appears unnecessary, since the initial support set always appears to contain the "correct" UR. probable explanation of the data 10 given φ and θ.

Parameter Learning
We employ MAP-EM as the learning algorithm. The E-step is approximated by the loopy BP algorithm of section 4. The M-step takes the resulting beliefs, together with the prior of section 3.4, and uses them to reestimate the parameters θ and φ.
If we knew the true UR u k for each observed word type s k , we would just do supervised training of θ, using L-BFGS (Liu and Nocedal, 1989) to locally maximize θ's posterior log-probability ( k log S θ (s k | u k )) + log p prior (θ) Cotterell et al. (2014) give the natural dynamic programming algorithm to compute each summand and its gradient w.r.t. θ. The gradient is the difference between observed and expected feature vectors of the contextual edits (section 3.2), averaged over edit contexts in proportion to how many times those contexts were likely encountered. The latent alignment makes the objective non-concave.
In our EM setting, u k is not known. So our Mstep replaces log S θ (s k | u k ) with its expectation, where b k is the normalized belief about u k computed by the previous E-step. Since b k and S θ are both represented by WFSMs (with 1 and 2 tapes respectively), it is possible to compute this quantity and its gradient exactly, using finite-state composition in a secondorder expectation semiring (Li and Eisner, 2009). For speed, however, we currently prune b k back to the 5-best values of u k . This lets us use a simpler and faster approach: a weighted average over 5 runs of the Cotterell et al. (2014) algorithm.
Our asymptotic runtime benefits from the fact that our graphical model is directed (so our objective does not have to contrast with all other values of u k ) and the fact that S θ is locally normalized (so our objective does not have to contrast with all other values of s k for each u k ). In practice we are far faster than Dreyer and Eisner (2009).
We initialized the parameter vector θ to 0, except for setting the weight of the COPY feature (section 6) such that the probability of a COPY edit is 0.99 in every context other than end-of-string. This encourages URs to resemble their SRs.
To reestimate φ, the M-step does not need to use L-BFGS, for section 3.3's simple model of M φ BIGRAM(strident,strident) adjacent surface stridents BIGRAM ( ,uvular) surface uvular EDIT ([s] and uniform prior over φ ∈ (0, 1]. It simply sets φ = 1/( + 1) where is the average expected length of a UR according to the previous E-step. The expected length of each u k is extracted from the WFSM for the belief b k , using dynamic programming (Li and Eisner, 2009). We initialized φ to 0.1; experiments on development data suggested that the choice of initializer had little effect.

Features of the Phonology Model
Our stochastic edit process S θ (s | u) assigns a probability to each possible u-to-s edit sequence. This edit sequence corresponds to a character-wise alignment of u to s. Our features for modeling the contextual probability of each edit are loosely inspired by constraints from Harmonic Grammar and Optimality Theory (Smolensky and Legendre, 2006). Such constraints similarly evaluate a u-tos alignment (or "correspondence"). They are traditionally divided into markedness constraints that encourage a well-formed s, and faithfulness constraints that encourage phonemes of s to resemble their aligned phonemes in u. Our EDIT faithfulness features evaluate an edit's (input, output) phoneme pair. Our BIGRAM markedness features evaluate an edit that emits a new phoneme of s. They evaluate the surface bigram it forms with the previous output phoneme. 11 Table 1 shows example features. Notice that these features back off to various natural classes of phonemes (Clements and Hume, 1995).
These features of an edit need to examine at most (0,1,1) phonemes of (left input, right input, left output) context respectively (see Figure 2). So the PFST that implements S θ should be able to use what Cotterell et al. (2014) calls a (0,1,1) topology. However, we actually used a (0,2,1) topology, to allow features that also look at the "upcoming" input phoneme that immediately follows the edit's input (/@/ in Figure 2). Specifically, for each natural class, we also included contextual versions of each EDIT or BIGRAM feature, which fired only if the "upcoming" input phoneme fell in that natural class. Contextual BIGRAM features are our approximation to surface trigram features that look at the edit's output phoneme together with the previous and next output phonemes. (A PFST cannot condition its edit probabilities on the next output phoneme because that has not been generated yet-see section 3.2-so we are using the upcoming input phoneme as a proxy.) Contextual EDIT features were cheap to add once we were using a (0,2,1) topology, and in fact they turned out to be helpful for capturing processes such as Catalan's deletion of the underlyingly final consonant.
Finally, we included a COPY feature that fires on any edit where surface and underlying phonemes are exactly equal. (This feature resembles Optimality Theory's IDENT-IO constraint, and ends up getting the strongest weight.) In total, our model has roughly 50,000 binary features.
Many improvements to this basic feature set would be possible in future. We cannot currently express implications such as "adjacent obstruents must also agree in voicing," "a vowel that surfaces must preserve its height," or "successive vowels must also agree in height." We also have not yet designed features that are sensitive to surface prosodic boundaries or underlying morph boundaries. (Prosodic structure and autosegmental tiers are absent from our current representations, and we currently simplify the stochastic edit process's feature set by having S θ erase the morph boundaries # before applying that process.) Our standard prior over θ (section 3.4) resists overfitting in a generic way, by favoring phonologies that are "simple to describe." Linguistic improvements are possible here as well. The prior should arguably discourage positive weights more than negative ones, since most of our features detect constraint violations that ordinarily reduce probability. It should also be adjusted to mitigate the current structural bias against deletion edits, which arises because the single deletion possible in a context must compete on equal footing with |Σ s | insertions and |Σ s | − 1 substitutions. More ambitiously, a linguistically plausible prior should prefer phonologies that are conservative (s ≈ u) and have low conditional entropies H(s | u), H(u | s) to facilitate communication.

Experimental Design
We objectively evaluate our learner on its ability to predict held-out surface forms. This blind testing differs from traditional practice by linguists, who evaluate a manual or automatic analysis (= URs + phonology) on whether it describes the full dataset in a "natural" way that captures "appropriate" generalizations. We avoid such theory-internal evaluation by simply quantifying whether the learner's analysis does generalize (Eisner, 2015).
To avoid tailoring to our training/test data, we developed our method, code, features, and hyperparameters using only two development languages, English and German. Thus, our learner was not engineered to do well on the other 5 languages below: the graphs below show its first attempt to learn those languages. We do also evaluate our learners on English and German, using separate training/test data.
We provide all our data (including citations, development data, training-test splits, and natural classes) at http://hubal.cs.jhu.edu/ tacl2015/, along with brief sketches of the phonological phenomena in the datasets, the "gold" stem URs we assumed for evaluation, and our learner's predictions and error patterns.

Evaluation methodology
Given a probability distribution p over surface word types of a language, we sample a training set of N types without replacement. This simulates reading text until we have seen N distinct types. For each of these frequent words, we observe the SR s and the morpheme sequence a.
After training our model, we evaluate its beliefs b about the SRs s on a disjoint set of test words whose a are observed. To improve interpretability of the results, we limit the test words to those whose morphemes have all appeared at least once in the training set. (Any method would presumably get other words badly wrong, just as it would tend to get the training words right; we exclude both.) To evaluate our belief b about the SR of a test word ( a, s * ), we use three measures for which "smaller is better." First, 0-1 loss asks whether s * = argmax s b(s). This could be compared with non-probabilistic predictors. Second, the surprisal − log 2 b(s * ) is low if the model finds it plausible that s * realizes a. If so, this holds out promise for future work on analyzing or learning from unannotated tokens of s * . Third, we evaluate the whole We take the average of each measure over test words, weighting those words according to p. This yields our three reported metrics: 1-best error rate, cross-entropy, and expected edit distance. Each metric is the expected value of some measure on a random test token.
These metrics are actually random variables, since they depend on the randomly sampled training set and the resulting test distribution. We report the expectations of these random variables by running many training-test splits (see section 7.2).

Datasets
To test discovery of interesting patterns from limited data, we ran our learner on 5 "exercises" drawn from phonology textbooks (102 English nouns, 68 Maori verbs, 72 Catalan adjectives, 55 Tangale nouns, 44 Indonesian nouns), exhibiting a range of phenomena. In each case we took p to be the uniform distribution over the provided word types. We took N to be one less than the number of provided types. So to report our expected metrics, we ran all N + 1 experiments where we trained jointly on N forms and tested on the 1 remaining form. This is close to linguists' practice of fitting an analysis on the entire dataset, yet it is a fair test. There is no sampling error in these reported results, hence no need for error bars.
To test on larger, naturally occurring datasets, we ran our learner on subsets of the CELEX database (Baayen et al., 1995), which provides surface phonological forms and token counts for German, Dutch, and English words. For each language, we constructed a coherent subcorpus of 1000 nouns and verbs, focusing on inflections with common phonological phenomena. These turned out to involve mainly voicing: final obstru-ent devoicing (German 2nd-person present indicative verbs, German nominative singular nouns, Dutch infinitive verbs, Dutch singular nouns) and voicing assimilation (English past tense verbs, English plural nouns). We were restricted to relatively simple phenomena because our current representations are simple segmental strings that lack prosodic and autosegmental structure. In future we plan to consider stress, vowel harmony, and templatic morphology.
We constructed the distribution p in proportion to CELEX's token counts. In each language, we trained on N = 200, 400, 600, or 800 forms sampled from p. To estimate the expectation of each metric over all training sets of size N , we report the sample mean and bootstrap standard error over 10 random training sets of size N .
Except in Indonesian, every word happens to consist of ≤ 2 morphemes (a stem plus a possibly empty suffix). In all cases, we take the phoneme inventories Σ u and Σ s to be given as the set of all surface phonemes that appear in the full dataset.

Comparison systems
There do not appear to be previous systems that perform our generalization task. Therefore, we compared our own system against variants.
We performed an ablation study to determine whether the learned phonology was helpful. We substituted a simplified phonology model where S θ (s | u) just decays exponentially with the edit distance between s and u; the decay rate was learned by EM as usual. That is, this model uses only the COPY feature of section 6. This baseline system treats phonology as "noisy concatenation" of learned URs, not trying to model its regularity.
We considered an additional ablation study to determine whether the learned URs were helpful. However, we did not come up with a plausible ∈ training whose morphemes ∈ training. Meanwhile, the connected points permit comparison across the 4 values of N , by evaluating only on a common test set found by intersecting the 4 unconnected test sets. Each point estimates the metric's expectation over all ways of sampling the 4 training sets; specifically, we plot the sample mean from 10 such runs, with error bars showing a bootstrap estimate of the standard error of the mean. Non-overlapping error bars at a given N always happen to imply that the difference in the two methods' sample means is too extreme to be likely to have arisen by chance (paired permutation test, p < 0.05). Each time we evaluated some training-test split on some metric, we first tuned σ 2 (section 3.4) by a coarse grid search where we trained on the first 90% of the training set and evaluated on the remaining 10%.
heuristic for identifying URs in some simpler way. Thus, instead we asked whether the learned URs were as good as hand-constructed URs. Our "oracle" system was allowed to observe gold-standard URs for stems instead of inferring them. This system is still fallible: it must still infer the affix URs by belief propagation, and it must still use MAP-EM to estimate a phonology within our current model family S θ . Even with supervision, this family will still struggle to model many types of phonology, e.g., ablaut patterns (in Germanic strong verbs) and many stress-related phenomena.

Results
We graph our results in Figures 3 and 4. When given enough evidence, our method works quite well across the 7 datasets. For 94-98% of held-out words on the CELEX languages (when N = 800), and 77-100% on the phonological exercises, our method's top pick is the correct surface form. Further, the other metrics show that it places most of  Table 2: Percent of training words, weighted by the distribution p, whose 1-best recovered UR (including the boundary #) exactly matches the manual "gold" analysis. Results are averages over all runs (with N = 800 for the CELEX datasets).
its probability mass on that form, 12 and the rest on highly similar forms. Notably, our method's predictions are nearly as good as if gold stem URs had been supplied (the "oracle" condition). Indeed, it does tend to recover those gold URs (Table 2). Yet there are some residual errors in predicting the SRs. Our phonological learner cannot perfectly learn the UR-to-SR mapping even from many well-supervised pairs (the oracle condition). In the CELEX and Tangale datasets, this is partly due to irregularity in the language itself. However, error analysis suggests we also miss some generalizations due to the imperfections of our current S θ model (as discussed in sections 3.2 and 6).
When given less evidence, our method's performance is more sensitive to the training sample and is worse on average. This is expected: e.g., a stem's final consonant cannot be reconstructed if it was devoiced (German) or deleted (Maori) in all the training SRs. However, a contributing factor may be the increased error rate of the phonological learner, visible even with oracle data. Thus, we suspect that a S θ model with better generalization would improve our results at all training sizes. Note that weakening S θ -allowing only "noisy concatenation"-clearly harms the method, proving the need for true phonological modeling.

Related Work
We must impute the inputs to the phonological noisy channel S θ (URs) because we observe only the outputs (SRs). Other NLP problems of this form include unsupervised text normalization (Yang and Eisenstein, 2013), unsupervised training of HMMs (Christodoulopoulos et al., 2010), and particularly unsupervised lexicon acquisition from phonological data (Elsner et al., 2012). However, unlike these studies, we currently use some indirect supervision-we know each SR's morpheme sequence, though not the actual morphs. Jarosz (2013, §2) and Tesar (2014, chapters 5-6) review work on learning the phonology S θ . Phonologists pioneered stochastic-gradient and passive-aggressive training methods-the Gradual Learning Algorithm (Boersma, 1998) and Error-Driven Constraint Demotion (Tesar and Smolensky, 1998)-for structured prediction of the surface word s from the underlying word u. If s is not fully observed during training (layer 4 of Figure 1 is observed, not layer 3), then it can be imputed, a step known as Robust Interpretive Parsing.
Recent papers consider our setting where u = m 1 #m 2 # · · · is not observed either. The contrast analysis method (Tesar, 2004;Merchant, 2008) in effect uses constraint propagation (Dechter, 2003). That is, it serially eliminates variable values (describing aspects of the URs or the constraint ranking) that are provably incompatible with the data. Constraint propagation is an incomplete method that is not guaranteed to make all logical deductions. We use its probabilistic generalization, loopy belief propagation (Dechter et al., 2010)which is still approximate but can deal with noise and stochastic irregularity. A further improvement is that we work with string-valued variables, representing uncertainty using WFSMs; this lets us reason about URs of unknown length and unknown alignment to the SRs. (Tesar and Merchant instead used binary variables, one for each segmental feature in each UR-requiring the simplifying assumption that the URs are known except for their segmental features. They assume that SRs are annotated with morph boundaries and that the phonology only changes segmental features, never inserting or deleting segments.) On the other hand, Tesar and Merchant reason globally about the constraint ranking, whereas in this paper, we only locally improve the phonology-we use EM, rather than the full Bayesian approach that treats the parameters θ as variables within BP. Jarosz (2006) is closest to our work in that she uses EM, just as we do, to maximize the probability of observed surface forms whose constituent morphemes (but not morphs) are known. 13 Her model is a probabilistic analogue of Apoussidou (2006), who uses a latent-variable structured perceptron. A non-standard aspect of this model (defended by Pater et al. (2012)) is that a morpheme a can stochastically choose different morphs M (a) when it appears in different words. To obtain a single shared morph, one could penalize this distribution's entropy, driving it toward 0 as learning proceeds. Such an approach-which builds on a suggestion by Eisenstat (2009, §5.4)-would loosely resemble dual decomposition (Peng et al., 2015). Unlike our BP approach, it would maximize rather than marginalize over possible underlying morphs.
Our work has focused on scaling up inference. For the phonology S, the above papers learn the weights or rankings of just a few plausible constraints (or Jarosz (2006) learns a discrete distribution over all 5! = 120 rankings of 5 constraints), whereas we use S θ with roughly 50,000 constraints (features) to enable learning of unknown languages. Our S also allows exceptions. The above papers also consider only very restricted sets of morphs, either identifying a small set of plausible morphs or prohibiting segmental insertion/deletion. We use finite-state methods so that it is possible to consider the space Σ * u of all strings.