Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models

We tackle unsupervised part-of-speech (POS) tagging by learning hidden Markov models (HMMs) that are particularly well-suited for the problem. These HMMs, which we call anchor HMMs, assume that each tag is associated with at least one word that can have no other tag, which is a relatively benign condition for POS tagging (e.g., “the” is a word that appears only under the determiner tag). We exploit this assumption and extend the non-negative matrix factorization framework of Arora et al. (2013) to design a consistent estimator for anchor HMMs. In experiments, our algorithm is competitive with strong baselines such as the clustering method of Brown et al. (1992) and the log-linear model of Berg-Kirkpatrick et al. (2010). Furthermore, it produces an interpretable model in which hidden states are automatically lexicalized by words.


Introduction
Part-of-speech (POS) tagging without supervision is a quintessential problem in unsupervised learning for natural language processing (NLP). A major application of this task is reducing annotation cost: for instance, it can be used to produce rough syntactic annotations for a new language that has no labeled data, which can be subsequently refined by human annotators.
Hidden Markov models (HMMs) are a natural choice of model and have been a workhorse for this problem. Early works estimated vanilla HMMs * Currently on leave at Google Inc. New York.
with standard unsupervised learning methods such as the expectation-maximization (EM) algorithm, but it quickly became clear that they performed very poorly in inducing POS tags (Merialdo, 1994). Later works improved upon vanilla HMMs by incorporating specific structures that are well-suited for the task, such as a sparse prior (Johnson, 2007) or a hard-clustering assumption (Brown et al., 1992).
In this work, we tackle unsupervised POS tagging with HMMs whose structure is deliberately suitable for POS tagging. These HMMs impose an assumption that each hidden state is associated with an observation state ("anchor word") that can appear under no other state. For this reason, we denote this class of restricted HMMs by anchor HMMs. Such an assumption is relatively benign for POS tagging; it is reasonable to assume that each POS tag has at least one word that occurs only under that tag. For example, in English, "the" is an anchor word for the determiner tag; "laughed" is an anchor word for the verb tag.
We build on the non-negative matrix factorization (NMF) framework of Arora et al. (2013) to derive a consistent estimator for anchor HMMs. We make several new contributions in the process. First, to our knowledge, there is no previous work directly building on this framework to address unsupervised sequence labeling. Second, we generalize the NMF-based learning algorithm to obtain extensions that are important for empirical performance (Table 1). Third, we perform extensive experiments on unsupervised POS tagging and report competitive results against strong baselines such as the clustering method of Brown et al. (1992) and the log-linear model of Berg-Kirkpatrick et al. (2010).
One characteristic of the approach is the immediate interpretability of inferred hidden states. Because each hidden state is associated with an observation, we can examine the set of such anchor observations to qualitatively evaluate the learned model. In our experiments on POS tagging, we find that anchor observations correspond to possible POS tags across different languages (Table 3). This property can be useful when we wish to develop a tagger for a new language that has no labeled data; we can label only the anchor words to achieve a complete labeling of the data. This paper is structured as follows. In Section 2, we establish the notation we use throughout. In Section 3, we define the model family of anchor HMMs. In Section 4, we derive a matrix decomposition algorithm for estimating the parameters of an anchor HMM. In Section 5, we present our experiments on unsupervised POS tagging. In Section 6, we discuss related works.

Notation
We use [n] to denote the set of integers {1, . . . , n}. We use E[X] to denote the expected value of a random variable X. We define ∆ m−1 := {v ∈ R m : v i ≥ 0 ∀i, i v i = 1}, i.e., the (m−1)-dimensional probability simplex. Given a vector v ∈ R m , we use diag(v) ∈ R m×m to denote the diagonal matrix with v 1 . . . v m on the main diagonal. Given a matrix M ∈ R n×m , we write M i ∈ R m to denote the i-th row of M (as a column vector).

The Anchor Hidden Markov Model
Definition 3.1. An anchor HMM (A-HMM) is a 6tuple (n, m, π, t, o, A) for positive integers n, m and functions π, t, o, A where • [n] is a set of observation states.
• [m] is a set of hidden states.
in the first position of a sequence.
• t(h |h) is the probability of generating h ∈ [m] given h ∈ [m].
• o(x|h) is the probability of generating x ∈ [n] given h ∈ [m].
In other words, an A-HMM is an HMM in which each hidden state h is associated with at least one "anchor" observation state that can be generated by, and only by, h. Note that the anchor condition implies n ≥ m.
An equivalent definition of an A-HMM is given by organizing the parameters in matrix form. Under this definition, an A-HMM has parameters (π, T, O) where π ∈ R m is a vector and T ∈ R m×m , O ∈ R n×m are matrices whose entries are set to: The anchor condition implies that rank(O) = m. To see this, consider the rows O a 1 . . . O am where a h ∈ A(h). Since each O a h has a single non-zero entry at the h-th index, the columns of O are linearly independent. We assume rank(T ) = m. An

Parameter Estimation for A-HMMs
We now derive an algorithm for learning A-HMMs. The algorithm reduces the learning problem to an instance of NMF from which the model parameters can be computed in closed-form.

NMF
We give a brief review of the NMF method of Arora et al. (2013). (Exact) NMF is the following problem: given an n × d matrix A = BC where B ∈ R n×m and C ∈ R m×d have non-negativity constraints, recover B and C. This problem is NP-hard in general (Vavasis, 2009), but Arora et al. (2013) provide an exact and efficient method when A has the following special structure: Condition 4.1. A matrix A ∈ R n×d satisfies this condition if A = BC for B ∈ R n×m and C ∈ R m×d where 246 Anchor-NMF Input: A ∈ R n×d satisfying Condition 4.1 with A = BC for some B ∈ R n×m and C ∈ R m×d , value m Output: B and C such that B h = B ρ(h) and C h =   3. rank(C) = m.
Since rank(B) = rank(C) = m (by property 2 and 3), the matrix A must have rank m. Note that by property 1, each row of A is given by a convex combination of the rows of C: for x ∈ [n], Furthermore, by property 2 each h ∈ [m] has an associated row a h ∈ [n] such that A a h = C a h . These properties can be exploited to recover B and C. A concrete algorithm for factorizing a matrix satisfying Condition 4.1 is given in Figure 1 (Arora et al., 2013). It first identifies a 1 . . . a m (up to some permutation) by greedily locating the row of A furthest away from the subspace spanned by the vertices selected so far. Then it recovers each B x as the convex coefficients required to combine A a 1 . . . A am to yield A x . The latter computation (1) can be achieved with any constrained optimization method; we use the Frank-Wolfe algorithm (Frank and Wolfe, 1956). See Arora et al. (2013) for a proof of the correctness of this algorithm.

Random Variables
To derive our algorithm, we make use of certain random variables under the A-HMM. Let (X 1 , . . . , X N ) ∈ [n] N be a random sequence of N ≥ 2 observations drawn from the model, along with the corresponding hidden state sequence (H 1 , . . . , H N ) ∈ [m] N ; independently, pick a position I ∈ [N − 1] uniformly at random. Let Y I ∈ R d be a d-dimensional vector which is conditionally independent of X I given H I , i.e., P (Y I |H I , X I ) = P (Y I |H I ). We will provide how to define such a variable in Section 4.4.1: Y I will be a function of (X 1 , . . . , X N ) serving as a "context" representation of X I revealing the hidden state H I .
Define unigram probabilities u ∞ , u 1 ∈ R n and bigram probabilities B ∈ R n×n where Additionally, defineπ ∈ R m wherē We assumeπ h > 0 for all h ∈ [m].

Derivation of a Learning Algorithm
The following proposition provides a way to use the NMF algorithm in Figure 1 to recover the emission parameters O up to scaling.
Proposition 4.1. Let X I ∈ [n] and Y I ∈ R d be respectively an observation and a context vector drawn from the random process described in Section 4.2. Define a matrix Ω ∈ R n×d with rows If rank(Ω) = m, then Ω satisfies Condition 4.1: Proof.
The last equality follows by the conditional independence of Y I . This shows Ω = OΘ. By the anchor assumption of the A-HMM, each h ∈ [m] has at least one x ∈ A(h) such that P (H I = h|X I = x) = 1, thus Ω satisfies Condition 4.1.
A useful interpretation of Ω in Proposition 4.1 is that its rows Ω 1 . . . Ω n are d-dimensional vector representations of observation states forming a convex hull in R d . This convex hull has m vertices Ω a 1 . . . Ω am corresponding to anchors a h ∈ A(h) which can be convexly combined to realize all Ω 1 . . . Ω n .
Given O, we can recover the A-HMM parameters as follows. First, we recover the stationary state distributionπ as: The emission parameters O are given by Bayes' theorem: Using the fact that the emission probabilities are position-independent, we see that the initial state distribution π satisfies u 1 = Oπ: Learn-Anchor-HMM Input: Ω in Proposition 4.1, number of hidden states m, bigram probabilities B, unigram probabilities u ∞ , u 1 • Recover the parameters: Output: A-HMM parameters (π, T, O) Thus π can be recovered as π . Since all the involved matrices have rank m, we can directly solve for T as Figure 2 shows the complete algorithm. As input, it receives a matrix Ω satisfying Proposition 4.1, the number of hidden states, and the probabilities of observed unigrams and bigrams. It first decomposes Ω using the NMF algorithm in Figure 1. Then it computes the A-HMM parameters whose solution is given analytically.
The following theorem guarantees the consistency of the algorithm. Proof. By Proposition 4.1, Ω satisfies Condition 4.1 with Ω = OΘ, thus O can be recovered up to a permutation on columns with the algorithm Anchor-NMF. The consistency of the recovered parameters follows from the correctness of (4-7) under the rank conditions.

Constrained Optimization for π and T
Note that (6) and (7) require computing the pseudoinverse of the estimated O, which can be expensive and vulnerable to sampling errors in practice. To make our parameter estimation more robust, we can explicitly impose probability constraints. We recover π by solving: which can again be done with algorithms such as Frank-Wolfe. We recover T by maximizing the log likelihood of observation bigrams subject to the constraint (T ) h ∈ ∆ m−1 . Since (9) is concave in T with other parameters O andπ fixed, we can use EM to find the global optimum.

Construction of the Convex Hull Ω
In this section, we provide several ways to construct a convex hull Ω satisfying Proposition 4.1.

Choice of the Context Y I
In order to satisfy Proposition 4.1, we need to define the context variable Y I ∈ R d with two properties: has rank m.
A simple construction (Arora et al., 2013) is given by defining Y I ∈ R n to be an indicator vector for the next observation: The first condition is satisfied since X I+1 does not depend on X I given H I . For the second condition, observe that Ω x,x = P (X I+1 = x |X I = x), or in matrix form Under the rank conditions in Theorem 4.1, (11) has rank m. More generally, we can let Y I be an observation (encoded as an indicator vector as in (10)) randomly drawn from a window of L ∈ N nearby observations. We can either only use the identity of the chosen observation (in which case Y I ∈ R n ) or additionally indicate the relative position in the window (in which case Y I ∈ R nL ). It is straightforward to verify that the above two conditions are satisfied under these definitions. Clearly, (11) is a special case with L = 1.

Reducing the Dimension of Ω x
With the definition of Ω in the previous section, the dimension of Ω x is d = O(n) which can be difficult to work with when n m. Proposition 4.1 allows us to reduce the dimension as long as the final matrix retains the form in (3) and has rank m. In particular, we can multiply Ω by any rank-m projection matrix Π ∈ R d×m on the right side: if Ω satisfies the properties in Proposition 4.1, then so does ΩΠ with m-dimensional rows Since rank(Ω) = m, a natural choice of Π is the projection onto the best-fit m-dimensional subspace of the row space of Ω.
We mention that previous works on the NMFlearning framework have employed various projection methods, but they do not examine relative merits of their choices. For instance, Arora et al. (2013) simply use random projection, which is convenient for theoretical analysis. Cohen and Collins (2014) use a projection based on canonical correlation analysis (CCA) without further exploration. In contrast, we give a full comparison of valid construction methods and find that the choice of Ω is crucial in practice.

Construction of Ω for the Brown Model
We can formulate an alternative way to construct a valid Ω when the model is further restricted to be a Brown model. Since every observation is an anchor, O x ∈ R m has a single nonzero entry for every x. Thus the rows defined by Ω x = O x / ||O x || (an indicator vector for the unique hidden state of x) form Input: bigram probabilities B, unigram probabilities u ∞ , number of hidden states m, construction method τ Scaled Matrices: ( √ · is element-wise) is an n × m matrix of the left (right) singular vectors of M corresponding to the largest m singular values where the projection matrix Π ∈ R n×m is given by where Output: Ω ∈ R n×m in Proposition 4.1 Figure 3: Algorithm for constructing a valid Ω with different construction methods. For simplicity, we only show the bigram construction (context size L = 1), but an extension for larger context (L > 1) is straightforward.
a trivial convex hull in which every point is a vertex. This corresponds to choosing an oracle context It is possible to recover the Brown model parameters O up to element-wise scaling and rotation of rows using the algorithm of Stratos et al. (2015). More specifically, let f (O) ∈ R n×m denote the output of their algorithm. Then they show that for some vector s ∈ R m with strictly positive entries and an orthogonal matrix Q ∈ R m×m : where O 1/4 is an element-wise exponentiation of O by 1/4. Since the rows of f (O) are simply some scaling and rotation of the rows of O, using Ω While we need to impose an additional assumption (the Brown model restriction) in order to justify this choice of Ω, we find in our experiments that it performs better than other alternatives. We speculate that this is because a Brown model is rather appropriate for the POS tagging task; many words are indeed unambiguous with respect to POS tags (Table 5). Also, the general effectiveness of f (O) for representational purposes has been demostrated in previous works (Stratos et al., 2014;Stratos et al., 2015). By restricting the A-HMM to be a Brown model, we can piggyback on the proven effectiveness of f (O). Figure 3 shows an algorithm for constructing Ω with these different construction methods. For simplicity, we only show the bigram construction (context size L = 1), but an extension for larger context (L > 1) is straightforward as discussed earlier. The construction methods random (random projection), best-fit (projection to the best-fit subspace), and cca (CCA projection) all compute (11) and differ only in how the dimension is reduced. The construction method brown computes the transformed Brown parameters f (O) as the left singular vectors of a scaled covariance matrix and then normalizes its rows. We direct the reader to Stratos et al. (2015) for a derivation of this calculation.

Ω with Feature Augmentation
The x-th row of Ω is a d-dimensional vector representation of x lying in a convex set with m vertices. This suggests a natural way to incorporate domainspecific features: we can add additional dimensions that provide information about hidden states from the surface form of x.
For instance, consider the the POS tagging task. In the simple construction (11), the representation of word x is defined in terms of neighboring words x : [ where 1(·) ∈ {0, 1} is the indicator function. We can augment this vector with s additional dimen-250 sions indicating the spelling features of x. For instance, the (n + 1)-th dimension may be defined as: [ This value will be generally large for verbs and small for non-verbs, nudging verbs closer together and away from non-verbs. The modified (n + s)dimensional representation is followed by the usual dimension reduction. Note that the spelling features are a deterministic function of a word, and we are implicitly assuming that they are independent of the word given its tag. While this is of course not true in practice, we find that these features can significantly boost the tagging performance.

Experiments
We evaluate our A-HMM learning algorithm on the task of unsupervised POS tagging. The goal of this task is to induce the correct sequence of POS tags (hidden states) given a sequence of words (observation states). The anchor condition corresponds to assuming that each POS tag has at least one word that occurs only under that tag.

Background on Unsupervised POS Tagging
Unsupervised POS tagging has long been an active area of research (Smith and Eisner, 2005a;Johnson, 2007;Toutanova and Johnson, 2007;Haghighi and Klein, 2006;Berg-Kirkpatrick et al., 2010), but results on this task are complicated by varying assumptions and unclear evaluation metrics (Christodoulopoulos et al., 2010). Rather than addressing multiple alternatives for evaluating unsupervised POS tagging, we focus on a simple and widely used metric: many-to-one accuracy (i.e., we map each hidden state to the most frequently coinciding POS tag in the labeled data and compute the resulting accuracy).

Better Model v.s. Better Learning
Vanilla HMMs are notorious for their mediocre performance on this task, and it is well known that they perform poorly largely because of model misspecification, not because of suboptimal parameter estimation (e.g., because EM gets stuck in local optima). More generally, a large body of work points to the inappropriateness of simple generative models for unsupervised induction of linguistic structure (Merialdo, 1994;Smith and Eisner, 2005b;Liang and Klein, 2008).
Consequently, many works focus on using more expressive models such as log-linear models (Smith and Eisner, 2005a;Berg-Kirkpatrick et al., 2010) and Markov random fields (MRF) (Haghighi and Klein, 2006). These models are shown to deliver good performance even though learning is approximate. Thus one may question the value of having a consistent estimator for A-HMMs and Brown models in this work: if the model is wrong, what is the point of learning it accurately?
However, there is also ample evidence that HMMs are competitive for unsupervised POS induction when they incorporate domain-specific structures. Johnson (2007) is able to outperform the sophisticated MRF model of Haghighi and Klein (2006) on one-to-one accuracy by using a sparse prior in HMM estimation. The clustering method of Brown et al. (1992) which is based on optimizing the likelihood under the Brown model (a special case of HMM) remains a baseline difficult to outperform (Christodoulopoulos et al., 2010).
We add to this evidence by demonstrating the effectiveness of A-HMMs on this task. We also check the anchor assumption on data and show that the A-HMM model structure is in fact appropriate for the problem (Table 5).

Experimental Setting
We use the universal treebank dataset (version 2.0) which contains sentences annotated with 12 POS tag types for 10 languages (McDonald et al., 2013). All models are trained with 12 hidden states. We use the English portion to experiment with different hyperparameter configurations. At test time, we fix a configuration (based on the English portion) and apply it across all languages.
The list of compared methods is given below: BW The Baum-Welch algorithm, an EM algorithm for HMMs (Baum and Petrie, 1966).
CLUSTER A parameter estimation scheme for HMMs based on Brown clustering (Brown et al., 1992). We run the Brown clustering algorithm 1 to obtain 12 word clusters C 1 . . . C 12 . Then we set the emission parameters o(x|h), transition parameters t(h |h), and prior π(h) to be the maximumlikelihood estimates under the fixed clusters.
ANCHOR Our algorithm Learn-Anchor-HMM in Figure 2 but with the constrained optimization (8) and (9) for estimating π and T . 2 ANCHOR-FEATURES Same as ANCHOR but employs the feature augmentation scheme described in Section 4.4.4.
LOG-LINEAR The unsupervised log-linear model described in Berg-Kirkpatrick et al. (2010). Instead of emission parameters o(x|h), the model maintains a miniature log-linear model with a weight vector w and a feature function φ. The probability of a word x given tag h is computed as The model can be trained by maximizing the likelihood of observed sequences. We use L-BFGS to directly optimize this objective. 3 This approach obtains the current state-of-the-art accuracy on finegrained (45 tags) English WSJ dataset.
We use maximum marginal decoding for HMM predictions: i.e., at each position, we predict the most likely tag given the entire sentence.

Practical Issues with the Anchor Algorithm
In our experiments, we find that Anchor-NMF (Figure 1) tends to propose extremely rare words as anchors. A simple fix is to search for anchors only among relatively frequent words. We find that any reasonable frequency threshold works well; we use the 300 most frequent words. Note that this is not a problem if these 300 words include anchor words corresponding to all the 12 tags.
We must define the context for constructing Ω. We use the previous and next words (i.e., context size L = 2) marked with relative positions. Thus Ω has 2n columns before dimension reduction.  Table 1: Many-to-one accuracy on the English data with different choices of the convex hull Ω (Figure 3). These results do not use spelling features.
construction (τ = brown in Figure 3) clearly performs the best: essentially, the anchor algorithm is used to extract the HMM parameters from the CCAbased word embeddings of Stratos et al. (2015). We also explore feature augmentation discussed in Section 4.4.4. For comparison, we employ the same word features used by Berg-Kirkpatrick et al. (2010): • Indicators for whether a word is capitalized, contains a hyphen, or contains a digit • Suffixes of length 1, 2, and 3 We weigh the l 2 norm of these extra dimensions in relation to the original dimensions: we find a small weight (e.g., 0.1 of the norm of the original dimensions) works well. We also find that these features can sometimes significantly improve the performance. For instance, the accuracy on the English portion can be improved from 66.1% to 71.4% with feature augmentation. Another natural experiment is to refine the HMM parameters obtained from the anchor algorithm (or Brown clusters) with a few iterations of the Baum-Welch algorithm. In our experiments, however, it did not significantly improve the tagging performance, so we omit this result. Table 2 shows the many-to-one accuracy on all languages in the dataset. For the Baum-Welch algorithm and the unsupervised log-linear models, we report the mean and the standard deviation (in parentheses) of 10 random restarts run for 1,000 iterations.

Tagging Accuracy
Both ANCHOR and ANCHOR-FEATURES compete favorably. On 5 out of 10 languages, ANCHOR-FEATURES achieves the highest accuracy, often  Table 2: Many-to-one accuracy on each language using 12 universal tags. The first four models are HMMs estimated with the Baum-Welch algorithm (BW), the clustering algorithm of Brown et al. (1992), the anchor algorithm without (ANCHOR) and with (ANCHOR-FEATURES) feature augmentation. LOG-LINEAR is the model of Berg-Kirkpatrick et al. (2010) trained with the direct-gradient method using L-BFGS. For BW and LOG-LINEAR, we report the mean and the standard deviation (in parentheses) of 10 random restarts run for 1,000 iterations.
closely followed by ANCHOR. The Brown clustering estimation is also competitive and has the highest accuracy on 3 languages. Not surprisingly, vanilla HMMs trained with BW perform the worst (see Section 5.1.1 for a discussion).
LOG-LINEAR is a robust baseline and performs the best on the remaining 2 languages. It performs especially strongly on Japanese and Korean datasets in which poorly segmented strings such as "1950年11月5日には" (on November 5, 1950) and "40.3%ᄅ ᅩ" (by 40.3%) abound. In these datasets, it is crucial to make effective use of morphological features.

A-HMM Parameters
An A-HMM can be easily interpreted since each hidden state is marked with an anchor observation. Table 3 shows the 12 anchors found in each language. Note that these anchor words generally have a wide coverage of possible POS tags.
We also experimented with using true anchor words (obtained from labeled data), but they did not improve performance over automatically induced anchors. Since anchor discovery is inherently tied to parameter estimation, it is better to obtain anchors in a data-driven manner. In particular, certain POS tags (e.g., X) appear quite infrequently, and the model is worse off by being forced to allocate a hidden state for such a tag. Table 4 shows words with highest emission probabilities o(x|h) under each anchor. We observe that an anchor is representative of a certain group of words. For instance, the state "loss" represents noun-like words, "1" represents numbers, "on" represents preposition-like words, "one" represents determiner-like words, and "closed" represents verb-like words. The conditional distribution is peaked for anchors that represent function tags (e.g., determiners, punctuation) and flat for anchors that represent content tags (e.g., nouns). Occasionally, an anchor assigns high probabilities to words that do not seem to belong to the corresponding POS tag. But this is to be expected since o(x|h) ∝ P (X I = x) is generally larger for frequent words.

Model Assumptions on Data
Table 5 checks the assumptions in A-HMMs and Brown models on the universal treebank dataset. The anchor assumption is indeed satisfied with 12 universal tags: in every language, each tag has at least one word uniquely associated with the tag. The Brown assumption (each word has exactly one possible tag) is of course not satisfied, since some words are genuinely ambiguous with respect to their POS tags. However, the percentage of unambiguous words is very high (well over 90%). This analysis supports that the model assumptions made by A-HMMs and Brown models are appropriate for POS tagging. Table 6 reports the log likelihood (normalized by the number of words) on the English portion of different estimation methods for HMMs. BW and CLUSTER obtain higher likelihood than the anchor algorithm, but this is expected given that both EM    Anandkumar et al. (2014) propose an exact tensor decomposition method for learning a wide class of latent variable models with similar non-degeneracy conditions. Arora et al. (2013) derive a provably cor-rect learning algorithm for topic models with a certain parameter structure. The anchor-based framework has been originally formulated for learning topic models (Arora et al., 2013). It has been subsequently adopted to learn other models such as latent-variable probabilistic context-free grammars (Cohen and Collins, 2014). In our work, we have extended this framework to address unsupervised sequence labeling. Zhou et al. (2014) also extend Arora et al. (2013)'s framework to learn various models including HMMs, but they address a more general problem. Consequently, their algorithm draws from Anandkumar et al. (2012) and is substantially different from ours.

Unsupervised POS Tagging
Unsupervised POS tagging is a classic problem in unsupervised learning that has been tackled with various approaches. Johnson (2007)    EM performs poorly in this task because it induces flat distributions; this is not the case with our algorithm as seen in the peaky distributions in Table 4. Haghighi and Klein (2006) assume a set of prototypical words for each tag and report high accuracy. In contrast, our algorithm automatically finds such prototypes in a subroutine. Berg-Kirkpatrick et al. (2010) achieve the stateof-the-art result in unsupervised fine-grained POS tagging (mid-70%). As described in Section 5.2, their model is an HMM in which probabilties are given by log-linear models. Table 7 provides a point of reference comparing our work with Berg-Kirkpatrick et al. (2010) in their setting: models are trained and tested on the entire 45-tag WSJ dataset. Their model outperforms our approach in this setting: with fine-grained tags, spelling features become more important, for instance to distinguish "played" (VBD) from "play" (VBZ). Nonetheless, we have shown that our approach is competitive when universal tags are used (Table 2).
Many past works on POS induction predate the introduction of the universal tagset by Petrov et al. (2012) and thus report results with fine-grained tags. More recent works adopt the universal tagset but  Table 7: Many-to-one accuracy on the English data with 45 original tags. We use the same setting as in Table 2. For BW and LOG-LINEAR, we report the mean and the standard deviation (in parentheses) of 10 random restarts run for 1,000 iterations. they leverage additional resources. For instance, Das and Petrov (2011) and Täckström et al. (2013) use parallel data to project POS tags from a supervised source language. Li et al. (2012) use tag dictionaries built from Wiktionary. Thus their results are not directly comparable to ours. 4

Conclusion
We have presented an exact estimation method for learning anchor HMMs from unlabeled data. There are several directions for future work. An important direction is to extend the method to a richer family of models such as log-linear models or neural networks. Another direction is to further generalize the method to handle a wider class of HMMs by relaxing the anchor condition (Condition 4.1). This will require a significant extension of the NMF algorithm in Figure 1.