Unsupervised Grammar Induction with Depth-bounded PCFG

There has been recent interest in applying cognitively- or empirically-motivated bounds on recursion depth to limit the search space of grammar induction models (Ponvert et al., 2011; Noji and Johnson, 2016; Shain et al., 2016). This work extends this depth-bounding approach to probabilistic context-free grammar induction (DB-PCFG), which has a smaller parameter space than hierarchical sequence models, and therefore more fully exploits the space reductions of depth-bounding. Results for this model on grammar acquisition from transcribed child-directed speech and newswire text exceed or are competitive with those of other models when evaluated on parse accuracy. Moreover, grammars acquired from this model demonstrate a consistent use of category labels, something which has not been demonstrated by other acquisition models.


Introduction
Grammar acquisition or grammar induction (Carroll and Charniak, 1992) has been of interest to linguists and cognitive scientists for decades.
This task is interesting because a well-performing acquisition model can serve as a good baseline for examining factors of grounding (Zettlemoyer and Collins, 2005;Kwiatkowski et al., 2010), or as a piece of evidence (Clark, 2001;Zuidema, 2003) about the Distributional Hypothesis (Harris, 1954) against the poverty of the stimulus (Chomsky, 1965).Unfortunately, previous attempts at inducing unbounded context-free grammars (Johnson et al., 2007;Liang et al., 2009) converged to weak modes of a very multimodal distribution of grammars.There has been recent interest in applying cognitively or empirically motivated bounds on recursion depth to limit the search space of grammar induction models (Ponvert et al., 2011;Noji and Johnson, 2016;Shain et al., 2016).Ponvert et al. (2011) and Shain et al. (2016) in particular report benefits for depth bounds on grammar acquisition using hierarchical sequence models, but either without the capacity to learn full grammar rules (e.g. that a noun phrase may consist of a noun phrase followed by a prepositional phrase), or with a very large parameter space that may offset the gains of depth-bounding.This work extends the depth-bounding approach to directly induce probabilistic context-free grammars,1 which have a smaller parameter space than hierarchical sequence models, and therefore arguably make better use of the space reductions of depth-bounding.
This approach employs a procedure for deriving a sequence model from a PCFG (van Schijndel et al., 2013), developed in the context of a supervised learning model, and adapts it to an unsupervised setting.
Results for this model on grammar acquisition from transcribed child-directed speech and newswire text exceed or are competitive with those of other models when evaluated on parse accuracy.Moreover, grammars acquired from this model demonstrate a consistent use of category labels, as shown in a noun phrase discovery task, something which has not been demonstrated by other acquisi-tion models.

Related work
This paper describes a Bayesian Dirichlet model of depth-bounded probabilistic context-free grammar (PCFG) induction.Bayesian Dirichlet models have been applied to the related area of latent variable PCFG induction (Johnson et al., 2007;Liang et al., 2009), in which subtypes of categories like noun phrases and verb phrases are induced on a given tree structure.The model described in this paper is given only words and not only induces categories for constituents but also tree structures.
There are a wide variety of approaches to grammar induction outside the Bayesian modeling paradigm.
The CCL system (Seginer, 2007a) uses deterministic scoring systems to generate bracketed output of raw text.UPPARSE (Ponvert et al., 2011) uses a cascade of HMM chunkers to produce syntactic structures.BMMM+DMV (Christodoulopoulos et al., 2012) combines an unsupervised part-of-speech (POS) tagger BMMM and an unsupervised dependency grammar inducer DMV (Klein and Manning, 2004).The BMMM+DMV system alternates between phases of inducing POS tags and inducing dependency structures.
A large amount work (Klein and Manning, 2002;Klein and Manning, 2004;Bod, 2006;Berg-Kirkpatrick et al., 2010;Gillenwater et al., 2011;Headden et al., 2009;Bisk and Hockenmaier, 2013;Scicluna and de la Higuera, 2014;Jiang et al., 2016;Han et al., 2017) has been on grammar induction with input annotated with POS tags, mostly for dependency grammar induction.Although POS tags can also be induced, this separate induction has been criticized (Pate and Johnson, 2016) for missing an opportunity to leverage information learned in grammar induction to estimate POS tags.Moreover, most of these models explore a search space that includes syntactic analyses that may be extensively center embedded and therefore are unlikely to be produced by human speakers.Unlike most of these approaches, the model described in this paper uses cognitively motivated bounds on the depth of human recursive processing to constrain its search of possible trees for input sentences.
Some previous work uses depth bounds in the form of sequence models (Ponvert et al., 2011;Shain et al., 2016), but these either do not produce complete phrase structure grammars (Ponvert et al., 2011) or do so at the expense of large parameter sets (Shain et al., 2016).Other work implements depth bounds on left-corner configurations of dependency grammars (Noji and Johnson, 2016), but the use of a dependency grammar makes the system impractical for addressing questions of how category types such as noun phrases may be learned.Unlike these, the model described in this paper induces a PCFG directly and then bounds it with a model-to-model transform, which yields a smaller space of learnable parameters and directly models the acquisition of category types as labels.
Some induction models learn semantic grammars from text annotated with semantic predicates (Zettlemoyer and Collins, 2005;Kwiatkowski et al., 2012).There is evidence humans use semantic bootstrapping during grammar acquisition (Naigles, 1990), but these models typically rely on a set of pre-defined universals, such as combinators (Steedman, 2000), which simplify the induction task.In order to help address the question of whether such universals are indeed necessary for grammar induction, the model described in this paper does not assume any strong universals except independently motivated limits on working memory.

Like
Noji and Johnson (2016) and Shain et al. (2016), the model described in this paper defines bounding depth in terms of memory elements required in a left-corner parse.A left-corner parser (Rosenkrantz and Lewis, 1970;Johnson-Laird, 1983;Abney and Johnson, 1991;Resnik, 1992) uses a stack of memory elements to store derivation fragments during incremental processing.Each derivation fragment represents a disjoint connected component of phrase structure a/b consisting of a top sign a lacking a bottom sign b yet to come.For example, Figure 1  structure tree for the sentence The cart the horse the man bought pulled broke.Immediately before processing the word man, the traversal has recognized three fragments of tree structure: two from category NP to category RC (covering the cart and the horse) and one from category NP to category N (covering the).Derivation fragments at every time step are numbered top-down by depth d to a maximum depth of D. A left-corner parser requires more derivation fragments -and thus more memory -to process center-embedded constructions than to process leftor right-embedded constructions, consistent with observations that center embedding is more difficult for humans to process (Chomsky and Miller, 1963;Miller and Isard, 1964).
Grammar acquisition models (Noji and Johnson, 2016;Shain et al., 2016) then restrict this memory to some low bound: e.g. two derivation fragments.
For sequences of observed word tokens w t for time steps t ∈ {1..T }, sequence models like Ponvert et al. (2011) and Shain et al. (2016) hypothesize sequences of hidden states q t .Models like Shain et al. (2016) implement bounded grammar rules as depth bounds on a hierarchical sequence model implementation of a left-corner parser, using random variables within each hidden state q t for: 1. preterminal labels p t and labels of top and bot-tom signs, a d t and b d t , of derivation fragments at each depth level d (which correspond to left and right children in tree structure), and 2. boolean variables for decisions to 'fork out' f t and 'join in' j t derivation fragments (in Johnson-Laird (1983) terms, to shift with or without match and to predict with or without match).
Probabilities from these distributions are then multiplied together to define a transition model M over hidden states: For example, just after the word horse is recognized in Figure 1, the parser store contains two derivation fragments yielding the cart and the horse, both with top category NP and bottom category RC.The parser then decides to fork out the next word the based on the bottom category RC of the last derivation fragment on the store.Then the parser generates a preterminal category D for this word based on this fork decision and the bottom category of the last derivation fragment on the store.Then the parser decides not to join the resulting D directly to the RC above it, based on these fork and preterminal decisions and the bottom category of the store.Finally the parser generates NP and N as the top and bottom categories of a new derivation fragment yielding just the new word the based on all these previous decisions, resulting in the store state shown in the figure .The model over the fork decision (shift with or without match) is defined in terms of a depthspecific sub-model θ F, d, where ⊥ is an empty derivation fragment and d is the depth of the deepest nonempty derivation fragment at time step t − 1: The model over the preterminal category label is then conditioned on this fork decision.When there is no fork, the preterminal category label is deterministically linked to the category label of the bottom sign of the deepest derivation fragment at the previous time step (using φ as a deterministic indicator function, equal to one when φ is true and zero otherwise).When there is a fork, the preterminal category label is defined in terms of a depth-specific sub-model θ P, d: 2 The model over the join decision (predict with or without match) is also defined in terms of a depthspecific sub-model θ J, d with parameters depending on the outcome of the fork decision: 3 Decisions about the top categories of derivation fragments a 1..D t (which correspond to left siblings in tree structures) are decomposed into fork-and joinspecific cases.When there is a join, the top category of the deepest derivation fragment deterministically depends on the corresponding value at the previous time step.When there is no join, the top category is defined in terms of a depth-specific sub-model: 4 Decisions about the bottom categories b 1..D t (which correspond to right children in tree structures) also depend on the outcome of the fork and join variables, but are defined in terms of a side-and 2 Here, again, d depth-specific sub-model in every case:5 In a sequence model inducer like Shain et al. (2016), these depth-specific models are assumed to be independent of each other and fit with a Gibbs sampler, backward sampling hidden variable sequences from forward distributions using this compiled transition model M (Carter and Kohn, 1996), then counting individual sub-model outcomes from sampled hidden variable sequences, then resampling each sub-model using these counts with Dirichlet priors over a, b, and p models and Beta priors over f and j models, then re-compiling these resampled models into a new M.
However, note that with K category labels this model contains DK 2 + 3DK 3 separate parameters for preterminal categories and top and bottom categories of derivation fragments at every depth level, each of which can be independently learned by the Gibbs sampler.Although this allows the hierarchical sequence model to learn grammars that are more expressive than PCFGs, the search space is several times larger than the K 3 space of PCFG nonterminal expansions.The model described in this paper instead induces a PCFG and derives sequence model distributions from the PCFG, which has fewer parameters, and thus strictly reduces the search space of the model.

The DB-PCFG Model
Unlike Shain et al. (2016), the depth-bounded probabilistic context-free grammar (DB-PCFG) model described in this paper directly induces a PCFG and then deterministically derives the parameters of a probabilistic left-corner parser from this single source.This derivation is based on an existing derivation of probabilistic left-corner parser models from PCFGs (van Schijndel et al., 2013), which was developed in a supervised parsing model, here adapted to run more efficiently within a larger unsupervised grammar induction model. 6 A PCFG can be defined in Chomsky normal form as a matrix G of binary rule probabilities with one row for each of K parent symbols c and one column for each of K 2 +W combinations of left and right child symbols a and b, which can be pairs of nonterminals or observed words from vocabulary W followed by null symbols ⊥: A depth-bounded grammar is a set of side-and depth-specific distributions: The posterior probability of a depth-bounded model G D given a corpus (sequence) of words w 1..T is proportional to the product of a likelihood and a prior: The likelihood is defined as a marginal over bounded PCFG trees τ of the probability of that tree given the grammar times the product of the probability of the word at each time step or token index t 6 More specifically, the derivation differs from that of van Schijndel et al. (2013) in that it removes terminal symbols from conditional dependencies of models over fork and join decisions and top and bottom category labels, substantially reducing the size of the derived model that must be run during induction.
7 This definition assumes a Kronecker delta function δ i , defined as a vector with value one at index i and zeros everywhere else, and a Kronecker product M ⊗ N over matrices M and N, which tiles copies of N weighted by values in M as follows: The Kronecker product specializes to vectors as single-column matrices, generating vectors that contain the products of all combinations of elements in the operand vectors.
given this tree:8 The probability of each tree is defined to be the product of the probabilities of each of its branches:9 The probability P(G D ) is itself an integral over the product of a deterministic transform φ from an unbounded grammar to a bounded grammar and a prior over unbounded grammars P(G): Distributions P(G) for each nonterminal symbol (rows) within this unbounded grammar can then be sampled from a Dirichlet distribution with a symmetric parameter β: which then yields a corresponding transformed sample in P(G D ) for corresponding nonterminals.
Note that this model is different than that of Shain et al. (2016), who induce a hierarchical HMM directly.
A depth-specific grammar G D is (deterministically) derived from G via transform φ with probabilities for expansions constrained to and renormalized over only those outcomes that yield terminals within a particular depth bound D. This depth-bounded grammar is then used to derive left-corner expectations (anticipated counts of categories appearing as left descendants of other categories), and ultimately the parameters of the depth-bounded leftcorner parser defined in Section 3. Counts for G are then obtained from sampled hidden state sequences, and rows of G are then directly sampled from the posterior updated by these counts.

Depth-bounded grammar
In order to ensure the bounded version of G is a consistent probability model, it must be renormalized in transform φ to assign a probability of zero to any derivation that exceeds its depth bound D. For example, if D = 2, then it is not possible to expand a left sibling at depth 2 to anything other than a lexical item, so the probability of any non-lexical expansion must be removed from the depth-bounded model, and the probabilities of all remaining outcomes must be renormalized to a new total without this probability.Following van Schijndel et al. (2013), this can be done by iteratively defining a side-and depthspecific containment likelihood h (i)  s,d for left-or rightside siblings s ∈ {L, R} at depth d ∈ {1..D} at each iteration i ∈ {1..I}, 10 as a vector with one row for each nonterminal or terminal symbol (or null symbol ⊥) in G, containing the probability of each symbol generating a complete yield within depth d as an s-side sibling: where 'T' is a top-level category label at depth zero.
A depth-bounded grammar G s,d can then be defined to be the original grammar G reweighted and 10 Experiments described in this article use I = 20 following observations of convergence at this point in supervised parsing.
renormalized by this containment likelihood: 11 This renormalization ensures the depth-bounded model is consistent.Moreover, this distinction between a learned unbounded grammar G and a derived bounded grammar G s,d which is used to derive a parsing model may be regarded as an instance of Chomsky's (1965) distinction between linguistic competence and performance.
The side-and depth-specific grammar can then be used to define expected counts of categories occurring as left descendants (or 'left corners') of rightsibling ancestors: This left-corner expectation will be used to estimate the marginalized probability over all grammar rule expansions between derivation fragments, which must traverse an unknown number of left children of some right-sibling ancestor.

Depth-bounded parsing
Again following van Schijndel et al. ( 2013), the fork and join decision, and the preterminal, top and bottom category label sub-models described in Section 3 can now be defined in terms of these sideand depth-specific grammars G s,d and depth-specific left-corner expectations E + d .First, probabilities for no-fork and yes-fork outcomes below some bottom sign of category b at depth d are defined as the normalized probabilities, respectively, of any lexical expansion of a right sibling b at depth d, and of any lexical expansion following any number of left child expansions from b 11 where diag(v) is a diagonalization of a vector v: at depth d: The probability of a preterminal p given a bottom category b is simply a normalized left-corner expected count of p under b: Yes-join and no-join probabilities below bottom sign b and above top sign a at depth d are then defined similarly to fork probabilities, as the normalized probabilities, respectively, of an expansion to left child a of a right sibling b at depth d, and of an expansion to left child a following any number of left child expansions from b at depth d: The distribution over category labels for top signs a above some top sign of category c and below a bottom sign of category b at depth d is defined as the normalized distribution over category labels following a chain of left children expanding from b which then expands to have a left child of category c: The distribution over category labels for bottom signs b below some sign a and sibling of top sign c is then defined as the normalized distribution over right children of grammar rules expanding from a to c followed by b: Finally, a lexical observation model L is defined as a matrix of unary rule probabilities with one row for each combination of store state and preterminal symbol and one column for each observation symbol:

Gibbs sampling
Grammar induction in this model then follows a forward-filtering backward-sampling algorithm (Carter and Kohn, 1996).This algorithm first computes a forward distribution v t over hidden states at each time step t from an initial value ⊥: The algorithm then samples hidden states backward from a multinomial distribution given the previously sampled state q t+1 at time step t+1 (assuming input parameters to the multinomial function are normalized): ) Grammar rule applications C are then counted from these sampled sequences:12 and a new grammar G is sampled from a Dirichlet distribution with counts C and a symmetric hyperparameter β as parameters: This grammar is then used to define transition and lexical models M and L as defined in Sections 3 through 4.2 to complete the cycle.

Model hyper-parameters and priors
There are three hyper-parameters in the model.K is the number of non-terminal categories in the grammar G, D is the maximum depth, and β is the parameter for the symmetric Dirichlet prior over multinomial distributions in the grammar G.
As seen from the previous subsection, the prior is over all possible rules in an unbounded PCFG grammar.Because the number of non-terminal categories of the unbounded PCFG grammar is given as a hyper-parameter, the number of rules in the grammar is always known.It is possible to use nonparametric priors over the number of non-terminal categories, however due to the need to dynamically mitigate the computational complexity of filtering and sampling using arbitrarily large category sets, this is left for future work.

Evaluation
The DB-PCFG model described in Section 4 is evaluated first on synthetic data to determine whether it can reliably learn a recursive grammar from data with a known optimum solution, and to determine the hyper-parameter value for β for doing so.Two experiments on natural data are then carried out.First, the model is run on natural data from the Adam and Eve parts of the CHILDES corpus (MacWhinney, 1992) to compare with other grammar induction systems on a human-like acquisition task.Then data from the Wall Street Journal section of the Penn Treebank (Marcus et al., 1993) is used for further comparison in a domain for which competing systems are optimized.The competing systems include UPPARSE (Ponvert et al., 2011) 13 , CCL (Seginer, 2007a) 14 , BMMM+DMV with undirected dependency features (Christodoulopoulos et al., 2012) 15 and UHHMM (Shain et al., 2016). 16or the natural language datasets, the variously parametrized DB-PCFG systems17 are first validated on a development set, and the optimal system is then run until convergence with the chosen hyperparameters on the test set.In development experiments, the log-likelihood of the dataset plateaus usually after 500 iterations.The system is therefore run at least 500 iterations in all test set experiments, with one iteration being a full cycle of Gibbs sampling.The system is then checked to see whether the loglikelihood has plateaued, and halted if it has.
The DB-PCFG model assigns trees sampled from conditional posteriors to all sentences in a dataset in every iteration as part of the inference.The system is further allowed to run at least 250 iterations after convergence and proposed parses are chosen from the iteration with the greatest log-likelihood after convergence.However, once the system reaches convergence, the evaluation scores of parses from different iterations post-convergence appear to differ very little.

Following
Liang et al. ( 2009) and Scicluna and de la Higuera (2014), an initial set of experiments on synthetic data are used to investigate basic properties of the model-in particular: 1. whether the model is balanced or biased in favor of left-or right-branching solutions, 2. whether the model is able to posit recursive structure in appropriate places, and 3. what hyper-parameters enable the model to find optimal modes more quickly.
The risk of bias in branching structure is important because it might unfairly inflate induction results on languages like English, which are heavily right branching.In order to assess its bias, the model is evaluated on two synthetic datasets, each consisting of 200 sentences.The first dataset is a leftbranching corpus, which consists of 100 sentences of the form a b and 100 sentences of the form a b b , with optimal tree structures as shown in Figure 2 (a) and (b).The second dataset is a right-branching corpus, which consists of 100 sentences of the form a) Finally, as a gauge of the complexity of this task, results of the model described in this paper are compared with those of other grammar induction mod- 18 Here, in order to more closely resemble natural language input, tokens a, b, and c are randomly chosen uniformly from {a 1 , . . ., a 50 }, {b 1 , . . ., b 50 } and {c 1 , . . ., c 50 }, respectively.

System
Precision els on the center-embedding dataset.In this experiment, all models are assigned hyper-parameters matching the optimal solution.The DB-PCFG is run with K=5 and D=2 and β=0.2 for all priors, the BMMM+DMV (Christodoulopoulos et al., 2012) is run with 3 preterminal categories, and the UHHMM model is run with 2 active states, 4 awaited states and 3 parts of speech. 19Table 1 shows the PARSE-VAL scores for parsed trees using the learned grammar from each unsupervised system.Only the DB-PCFG model is able to recognize the correct tree structures and the correct category labels on this dataset, showing the task is indeed a robust challenge.This suggests that hyper-parameters optimized on this dataset may be portable to natural data.

Child-directed speech corpus
After setting the β hyperparameter on synthetic datasets, the DB-PCFG model is evaluated on 14,251 sentences of transcribed child-directed speech from the Eve section of the Brown corpus of CHILDES (MacWhinney, 1992) Model performance is evaluated against Penn Treebank style annotations of both Adam and Eve corpora (Pearl and Sprouse, 2013).Table 2 shows the PARSEVAL scores of the DB-PCFG system with different hyperparameters on the Adam corpus for development.The simplest configuration, D1K15 (depth 1 only with 15 non-terminal categories), obtains the best score, so this setting is applied to the test corpus, Eve.Results of the D=1, K=15 DB-PCFG model on Eve are then compared against those of other grammar induction systems which use only raw text as input on the same corpus.Following Shain et al. (2016) the BMMM+DMV system is run for 10 iterations with 45 categories and its output is converted from dependency graphs to constituent trees (Collins et al., 1999).The UHHMM system is run on the Eve corpus using settings in Shain et al. (2016), which also includes a postprocess option to flatten trees (reported here as UHHMM-F).
Table 3 shows the PARSEVAL scores for all the competing systems on the Eve dataset.The rightbranching baseline is still the most accurate in terms of PARSEVAL scores, presumably because of the highly right-branching structure of child-directed speech in English.The DB-PCFG system with only one memory depth and 15 non-terminal categories achieves the best performance in terms of F1 score and recall among all the competing systems, signif-icantly outperforming other systems (p < 0.0001, permutation test). 20 The Eve corpus has about 5,000 sentences with more than one depth level, therefore one might expect a depth-two model to perform better than a depth-one model, but this is not true if only PAR-SEVAL scores are considered.This issue will be revisited in the following section with the noun phrase discovery task.

NP discovery on child-directed speech
When humans acquire grammar, they do not only learn tree structures, they also learn category types: noun phrases, verb phrases, prepositional phrases, and where each type can and cannot occur.
Some of these category types -in particular, noun phrases -are fairly universal across languages, and may be useful in downstream tasks such as (unsupervised) named entity recognition.The DB-PCFG and other models that can be made to produce category types are therefore evaluated on a noun phrase discovery task.
Two metrics are used for this evaluation.First, the evaluation counts all constituents proposed by the candidate systems, and calculates recall against the gold annotation of noun phrases.This metric is not affected by which branching paradigm the system is using and reveals more about the systems' performances.This metric differs from that used by Ponvert et al. (2011) in that this metric takes NPs at all levels in gold annotation into account, not just base NPs. 21 The second metric, for systems that produce category labels, calculates F1 scores of induced categories that can be mapped to noun phrases.The first 4,000 sentences are used as the development set for learning mappings from induced category labels to phrase types.The evaluation calculates precision, recall and F1 of all spans of proposed categories against the gold annotations of noun phrases 20 Resulting scores are better when applying Shain et al. (2016) flattening to output binary-branching trees.For the D=1, K=15 model, precision and F1 can be raised to 70.31% and 74.33%.However, since the flattening is a heuristic which may not apply in all cases, these scores are not considered to be comparable results. 21 in the development set, and aggregates the categories ranked by their precision scores so that the F1 score of the aggregated category is the highest on the development set.The evaluation then calculates the F1 score of this aggregated category on the remainder of the dataset, excluding this development set.
The UHHMM system is the only competing system that is natively able to produce labels for proposed constituents.BMMM+DMV does not produce constituents with labels by default, but can be evaluated using this metric by converting dependency graphs into constituent trees, then labeling each constituent with the part-of-speech tag of the head.For CCL and UPPARSE, the NP agg F1 scores are not reported because they do not produce labeled constituents.
Table 4 shows the scores for all systems on the Eve dataset and four runs of the DB-PCFG system on these two evaluation metrics.Surprisingly the D=2, K=15 model which has the lowest PARSE-VAL scores is most accurate at discovering noun phrases.It has the highest scores on both evaluation metrics.The best model in terms of PARSE-VAL scores, the D=1, K=15 DB-PCFG model, performs poorly among the DB-PCFG models, despite the fact that its NP Recall is higher than the competing systems.The low score of NP agg F1 of DB-PCFG at D1K15 shows a diffusion of induced syntactic categories when the model is trying to find a balance among labeling and branching decisions.The UPPARSE system, which is proposed as a base NP chunker, is relatively poor at NP recall by this definition.
The right-branching baseline does not perform well in terms of NP recall.This is mainly because noun phrases are often left children of some other constituent and the right branching model is unable to incorporate them into the syntactic structures of whole sentences.Therefore although the rightbranching model is the best model in terms of PAR-SEVAL scores, it is not helpful in terms of finding noun phrases.

Penn Treebank
To further facilitate direct comparison to previous work, we run experiments on sentences from the Penn Treebank (Marcus et al., 1993).The first experiment uses the sentences from Wall Street Journal part of the Penn Treebank with at most 20 words (WSJ20).The first half of the WSJ20 dataset is used as a development set (WSJ20dev) and the second half is used as a test set (WSJ20test).We also extract sentences in WSJ20test with at most 10 words from the proposed parses from all systems and report results on them (WSJ10test).WSJ20dev is used for finding the optimal hyperparameters for both DB-PCFG and BMMM-DMV systems. 22 Table 5 shows the PARSEVAL scores of all systems.The right-branching baseline is relatively weak on these two datasets, mainly because formal writing is more complex and uses more non-rightbranching structures (e.g., subjects with modifiers or parentheticals) than child-directed speech.For WSJ10test, both the DB-PCFG system and CCL are able to outperform the right branching baseline.The F1 difference between the best-performing previouswork system, CCL, and DB-PCFG is highly significant.For WSJ20test, again both CCL and DB-PCFG are above the right-branching baseline.The difference between the F scores of CCL and DB-PCFG is very small compared to WSJ10, however it is also significant.
It is possible that the DB-PCFG is being penalized for inducing fully binarized parse trees.The 22 Although UHHMM also needs tuning, in practice we find that this system is too inefficient to be tuned on a development set, and it requires too many resources when the hyperparameters become larger than used in previous work.We believe that further increasing the hyperparameters of UHHMM may lead to performance increase, but the released version is not scalable to larger values of these settings.We also do not report UHHMM on WSJ20test for the same scalabilty reason.accuracy of the DB-PCFG model is dominated by recall rather than precision, whereas CCL and other systems are more balanced.This is an important distinction if it is assumed that phrase structure is binary (Kayne, 1981;Larson, 1988), in which case precision merely scores non-linguistic decisions about whether to suppress annotation of nonmaximal projections.However, since other systems are not optimized for recall, it would not be fair to use only recall as a comparison metric in this study.
Finally, Table 6 shows the published results of different systems on WSJ.The CCL results come from Seginer (2007b), where the CCL system is trained with all sentences from WSJ, and evaluated on sentences with 40 words or fewer from WSJ (WSJ40) and WSJ10.The UPPARSE results come from Ponvert et al. (2011), where the UPPARSE system is trained using 00-21 sections of WSJ, and evaluated on section 23 and the WSJ10 subset of section 23.The DB-PCFG system uses hyperparameters optimized on the WSJ20dev set, and is evaluated on WSJ40 and WSJ10, both excluding WSJ20dev.The results are not directly comparable, but the results from the DB-PCFG system is competitive with the other systems, and numerically have the best recall scores.

Conclusion
This paper describes a Bayesian Dirichlet model of depth-bounded PCFG induction.Unlike earlier work this model implements depth bounds directly on PCFGs by derivation, reducing the search space of possible trees for input words without exploding the search space of parameters with multiple sideand depth-specific copies of each rule.Results for this model on grammar acquisition from transcribed child-directed speech and newswire text exceed or are competitive with those of other models when evaluated on parse accuracy.Moreover, grammars acquired from this model demonstrate a consistent use of category labels, something which has not been demonstrated by other acquisition models.
In addition to its practical merits, this model may offer some theoretical insight for linguists and other cognitive scientists.First, the model does not assume any universals except independently motivated limits on working memory, which may help address the question of whether universals are indeed necessary for grammar induction.Second, the distinction this model draws between its learned unbounded grammar G and its derived bounded grammar G D seems to align with Chomsky's (1965) distinction between competence and performance, and has the potential to offer some formal guidance to linguistic inquiry about both kinds of models.

Figure 1 :
Figure 1: Derivation fragments before the word man in a left-corner traversal of the sentence The cart the horse the man bought pulled broke.

Figure 3 :
Figure 3: Synthetic center-embedding structure.Note that tree structures (b) and (d) have depth 2 because they have complex sub-trees spanning a b and a b b, respectively, embedded in the center of the yield of their roots.

Table 1 :
The performance scores of unlabeled parse evaluation of different systems on synthetic data.

Table 2 :
PARSEVAL results of different hyperparameter settings for the DB-PCFG system on the Adam dataset.Hyperparameter D is the number of possible depths, and K is the number of non-terminals.

Table 3 :
Shain et al. (2016)1)nd K are set to optimize performance on the Adam section of the Brown Corpus of CHILDES, which is about twice as long as Eve.Fol-PARSEVAL scores on Eve dataset for all competing systems.These are unlabeled precision, recall and F1 scores on constituent trees without punctuation.Both the right-branching baseline and the best performing system are in bold.(**:p<0.0001,permutationtest)lowingSeginer (2007a),Ponvert et al. (2011)andShain et al. (2016), these experiments leave all punctuation in the input for learning, then remove it in all evaluations on development and test data.

Table 4 :
Ponvert et al. (2011)define base NPs as NPs with no NP descendants, a restriction motivated by their particular task (chunking).Performances of different systems for noun phrase recall and aggregated F1 scores on the Eve dataset.

Table 5 :
The results of WSJ10test of UHHMM is induced with all WSJ10 sentences.PARSEVAL scores for all competing systems on WSJ10 and WSJ20 test sets.These are unlabeled precision, recall and F1 scores on constituent trees without punctuation (**: p <0.0001, permutation test).

Table 6 :
Published PARSEVAL results for competing systems.Please see text for details as the systems are trained and evaluated differently.