Learning Tier-based Strictly 2-Local Languages

The Tier-based Strictly 2-Local (TSL2) languages are a class of formal languages which have been shown to model long-distance phonotactic generalizations in natural language (Heinz et al., 2011). This paper introduces the Tier-based Strictly 2-Local Inference Algorithm (2TSLIA), the first nonenumerative learner for the TSL2 languages. We prove the 2TSLIA is guaranteed to converge in polynomial time on a data sample whose size is bounded by a constant.


Introduction
This work presents the Tier-based Strictly 2-Local Inference Algorithm (2TSLIA), an efficient learning algorithm for a class of Tier-based Strictly Local (TSL) formal languages (Heinz et al., 2011).A TSL class is determined by two parameters: the tier, or subset of the alphabet, and the permissible tier k-factors, which are the legal sequences of length k allowed in the string, once all non-tier symbols have been removed.The Tier-based Strictly 2-Local (TSL 2 ) languages are those in which k = 2.
As will be discussed below, the TSL languages are of interest to phonology because they can model a wide variety of long-distance phonotactic patterns found in natural language (Heinz et al., 2011;Mc-Mullin and Hansson, forthcoming).One example is derived from Latin liquid dissimilation, in which two ls cannot appear in a word unless there is an r intervening, regardless of distance.For example, floralis 'floral' is well-formed but not *militalis (cf.militaris 'military').As explained in sec-tions 2 and 4, this can be modeled with permissible 2-factors over a tier consisting of the liquids {l, r}.
For long-distance phonotactics, k can be fixed to 2, but it is does not appear that the tier can be fixed since languages employ a variety of different tiers.This presents an interesting learning problem: Given a fixed k, how can an algorithm induce both a tier and a set of permissible tier k-factors from positive data?
There is some related work which addresses this question.Goldsmith and Riggle (2012), building on work by Goldsmith and Xanthos (2009), present a method based on mutual information for learning tiers and subsequently learning harmony patterns.This paper differs in that its methods are rooted firmly in grammatical inference and formal language theory (de la Higuera, 2010).For instance, in contrast to the results presented there, we prove the kinds of patterns 2TSLIA succeeds on and the kind of data sufficient for it to do so.
Nonetheless, there is relevant work in computational learning theory: Gold (1967) proved that any finite class of languages is identifiable in the limit via an enumeration method.Given a fixed alphabet and a fixed k, the number of possible tiers and permissible tier k-factors is finite, and thus learnable in this way.However, such learners are grossly inefficient.No provably-correct, non-enumerative, efficient learner for both the tier and permissible tier k-factor parameters has previously been proposed.This work fills this gap with an algorithm which learns these parameters when k = 2 from positive data in time polynomial in the size of the data.
Finally, Jardine (2016) presents a simplified ver-sion of 2TSLIA and reports the results of some simulations.Unlike that paper, the present work provides the full mathematical details and proofs.The simulations are discussed in the discussion section.This paper is structured as follows.§2 motivates the work with examples from natural language phonology.§3 outlines the basic concepts and definitions to be used throughout the paper.§4 defines the TSL languages and discusses their properties.§5 details the 2TSLIA and proves it learns the TSL 2 class in polynomial time and data.§6 discusses future work, and §7 concludes.

Linguistic motivation
The primary motivation for studying the TSL languages and their learnability comes from their relevance to phonotactic (word well-formedness) patterns in natural language phonology.Many phonotactic patterns belong to either the Strictly Local (SL) languages (McNaughton and Papert, 1971) or the Strictly Piecewise (SP) languages (Rogers et al., 2010).1An example of phonotactic knowledge which is SL is Chomsky and Halle's (Chomsky and Halle, 1965) observation that English speakers will classify blick as a possible word of English while rejecting bnick as impossible.This is SL because it can be described as *bn being unacceptable as a word-initial sequence.SP languages can describe long-distance dependencies based on precedence relationships between sounds (such as consonant harmony in Sarcee, in which s may not follow a S anywhere in a word, but may precede one (Cook, 1984)) (Heinz, 2010a).Also, the SL and SP languages are efficiently learnable (García et al., 1990;Heinz, 2010a;Heinz, 2010b;Heinz and Rogers, 2013).
However, there are some long-distance patterns which cannot be described purely with precedence relationships.One example is a pattern from Latin, in which in certain cases an l cannot follow another l unless an r intervenes, no matter the distance between them (Jensen, 1974;Odden, 1994;Heinz, 2010a).This can be seen in the -alis adjectival suffix (Example 1), which appears as -aris if the word it attaches to already contains an l ((d) through (f) in Example 1), except in cases where there is an intervening r, in which it appears again as -alis ((g) through (i) in Example 1).In the examples, the sounds in question are underlined for emphasis (data from Heinz (2010a)), and for (d) through (f), illicit forms are given, marked with a star (* This non-local alternating pattern is not SP because SP languages cannot express blocking effects (Heinz, 2010a).However, it can be described with a TSL grammar in which the tier is {r, l} and the permissible tier 2-factors do not include *ll and *rr.This yields exactly the set of strings in which an l is always immediately (disregarding all sounds besides {r, l}) followed by an r, and vice versa.
Formal learning algorithms for SL and SP languages can provide a model for human learning of SL and SP sound patterns (Heinz, 2010a).TSL languages are also similarly learnable, given the stipulation that both the tier and k are fixed.For natural language, the value for k never seems to go above 2 (Heinz et al., 2011).However, tiers vary in human language-TSL patterns occur both with different kinds of vowels (Nevins, 2010) and consonants (Suzuki, 1998;Bennett, 2013).For example, in Turkish the tier is the entire vowel inventory (Clements and Sezer, 1982), while in Finnish it is vowels except /i,e/ (Ringen, 1975).In Samala consonant harmony, the tier is sibilants (Rose and Walker, 2004), whereas in Koorete, the tier is sibilants and {b,r,g,d} (McMullin and Hansson, forthcoming).Thus, it is of interest to understand how both the tier and permissible tier 2-factors for TSL 2 grammars might be learned efficiently.
S 2 .Let P(S) denote the powerset of S and P f in (S) be the set of all finite subsets of S.
Let Σ denote a set of symbols, referred to as the alphabet, and let a string over Σ be a finite sequence of symbols from that alphabet.The length of a string w will be denoted |w|.Let λ denote the empty string; |λ| = 0. Let Σ * (Kleene star) represent all strings over this alphabet, and Σ k represent all strings of length k.Concatenation of two strings w and v (or symbols σ 1 and σ 2 ) will be written wv (σ 1 σ 2 ).Special beginning ( ) and end ( ) symbols ( , ∈ Σ) will often be used to mark the beginning and end of words; the alphabet Σ augmented with these, Σ ∪ { , }, will be denoted Σ .
A string u ∈ Σ * is said to be a factor or substring of another string w if there are two other strings v, v ∈ Σ * such that w = vuv .We call u a k-factor of w if it is a factor of w and |u| = k.Let fac k : Σ * → P(Σ ≤k ) be a function mapping strings to their k-factors, where fac k (w) equals {u|u is a k-factor of w} if |w| > k and equals {w} otherwise.For example, fac 2 (aab) = {aa, ab} and fac 8 (aab) = {aab}.We extend the k-factor function to languages; fac k (L) = w∈L fac k (w).

Grammars, languages, and learning
A language (or stringset) L is a subset of Σ * .If L is finite, let |L| denote the cardinality of L, and let ||L|| denote the size of L, which is defined to be w∈L |w|.Let L 1 • L 2 denote the concatenation of the languages L 1 and L 2 , i.e., the pairwise concatenation of each word in L 1 to each word in L 2 .For notational simplicity, the concatenation of a singleton language {w} to another language L 2 (or L 1 to {w}) will be written wL 2 (L 1 w).
A grammar is a finite representation of a possibly infinite language.A class L of languages is represented by a class R of representations if every r ∈ R is of finite size and there is a naming function L : R → L which is both total and surjective.
The learning paradigm used in this paper is identification in the limit learning paradigm (Gold, 1967), with polynomial bounds on time and data (de la Higuera, 1997).This paradigm has two complementary aspects.First, it requires that the information which distinguishes a learning target from other potential targets be present in the input for algorithms to successfully learn.Second, it requires successful algorithms to return a hypothesis in time polynomial of the size of the sample, and that the size of the sample itself must be polynomial in the size of grammar.
The definition of the learning paradigm (Definition 3) depends on some preliminary notions.Definition 1.Let L be a class of languages represented by some class R of representations.
1.An input sample I for a language L ∈ L is a finite set of data consistent with L, that is to say I ⊆ L.
2. A (L, R)-learning algorithm A is a program that takes as input a sample for a language L ∈ L and outputs a representation from R.
The notion of characteristic sample is integral.Definition 2 (Characteristic sample).For a (L, R)learning algorithm A, a sample CS is a characteristic sample of a language L ∈ L if for all samples I for L such that CS ⊆ I, A returns a representation r such that L(r) = L.
Now the learning paradigm can be defined.Definition 3 (Identification in polynomial time and data).A class L of languages is identifiable in polynomial time and data if there exists a (L, R)learning algorithm A and two polynomials p() and q() such that: 1.For any sample I of size m for L ∈ L, A returns a hypothesis r ∈ R in O(p(m)) time.2. For each representation r ∈ R of size n, there exists a characteristic sample of r for A of size at most O(q(n)).

Tier-based Strictly Local Languages
This section introduces the Tier-based Strictly Local (TSL) languages (Heinz et al., 2011).The TSL languages generalize the SL languages (McNaughton and Papert, 1971;García et al., 1990;Caron, 2000), and as such these will briefly be discussed first.

The Strictly Local Languages
The SL languages can be defined as follows (Heinz et al., 2011): Such a set S is sometimes referred to as the permissible k-factors or permissible substrings of L. For example, let L = {λ, ab, abab, ababab, ...}.This L can be described with a set of permissible 2-factors S = { , a, ab, ba, b } because every 2-factor of every word in L is in S; thus, L is Strictly 2-Local (abbreviated SL 2 ).
As a set S of permissible k-factors is finite it can also be viewed as a SL grammar where A canonical SL grammar contains no useless elements.
In the example above, aa ∈ S and aa ∈ fac 2 (w) for any w ∈ L. Such a string is referred to as a forbidden k-factor or a restriction on L. The set of forbidden k-factors R is fac k ( Σ * ) − S. Thinking about the grammar in terms of S or in terms of R is equivalent, but in some cases it is simpler to refer to one rather than the other, so we shall use both.
Any SL k class of languages is learnable with polynomial bounds on time and data if k is known in advance (García et al., 1990;Heinz, 2010b).
The class of SL k languages (for each k) belongs to a collection of language classes called string extension language classes (Heinz, 2010b).The discussion above presents SL k languages from this perspective.These language classes have many desirable learning properties, due to their underlying lattice structure (Heinz et al., 2012).

The TSL languages
The TSL languages can be thought of as a further parameterization on the k-factor function where a certain subset of the alphabet takes part in the grammar and all other symbols in the alphabet are ignored.This special subset is referred to as a tier T ⊆ Σ. Symbols not on the tier are removed from consideration of the grammar by an erasing function For example, if Σ = {a, b, c} and T = {a, c} then E T (bbabbcbba) = aca.We can then define a tier version Here, and are built into the function as they are always treated as part of the tier.Continuing the example from above, fac T-2 (bbabbcbba) = { a, ac, ca, a }.fac T-k can be extended to languages as with fac k above.
The TSL languages can now be defined parallel to the SL languages (the following definition is equivalent to the one in Heinz et al. (2011)): Definition 5 (TSL languages).A language L is Tierbased Strictly k-Local iff there exists a subset T ⊆ Σ of the alphabet and a finite set Parallel to SL grammars above, T, S can be thought of as a TSL grammar of L. Likewise, the forbidden tier substrings (or tier restrictions) R is simply the set fac T-k (Σ * ) − S. Finally, T, S is canonical if S contains no useless elements (i.e., s ∈ S ⇔ s ∈ fac T-k (L( T, S ))) and R is nonempty (this second restriction is explained below).
For example, let Σ = {a, b, c} and T = {a, c} as above and let S = { , a, c, ac, a , ca, cc, c }. Plugging these into Definition 5, we obtain a language L which only contains those strings without the forbidden 2-factor aa on tier T .These are words which may contain bs interspersed with as and cs provided that no a precedes another without an intervening c.For example, bbabbcbba ∈ L but bbabbbbabb ∈ L, because E T (bbabbbbabb) = aa and aa ∈ fac 2 ( aa ) but aa ∈ S.
Like the class of SL k languages, the class of TSL languages (for fixed T and k) is a string extension language class.The relationship of the TSL languages to other sub-Regular language classes (Mc-Naughton and Papert, 1971;Rogers et al., 2013) is studied in Heinz et al. (2011).
Given a fixed k and T , S is easily learnable in the same way as the SL languages (Heinz et al., 2011).However, as discussed above, in the case of natural language phonology, it is not clear that information about T is available a priori.Learning both T and S simultaneously is thus an interesting problem.
This problem admits a technical, but unsatisfying, solution.The number of subsets T such that T ⊆ Σ is finite, so the number of TSL k languages given a fixed k is finite.It is already known that any finite class of languages which can be enumerated can be identified in the limit by an algorithm which checks through the enumeration (Gold, 1967).However, given the cognitive relevance of TSL languages, it is of interest to pursue a smarter, computationally efficient method for learning them.
What are the consequences of varying T ?When T = Σ, the result is an SL k language, because the erasing function operates vacuously (Heinz et al., 2011).Conversely, when T = ∅, by Definition 5, S is either ∅ or { }.The former obtains the empty language while the latter obtains Σ * .By Definition 5, Σ * can be described with any T ⊆ Σ as long as S = fac 2 ( T * ).In such a grammar, none of the members of T serve any purpose; thus we stipulate that for a canonical grammar R is nonempty.
Importantly, a member of T may fail to belong to a string in R and still serve a purpose.For example, let Σ = {a, b, c}, T = {a, c}, and S = fac 2 ( T * ) − {aa}.Because c appears in no forbidden tier substrings in R, it is freely distributed in L = L( T, S ).However, it makes a difference in the language, because aca ∈ L but aba ∈ L. If c (and the relevant tier substrings in S) were missing from the tier, neither aba nor aca would be in L. This can be thought of a 'blocking' function of c, because it allows sequences a...a even though aa ∈ R.
We may now return to the Latin example from §2 in a little more detail.The Latin pattern, in which rs and ls must alternate, regardless of other intervening sounds.This can be captured by a TSL grammar in which T = {l, r} and S = { l, r, lr, rl, rr, r , l }.This captures the generalization that, ignoring all sounds besides l and r, l and l are never allowed to be adjacent.The remainder of the paper discusses how such a grammar, including T , may be induced from positive data.

Algorithm
This section introduces the Tier-based Strictly 2-Local Inference Algorithm (2TSLIA), which induces both a tier and a set of permissible tier 2factors from positive data.First, in §5.1, the concept of a path, which is crucial to the learner, is defined.§5.3 introduces and describes the algorithm, and §5.4 defines a distinguishing example and proves that it is a characteristic sample for which the algorithm is guaranteed to converge.Time and data complexity for the algorithm are discussed there.

Paths
First, we define the concept of 2-paths.How this concept might be generalized to k is discussed in §6.Paths denote precedence relations between symbols in a string, but they are further augmented with sets of intervening symbols.Formally, a 2-path is a 3tuple x, Z, y where x and y are a symbol in Σ and Z is a subset of Σ.The 2-paths of a string w = σ 0 σ 1 . . .σ n , denoted paths 2 (w), are Essentially, the 2-paths are pairs of symbols in a string ordered by precedence and interpolated with the sets of symbols intervening between them.For example, a, {b, c}, d is in paths 2 (abbcd) because a precedes d and {b, c} is the set of symbols that come between them.Intuitively, it gives the set of symbols one must 'travel over' in order to get from one symbol to another in a string.As shall be seen shortly, this information is essential for how the algorithm works.Let paths 2 () be extended to languages such that paths 2 (L) = w∈L paths 2 (w).
As we are here only concerned with 2-paths, we henceforth simply refer to them as 'paths'.
Remark 1.The paths of a string w can be calculated in time at most quadratic in the size of w = To see why, consider a n×n table and consider i, j such that 1 ≤ i < j ≤ n.The idea is each nonempty cell in the table contains the set of intervening symbols between σ i and σ j .Let p(x, y) denote the entry in the x-th row and the y-th column.The following hold: p(i, i + 1) = ∅; p(i, i + 2) = {σ i+1 }; and p(i, j) = i≤x≤j−2 p(x, x + 2) for any j ≥ i + 3. Since each of these operations is linear time or less, the size of the table, which equals n 2 , provides an upper bound on the time complexity of paths 2 (w).(This bound is not tight since half the table is empty.)

Terminology
Before introducing the algorithm we define some terms which are useful for understanding its operation.For a TSL 2 grammar G = T, S , T ⊆ Σ will be referred to as the tier, S the allowed tier substrings, and R = fac T-2 (Σ * ) − S as the forbidden tier substrings.Let H = Σ − T be the non-tier elements of G.For any elements σ i , σ j ∈ T , their tier adjacency with respect to a set L of strings refers to whether or not they may appear adjacent on the tier; formally, σ i , σ j are tier adjacent in L iff σ i σ j ∈ fac T-2 (w) for some w ∈ L.
An exclusive blocker is a σ ∈ T which does not appear in R.That is, ∀w ∈ R, σ ∈ fac 1 (w).An exclusive blocker is thus not restricted in its distribution but may intervene between other elements on the tier.The free elements of a grammar G will refer to the union of the set of the nontier elements of G and exclusive blockers given G.
Given an order σ 0 , σ 1 , ..., σ n on Σ, let Σ i = {σ h |h < i} refer to the elements of Σ less than σ i and J i = Σ−Σ i be the elements σ i and greater.Note that Σ 0 = ∅ and J 0 = Σ.Let H i = H ∩ Σ i be the non-tier elements less than σ i and T i = (T ∩Σ i )∪J i be the expanded tier given σ i , or the tier plus elements from J i .Note that H i and T i are complements of each other with respect to Σ.In the context of the algorithm, T i will represent the algorithm's current hypothesis for the tier.When referring not to the order on Σ but to the index of positions in a string or path, τ 1 , τ 2 , ..., ∈ Σ shall be used.
For a path τ 1 , X, τ 2 , we refer to τ 1 and τ 2 as the symbols on the path, and X as the intervening set.For a set of paths P , the term tier paths will refer to the paths whose symbols are on T .Formally, the tier paths are P T = { τ 1 , X, τ 2 ∈ P |τ 1 , τ 2 ∈ T , X ⊆ Σ}.Note that P T is restricted only by the symbols on the path; the intervening set X may be any subset of Σ.

Algorithm
The 2TSLIA 2 (Algorithm 1 on the following page) takes an ordered alphabet Σ and a set of input strings I and returns a TSL grammar T, S .It has two main functions: get tier, which calculates T , and main, which calls get tier and uses the resulting tier to determine S.
First, get tier takes as arguments an index i, an expanded tier T i , and a set of paths P .The expanded tier is the algorithm's hypothesis for the tier at stage i.It is first called with T i = Σ (the most conservative hypothesis for the tier) and i = 0.The goal of get tier is to whittle down T i , which is the set of elements known to be in T plus the set of elements whose membership in T has not yet been determined, down to only elements known to be in T .
The get tier function recursively iterates through T i , starting with σ 0 , to determine which members of T i should be in the final hypothesis T .Two other pieces of data are important for this: P T i , or the tier paths of T i whose symbols are in T i ∪ { , }, and H i , or the set of non-tier elements less than σ i .The set H i is the algorithm's largest safe hypothesis for non-tier elements, and it will reason about restrictions on σ i using paths from P T i .
Elements of Σ are checked for membership on the tier in order; thus, when checking σ i , get tier has already checked every other σ h for h < i.For each σ i , membership in T is decided on two conditions labeled (a) and (b) in the if-then statement of get tier.Condition (a) tests whether σ i is a free element, and condition (b) further tests whether σ i is an exclusive blocker.If σ i is found to be a free element and not an exclusive blocker, then it is a non-tier element (see §5.2), and should be removed from T .In detail: Condition (a).To test whether σ i is a free element, this condition checks to see if there are any restrictions on σ i given P T i and H i .It searches through P T i for , X, σ i , σ i , X , , and, for all σ ∈ T i , σ i , Y, σ and σ , Y , σ i , where the intervening sets X, X , Y, Y are all subsets of H i .If all of these appear in P T i , it means that σ i is a free element with respect to T i .set of the non-tier elements Σ − T .Such τ 1 τ 2 pairs are thus tier-adjacent in I.The resulting grammar T, S is then returned.

Identification in polynomial time and data
Here we establish the main result of the paper that 2TSLIA identifies the TSL 2 class in polynomial time and data.As is typical, the proof relies on a establishing a characteristic sample for the 2TSLIA.
Lemma 1.Given an input sample I of size n, 2TSLIA outputs a grammar in time polynomial in the size of n.
Proof.The paths are calculated for each word once at the beginning of main.This takes time quadratic in the size of the sample (Remark 1), so O(n 2 ).Call this set of paths P (note P is also bounded in size by n 2 ).Additionally, the loop in get tier is called exactly |Σ| times.Checking condition (a) in get tier requires a single pass through the paths.On other hand, condition (b) requires one search through P for every path element p ∈ P in the worst case, which is O(n 4 ).Thus the time complexity for get tier is O(|Σ|(n 2 + n 4 )) = O(n 4 ) since |Σ| is a constant.
Lastly, finding the permissible substrings on the tier also requires a single pass through P , which also takes O(n 2 ).Altogether then an upper bound on the time complexity of 2TSLIA is given by O(n 2 +n 4 + n 2 ) = O(n 4 ), which is polynomial.
Next we define a distinguishing sample for a target language for the 2TSLIA.We first show it is polynomial in the size of the target grammar.We then show that it is a characteristic sample.
Definition 6 (Distinguishing sample).Given an alphabet Σ, with some order σ 1 , σ 2 , ..., σ n on Σ, the target language L ∈ TSL 2 , and the canonical grammar G = T, S for L, a distinguishing sample D for G is the set meeting the following conditions.Recall that H = Σ − T are the non-tier elements and H i refers to the non-tier elements less than σ i .
1.The non-tier element condition.For all nontier elements 2. The exclusive blocker condition.For each The allowed tier substring condition.∀τ 1 τ 2 ∈ S, some w s.t.τ 1 , X, τ 2 ∈ paths 2 (w) where X ⊆ H Essentially, item (1) ensures that all symbols σ i not on the target T will meet conditions (a) and (b) in the for loop and be removed from the algorithm's hypothesis for T .Item (2) ensures that, in the situation an exclusive blocker σ i ∈ T meets condition (a) for removal from the tier hypothesis, it will not meet condition (b).Item (3) ensures that the sample contains every τ 1 τ 2 in S.These points will be discussed more detail in the proof that the distinguishing example is a characteristic sample.
Lemma 2. Given a grammar G = T, S whose size is |T | + ||S||, the size of a distinguishing sample D for G is polynomial in the size of G.
Proof.Recall that T ⊆ Σ and S ⊆ Σ 2 , that H = Σ − T and R = Σ 2 − S. The non-tier element condition requires that for each σ ∈ H and σ ∈ T , the sample minimally contains the words σ, σ σ, and σ σ, whose total length is |H|+2|H||T |.The exclusive blocker condition requires for each exclusive blocker σ ∈ T and each τ 1 τ 2 ∈ R that minimally τ 1 στ 2 is contained in the sample.Letting B denote the set of exclusive blockers, we have the total length of words in the characteristic sample is ||B|| × ||R||.Finally, the allowed tier substring condition is requires for each τ 1 τ 2 ∈ S that minimally τ 1 τ 2 is contained in the sample.Hence, the length of these words equals |S|.
Altogether this means there is a characteristic sample D such that Next we prove Lemma 3, which shows that the tier conjectured by the 2TSLIA at step i is correct for all symbols up to σ i .Thus, once all symbols are treated by the 2TSLIA, its conjecture for the tier is correct.Let T i correspond to the algorithm's current tier hypothesis when σ i is being checked in the for loop of the get tier function, let H i = Σ − T i be the algorithm's hypothesis for the set of non-tier elements less than σ i , and let P T i be the set of paths under consideration (i.e., the set of paths from the initialization of P T i before the for loop).As above, τ 0 τ 1 ...τ m index positions in a string or path.
Lemma 3. Let Σ = {σ 0 , . . ., σ n } and consider any G = T, S .Given any finite input sample I which contains a distinguishing sample D for G, it is the case that for all i (0 Proof.The proof is by recursion on i.The base case is when i = 0.By definition, T 0 = Σ.The algorithm starts with T 0 = Σ, so T 0 = T 0 . Next we assume the recursive hypothesis (RH): for some i ∈ N that T i = T i .We prove that T i+1 = T i+1 .Specifically, we show that if RH is true for i, then (Case 1) Case 1.This is the case in which σ i is a non-tier element.The non-tier element condition in Definition 6 for D ensures that the data in I will meet both conditions (a) and (b) in the for loop in get tier() for removing σ i from the tier hypothesis.
Condition (a) requires that , X, σ i ∈ P T i and σ i , X , ∈ P T i for some X, X ⊆ H i , and ∀σ ∈ T i , σ i , Y, σ ∈ P T i and σ , Y , σ i ∈ P T i for some Y, Y ⊆ H i .Part (i) of the non-tier element condition in Defintion (6) ensures that for σ i , there are words w 1 , w 2 ∈ I such that , X, σ i ∈ paths 2 (w 1 ), σ i , X , ∈ paths 2 (w 2 ), and, for all and σ , Y , σ i ∈ paths 2 (v 2 ), where the intervening sets X, X , Y, Y are all subsets of H i .Because by RH H i = H i , condition (a) for removing σ i from the tier hypothesis is satisfied.
For σ i to satisfy condition (b), for any path τ 1 , X, τ 2 ∈ P T i such that σ i ∈ X and X − {σ i } ⊆ H i , there must be another path τ 1 , X , τ 2 ∈ P T i where X ⊆ H i .If τ 1 τ 2 ∈ R, then such a τ 1 , X, τ 2 is guaranteed not to exist in P T i , because τ 1 and τ 2 will, by the definition of R, not appear in the data with only non-tier elements between them.For τ 1 τ 2 ∈ R, part (ii) of the nontier element condition in Definition 6 ensures that some τ 1 , Y, τ 2 wher e Y ⊆ H i exists in P T i , as it requires that for each such τ 1 , τ 2 there is some w ∈ I such that τ 1 , X, τ 2 ∈ paths 2 (w) where X ⊆ H i .By hypothesis H i = H i , and so there is always guaranteed to be some τ 1 , X , τ 2 ∈ P T i where X ⊆ H i .Thus, condition (b) will always be satisfied for σ i .
Thus, assuming the RH, σ i ∈ H i+1 is guaranteed to satisfy conditions (a) and (b) and be removed from the algorithm's hypothesis for the tier, so σ i ∈ H i+1 .
Case 2. This is the case in which σ i ∈ T .There are two mutually exclusive possibilities.The first possibility is that σ i is not a free element.Here, σ i is guaranteed to not be taken off of the tier hypothesis, because condition (a) for removing a symbol from the tier requires that there exists some path σ j , X, σ i and σ i , X , σ j where X, X ⊆ H i .From the definition of a TSL grammar, there will exist no σ j , Y, σ i and σ i , Y , σ j , where Y, Y ⊆ H, in the paths of I.Because H i ⊆ H and by hypothesis H i = H i , H i ⊆ H, so the algorithm will correctly register that σ j σ i ∈ R or σ i σ j ∈ R and σ i will remain on the tier hypothesis.Thus σ i ∈ H i+1 .
The other possibility is that σ i is an exclusive blocker.If so, σ i may satisfy condition (a) for tier removal (as discussed above for the case in which σ i ∈ H).However, the exclusive blocker condition in Definition 6 for D guarantees that σ i will not meet condition (b).As discussed above, condition (b) will fail if there is a τ 1 , X, τ 2 ∈ P T i such that X includes σ i and zero or more elements of H i and no other path τ 1 , X , τ 2 ∈ P T i where X only includes elements of H i .The exclusive blocker condition requires that there will be some τ 1 τ 2 ∈ R such that there is a word w From the definition of a TSL grammar, no w such that τ 1 , Y , τ 2 ∈ paths 2 (w) where Y i ⊆ H i will appear in I, because H i ⊆ H.Because by hypothesis H i = H i , the algorithm will correctly find τ 1 , Y, τ 2 and also find that there is no τ 1 , Y , τ 2 .Thus when σ i is an exclusive blocker, it will not be removed from the tier hypothesis, and σ i ∈ H i .
We now know that both Cases (1) and (2) are true; thus, assuming the RH, H i+1 = H i+1 .Thus, by induction, (∀i)[H i = H i ].Because H i and H i are the complements of T i and T i , respectively, (∀i)[T i = T i ].
Lemma 4 (Distinguishing sample = characteristic sample).A distinguishing sample for a TSL 2 language as defined in Definition 6 is a characteristic sample for that language for the 2TSLIA.
Proof.As above, let G = (T , S ) be the output of the algorithm.From Lemma 3, we know that for any language L(G), G = T, S , and a characteristic sample D for G, given any sample I of L such that D ⊆ I, (∀i)[T i = T i ].That T = T immediately follows from this fact.
That S = S follows directly from T = T and the allowed tier substring condition of Definition 6 for D. The allowed tier substring condition states that for all τ 1 τ 2 ∈ S, the distinguishing sample will contain some w s.t.τ 1 , X, τ 2 ∈ paths 2 (w) where X ⊆ H.Because T = T , the for loop of main will correctly find all such τ 1 τ 2 .Thus, S = S , and G = G .
Theorem 1 (Identification in the limit in polynomial time and data).The 2TSLIA identifies the TSL k languages in polynomial time and data.
We note further that in the worst case the time complexity is polynomial (degree four Lemma 1) and the data complexity is constant (Lemma 2).

Discussion
This algorithm opens up multiple paths for future work.The most immediate theoretical question is whether the algorithm here can be (efficiently) generalized to any TSL class as long as the learner knows k a priori.We believe that it can.The notion of 2-paths can be straightforwardly extended to k such that a k-path is a 2k − 1 tuple of the form σ 0 , X 0 , σ 1 , X 1 , ..., X k−1 , σ k , where each set X i represents the symbols between σ i and σ i+1 .The algorithm presented here can then be modified to check a set of such k-paths.We believe such an algorithm could be shown to be provably correct using a proof of similar structure to the one here, although time and data complexity will likely increase.
However, in terms of applying these algorithms to natural language phonology, it is likely that a k value of 2 is sufficient.Heinz et al. (2011) argue that TSL 2 can describe both long-distance dissimilation and assimilation patterns.One potential exception to this claim comes from Sundanese, where whether liquids must agree or disagree partly depends on the syllable structure (Bennett, 2015).
Additional issues arise with natural language data.One is that natural language corpora often include some exceptions to phonotactic generalizations.Algorithms which take as input such noisy data and output grammars that are guaranteed to be representations of languages 'close' to the target language have been obtained and studied in the PAC learning paradigm (Angluin and Laird, 1988).It would be interesting to apply such techniques and other similar ones to adapt 2TSLIA into an algorithm that remains effective despite noisy input data.
Another area of future research is how to generalize over multiple tiers.Jardine (2016), in running versions of the 2TSLIA on natural language corpora, shows that it fails because local dependencies (which again can be modeled where T = Σ) prevent crucial information in the CS from appearing in the data.Furthermore, natural languages can have separate long-distance phonotactics which hold over distinct tiers.For example, KiYaka has both a vowel harmony pattern (Hyman, 1998) and a liquidnasal harmony pattern over the tier {l,m,n,N} (Hyman, 1995).Thus, words in KiYaka exhibit a pattern corresponding to the intersection of two TSL 2 grammars, one with a vowel tier and one with a nasalliquid tier.The problem of learning a finite intersection of TSL 2 languages is thus another relevant learning problem.
One final way this result can be extended is to study the nature of long-distance processes in phonology.Chandlee (2014) extends the notion of SL languages to the Input-and Output-Strictly Local string functions, which are sufficient to model local phonological processes.Subsequent work (Chandlee et al., 2014;Chandlee et al., 2015;Jardine et al., 2014) has shown how these classes of functions can be efficiently learned, building on ideas on the learning of SL functions.An open question, then, is how these ideas can be used to develop a functional version of the TSL languages to model long-distance processes.The central result in this paper may then help to understand how the tiers over which these processes apply can be learned.

Conclusion
This paper has presented an algorithm which can learn a grammar for any TSL 2 language in time polynomial in the size of the input sample, whose size is bounded by a constant.As the TSL 2 languages can model long-distance phonotactics in natural language, this represents a step towards understanding how humans internalize such patterns.