How Data Drive Early Word Learning: A Cross-Linguistic Waiting Time Analysis

The extent to which word learning is delayed by maturation as opposed to accumulating data is a longstanding question in language acquisition. Further, the precise way in which data influence learning on a large scale is unknown—experimental results reveal that children can rapidly learn words from single instances as well as by aggregating ambiguous information across multiple situations. We analyze Wordbank, a large cross-linguistic dataset of word acquisition norms, using a statistical waiting time model to quantify the role of data in early language learning, building off Hidaka (2013). We find that the model both fits and accurately predicts the shape of children’s growth curves. Further analyses of model parameters suggest a primarily data-driven account of early word learning. The parameters of the model directly characterize both the amount of data required and the rate at which informative data occurs. With high statistical certainty, words require on the order of ∼ 10 learning instances, which occur on average once every two months. Our method is extremely simple, statistically principled, and broadly applicable to modeling data-driven learning effects in development.

The extent to which word learning is delayed by maturation as opposed to accumulating data is a longstanding question in language acquisition.Further, the precise way in which data influence learning on a large scale is unknown-experimental results reveal that children can rapidly learn words from single instances as well as by aggregating ambiguous information across multiple situations.We analyze Wordbank, a large cross-linguistic dataset of word acquisition norms, using a statistical waiting time model to quantify the role of data in early language learning, building off Hidaka (2013).We find that the model both fits and accurately predicts the shape of children's growth curves.Further analyses of model parameters suggest a primarily data-driven account of early word learning.The parameters of the model directly characterize both the amount of data required and the rate at which informative data occurs.With high statistical certainty, words require on the order of ∼ 10 learning instances, which occur on average once every two months.Our method is extremely simple, statistically principled, and broadly applicable to modeling data-driven learning effects in development.
The first year of life is an incredibly productive time for language learners.Babies discover which sounds are in their language (Eimas, Siqueland, Jusczyk, & Vigorito, 1971;Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992), how speech is segmented (Saffran, Aslin, & Newport, 1996), what common words refer to (Bergelson & Swingley, 2012), and, toward the end of the first year, how to produce their first word (Brown, 1973;Schneider, Daniel, & Frank, 2015).This growth is a complex endeavor that requires relying on abilities in many domains-social and pragmatic understanding, conceptual representation, joint attention, and acoustic and motor systems.However, little is known about how the development of nonlinguistic factors influences language growth.For instance, is the timing of language growth locked to factors like the maturation of cognitive and motor systems (e.g., memory and attention), or to the growth of children's conceptual repertoire?Or, alternatively, is early language learning primarily limited by the amount of data that children receive about language itself?Evidence for a data-driven view of the timing of language learning comes from studies showing the importance of linguistic input for early learning (Hoff, 2003;Huttenlocher, Haight, Bryk, Seltzer, & Lyons, 1991;Shneidman, Arroyo, Levine, & Goldin-Meadow, 2013;Weisleder & Fernald, 2013).However, there are complications for the view that data are all that matters.Maturational constraints are often thought to play an important role in language learning (Borer & Wexler, 1987;Newport, 1990).Many words like function words (e.g., "the") and number words (e.g., "two") are learned surprisingly late for their frequency, How Data Drive Early Word Learning Mollica, Piantadosi suggesting that the number of times a word is heard by a child is not a definitive predictor of learning.This fact has motivated hypothetical processes, including maturational constraints on function words or syntax (Borer & Wexler, 1987;Modyanova & Wexler, 2007) and conceptual or linguistic constraints in the case of number words (Carey, 2009).
At the heart of data-driven accounts is an ambiguity about how much data are required.Experimental studies of word learning have revealed children's ability to acquire word meanings from single instances (Carey & Bartlett, 1978;Heibeck & Markman, 1987;Markson & Bloom, 1997;Spiegel & Halberda, 2011), as well as from the aggregation of word usage across multiple contexts (Smith & Yu, 2008).It is not known which of these regimes governs the majority of lexical acquisition: Are most words learned by aggregation of tens, hundreds, or thousands of examples, or from a single informative instance?
Here, we develop a novel data analysis of word learning across 13 languages in order to address two questions about early word learning: When does it begin and how much data does it require?These questions turn out to be interrelated-they are coupled together by quantitative predictions that they make about the distribution of ages at which children learn a word.To illustrate this, consider a simplified picture of learning: Suppose that a word is learned by age 2. This could occur under many different situations.Three illustrative examples are: (a) the child could start accumulating data at birth, require about 24 cross-situational examples of the word, and receive them about once a month; (b) the child could start accumulating data at birth, require 4 examples, and receive them on average once every 6 months; (c) the child could start accumulating data at 12 months, require 12 cross-situational examples, and receive them once a month.
The central idea of our approach is that although (a), (b), and (c) predict the same mean age of learning, they critically predict different distributions of ages at which acquisition succeeds due to the statistics of waiting for data (see Figure 1).Empirical measurement of the distribution shape could in principle distinguish these hypotheses, informing us about how data influence the process of word learning.For instance, if the distribution supported (b), we might infer that there are few early constraints on learning since data accumulation begins at birth, and that learning required few examples.If the data supported (c), we might infer that cognitive or maturational constraints delayed the accumulation of data substantially, and that word learning required aggregating information across contexts.
The logic of our approach is to formalize the process of learning by accumulating data.Following Hidaka (2013), we assume that learners successfully acquire a word after k effective learning instances (ELIs), or instances of the word that contribute to the learner's accumulating an amount of information about the word and we assume that ELIs arrive with an average frequency of λ per month.1However, unlike previous work, we also infer the age s at which data accumulation begins and implement our analyses in a Bayesian data analysis that is capable of inferring the likely ranges of parameter values from children's data.This Bayesian approach comes with several distinct advantages (Kruschke, 2010;Wagenmakers, Lee, Lodewyckx, & Iverson, 2008), including the ability to determine all three variables simultaneously, with our uncertainty in each correctly influenced by uncertainty in the others.Thus, our inferences How Data Drive Early Word Learning Mollica, Piantadosi  about the amount of data required to learn a word are statistically adjusted for our uncertainty over when learning that word began, and vice versa.The analysis also has the potential to reveal that the data are not informative about these variables, in which case we would find high uncertainty in the parameters given children's data.The advantage of our analysis compared to Hidaka's (2013) model comparisons is that we can confidently focus on interpreting the parameter estimates.

PROBABILISTIC ASSUMPTIONS
Our model requires three primary assumptions: (i) age of acquisition (AoA) consists of two periods of time: a start time s before learning a word begins and an accumulation time t, during which children are waiting for data; (ii) children learn a word after observing a number k of ELIs of the word; and (iii) these ELIs occur stochastically, but at a fixed rate λ (measured here in ELIs per month).For instance, s = 0, k = 24 and λ = 1 in example (a) above.Note that the model infers these parameters from learning curves, not from counting putative ELIs in childdirected data.It is likely that a constellation of factors are involved in determining whether any given instance contributes to learning (counts as an ELI).Similarly, start time s could reflect several processes, including when children develop the ability to track and remember the data that they need to learn a word, or when their conceptual repertoire is ready to begin learning a word.
When data are observed stochastically with a rate λ that is uniform in time, the number of ELIs actually received in a month will follow a Poisson distribution with rate λ.Under these assumptions, the distribution of times t children must wait before receiving k ELIs follows a Gamma distribution Γ(k, λ) with density, Thus, f describes the distribution of time children must wait before observing enough data to learn a word.The curves in Figure 1 are Gamma distributions with the appropriate values In the text, we provide equations for a single word and omit the subscript w.
for k and λ.Note that in a Gamma, the mean scales linearly in the variance, meaning that if acquisition is driven by accumulating data, children's variance in learning times should scale with their mean learning time.Gamma-shaped learning time distributions should be taken as a hallmark of data-driven, constructivist accounts of learning (Xu, 2007;Xu & Kushnir, 2012) that applies to any theory of development in which accumulating data is the primary force advancing learners' knowledge.

THE DATA ANALYSIS MODEL
Our data analysis model uses Bayesian techniques to recover k, λ, and s from empirically measured learning curves.To do this, we require one data-analysis assumption that the population of children studied is relatively homogeneous, meaning that we may extend a word's single s, k, and λ across children. 2 In this case, the proportion of children who know a word at accumulation time T will approximate the cumulative distribution function of (1) at time T, Figure 2 shows a graphical model of the relationships between these variables and the observed data.At each age a, N a children were measured and x a of them reported having learned the word to either production or comprehension. 3We model the number of children producing/ comprehending the word x a as being drawn from a binomial distribution with N a trials and a probability of success equal to the proportion of children who know the word given by (2) at time t = a − s: We assume uniform priors on these variables: k ∼ Uniform(0, 10,000) ELIs, λ ∼ Uniform(0, 10,000) ELI(s)/month and s ∼ Uniform(0, 1,000) months.Bayesian inference in this generative model allows us to take the empirical acquisition curves and determine posterior distributions for k, λ, and s for each word in each language.
How Data Drive Early Word Learning Mollica, Piantadosi

The Cumulative Gamma Matches Observed Word Learning Curves
Figure 3 shows a general visualization of the model fit across a variety of English words.Despite its simplicity, the model closely accounts for the empirical learning trajectories across word types for both comprehension and production.Quantitatively, correlations between predicted values and the behavioral data are near 1.0 for each language (see Supplemental Figure S1 in our Supplemental Materials [Mollica & Piantadosi, 2017]) meaning that the model is able to capture the overall shape of acquisition across languages.More importantly, the model is able to more successfully predict learning than more standard alternatives: a probit (McMurray, 2007) and a logistic model.To test this, we divided the learning curve for each word into two halves, where we fit k, λ, and s for each word on the first half and then computed the correlation between model and human data across words and ages on the full curves.The Gamma distribution fit quantitatively outperforms either the probit or the logit across most languages (see Figure 4).

On the Order of 10 ELIs Are Needed to Learn a Word
The order of magnitude of the estimated parameters are informative about the underlying mechanisms of learning, as they characterize when learning starts (s), how many ELIs are needed (k), and how frequently they occur (λ). Figure 5 shows the mean values of k, λ, and s for each language.The box plots for English further broken down based on MacArthur-Bates Communicative Development Inventory (MCDI) semantic category are similar (see Supplemental Figure S2 in our Supplemental Materials [Mollica & Piantadosi, 2017]).
Figure 5a and 5d show that, across languages, the order of magnitude of k is around 10 for production, with slightly lower values for comprehension.It is important to focus on the order How Data Drive Early Word Learning Mollica, Piantadosi of magnitude, not the exact numerical values, because the order of magnitude of our parameter estimates are robust to noise (see Appendix B of the Supplemental Materials).The important issues in language development can still be distinguished based on order of magnitude.We primarily interpret Figure 5 as showing that languages agree in order of magnitude of their estimates.4Thus, children do not require hundreds or thousands of instances of a word to learn, even for words that may be very frequent, nor do they learn from a single instance.Instead, learning is likely focused around ten critically informative learning instances.These findings demonstrate the importance of cross-situational statistics over single examples and is consistent with the finding that children do not retain fast-mapped meanings (Horst & Samuelson, 2008).

ELIs of a Word Occur Roughly Every Two Months
The variable λ characterizes the estimated rate at which ELIs of a word occur.Figures 5b and  5e show that ELIs of a word occur once every two months (λ ≈ 0.5), indicating that ELIs are relatively infrequent for an individual word.However, because children learn many words simultaneously, ELIs of any word may in fact be quite frequent.For instance, if children track statistics on 1,000 early words, and observe an ELI for each word on average once every two months, they will receive around 17 ELIs per day.

Data Accumulation Starts Around Two Months
The start times in Figures 5c and 5f show that learning begins early: approximately by two months in the case of comprehension measures.The starting age is somewhat later when curves are fit to production measures, possibly because production may require motor and speech systems to be working before production can progress.This may indicate that although maturational factors play little role in learning as measured by comprehension, production depends on the development of other cognitive or motor systems.

Early Word Learning Is Primarily Data-Driven
The model assumes that AoA is the sum of two time periods: start time s and accumulation time t.There are two measures we derive from these parameters to quantify the extent to which early word learning is data-driven: the percent of total AoA time spent accumulating data, and the percent of variance in AoA explained by variance in accumulation times.If early word learning is primarily constrained by maturation, the majority of acquisition time should not be spent accumulating data and the majority of the variance in acquisition times should be explained by the variance in start times s.On the other hand, a data-driven account of early word learning would expect the majority of acquisition time to be spent accumulating data and the majority of the variance in acquisition times to be explained by variance in accumulation times t. Figure 6 shows the proportion of total acquisition time and the variance in acquisition times that is due to t (accumulating data) rather than s (start times).We find that generally the majority of acquisition time is spent accumulating data and the variance in accumulation times explains the majority of the variance in acquisition times.Taken together, this indicates that data-driven factors are the primary drivers of early word learning.

Learning Instances Are Weakly Correlated With Log Frequency
Under a simple view that most usages of a word are informative about its meaning, our estimates of k and λ should be surprising; word frequencies vary over several orders of magnitude (Zipf, 1949), yet the inferred k and λ values do not.This means that ELIs cannot be very strongly correlated with frequency.Most of the time a frequent word is used, it is not an ELI.One possibility is that a single ELI for a word like tiger might be an entire visit to the zoo.
To investigate the relationship further, we computed the correlation between the estimated k, λ, and s values for each word in English and the log frequency as measured in CHILDES (MacWhinney, 2000).For comprehension, there is only a small correlation between the estimated k parameter and frequency (k : r = −.14, p = .01).For production, there is a modest correlation (k : r = .19,p < .001;λ : r = .32,p < .001;s : r = −.22,p < .001) as observed by Hidaka (2013).But what is notable is the weakness of the correlation (see λ λ Figure 7. between CHILDES frequency for words in English and estimated parameter values.Top row: For comprehension, there is a small correlation between frequency and k and no correlation between frequency and λ and frequency and s.Bottom row: For production, the correlations between frequency and k, frequency and λ, and frequency and s are very weak and only significant when frequency is log transformed.Mollica, Piantadosi Figure 7)-it is not as though doubling the quantity of input will double the number of ELIs.This finding is compatible with findings of frequency effects in word learning (Ambridge, Kidd, Rowland, & Theakston, 2015;Hoff, 2003;Huttenlocher et al., 1991;Shneidman et al., 2013;Weisleder & Fernald, 2013), but suggests that frequency will be less important than the frequency of ELIs (see also Hoff, 2003).

DISCUSSION
We view the Gamma model not as a mechanistic learning account, but instead as a scientific tool for understanding the basic forces in early language acquisition.Unlike characterizations in terms of mean acquisition ages, the parameters s, k, and λ are psychologically meaningful in terms of a causal process that likely supports part of word learning, data accumulation (Hidaka, 2013).Our analysis of empirical learning curves strongly suggests that data accumulation begins very early, that production may be delayed due to maturational factors, and that typical words take on the order of ∼ 10 ELIs to learn, not hundreds of occurrences and not a single occurrence or two.The model also suggests that the informative data points for word learning occur relatively infrequently, about once every two months, and that these occurrences are not strongly related to a word's overall frequency.Moreover, the mechanisms of data accumulation not only provide the best quantitative fit to learning curves, they explain nearly all of the variance in when children learn a word.This analysis has capitalized on the existence of large corpora of acquisition trajectories across children.In particular, the key variables of interest, data amounts, data rates, and the time at which data are first considered, are discovered entirely from children's acquisition trajectory-not from recordings of children's input.While it may seem tempting to address these questions of acquisition with an intensive home recording study (Roy et al., 2006) or an evaluation of child-parent interactions (MacWhinney, 2000), these approaches come with the challenge of delineating which instances of a word concretely contributed to learning.For example, a word use might only aid acquisition if the child is attentive and receptive, and the referent is clear, which might not be observable in those datasets.Given that we have found that overall frequency is a weak predictor of the rate of ELIs, the detailed measurement of just parental productions will not fully clarify the relevant data sources for learning.Instead, our work takes a different tack, looking to find evidence of data-driven effects writ large in the distribution of learning times for words.
This work leaves open a central question: what makes a usage of a word an ELI?The weak correlation between the parameters and word frequency suggests that ELIs are rare-and perhaps even intentional.It is likely that children actively decide what stimuli they engage and deeply process (Kidd, Piantadosi, & Aslin, 2012, 2014), which could place an internal yoke on the rate of ELIs.Extrinsic factors probably also play a role though, as seen by the correlations with frequency.Analogously, these analyses raise the question of what determines differences in k and λ across words and languages.Future research should attempt to characterize the impact of external factors, such as semantic content (Jones, Johns, & Recchia, 2012) and phonotactic probability (Storkel, 2001), on k and λ.Our framework provides the initial step at connecting such factors to the data accumulation process that implicitly supports all existing models of word learning.
It is also important to note the limitations of the MCDI data and our model.First, we restrict all of our conclusions to the early learned words covered by the MCDI.It will be important to extend this model beyond the age range of the existing MCDI.Children are flexible learners and it is probable that an older child adopts a variety of strategies, which may influence

Figure 1 .
Figure 1.Example acquisition ages under 3 example assumptions: (a) children receive learning instances once a month from birth and require 24 total, (b) children require 4 examples and receive one every 6 months on average, (c) children require 12 instances, coming once every month, but only begin accumulating data at 12 months.Each predicts the same mean of 24 months (dotted line), but different shapes and variances in the timing of acquisition.

Figure 2 .
Figure 2. Graphical model notation for our model.Nodes denote variables of interest.Shaded nodes are observed variables.Plates denote groups of variables over age (A) and words (W).In the text, we provide equations for a single word and omit the subscript w.

Figure 3 .
Figure3.Points shows the proportion of English-speaking children (y-axis) who know a word at each age (x-axis) as measured by comprehension (blue) and production (green).Lines show the posterior mean parameters in the model (2), and X and O show the posterior mean start time of data accumulation for each word.This generally shows good model fits, early start times for comprehension, and somewhat later times for production.

Figure 4 .
Figure 4. Model comparison of the Logit, Probit, and Gamma models when trained on the first half of comprehension and production learning curves and tested on the full trajectory.Across words and languages, the correlations between observed data and model predictions for the full curve are close to 1 with the Gamma model showing the best fit.

Figure 5 .
Figure 5. Box plots of the distribution of k, λ, and s across words in each language.

Figure 6 .
Figure 6.The bar plot shows percent of the variance in age of acquisition (AoA) times explained by accumulation time (suggesting data-driven learning).The triangular points shows the percent of AoA time spent accumulating data.Error bars and point ranges represent bootstrapped 95% confidence intervals.Outliers (< 2.5% of the data) were removed for this analysis (see Methods section).