Partial Truths: Adults Choose to Mention Agents and Patients in Proportion to Informativity, Even If It Doesn’t Fully Disambiguate the Message

How do we decide what to say to ensure our meanings will be understood? The Rational Speech Act model (RSA; Frank & Goodman, 2012) asserts that speakers plan what to say by comparing the informativity of words in a particular context. We present the first example of an RSA model of sentence-level (who-did-what-to-whom) meanings. In these contexts, the set of possible messages must be abstracted from entities in common ground (people and objects) to possible events (Jane eats the apple, Marco peels the banana), with each word contributing unique semantic content. How do speakers accomplish the transformation from context to compositional, informative messages? In a communication game, participants described transitive events (e.g., Jane pets the dog), with only two words, in contexts where two words either were or were not enough to uniquely identify an event. Adults chose utterances matching the predictions of the RSA even when there was no possible fully “successful” utterance. Thus we show that adults’ communicative behavior can be described by a model that accommodates informativity in context, beyond the set of possible entities in common ground. This study provides the first evidence that adults’ language production is affected, at the level of argument structure, by the graded informativity of possible utterances in context, and suggests that full-blown natural speech may result from speakers who model and adapt to the listener’s needs.


INTRODUCTION
Communication requires continually making decisions about what information to include and exclude. It is not always necessary to fully describe an event: If someone asks, What are you doing? then I'm eating might be sufficient, and possibly preferable to longer alternatives like I'm eating a sandwich or I'm eating a grilled cheese sandwich. For a speaker to successfully communicate with a listener in this way, the two need to implicitly agree on some shared principles of communication. Grice (1975) codified these conversational assumptions as a series of "maxims," including the maxims of Quantity ("give as much information as is needed, but no more") and Relevance ("say something that furthers the goal of the conversation"). Thus a speaker can refer to a sandwich alone if the alternative is a salad, but should refer to a grilled cheese sandwich if the alternative is peanut butter and jelly.
As listeners, adults understand language in part by using statistical information to predict upcoming words and structures (Altmann & Kamide, 1999;Levy, 2008;MacDonald, 2013;MacDonald, Pearlmutter, & Seidenberg, 1994;Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995;cf. Kuperberg & Jaeger, 2016, for a recent review). How does this predicting Partial Truths Kline, Schulz, Gibson listener operate? Listeners could simply expect similar language to occur in similar contexts, without regard to the speaker's motives. But more specifically, they could expect speakers to behave predictably because they expect them to behave helpfully. A recent formalization of this latter hypothesis is the Rational Speech Act model (RSA), which is based around a cooperative speaker-listener pair (Frank & Goodman, 2012). Speakers attempt to maximize the information transferred to the listener, and listeners succeed by assuming that the speaker is doing this. Rational Speech Act models successfully predict a variety of phenomena in pragmatics including the interpretation of scalar implicatures, hyperbole, and metaphor (Goodman & Frank, 2016;Goodman & Stuhlmüller, 2013;Kao, Levy, & Goodman, 2013;Kao, Wu, Bergen, & Goodman, 2014). Are listeners warranted in making these generous assumptions about speakers? Many features of language production seem to be shaped to improve the chances of successful communication. Formal approaches based on information theory (Shannon, 1949) have been used to successfully explain reduction and omission phenomena in natural language production including phonological reduction, lexical choice (e.g., math/mathematics), and inclusion of optional arguments (Aylett & Turk, 2004;Jaeger, 2010;Mahowald, Fedorenko, Piantadosi, & Gibson, 2013;Resnik, 1996;van Son & van Santen, 2005; though see Keysar, Barr, & Horton, 1998).
If production is driven by the value to the listener rather than the costs to the speaker, then the speaker should flexibly adapt when the (linguistic or nonlinguistic) context changes. For the specific case of referring expressions (that, that big sandwich), there is a large body of work showing that speakers' choices are related to available nonlinguistic information (e.g., Brennan & Clark, 1996;Brown-Schmidt & Tanenhaus, 2008;Nadig & Sedivy, 2002;Pogue, Kurumada & Tanenhaus, 2016;Sedivy, 2003). This is taken as evidence of an awareness of listeners' needs because the language production cost of that big sandwich is presumably the same across contexts, while the benefit to the listener is considerable when there are many sandwiches, but null if the listener can already pick out the lone sandwich. Speakers do this even when a listener would need to make inferences about a speaker's intention to succeed: In a context with a blue circle as a target with a blue square and a green square as distractors, adults limited to a single word produce CIRCLE to identify the target object, not BLUE: Although BLUE is a good description of the target in isolation, it could also refer to the blue square (Frank & Goodman, 2012).
But human language goes beyond referring expressions for objects: Sentences express entire propositions about the world (Ben is eating my grilled cheese sandwich). Deriving the set of possible propositions (not just possible object referents) would seem to require an extensive understanding of both world knowledge and the ways that conversations tend to unfold (cf. Ginzburg, 1996). Even once a particular proposition has been chosen, we have many choices about how to encode it in a sentence. We make choices about argument structure and verb identify (he ate it/he put it in his mouth), and language provides many ways to omit or limit how much we say in conveying a proposition, including pronouns (Ben/he ate the sandwich), ellipsis (Ben ate the sandwich, and then a cookie), passive constructions (The sandwich was eaten), and optional arguments (Ben ate the sandwich [with a fork and knife]). These options are used pragmatically: Speakers tend to (a) omit or reduce information the listener can retrieve from linguistic context, (b) converge with dialog partners on syntactic alternations, and (c) include optional material when listeners might otherwise go astray (Brennan & Hanna, 2009;Galati & Brennan, 2010;Horton, 2005;Kurumada & Jaeger, 2015;Pickering & Garrod, 2004).
Relatively little attention has been paid to how speakers use nonlinguistic information to produce informative sentences. Do we attempt to communicate sentence-level meaning using Partial Truths Kline, Schulz, Gibson something like the rational speaker model, tailoring what we say to the surrounding context? At least one study suggests this may be the case: Lockridge and Brennan (2002) had participants describe scenes with either typical or atypical instruments (He stabbed him with a knife/an icepick) to a naïve listener. In an unconstrained storytelling task, speakers were more likely to mention atypical than typical instruments, especially when the listener could not see the event. However, understanding event descriptions is challenging exactly because events are transitory-they don't "stick around" in the context like objects do, and references to events often occur when the event itself is in the past or future (Gleitman, 1990). Thus, while this study suggests speakers are sensitive to how world knowledge impacts linguistic informativity, it does not address the fit between production and particular nonlinguistic contexts: The contrast in that study is between not seeing the event (the usual scenario) and seeing the event as it is being described (which listeners usually can't).
To begin understanding how speakers use nonlinguistic context to decide what to say about an event, we focus on a single class of basic propositions: transitive sentences like John feeds the dog. While this construction is used for many classes of verbs, we consider prototypical cases in which the basic meaning involves an agent performing some action on a patient. Even assuming the speaker is referring to an event that might occur (or has recently occurred) in the immediate context, the set of possible messages is potentially infinite (Quine, 1960). Here, we leave aside the question of word-to-concept mapping and focus on the question of possible events given a set of possible participants. A speaker trying to design an informative event utterance must consider not only the possible verbs, but also what referent could correspond to each argument position. We can represent the number of possible events as the product of the possible verbs, agents, and patients: We use this logic to create "toy" worlds in which there are always exactly seven entities (people and objects), and the messages to be communicated are interactions between these entities (e.g., John feeds the dog).
In natural speech, both agents and patients can sometimes be omitted from transitive descriptions. Many transitive verbs can be used intransitively, for example, We'll eat in the kitchen, 1 and many languages also allow noun phrases in subject position to be omitted relatively freely (e.g., in Spanish Comió bocadillos, [He] ate sandwiches). In English, these kinds of subject omissions require specific discourse context (e.g., a command, Don't eat in the kitchen). We therefore use a production task that restricts the producer to exactly two words, forcing participants to make the choice to omit at least one element (agent, patient, or verb). In most object reference studies (cf. Brennan & Clark, 1996;Brown-Schmidt & Tanenhaus, 2008;Nadig & Sedivy, 2002;Pogue et al., 2016;Sedivy, 2003), a noun phrase like my sandwich or my grilled-cheese sandwich is assumed to be informative when it uniquely identifies one out of several referents in the context, underinformative if it could apply to more than one object (two such sandwiches), and overinformative if it includes additional modifiers ( my grilled cheese sandwich when there is only one sandwich). Rational Speech Act models assume a richer sense of "informativity" in which words are informative to the extent that they reduce Partial Truths Kline, Schulz, Gibson the number of possible interpretations by any amount (Frank & Goodman, 2012). Thus, we can vary the informativity of these utterances by varying the possible events that might have occurred in the local context, specifically by manipulating the set of possible agents and patients. We can then ask whether speakers choose informative utterances, even in cases where a listener would be unable to identify the entire event meaning. Figure 1 shows a possible event (JOHN FEEDS THE DOG) and six sets of entities that could participate in the event to be named. Each context set is made up of people (canonical agents) and either animals or inanimate objects (both of which are more likely than humans to appear as patients). Critically, we manipulate the communicative context (and therefore the informativity of potential utterances) by altering the set of seven entities that appear in the context picture. If the context is Figure 1a, the utterance FEED DOG fails to resolve the ambiguity (anyone could have done it); on the other hand, the utterance JOHN FEED specifies the agent and relies on an intelligent listener to identify the unique patient in context. For Figure 1f, the reverse is true: FEED DOG resolves the ambiguity. In the intermediate cases (Figures 1b-1e), there is no two-word utterance that can fully disambiguate the intended meaning: There are multiple options for both agent and patient, and the verb cannot be uniquely inferred the context images.
Our critical hypothesis has to do with how people will behave in the four intermediate arrays. In these conditions, different words reduce ambiguity to different degrees: In Figure 1e, mentioning John (and the verb) narrows down the possible events to just two alternatives (he feeds the dog or duck) rather than five (somebody feeds the dog). If the RSA model extends to descriptions of argument structure relations, adults should still be able to select informative utterances: When there are more agents than patients, participants should be more likely to mention subjects, even if ambiguity between multiple messages remains. However, if participants use a simpler strategy of determining just whether or not a given utterance successfully conveys the intended event, then they should still choose informative arguments in the deterministic cases, but perform at chance (or otherwise not differentiate the intermediate conditions) when both arguments remain ambiguous.

Participants
Ninety-one English-speaking adults participated on Amazon's Mechanical Turk (AMT). Participants were screened to be located in the United States and self-reporting English as their first language (an additional 21 participants were excluded who did not meet these criteria). No other demographic information was collected. The task took approximately 13 minutes to complete and participants were paid $1.00. This pay rate was based on an anticipated study length of 10 minutes, following the 10¢/minute rule of thumb used for AMT studies in the lab at the time these data were collected. All participants gave informed consent in accordance with the requirements of the Massachusetts Institute of Technology's institutional review board.

Stimuli
We created cartoon stimulus sets for each of 12 verbs (eat, feed, hold, drink, kick, drop, wash, pour, throw, touch, read, and roll). Each set consisted of an action picture and six "context" pictures showing possible agents and patients who might participate in the event. The people were generated using a character-creation website (Brooks et al., 2007) with distinct features and names on their shirts. The objects were chosen from a category (e.g., various OPEN MIND: Discoveries in Cognitive Science Partial Truths Kline, Schulz, Gibson  foods) appropriate for each verb. The total number of agents and patients in each context sums to 7, yielding six variations (i.e., [6:1] to [1:6]) for each of the 12 stimulus sets. All stimuli, code, and analyses are available in the Supplemental Materials for this article (Kline, Schulz, & Gibson, 2017).

OPEN MIND: Discoveries in Cognitive Science
Partial Truths Kline, Schulz, Gibson

Procedure
Stimuli were presented using Python and the EconWillow package (Weel, 2008), accessed through AMT. Participants were told that they were providing descriptions for another (sham) participant. Participants saw the trials in a random order, with two items presented at each context type. On each trial, they saw the context picture for ten seconds, read a sentence describing the action they would see (e.g., "John feeds the dog"), and then saw the action picture for ten seconds. Finally, the context picture reappeared and participants were given two separate text boxes to enter their description; if they entered more than two words (screened by checking for spaces, e.g., "baby rolls" in one box), they were told to try again. To encourage participants to answer quickly, their response speed in seconds was shown after every trial.

Data Coding
A total of 1,092 responses were collected from the 91 participants, 182 responses in each condition. Responses were first checked for minor variations such as capitalization and verb form (e.g., "Eaten" was coded as "eat"). The majority of these responses (84%) consisted of two of the possible three content words in the sentence (e.g., JOHN FEED, FEED DOG, or JOHN DOG). In the remaining responses, participants deviated from these exact lexical items; in these cases we checked if the word used could refer to a unique entity (e.g., she in an array with a female agent among only male distractors). A full record of this coding is available in the Supplemental Materials (Kline, Schulz, & Gibson, 2017); just 20 responses (1.8%) consisted of two unclassified words and thus were excluded from analyses. Because not every response contained two codable words, we present analyses below for the presence of agents, patients, and verbs in each response.

RESULTS
We code the main effect of interest numerically, representing the key condition of context type in the model as the number of potential agents in the context image (recall that the number of agents and patients in these context images are inversely related, always summing to seven total). The effect of the number of agents vs. patients on whether participants mentioned the agent in their response was highly significant by a mixed-effect logistic regression 2 with random slopes and intercepts for both item and participant (β = 0.55, SE = 0.15, Z = 3.79, p < .001; LRT: χ 2 = 10.7, d f = 8, p < .005). The same was true for patients (β = −0.55, SE = 0.13, Z = −4.38, p < .001; LRT: χ 2 = 17.2, d f = 8, p < .001). These patterns are as predicted-as more agent distractors (and thus fewer patient distractors) were present, participants were more likely to mention the agent and less likely to mention the patient (Figure 2). We also found that participants overall were somewhat more likely to mention patients than agents: On the subset of trials (74%) where participants mentioned only one of the two, there were significantly more patients than agents (p < .001, binomial test).
To test whether participants gave graded responses to the intermediate arrays (e.g., [2:5]), we also examined the effects of array type after removing trials for which a "deterministic" answer could be given ([6:1], [1:6]). The effects of array type on both agent and patient mention were both significant when evaluating only these intermediate cases (agent mention: Kline, Schulz, Gibson

MODEL COMPARISONS
To evaluate how human performance might reflect pragmatic choices, we compared three computational models (with two additional variations shown in the Supplemental Materials [Kline, Schulz, & Gibson, 2017]). Each of these models generates (unordered) two-word utterances ("AV"-agent and verb, "VP"-verb and patient, or "AP"-agent and patient ) at each of the conditions in the experiment; we compare model predictions to participants' responses of these types (omitting the ∼15% of responses that included some other word). Below, we describe the common assumptions the models share, define the particulars of each model, and then compare them to human performance.
In all models, the shared context is the set of possible events that might occur given the set of agents, patients, and plausible verbs (we assume the prior probability of picking any particular event e in E is uniform). We assume that each object/person in the scene is classified unambiguously as an agent or patient (wrong in general, but true in our experimental context). For the verbs, we assume participants are considering some set of possible interactions between the agents and patients (e.g., petting, feeding). In principle, the notion of "possible verb set" could be estimated empirically by asking naïve participants to list possible actions between the agent-patient stimulus sets directly. Here, we simply assume the set is relatively small and does not vary with the number of agents and patients. 3 Thus, the shared context of possible events E is E = {(a 1 , v 1 , p 1 ) , (a 1 , v 1 , p 2 ) , (a 1 , v 2 , p 1 Partial Truths Kline, Schulz, Gibson with the number of possible events e in E calculated as follows: (3) Next, we consider a set of possible descriptions (D) that could be used for some target event e. Each of these descriptions might also apply to other events in the possible set; the number of events some description d can refer to is notated as |d|. We assume that there is a single, unambiguous label available for each agent, patient, and verb. This means that for a single-word description, |d| can be defined easily: For instance, the word referring to the agent can refer to any of the events in E that include that agent plus some patient and verb: For these models, we consider the set of utterances D = ["AV," "VP," "AP"] (i.e., two words, produced in any order). A two-word description like "AV" can refer to any event in E that contains that agent, that verb, and then some patient: All of the following models use these same assumptions about the relationship between a particular context (containing possible agents, patients, and verbs) and the number of events a particular utterance could refer to; they differ in how utterances are produced given this information. The outputs of the models are shown in Figure 3.

Nonpragmatic "Cost only" model ( p NP )
We begin with a baseline model that does not take any aspect of the context in which an utterance was produced into account. In our dataset, we found an overall difference in the frequency with which certain utterances are produced (in particular, "AV" sequences are less frequent than "VP" sequences), averaging across contexts. We can consider this as reflecting a differential cost to the speaker of producing each of these utterances. The corresponding model just produces utterances at this base rate ( p human ), with no effects of the context in which the utterance is produced. Because the human participants sometimes produced a word that did not refer to the agent, verb, or patient, we estimate these probabilities for the model by renormalizing over only the "standard" responses (~84% of all responses). The probability of producing each two-word description d for this model is simply this global likelihood: To avoid overfitting, we evaluate this model by randomly splitting the human data in half to calculate these parameters from the data, and evaluating each set of predictions against the other half of the human dataset.

Pragmatic "succeed/fail" heuristic (p SF )
Many common-ground-type experiments (e.g., Sedivy, 2003) tacitly assume that an utterance is pragmatically helpful when it uniquely identifies the target referent. For this model, we assume the verb is always mentioned because (unlike the possible agents and patients) there Kline, Schulz, Gibson  is no direct information about the verb present in the context image, and therefore this word will always be highly informative. We thus assume that the possible utterances are simply (unordered) "AV" or "VP," with the probabilities of producing the two utterances summing to one. The probability of each utterance is given by:

Partial Truths
That is, out of the available descriptions, the model will consider only the ones that succeed in uniquely identifying the target event. The ε symbol represents a small number arbitrarily close to zero, and indicates that this model is extremely unlikely to choose the uninformative utterance if any informative ones are available. As shown in Equations 8-9, its exact value does not impact the quantitative predictions the model makes. If there is a single informative choice (|d|= 1), the model will select it approximately deterministically:

OPEN MIND: Discoveries in Cognitive Science
Partial Truths Kline, Schulz, Gibson But if neither utterance is fully informative (that is, if |d| > 1), the two utterances are produced at chance: 4

Rational speaker ( p RS )
We implement Frank and Goodman's RSA model, which states that a description d will be chosen in inverse proportion to how many events that description can apply to. 5 Thus the probability for each of the three possible two-word descriptions is:

Model comparison
To facilitate comparison with the human results ( Figure 3) we plot the probabilities that a word for a particular element (A, V, P) is included in the utterances generated by each model. The "nonpragmatic" model that considers only base rate performs relatively poorly at matching human performance, r (36) = .63 (this and all model comparison p values are < .0001; we randomly divided the human data into two halves to avoid overfitting to parameters estimated from the data). The succeed/fail model is somewhat better, r (36) = .75, and the rational speaker model better still, r (36) = .81. In the Supplemental Materials (Kline, Schulz, & Gibson, 2017), we compare versions of the latter two models that also incorporate information about the base rate of each words; again, the rational speaker model is a closer fit to human performance than the equivalent succeed/fail model.

DISCUSSION
As predicted by the RSA, when participants described events after seeing arrays of possible agents and patients, their two-word answers reflected the degree to which a given word could convey new information about the event. Participants were more likely to mention the agent of the event when the agent was more ambiguous, and more likely to mention the patient when the patient was more ambiguous. This was not limited to cases where an event could be uniquely identified: Even for the intermediate cases where there were multiple agents and multiple patients in the array, participants still chose the two-word sequence that reduced uncertainty the most. Quantitative comparison to the RSA reveals a close fit to human data, with a baseline-adjusted version of the RSA performing best.
While understanding language appears to involve assuming that we are listening to rational speakers, our own speech also involves messy, sometime under-or overinformative utterances. Nevertheless, we mainly succeed in getting our meanings across, and it is clear that at least some aspects of adult speech are well designed for robustly transferring information. While there is a rich literature on how speakers accommodate nonlinguistic context when describing individual objects (cf. Brennan & Clark, 1996;Brown-Schmidt & Tanenhaus, 2008;Nadig & Sedivy, 2002;Pogue et al., 2016;Sedivy, 2003), this study provides the first evidence that adults' language production is affected, at the level of argument structure, by Partial Truths Kline, Schulz, Gibson the graded informativity of possible utterances in context. Although the two kinds of shortened sentences (Agent-Verb, e.g., GIRL READ, and Verb-Patient, e.g., READ BOOK) are on average equal in length and express the same amount of information, participants recognize that informativity depends on the set of possible alternative events. This holds even when either utterance will leave some ambiguity, suggesting that RSA-type listeners are correct: Their speaker partners are choosing what to say and what to omit in a way that can maximally reduce their uncertainty.
Understanding how listeners and speakers represent contexts and possible messages for verbs and events is a puzzling problem. In noun-referent studies, participants (listeners or speakers) need simply note how many possible referents there are and what features differ between them (e.g., Stiller, Goodman, & Frank, 2015). For sentence-level meanings, the set of possible messages is much larger than the number of visible referents. When there are three potential agents and four patients, there are 12 possible combinations, and there may often be multiple verbs under consideration. Beyond this, the listener might have to guess at likely events, as well as multiple ways of referring to that event: Beyond just relations between a girl and an apple, speakers and listeners must consider the many different propositions or perspectives that can be used to refer to the same event (e.g., a girl swinging a bat and hitting a ball toward the outfielder can describe the very same event; cf. Gleitman, 1990;Kline, Snedeker, & Schulz, 2017). These perspectives might differ in argument structure, so that a listener might need to consider multiple argument sets: an agent and patient, an agent, theme, and recipient, and so on. Furthermore, in the real world many referents, especially humans, can play many roles (e.g., agent and patient of hugging), and some possible referent pairs will permit different interactions due to either selectional restrictions or real-world knowledge. We may be able to use the current paradigm to address features of argument structure communication like these: If a speaker learns that wugging can be performed by animals but not people, will he or she take this information into account when designing utterances for a partner who does or doesn't know this restriction? How far do parallels between messages about object identity and propositions about the world (event descriptions) extend? Which of the complexities of sentence-level predictability do speakers and listeners fold into their models of communicative context? Understanding the dynamics of utterance production in these contexts will further our understanding of how adults calculate and use informativity to accomplish our communicative goals.

ACKNOWLEDGMENTS
We would like to thank the members of the Schulz and Gibson labs for their helpful feedback; Audra Podany, Olivia Murton, and Dmetri Hayes for assistance in creating stimuli and data annotation; and all of the participating AMT workers for their involvement in the study. This work was funded by grants from the National Science Foundation to Edward Gibson and Melissa Kline.

AUTHOR CONTRIBUTIONS
MK, LS, and EG conceived of and planned the experiment. MK implemented and carried out the experiment and performed all analyses with input from LS and EG. MK planned and carried out the computational modeling, with EG providing feedback on implementation and interpretation of the models. MK took the lead in writing the manuscript, and all authors provided critical feedback and input to the interpretation of the results and revision of the manuscript.