Ordinal Common-sense Inference

Humans have the capacity to draw common-sense inferences from natural language: various things that are likely but not certain to hold based on established discourse, and are rarely stated explicitly. We propose an evaluation of automated common-sense inference based on an extension of recognizing textual entailment: predicting ordinal human responses on the subjective likelihood of an inference holding in a given context. We describe a framework for extracting common-sense knowledge from corpora, which is then used to construct a dataset for this ordinal entailment task. We train a neural sequence-to-sequence model on this dataset, which we use to score and generate possible inferences. Further, we annotate subsets of previously established datasets via our ordinal annotation protocol in order to then analyze the distinctions between these and what we have constructed.


Introduction
We use words to talk about the world.Therefore, to understand what words mean, we must have a prior explication of how we view the world.- Hobbs (1987) Researchers in Artificial Intelligence and (Computational) Linguistics have long-cited the requirement of common-sense knowledge in language understanding. 1 This knowledge is viewed as a key Sam bought a new clock The clock runs Dave found an axe in his garage A car is parked in the garage Tom was accidentally shot by his teammate in the army The teammate dies Two friends were in a heated game of checkers A person shoots the checkers My friends and I decided to go swimming in the ocean The ocean is carbonated component in filling in the gaps between the telegraphic style of natural language statements: we are able to convey considerable information in a relatively sparse channel, presumably owing to a partially shared model at the start of any discourse. 2 Common-sense inference -inferences based on common-sense knowledge -is possibilistic: things everyone more or less would expect to hold in a given context, but without the necessary strength of logical entailment. 3Because natural language corpora exhibit human reporting bias (Gordon and Van Durme, 2013), systems that derive knowledge exclusively from such corpora may be more accurately considered models of language, rather than of the 2 McCarthy (1959): a program has common sense if it automatically deduces for itself a sufficiently wide class of immediate consequences of anything it is told and what it already knows.
3 E.g., many of the bridging inferences of Clark (1975) make use of common-sense knowledge, such as the following example of "Probable part": I walked into the room.The windows looked out to the bay.To resolve the definite reference the windows, one needs to know that rooms have windows is probable.
world (Rudinger et al., 2015).Facts such as "A person walking into a room is very likely to be blinking and breathing" are usually unstated in text, so their real-world likelihoods do not align to language model probabilities. 4We would like to have systems capable of, e.g., reading a sentence that describes a real-world situation and inferring how likely other statements about that situation are to hold true in the real world.This capability is subtly but crucially distinct from the ability to predict other sentences reported in the same text, as a language model may be trained to do.
We therefore propose a model of knowledge acquisition based on first deriving possibilistic statements from text: as the relative frequency of these statements suffers the mentioned reporting bias, we then follow up with human annotation of derived examples.Since we initially are uncertain about the real-world likelihood of the derived common-sense knowledge holding in any particular context, we pair it with various grounded context and present to humans for their own assessment.As these examples vary in assessed plausibility, we propose the task of ordinal common-sense inference, which embraces a wider set of natural conclusions arising from language comprehension (see Fig 1).
In what follows, we describe prior efforts in common-sense and textual inference ( §2).We then state our position on how ordinal common-sense inference should be defined ( §3), and detail our own framework for large-scale extraction and abstraction, along with a crowdsourcing protocol for assessment ( §4).This includes a novel neural model for forward generation of textual inference statements.Together these methods are applied to contexts derived from various prior textual inference resources, resulting in the JHU Ordinal Common-sense Inference (JOCI) corpus, a large collection of diverse common-sense inference examples, judged to hold with varying levels of subjective likelihood ( §5).We provide baseline results ( §6) for prediction on the JOCI corpus. 54 For further background see discussions by Van Durme (2010), Gordon and Van Durme (2013), Rudinger et al. (2015) and Misra et al. (2016). 5The JOCI corpus is released freely at: http://decomp.net/.

Background
Mining Common Sense Building large collections of common-sense knowledge can be done manually via professionals (Hobbs and Navarretta, 1993), but at considerable cost in terms of time and expense (Miller, 1995;Lenat, 1995;Baker et al., 1998;Friedland et al., 2004).Efforts have pursued volunteers (Singh, 2002;Havasi et al., 2007) and games with a purpose (Chklovski, 2003), but are still left fully reliant on human labor.Many have pursued automating the process, such as in expanding lexical hierarchies (Hearst, 1992;Snow et al., 2006), constructing inference patterns (Lin and Pantel, 2001;Berant et al., 2011), reading reference materials (Richardson et al., 1998;Suchanek et al., 2007), mining search engine query logs (Pas ¸ca and Van Durme, 2007), and most relevant here: abstracting from instance-level predications discovered in descriptive texts (Schubert, 2002;Liakata and Pulman, 2002;Clark et al., 2003;Banko and Etzioni, 2007).In this article we are concerned with knowledge mining for purposes of seeding a text generation process (constructing common-sense inference examples).
Common-sense Tasks Many textual inference tasks have been designed to require some degree of common-sense knowledge, e.g., the Winograd Schema Challenge discussed by Levesque et al. (2011).The data for these tasks are either smaller, carefully constructed evaluation sets by professionals, following efforts like the FRACAS test suite (Cooper et al., 1996), or they rely on crowdsourced elicitation (Bowman et al., 2015).Crowdsourcing is scalable, but elicitation protocols can lead to biased responses unlikely to contain a wide range of possible common-sense inferences: humans can generally agree on the plausibility of a wide range of possible inference pairs, but they are not likely to generate them from an initial prompt. 6he construction of SICK (Sentences Involving Compositional Knowledge) made use of existing paraphrastic sentence pairs (descriptions by differ-ent people of the same image), which were modified through a series of rule-based transformations then judged by humans (Marelli et al., 2014).As with SICK, we rely on humans only for judging provided examples, rather than elicitation of text.Unlike SICK, our generation is based on a process targeted specifically at common sense (see §4.1.1).
Plausibility Researchers in psycholinguistics have explored a notion of plausibility in human sentence processing, where, for instance, arguments to predicates are intuitively more or less "plausible" as fillers to different thematic roles, as reflected in human reading times.For example, McRae et al. (1998) looked at manipulations such as: (a) The boss hired by the corporation was perfect for the job.
(b) The applicant hired by the corporation was perfect for the job.where the plausibility of a boss being the agent -as compared to patient -of the predicate hired might be measured by looking at delays in reading time in the words following the predicate; this measurement is then contrasted with the timing observed in the same positions in (b). 7ather than measuring according to predictions such as human reading times, here we ask annotators explicitly to judge plausibility on a 5-point ordinal scale (See §3).Further, our effort might be described in this setting as conditional plausibility,8 where plausibility judgments for a given sentence are expected to be dependent on preceding context.Further exploration of conditional plausibility is an interesting avenue of potential future work, perhaps through the measurement of human reading times when using prompts derived from our ordinal common-sense inference examples.Computational modeling of (unconditional) semantic plausibility has been explored by those such as Padó et al. (2009), Erk et al. (2010) and Sayeed et al. (2015).
Textual Entailment A multi-year source of textual inference examples were generated under the Recognizing Textual Entailment (RTE) Challenges, introduced by Dagan et al. (2006): We say that T entails H if, typically, a human reading T would infer that H is most likely true.This somewhat informal definition is based on (and assumes) common human understanding of language as well as common background knowledge.
This definition strayed from the more strict notion of entailment as used by linguistic semanticists, such as those involved with FRACAS.While Giampiccolo et al. (2008) extended binary RTE with an "unknown" category, the entailment community has primarily focussed on issues such as paraphrase and monotonicity, such as captured by the Natural Logic implementation of MacCartney and Manning (2007).
Language understanding in context is not only understanding the entailments of a sentence, but also the plausible inferences of the sentence, i.e. the new posterior on the world after reading the sentence.A new sentence in a discourse is almost never entailed by another sentence in the discourse, because such a sentence would add no new information.In order to successfully process a discourse, there needs to be some understanding of what new information can be possibly or plausibly added to the discourse.Collecting sentence pairs with ordinal entailment connections is potentially useful for improving and testing these language understanding capabilities that would be needed by algorithms for applications like storytelling.Garrette et al. (2011) and Beltagy et al. (2016) treated textual entailment as probabilistic logical inference in Markov Logic Networks (Richardson and Domingos, 2006).But the notion of probability in their entailment task has a subtle distinction from our problem of common-sense inference.The probability of being an entailment given by a probabilistic model trained for a binary classification (being an entailment or not), is not necessarily the same as the likelihood of an inference being true.For example: T: A person flips a coin.H: That flip comes up heads.No human reading T should infer that H is true.A model trained to make ordinal predictions should say: "plausible, with probability 1.0", whereas a model trained to make binary entailed/not-entailed predictions should say: "not entailed, with probability 1.0".The following example exhibits the same property: T: An animal eats food.H: A person eats food.
Again, with high confidence, H is plausible; and, with high confidence, it is also not entailed.
Non-entailing Inference Of the various non-"entailment" textual inference tasks, a few are most salient here.Agirre et al. (2012) piloted a Textual Similarity evaluation which has been refined in subsequent years: systems produce scalar values corresponding to predictions of how similar the meaning is between two provided sentences.E.g., the following pair from SICK was judged very similar (4.2 out of 5), while also being a contradiction: There is no biker jumping in the air and A lone biker is jumping in the air.The ordinal approach we advocate for relies on a graded notion, like textual similarity.
The Choice of Plausible Alternative (COPA) task (Roemmele et al., 2011) was a reaction to RTE, similarly motivated to probe a system's ability to understand inferences that are not strictly entailed: a single context was provided, with two alternative inferences, and a system had to judge which was more plausible.The COPA dataset was manually elicited, and is not large: we discuss this data further in §5.
The Narrative Cloze task (Chambers and Jurafsky, 2008) requires a system to score candidate inferences as to how likely they are to appear in a document that also included the provided context.Many such inferences are then not strictly entailed by the context.Further, the cloze task gives the benefit of being able to generate very large numbers of examples automatically by simply occluding parts of existing documents and asking a system to predict what is missing.The LAMBADA dataset (Paperno et al., 2016) is akin to our strategy for automatic generation followed by human filtering, but for cloze examples.As our concern is with inferences that are often true but never stated in a document, this approach is not viable here.The ROC-Stories corpus (Mostafazadeh et al., 2016) elicited a more "plausible" collection of documents in order to retain the narrative cloze in the context of common-sense inference.The ROCStories corpus can be viewed as an extension of the idea behind the COPA corpus, done at a larger scale with crowdsourcing, and with multi-sentence contexts; we con-sider this dataset in §5.
Alongside the narrative cloze, Pichotta and Mooney (2016) made use of a 5-point Likert scale (very likely to very unlikely) as a secondary evaluation of various script induction techniques.While they were concerned with measuring their ability to generate very likely inferences, here we are interested in generating a wide swath of inference candidates, including those that are impossible.

Ordinal Common-sense Inference
Our goal is a system that can perform speculative, common-sense inference as part of understanding language.Based on the observed shortfalls of prior work, we propose the notion of Ordinal Commonsense Inference (OCI).OCI embraces the notion of Dagan et al. (2006), in that we are concerned with human judgements of epistemic modality.9 As agreed by many linguists, modality in natural language is a continuous category, but speakers are able to map areas of this axis into discrete values (Lyons, 1977;Horn, 1989;de Haan, 1997) - Saurí and Pustejovsky (2009) According to Horn (1989), there are two scales of epistemic modality which differ in polarity (positive vs. negative polarity): certain, likely, possible and impossible, unlikely, uncertain .The Square of Opposition (SO) (Fig 2 ) illustrates the logical relations holding between values in the two scales.Based on their logical relations, we can make a set of exhaustive epistemic modals: very likely, likely, possible, impossible , where very likely, likely, possible lie on a single, positive Horn scale, and impossible, a complementary concept from the corresponding negative Horn scale, completes the set.In this paper, we further replace the value possible by the more fine-grained values (technically possible and plausible).This results in a 5-point scale of likelihood: very likely, likely, plausible, technically possible, impossible .The OCI task definition directly embraces subjective likelihood on such an ordinal scale.Humans are presented with a context C and asked whether a provided hypothesis H is very likely, likely, plausible, technically possible, or impossible.Furthermore, an important part of this process is the generation of H by automatic methods, which seeks to avoid the elicitation bias of many prior works.4 Framework for collecting OCI corpus We now describe our framework for collecting ordinal common-sense inference examples.It is natural to collect this data in two stages.In the first stage ( §4.1), we automatically generate inference candidates given some context.We propose two broad approaches using either general world knowledge or neural methods.In the second stage ( §4.2), we annotate these candidates with ordinal labels.

Generation based on World Knowledge
Our motivation for this approach was first introduced by Schubert (2002): There is a largely untapped source of general knowledge in texts, lying at a level beneath the explicit assertional content.This knowledge consists of relationships implied to be possible in the world, or, under certain conditions, implied to be normal or commonplace in the world.
Following Schubert (2002) and Van Durme and Schubert (2008), we define an approach for abstracting over explicit assertions derived from corpora, leading to a large-scale collection of general possibilistic statements.As shown in Fig 3 We use the Gigaword corpus (Parker et al., 2011) for extracting propositions as it is a comprehensive text archive.There exists a version containing automatically generated syntactic annotation (Ferraro et al., 2014), which bootstraps large-scale knowledge extraction.We use PyStanfordDependencies12 to convert constituency parses to depedency parses, from which we extract structured propositions.(b) Abstracting propositions: In this step, we abstract the propositions into a more general form.This involves lemmatization, stripping inessential modifiers and conjuncts, and replacing specific arguments with generic types. 13This method of abstraction often yields general presumptions about the world.To reduce noise from predicate-argument extraction, we only keep 1-place and 2-place predicates after abstraction.
We further generalize individual arguments to concepts by attaching semantic-class labels to them.Here we choose WordNet (Miller, 1995) noun synsets14 as the semantic-class set.When selecting the correct sense for an argument, we adopt a fast and relatively accurate method: always taking the first sense which is usually the most commonly used sense (Suchanek et al., 2007;Pasca, 2008).By doing so, we attach 84 million abstracted propositions with senses, covering 43.7% (35,811/81,861) of WordNet noun senses.
Each of these WordNet senses, then, is associated with a set of abstracted propositions.The abstracted propositions are turned into templates by replacing the sense's corresponding argument with a placeholder, similar to Van Durme et al. (2009) (see Fig 3 (b)).We remove any template associated with a sense if it occurs less than two times for that sense, leaving 38 million unique templates.(c) Deriving properties via WordNet: At this step, we want to associate with each WordNet sense a set of possible properties.We employ three strategies.
The first strategy is to use a decision tree to pick out highly discriminative properties for each WordNet sense.Specifically, for each set of cohyponyms, 15 we train a decision tree using the associated templates as features.For example, in Fig 3 (c), we train a decision tree over the cohyponyms of publication.n.01.Then the template "person subscribe to " would be selected as a property of magazine.n.01, and the template "person borrow from library" for book.n.01.The second strategy selects the most frequent templates associated with each sense as properties of that sense.The third strategy uses WordNet ISA relations to derive new properties of senses.E.g. for the sense book.n.01 and its hypernym publication.n.01, we generate a property " be publication".(d) Generating hypotheses: As shown in Fig 3 (d), given a discourse context (Tanenhaus and Seidenberg, 1980), we first extract an argument of the context, then select the derived properties for the argument.Since we don't assume any specific sense for the argument, these properties could come from any of its candidate senses.We generate hypotheses by replacing the placeholder in the selected properties with the argument, and verbalizing the properties.16

Generation via Neural Methods
In addition to the knowledge-based methods described above, we also adapt a neural sequence-tosequence model (Vinyals et al., 2015;Bahdanau et al., 2014) to generate inference candidates given contexts.The model is trained on sentence pairs labeled "entailment" from the SNLI corpus (Bowman et al., 2015) (train).Here, the SNLI "premise" is the input (context C), and the SNLI "hypothesis" is the output (hypothesis H).
We employ two different strategies for forward generation of inference candidates given any context.The sentence-prompt strategy uses the entire sentence in the context as an input, and generates output using greedy decoding.The word-prompt strategy differs by using only a single word from the context as input.This word is chosen in the same fashion as the step (d) in generation based on world knowledge, i.e. an argument of the context.The second approach is motivated by our hypothesis that providing only a single word context will force the model to generate a hypothesis that generalizes over the many contexts in which that word was seen, resulting in more common-sense-like hypotheses, as in Fig 4 .We later present the full context and decoded hypotheses to crowdsource workers for annotation.
dustpan a person is cleaning.a boy in blue and white shorts is sweeping with a broom and dustpan.
a young man is holding a broom.

Neural Sequence-to-Sequence Model
Neural sequence-to-sequence models learn to map variable-length input sequences to variablelength output sequences, as a conditional probability of output given input.For our purposes, we want to learn the conditional probability of an hypothesis sentence, H, given a context sentence, C, i.e., P (H|C).
The sequence-to-sequence architecture consists of two components: an encoder and a decoder.The encoder is a recurrent neural network (RNN) iterating over input tokens (i.e., words in C), and the decoder is another RNN iterating over output tokens (words in H).The final state of the encoder, h C , is passed to the decoder as its initial state.We use a three-layer stacked LSTM (state size 512) for both the encoder and decoder RNN cells, with independent parameters for each.We use the LSTM formulation of Hochreiter and Schmidhuber (1997) as summarized in Vinyals et al. (2015).
The network computes P (H|C): where w t are the words in H.At each time step, t, the successive conditional probability is computed from the LSTM's current hidden state: where v wt is the embedding of word w t from its corresponding row in the output vocabulary matrix, V (a learnable parameter of the network), and h t is the hidden state of the decoder RNN at time t.In our implementation, we set the vocabulary to be all words that appear in the training data at least twice, resulting in a vocabulary of size 24,322.This model also makes use of an attention mechanism. 17An attention vector, attn t , is concatenated with the LSTM hidden state at time t to form the hidden state, h t , from which output probabilities are computed (Eqn.2).This attention vector is a weighted average of the hidden states of the encoder, h 1≤i≤len(C) : where vector v and matrices W 1 , W 2 are parameters.The network is trained via backpropagation on the cross-entropy loss of the observed sequences in training.A sampled softmax is used to compute the loss during training, while a full softmax is used after training to score unseen (C, H) pairs, or generate an H given a C. Generation is performed via beam search with a beam size of 1; the highest probability word is decoded at each time step and fed as input to the decoder at the next time step until an end-ofsequence token is decoded.

Ordinal Label Annotation
In this stage, we turn to human efforts to annotate common-sense inference candidates with ordinal labels.The annotator is given a context, and then is asked to assess the likelihood of the hypotheses being true.These context-hypothesis pairs are annotated with one of the five labels: very likely, likely, plausible, technically possible, and impossible, corresponding to the ordinal values of {5,4,3,2,1} respectively.
In the case that the hypotheses in the inference candidates do not make sense, or have grammatical errors, judges can provide an additional label, NA, so that we can filter these candidates in post-processing.The combination of generation of common-sense inference candidates with human filtering seeks to avoid the problem of elicitation bias.

JOCI Corpus
We now describe in depth how we created the JHU Ordinal Common-sense Inference (JOCI) corpus.The main part of the corpus consists of contexts chosen from SNLI (Bowman et al., 2015) and ROCStories (Mostafazadeh et al., 2016), paired with hypotheses generated via methods described in §4.1.These pairs are then annotated with ordinal labels using crowdsourcing ( §4.2).We also include context-hypothesis pairs directly taken from SNLI and other corpora (e.g., as premise-hypothesis pairs), and re-annotate them with ordinal labels.

Data sources for Context-Hypothesis Pairs
In order to compare with existing inference corpora, we choose contexts from two resources: (1) the first sentence in the sentence pairs of the SNLI corpus which are captions from the Flickr30k corpus (Young et al., 2014), and (2) the first sentence in the stories of the ROCStories corpus.
We then collect candidates of automatically generated common-sense inferences (AGCI) against these contexts.Specifically, in the SNLI train set, there are over 150K different first sentences, involving 7,414 different arguments according to predicate-argument extraction.We randomly choose 4,600 arguments.For each argument, we sample one first sentence that has the argument, and collect candidates of AGCI against this as context.We also do the same generation for the SNLI development set and test set.We also collect candidates of AGCI against randomly sampled first sentences in the ROCStories corpus.Collectively, these pairs and their ordinal labels (to be described in § 5.2) make up the main part of the JOCI corpus.The statistics of this subset are shown in Table 1 (first five rows).
For comprehensiveness, we also produced ordinal labels on (C, H) pairs directly drawn from existing corpora.For SNLI, we randomly select 1000 contexts (premises) from the SNLI train set.Then, the corresponding hypothesis is one of the entailment, neutral, or contradiction hypotheses taken from SNLI.For ROCStories, we defined C as the first sentence of the story, and H as the second or third sentence.For COPA, (C, H) corresponds to premise-effect.The statistics are shown in the bottom rows of Table 1.

Crowdsourced Ordinal Label Annotation
We use Amazon Mechanical Turk to annotate the hypotheses with ordinal labels.In each HIT (Human Intelligence Task), a worker is presented with one context and one or two hypotheses, as shown in Fig 5 .First, the annotator sees an "Initial Sentence" (context), e.g."John's goal was to learn how to draw well.",and is then asked about the plausibility of the hypothesis, e.g."A person accomplishes the goal".In particular, we ask the annotator how plausible the hypothesis is true during or shortly after, because without this constraint, most sentences are technically plausible in some imaginary world.If the hypothesis does not make sense18 , the workers can check the box under the question and skip the ordinal annotation.In the annotation, about 25% of hypotheses are marked as not making sense, and are removed from our data.
With the sampled contexts and the auto-generated  3 [3,3,3] Two dogs fighting, one is black, the other beige .
The dogs are playing .

[2,2,3]
A bare headed man wearing a dark blue cassock, sandals, and dark blue socks mounts the stone steps leading into a weathered old building A man is in the middle of home building .

[1,1,1]
A skydiver hangs from the undercarriage of an airplane or some sort of air gliding device A camera is using an object .(The upper 5 rows are samples from AGCI-WK.The lower 5 rows are samples from AGCI-NN.) Initial Sentence: John 's goal was to learn how to draw well 1.The following statements is to be true during or shortly after the context of the initial sentence.

A person accomplishes the goal .
This statement does not make sense.
1 ex .The following statements is to be true during or shortly after the context of the initial sentence.

The goal is a content .
This statement does not make sense.hypotheses, we prepare 50K common-sense inference examples for crowdsourced annotation in bulk.In order to guarantee the quality of annotation, we have each example annotated by three workers.We take the median of the three as the gold label.To make sure non-expert workers have a correct understanding of our task, before launching the later tasks in bulk, we run two pilots to create a pool of qualified workers.In the first pilot, we publish 100 examples.Each example is anno-tated by five workers.From this pilot, we collect a set of "good" examples which have 100% annotation agreement among workers.The ordinal labels chosen by the workers are regarded as the gold labels.In the second pilot, we randomly select two "good" (high-agreement) examples for each ordinal label and publish a HIT with these examples.To measure workers' agreement, we calculate the average of quadratic weighted Cohen's κ scores between workers' annotation.By setting a threshold of the average of κ scores to 0.7, we are able to create a pool that has over 150 qualified workers.

Corpus Characteristics
We want a corpus with reliable inter-annotator agreement.Additionally, in order to evaluate or train a common-sense inference system, we ideally need a corpus that provides for every ordinal likelihood value as many inference examples as possible.In this section, we investigate the characteristics of the JOCI corpus.We also compare JOCI with related resources under our annotation protocol.Quality: We measure the quality of each pair by calculating Cohen's κ of workers' annotations.The average κ of the JOCI corpus is 0.54.Fig 7 shows the growth of the size of JOCI as we decrease the threshold of the averaged κ to filter pairs.Even if we place a relatively strict threshold (>0.6), we still get a large subset of JOCI with over 20K pairs.Label Distribution: We believe datasets with wide support of label distribution are important in training and evaluating systems to recognize ordinal scale inferences.Fig 6a shows the normalized label distribution of JOCI vs. SNLI.As desired, JOCI covers a wide range of ordinal likelihoods, with many samples in each ordinal scale.Note also how traditional RTE labels are related to ordinal labels, although many inferences in SNLI require no common-sense knowledge (e.g.paraphrases).As expected, entailments are mostly considered very likely; neutral inferences mostly plausible; and contradictions likely to be either impossible or technically possible.
Fig 6b shows the normalized distributions of JOCI and ROCStories.Compared with ROCStories, JOCI still covers a wider range of ordinal likelihood.We observe in ROCStories that, while 2nd sentences are in general more likely to be true than 3rd, a large proportion of both 2nd and 3rd sentences are plausible, as compared to likely or very likely.This matches intuition: pragmatics dictates that subsequent sentences in a standard narrative carry new in-formation. 19That our protocol picks this up is an encouraging sign for our ordinal protocol, as well as suggestive that the makeup of the elicited ROCStories collection is indeed "story like." For the COPA dataset, we make use only of the pairs in which the alternatives are plausible effects (rather than causes) of the premise, as our protocol more easily accommodates these pairs.20 Annotating this section of COPA with ordinal labels provides an enlightening and validating view of the dataset.Fig 6c shows the normalized distribution of COPA next to that of JOCI.(COPA-1 alternatives are marked as most plausible; COPA-0 are not.)True to its name, the majority of COPA alternatives are labeled as either plausible or likely; almost none are impossible.This is consistent with the idea that the COPA task is to determine which of two possible options is the more plausible.Fig 8 shows the joint distribution of ordinal labels on (COPA-0,COPA-1) pairs.As expected, the densest areas of the heatmap lie above the diagonal, indicating that in almost every pair, COPA-1 received a higher likelihood judgement than COPA-0.Automatic Generation Comparisons: We compare the label distributions of different methods for automatic generation of common-sense inference (AGCI) in Fig 9 .Among ACGI-WK (generation based on world knowledge) methods, the ISA strategy yields a bimodal distribtuion, with the majority of inferences labeled impossible or very likely.This is likely because most copular statements generated with the ISA strategy will either be categorically true or false.In contrast, the decision tree and frequency based strategies generate many more hypotheses with intermediate ordinal labels.This suggests the propositional templates (learned from text) capture many "possibilistic" hypotheses, which is our aim.The two AGCI-NN (generation via neural methods) strategies show interesting differences in label distribution as well.Sequence-to-sequence decodings with full-sentence prompts lead to more very likely labels than single-word prompts.The reason may be that the model behaves more similarly to SNLI entailments when it has access to all the information in the context.When combined, the five AGCI strategies (three AGCI-WK and two AGCI-NN) provide reasonable coverage over all five categories, as can be seen in Fig 6.

Predicting Ordinal Judgments
We want to be able to predict ordinal judgments of the kind presented in this corpus.Our goal in this section is to establish baseline results and explore what kinds of features are useful for predicting ordinal common-sense inference.To do so, we train and test a logistic ordinal regression model g θ (φ(C, H)), which outputs ordinal labels using features φ defined on context-inference pairs.Here, g θ (•) is a regression model with θ as trained parameters; we train using the margin-based method of (Rennie and Srebro, 2005), implemented in (Pedregosa-Izquierdo, 2015),21 with the following features: Bag of words features (BOW): We compute (1) "BOW overlap" (size of word overlap in C and H), and (2) BOW overlap divided by the length of H.

Similarity features (SIM):
Using Google's word2vec vectors trained on 100 billion tokens of GoogleNews,22 we (1) sum the vectors in both the context and hypothesis and compute the cosinesimilarity of the resulting two vectors ("similarity of average"), and (2) compute the cosine-similarity of all word pairs across the context and inference, then average those similarities ("average of similarity").Seq2seq score features (S2S): We compute the log probability log P (H|C) under the sequence-tosequence model described in § 4.1.2.There are five variants: (1) Seq2seq trained on SNLI "entailment" pairs only, (2) "neutral" pairs only, (3) "contradiction" pairs only, (4) "neutral" and "contradiction" pairs, and (5) SNLI pairs (any label) with the context (premise) replaced by an empty string.Seq2seq binary features (S2S-BIN): Binary indicator features for each of the five seq2seq model variants, indicating that model achieved the lowest score on the context-hypothesis pair.
Length features (LEN): This set comprises three features: the length of the context (in tokens), the difference in length between the context and hypothesis, and a binary feature indicating if the hypothesis is longer than the context.

Analysis
We train and test our regression model on two subsets of the JOCI corpus, which, for brevity, we call "A" and "B." "A" consists of 2,976 sentence pairs (i.e., context-hypothesis pairs) from SNLI-train annotated with ordinal labels.This corresponds to the three rows labeled SNLI in Table 1 (993 + 988 + 995 = 2, 976 pairs), and can be viewed as a textual entailment dataset re-labeled with ordinal judgments."B" consists of 6,375 context-inference pairs, in which the contexts are the same 2,976 SNLI-train premises as "A", and the hypotheses are generated based on world knowledge ( §4.1.1);these pairs are also annotated with ordinal labels.This corresponds to a subset of the row labeled AGCI in Table 1.A key difference between "A" and "B" is that the hypotheses in "A" are human-elicited, while those in "B" are auto-generated; we are interested in seeing whether this affects the task's difficulty.4 and 5 show each model's performance (mean squared error and Spearman's ρ, respectively) in predicting ordinal labels. 24We compare our ordinal regression model g θ (•) with these baselines: 23 Details of the data split is reported in the dataset release. 24MSE and Spearman's ρ are both commonly used eval-Most Frequent: Select the ordinal class appearing most often in train.
Frequency Sampling: Select an ordinal label according to their distribution in train.
Rounded Average: Average over all labels from train rounded to nearest ordinal.
One-vs-All: Train one SVM classifier per ordinal class and select the class label with the largest corresponding margin.We train this model with the same set of features as the ordinal regression model.
Overall, the regression model achieves the lowest MSE and highest ρ, implying that this dataset is learnable and tractable.Naturally, we would desire a model that achieves MSE under 1.0, and we hope that the release of our dataset will encourage more concerted effort in this common-sense inference task.Importantly, note that performance on A-test is better than on B-test.We believe "B" is a more challenging dataset because auto-generation of hypothesis leads to wider variety than elicitation.We also run a feature ablation test.Table 6 shows that the most useful features differ for Atest and B-test.On A-test, where the inferences are elicited from humans, removal of similarity-and bow-based features together results in the largest performance drop.On B-test, by contrast, removing similarity and bow features results in a comuations in ordinal prediction tasks (Baccianella et al., 2009;Bennett and Lanning, 2007;Gaudette and Japkowicz, 2009;Agresti, 2003;Popescu and Dinu, 2009;Liu et al., 2015;Gella et al., 2013).parable performance drop to removing seq2seq features.These observations point to statistical differences between human-elicited and auto-generated hypotheses, a motivating point of the JOCI corpus.

Conclusions and Future Work
In motivating the need for automatically building collections of common-sense knowledge, Clark et al. (2003) wrote: "China launched a meteorological satellite into orbit Wednesday."suggests to a human reader that (among other things) there was a rocket launch; China probably owns the satellite; the satellite is for monitoring weather; the orbit is around Earth; etc The use of "etc" summarizes an infinite number of other statements that a human reader would find to be very likely, likely, technically plausible, or impossible, given the provided context.
Preferably we could build systems that would automatically learn common-sense exclusively from available corpora; extracting not just statements about what is possible, but also the associated probabilities of how likely certain things are to obtain in any given context.We are unaware of existing work that has demonstrated this to be feasible.
We have thus described a multi-stage approach to common-sense textual inference: we first extract large numbers of possible statements from a corpus, and use those statements to generate contextually grounded context-hypothesis pairs.These are presented to humans for direct assessment of subjective likelihood, rather than relying on corpus data alone.As the data is automatically generated, we seek to bypass issues in human elicitation bias.Further, since subjective likelihood judgments are not difficult for humans, our crowdsourcing technique is both inexpensive and scalable.
Future work will extend our techniques for forward inference generation, further scale up the annotation of additional examples, and explore the use of larger, more complex contexts.The resulting JOCI corpus will be used to improve algorithms for natural language inference tasks such as storytelling and story understanding.

Figure 1 :
Figure 1: Examples of common-sense inference ranging from very likely, likely, plausible, technically possible, to impossible.

Figure 4 :
Figure 4: Examples of sequence-to-sequence hypothesis generation from single-word and full-sentence inputs.

Figure 5 :
Figure 5: The annotation interface, with a drop-down list provides ordinal labels to select.

Figure 6 :
Figure 6: Comparison of normalized distributions between JOCI and other corpora.

Table 1 :
JOCI corpus statistics, where each subset consists of different sources for context-and-hypothesis pairs, each annotated with common-sense ordinal labels.AGCI-WK represents candidates generated based on world knowledge.AGCI-NN represents candidates generated via neural methods.

Table 2 :
Examples of context-and-hypothesis pairs with ordinal judgements and Median value.
Table3shows the statistics of the crowdsourced efforts.

Table 3 :
Statistics of the crowdsourced efforts.
Table 2 contains pairs randomly sampled from this subset,

Table 6 :
Ablation results for ordinal regression model on A-test and B-test.(*p-value<.01 for ρ)