From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

We propose to use the visual denotations of linguistic expressions (i.e. the set of images they describe) to define novel denotational similarity metrics, which we show to be at least as beneficial as distributional similarities for two tasks that require semantic inference. To compute these denotational similarities, we construct a denotation graph, i.e. a subsumption hierarchy over constituents and their denotations, based on a large corpus of 30K images and 150K descriptive captions.


Introduction
The ability to draw inferences from text is a prerequisite for language understanding. These inferences are what makes it possible for even brief descriptions of everyday scenes to evoke rich mental images. For example, we would expect an image of people shopping in a supermarket to depict aisles of produce or other goods, and we would expect most of these people to be customers who are either standing or walking around. But such inferences require a great deal of commonsense world knowledge. Standard distributional approaches to lexical similarity (Section 2.1) are very effective at identifying which words are related to the same topic, and can provide useful features for systems that perform semantic inferences (Mirkin et al., 2009), but are not suited to capture precise entailments between complex expressions. In this paper, we propose a novel approach for the automatic acquisition of denotational similarities between descriptions of everyday situations (Section 2). We define the (visual) denotation of a linguistic expression as the set of images it describes. We create a corpus of images of everyday activities (each paired with multiple captions; Section 3) to construct a large scale visual denotation graph which associates image descriptions with their denotations (Section 4). The algorithm that constructs the denotation graph uses purely syntactic and lexical rules to produce simpler captions (which have a larger denotation). But since each image is originally associated with several captions, the graph can also capture similarities between syntactically and lexically unrelated descriptions. We apply these similarities to two different tasks (Sections 6 and 7): an approximate entailment recognition task for our domain, where the goal is to decide whether the hypothesis (a brief image caption) refers to the same image as the premises (four longer captions), and the recently introduced Semantic Textual Similarity task (Agirre et al., 2012), which can be viewed as a graded (rather than binary) version of paraphrase detection. Both tasks require semantic inference, and our results indicate that denotational similarities are at least as effective as standard approaches to similarity. Our code and data set, as well as the denotation graph itself and the lexical similarities we define over it are available for research purposes at http://nlp.cs.illinois.edu/ Denotation.html.

Distributional Similarities
The distributional hypothesis posits that linguistic expressions that appear in similar contexts have a Gray haired man in black suit and yellow tie working in a financial environment. A graying man in a suit is perplexed at a business meeting. A businessman in a yellow tie gives a frustrated look. A man in a yellow tie is rubbing the back of his neck. A man with a yellow tie looks concerned.
A butcher cutting an animal to sell. A green-shirted man with a butcher's apron uses a knife to carve out the hanging carcass of a cow. A man at work, butchering a cow. A man in a green t-shirt and long tan apron hacks apart the carcass of a cow while another man hoses away the blood. Two men work in a butcher shop; one cuts the meat from a butchered cow, while the other hoses the floor.  (Harris, 1954). This has led to the definition of vector-based distributional similarities, which represent each word w as a vector w derived from counts of w's co-occurrence with other words. These vectors can be used directly to compute the lexical similarities of words, either via the cosine of the angle between them, or via other, more complex metrics (Lin, 1998). More recently, asymmetric similarities have been proposed as more suitable for semantic inference tasks such as entailment (Weeds and Weir, 2003;Szpektor and Dagan, 2008;Clarke, 2009;Kotlerman et al., 2010). Distributional word vectors can also be used to define the compositional similarity of longer strings (Mitchell and Lapata, 2010). To compute the similarity of two strings, the lexical vectors of the words in each string are first combined into a single vector (e.g. by element-wise addition or multiplication), and then an appropriate vector similarity (e.g. cosine) is applied to the resulting pair of vectors.

Visual Denotations
Our approach is inspired by truth-conditional semantic theories in which the denotation of a declarative sentence is assumed to be the set of all situations or possible worlds in which the sentence is true (Montague, 1974;Dowty et al., 1981;Barwise and Perry, 1980). Restricting our attention to visually descriptive sentences, i.e. non-negative, episodic (Carlson, 2005) sentences that can be used to describe an image (Figure 1), we propose to instantiate the abstract notions of possible worlds or situations with concrete sets of images. The interpretation function · maps sentences to their visual denotations s , which is the set of images i ∈ U s ⊆ U in a 'universe' of images U that s describes: (1) Similarly, we map nouns and noun phrases to the set of images that depict the objects they describe, and verbs and verb phrases to the set of images that depict the events they describe.

Denotation Graphs
Denotations induce a partial ordering over descriptions: if s (e.g. "a poodle runs on the beach") entails a description s (e.g. "a dog runs"), its denotation is a subset of the denotation of s ( s ⊆ s ), and we say that s subsumes the more specific s (s s). In our domain of descriptive sentences, we can obtain more generic descriptions by simple syntactic and lexical operations ω ∈ O ⊂ S × S that preserve upward entailment, so that if ω(s) = s , s ⊆ s . We consider three types of operations: the removal of optional material (e.g PPs like on the beach), the extraction of simpler constituents (NPs, VPs, or simple Ss), and lexical substitutions of nouns by their hypernyms (poodle → dog). These operations are akin to the atomic edits of MacCartney and Manning (2008)'s NatLog system, and allow us to construct large subsumption hierarchies over image descriptions, which we call denotation graphs. Given a set of (upward entailment-preserving) operations O ⊂ S × S, the denotation graph DG = E, V of a set of images I and a set of strings S represents a subsumption hierarchy in which each node V = s, s corresponds to a string s ∈ S and its denotation s ⊆ I. Directed edges e = (s, s ) ∈ E ⊆ V × V indicate a subsumption relation s s between a more generic expression s and its child s . An edge from s to s exists if there is an operation ω ∈ O that reduces the string s to s (i.e. ω(s ) = s) and its inverse ω −1 expands the string s to s (i.e. ω −1 (s) = s ).

Denotational Similarities
Given a denotation graph over N images, we estimate the denotational probability of an expression s with a denotation of size | s | as P (s) = | s |/N , and the joint probability of two expressions analogously as P (s, s ) = | s ∩ s |/N . The conditional probability P (s | s ) indicates how likely s is to be true when s holds, and yields a simple directed denotational similarity. The (normalized) pointwise mutual information (PMI) (Church and Hanks, 1990) defines a symmetric similarity: We set P (s|s) = nPMI (s, s) = 1, and, if s or s are not in the denotation graph, nPMI (s, s ) = P (s, s ) = 0.

Our Data Set
Our data set ( Figure 1) consists of 31,783 photographs of everyday activities, events and scenes (all harvested from Flickr) and 158,915 captions (obtained via crowdsourcing). It contains and extends Hodosh et al. (2013)'s corpus of 8,092 images. We followed Hodosh et al. (2013)'s approach to collect images. We also use their annotation guidelines, and use similar quality controls to correct spelling mistakes, eliminate ungrammatical or non-descriptive sentences. Almost all of the images that we add to those collected by Hodosh et al. (2013) have been made available under a Creative Commons license. Each image is described independently by five annotators who are not familiar with the specific entities and circumstances depicted in them, resulting in captions such as "Three people setting up a tent", rather than the kind of captions people provide for their own images ("Our trip to the Olympic Peninsula"). Moreover, different annotators use different levels of specificity, from describing the overall situation (performing a musical piece) to specific actions (bowing on a violin). This variety of descriptions associated with the same image is what allows us to induce denotational similari-ties between expressions that are not trivially related by syntactic rewrite rules.

Constructing the Denotation Graph
The construction of the denotation graph consists of the following steps: preprocessing and linguistic analysis of the captions, identification of applicable transformations, and generation of the graph itself.
Preprocessing and Linguistic Analysis We use the Linux spell checker, the OpenNLP tokenizer, POS tagger and chunker (http://opennlp. apache.org), and the Malt parser (Nivre et al., 2006) to analyze the captions. Since the vocabulary of our corpus differs significantly from the data these tools are trained on, we resort to a number of heuristics to improve the analyses they provide. Since some heuristics require us to identify different entity types, we developed a lexicon of the most common entity types in our domain (people, clothing, bodily appearance (e.g. hair or body parts), containers of liquids, food items and vehicles).
After spell-checking, we normalize certain words and compounds with several spelling variations, e.g. barbecue (barbeque, BBQ), gray (grey), waterski (water ski), brown-haired (brown haired), and tokenize the captions using the OpenNLP tokenizer. The OpenNLP POS tagger makes a number of systematic errors on our corpus (e.g. mistagging main verbs as nouns). Since these errors are highly systematic, we are able to correct them automatically by applying deterministic rules (e.g. climbs is never a noun in our corpus, stand is a noun if it is preceded by vegetable but a verb when preceded by a noun that refers to people). These fixes apply to 27,784 (17% of the 158,915 image captions). Next, we use the OpenNLP chunker to create a shallow parse. Fixing its (systematic) errors affects 28,587 captions. We then analyze the structure of each NP chunk to identify heads, determiners and prenominal modifiers. The head may include more than a single token if WordNet (or our hypernym lexicon, described below) contains a corresponding entry (e.g. little girl). Determiners include phrases such as a couple or a few. Although we use the Malt parser (Nivre et al., 2006) to identify subjectverb-object dependencies, we have found it more accurate to develop deterministic heuristics and lexi-cal rules to identify the boundaries of complex (e.g. conjoined) NPs, allowing us to treat "a man with red shoes and a white hat" as an NP followed by a single PP, but "a man with red shoes and a white-haired woman" as two NPs, and to transform e.g. "standing by a man and a woman" into "standing" and not "standing and a woman" when dropping the PP.
Hypernym Lexicon We use our corpus and Word-Net to construct a hypernym lexicon that allows us to replace head nouns with more generic terms. We only consider hypernyms that occur themselves with sufficient frequency in the original captions (replacing "adult" with "person", but not with "organism"). Since the language in our corpus is very concrete, each noun tends to have a single sense, allowing us to always replace it with the same hypernyms. 1 But since WordNet provides us with multiple senses for most nouns, we first have to identify which sense is used in our corpus. To do this, we use the heuristic cross-caption coreference algorithm of Hodosh et al. (2010) to identify coreferent NP chunks among the original five captions of each image. 2 For each ambiguous head noun, we consider every non-singleton coreference chains it appears in, and reduce its synsets to those that stand in a hypernym-hyponym relation with at least one other head noun in the chain. Finally, we apply a greedy majority voting algorithm to iteratively narrow down each term's senses to a single synset that is compatible with the largest number of coreference chains it occurs in.
Caption Normalization In order to increase the recall of the denotations we capture, we drop all punctuation marks, and lemmatize nouns, verbs, and adjectives that end in "-ed" or "-ing" before gener-1 Descriptions of people that refer to both age and gender (e.g. "man") can have multiple distinct hypernyms ("adult"/'"male"). Because our annotators never describe young children or babies as "persons", we only allow terms that are likely to describe adults or teenagers (including occupations) to be replaced by the term "person". This means that the term "girl" has two senses: a female child (the default) or a younger woman. We distinguish the two senses in a preprocessing step: if the other captions of the same image do not mention children, but refer to teenaged or adult women, we assign girl the woman-sense. Some nouns that end in -er (e.g. "diner", "pitcher" also violate our monosemy assumption. 2 Coreference resolution has also been used for word sense disambiguation by Preiss (2001) and Hu and Liu (2011). ating the denotation graph. In order to distinguish between frequently occurring homonyms where the noun is unrelated to the verb, we change all forms of the verb dress to dressed, all forms of the verb stand to standing and all forms of the verb park to parking. Finally, we drop sentence-initial there/here/this is/are (as in there is a dog splashing in the water), and normalize the expressions in X and dressed (up) in X (where X is an article of clothing or a color) to wear X. We reduce plural determiners to {two, three, some}, and drop singular determiners except for no.

Rule Templates
The denotation graph contains a directed edge from s to s if there is a rule ω that reduces s to s, with an inverse ω −1 that expands s to s . Reduction rules can drop optional material, extract simpler constituents, or perform lexical substitutions.
Drop Pre-Nominal Modifiers: "red shirt" → "shirt" In an NP of the form "X Y Z", where X and Y both modify the head Z, we only allow X and Y to be dropped separately if "X Z" and "Y Z" both occur elsewhere in the corpus. Since "white building" and "stone building" occur elsewhere in the corpus, we generate both "white building" and "stone building" from the NP "white stone building". But since "ice player" is not used, we replace "ice hockey player" only with "hockey player" (which does occur) and then "player".
Drop Other Modifiers "run quickly" → "run" We drop ADVP chunks and adverbs in VP chunks. We also allow a prepositional phrase (a preposition followed by a possibly conjoined NP chunk) to be dropped if the preposition is locational ("in", "on", "above", etc.), directional ("towards", "through", "across", etc.), or instrumental ("by", "for", "with"). Similarly, we also allow the dropping of all "wear NP" constructions. Since the distinction between particles and prepositions is often difficult, we also use a predefined list of phrasal verbs that commonly occur in our corpus to identify constructions such as "climb up a mountain", which is transformed into "climb a mountain" or "walk down a street", which is transformed into "walk".
Replace Nouns by Hypernyms: "red shirt" → "red clothing" We iteratively use our hypernym Figure 2: Generating the graph lexicon to make head nouns more generic. We only allow head nouns to be replaced by their hypernyms if any age based modifiers have already been removed: "toddler" can be replaced with "child", but not "older toddler" with "older child".
Handle Partitive NPs: cup of tea → "cup", "tea" In most partitive NP 1 -of-NP 2 constructions ("cup of tea", "a team of football players") the corresponding entity can be referred to by both the first or the second NP. Exceptions include the phrase "body of water", and expressions such as "a kind/type/sort of", which we treat similar to determiners.
Handle VP 1 -to-VP 2 Cases Depending on the first verb, we replace VPs of the form X to Y with both X and Y if X is a movement or posture (jump to catch, etc.). Otherwise we distinguish between cases we can only replace with X (wait to jump) and those we can only replace with Y (seem to jump).
Extract Simpler Constituents Any noun phrase or verb phrase can also be used as a node in the graph and simplified further. We use the Malt dependencies (and the person terms in the entity type lexicon) to identify and extract subject-verb-object chunks which correspond to simpler sentences that we would otherwise not be able to obtain: from "man laugh(s) while drink(ing)", we extract "man laugh" and "man drink", and then further split those into "man", "laugh(s)", and "drink".

Graph Generation
The naive approach to graph generation would be to generate all possible strings for each caption. However, this would produce far more strings than can be processed in a reasonable amount of time, and most of these strings would have uninformative denotations, consisting of only a single image. To make graph generation tractable, we use a top-down algorithm which generates the graph from the most generic (root) nodes, and stops at nodes that have a singleton denotation (Figure 2). We first identify the set of rules that can apply to each original caption (GenerateRules). These rules are then used to reduce each caption as much as possible. The resulting (maximally generic) strings are added as root nodes to the graph (RootNodes), and added to the queue Q. Q keeps track of all currently possible node expansions. It contains items c, s , which pair the ID of an original caption and its image (c) with a string (s) that corresponds to an existing node in the graph and can be derived from c's caption. When c, s is processed, we check how many captions have generated s so far (Captions(s)). If s has more than a single caption, we use each of the applicable rewrite rules of c's caption to create new strings s that correspond to the children of s in the graph, and push all resulting c, s onto Q. If c is the second caption of s, we also use all of the applicable rewrite rules from the first caption c to create its children.
A post-processing step (not shown in Figure 2) attaches each original caption to all leaf nodes of the graph to which it can be reduced. Finally, we obtain the denotation of each node s from the set of images whose captions are in Captions(s).  119 images. We have not yet attempted to identify variants in word order ("stick tongue out" vs. "stick out tongue") or equivalent choices of preposition ("look into mirror" vs. "look in mirror"). Despite this brittleness, the current graph already gives us a large number of semantic associations.

Size of denotations
Denotational Similarities The following examples of the similarities found by nPMI and P show that denotational similarities do not simply find topically related events, but instead find events that are related by entailment: If someone is eating lunch, it is likely that they are sitting, and people who sit in a classroom are likely to be listening to somebody. These entailments can be very precise: "walk up stair" entails "ascend", but not "descend"; the reverse is true for "walk down stair": nPMI captures paraphrases as well as closely related events: people look in a mirror when shaving their face, and baseball players may try to tag someone who is sliding into base: Comparing the expressions that are most similar to "play baseball" or "play football" according to the denotational nPMI and the compositional Σ similarities reveals that the denotational similarity finds a number of actions that are part of the particular sport, while the compositional similarity finds events that are similar to playing baseball (football): A caption never provides a complete description of the depicted scene, but commonsense knowledge often allows us to draw implicit inferences: when somebody mentions a bride, it is quite likely that the picture shows a woman in a wedding dress; a picture of a parent most likely also has a child or baby, etc. In order to compare the utility of denotational and distributional similarities for drawing these inferences, we apply them to an approximate entailment task, which is loosely modeled after the Recognizing Textual Entailment problem (Dagan et al., 2006), and consists of deciding whether a brief caption h (the hypothesis) can describe the same image as a set of captions P = {p 1 , ..., p N } known to describe the same image (the premises).
Data We generate positive and negative items P, h, ± (Figure 3) as follows: Given an image, any subset of four of its captions form a set of premises. A hypothesis is either a short verb phrase or sentence that corresponds to a node in the denotation graph. By focusing on short hypotheses, we minimize the possibility that they contain extraneous details that cannot be inferred from the premises. Positive examples are generated by choosing a node h as hypothesis and an image i ∈ h such that exactly one caption of i generates h and the other four captions of i are not descendants of h and hence do not trivially entail h, giving an unfair advantage to denotational approaches. Negative examples are generated by choosing a node h as hypothesis and selecting four of the captions of an image i ∈ h .

Premises:
A woman with dark hair in bending, open mouthed, towards the back of a dark headed toddler's head. A dark-haired woman has her mouth open and is hugging a little girl while sitting on a red blanket. A grown lady is snuggling on the couch with a young girl and the lady has a frightened look. A mom holding her child on a red sofa while they are both having fun. VP Hypothesis: make face

Premises:
A man editing a black and white photo at a computer with a pencil in his ear. A man in a white shirt is working at a computer. A guy in white t-shirt on a mac computer. A young main is using an Apple computer. S Hypothesis: man sit Figure 3: Positive examples from the Approximate Entailment tasks.
Since our items are created automatically, a positive hypothesis is not necessarily logically entailed by its premises. We have performed a small-scale human evaluation on 300 items (200 positive, 100 negative), each judged independently by the same three judges (inter-annotator agreement: Fleiss-κ = 0.74). Our results indicate that over half (55%) of the positive hypotheses can be inferred from their premises alone without looking at the original image, while almost none of the negative hypotheses (100% for sentences, 96% for verb phrases) can be inferred from their premises. The training items are generated from the captions of 25,000 images, and the test items are generated from a disjoint set of 3,000 images. The VP data set consists of 290,000 training items and 16,000 test items, while the S data set consists of 400,000 training items and 22,000 test items. Half of the items in each set are positive, and the other half are negative.
Models All of our models are binary MaxEnt classifiers, trained using MALLET (McCallum, 2002). We have two baseline models: a plain bag-of-words model (BOW) and a bag-of-words model where we add all hypernyms in our lexicon to the captions before computing their overlap (BOW-H). This is intended to minimize the advantage the denotational features obtain from the hypernym lexicon used to construct the denotation graph. In both cases, a global BOW feature captures the fraction of tokens in the hypothesis that are contained in the premises. Word-specific BOW features capture the product of the frequencies of each word in h and P. All other models extend the BOW-H model.

Denotational Similarity Features
We compute denotational similarities nPMI and P (Sec-tion 2.4) over the pairs of nodes in a denotation graph that is restricted to the training images. We only consider pairs of nodes n, n if their denotations contain at least 10 images and their intersection contains at least 2 images.
To map an item P, h to denotational similarity features, we represent the premises as the set of all nodes P that are ancestors of its captions. A sentential hypothesis is represented as the set of nodes H = {h S , h sbj , h V P , h v , h dobj } that correspond to the sentence (h itself), its subject, its VP and its direct object. A VP hypothesis has only the nodes H = {h V P , hv, h dobj }. In both cases, h dobj may be empty. Both of the denotational similarities nPMI (h, p) and P (h|p) for h ∈ H, p ∈ P lead to two constituentspecific features, sum x and max x , (e.g. sum sbj = p sim(h sbj , p), max dobj = max p sim(h dobj , p)) and two global features sum p,h = p,h sim(h, p) and max p,h = max p,h sim(h, p). Each constituent type also has a set of node-specific sum x,s and max x,s features that are on when constituent x in h is equal to the string s and whose value is equal to the constituent-based feature. For P , each constituent (and each constituent-node pair) has an additional feature P (h|P ) = 1 − n (1 − P (h|p n )) that estimates the probability that h is generated by some node in the premise.

Lexical Similarity Features
We use two symmetric lexical similarities: standard cosine distance (cos), and Lin (1998)'s similarity (Lin): We use two directed lexical similarities: Clarke (2009)'s similarity (Clk), and Szpektor and Dagan (2008)'s balanced precision (Bal), which builds on Lin and on Weeds and Weir (2003)'s similarity (W): We also use two publicly available resources that provide precomputed similarities, Kotlerman et al. (2010)'s DIRECT noun and verb rules and Chklovski and Pantel (2004)'s VERBOCEAN rules. Both are motivated by the need for numerically quantifiable semantic inferences between predicates. We only use entries that correspond to single tokens (ignoring e.g. phrasal verbs). Each lexical similarity results in the following features: words in the output are represented by a max-sim w feature which captures its maximum similarity with any word in the premises (max-sim w = max w ∈P sim(w, w )) and by a sum-sim w feature which captures the sum of its similarities to the words in the premises (sum-sim w = w ∈P sim(w, w )). Global max sim and sum sim features capture the maximal (resp. total) similarity of any word in the hypothesis to the premise.
We compute distributional and compositional similarities (cos, Lin, Bal, Clk, Σ, Π) on our image captions ("cap"), the BNC and Gigaword. For each corpus C, we map each word w that appears at least 10 times in C to a vector w C of the nonnegative normalized pointwise mutual information scores (Section 2.4) of w and the 1,000 words (excluding stop words) that occur in the most sentences of C. We generally define P (w) (and P (w, w )) as the fraction of sentences in C in which w (and w ) occur. To allow a direct comparison between distributional and denotational similarities, we first define P (w) (and P (w, w )) over individual captions ("cap"), and then, to level the playing field, we redefine P (w) (and P (w, w )) as the fraction of images in whose captions w (and w ) occur ("img"), and then we use our lexicon to augment captions with all hypernyms ("+hyp"). Finally, we include BNC and Gigaword similarity features ("all").  Compositional Similarity Features We use two standard compositional baselines to combine the word vectors of a sentence into a single vector: addition (s = w 1 + ... + w n , which can be interpreted as a disjunctive operation), and element-wise (Hadamard) multiplication (s = w 1 ... w n , which can be seen as a conjunctive operation). In both cases, we represent the premises (which consist of four captions) as a the sum of each caption's vector p = p 1 + ...p 4 . This gives two compositional similarity features: Σ = cos(p Σ , h Σ ), and Π = cos(p Π , h Π ).

Experimental Results
Table 2 provides the test accuracy of our models on the VP and S tasks. Adding hypernyms (BOW-H) yields a slight improvement over the basic BOW model. Among the external resources, VERBOCEAN is more beneficial than DIRECT, but neither help as much as in-domain distributional similarities (this may be due to sparsity). Table 2 shows only the simplest ("Cap") and the most complex ("all") distributional and compositional models, but Table 3 provides accuracies of these models as we go from standard sentencebased co-occurrence counts towards more denotation graph-like co-occurrence counts that are based on all captions describing the same image ("Img"),   include hypernyms ("+Hyp"), and add information from other corpora ("All"). The "+Hyp" column in Table 3 shows that the denotational metrics clearly outperform any distributional metric when both have access to the same information. Although the distributional models benefit from the BNC and Gigaword-based similarities ("All"), their performance is still below that of the denotational models. Among the distributional model, the simple cos performs better than Lin, or the directed Clk and Bal similarities. In all cases, giving models access to different similarity features improves performance. Table 4 shows the results by hypothesis length. As the length of h increases, classifiers that use similarities between pairs of words (BOW-H and cos) continue to improve in performance relative to the classifiers that use similarities between phrases and sentences (Σ and nPMI ). Most likely, this is due to the lexical similarities having a larger set of features to work with for longer h. nPMI does especially well on shorter h, likely due to the shorter h having larger denotations.

Task 2: Semantic Textual Similarity
To assess how the denotational similarities perform on a more established task and domain, we apply them to the 1500 sentence pairs from the MSR Video Description Corpus (Chen and Dolan, 2011) that were annotated for the SemEval 2012 Semantic Textual Similarity (STS) task (Agirre et al., 2012). The goal of this task is to assign scores between 0 and 5 to a pair of sentences, where 5 indicates equivalence, and 0 unrelatedness. Since this is a symmetric task, we do not consider directed similarities. And because the goal of this experiment is not to achieve the best possible performance on this task, but to compare the effectiveness of denotational and more established similarities, we only compare the impact of denotational similarities with compositional similarities computed on our own corpus. Since the MSR Video corpus associates each video with multiple sentences, it is in principle also amenable to a denotational treatment, but the STS task description explicitly forbids its use.

Models
Baseline and Compositional Features Our starting point is Bär et al. (2013)'s DKPro Similarity, one of the top-performing models from the 2012 STS shared task, which is available and easily modified. It consists of a log-linear regression model trained on multiple text features (word and character n-grams, longest common substring and longest common subsequence, Gabrilovich and Markovitch (2007)'s Explicit Semantic Analysis, and Resnik (1995)'s WordNet-based similarity). We investigate the effects of adding compositional (computed on the vectors obtained from the image-caption training data) and denotational similarity features to this state-of-the-art system.
Denotational Features Since the STS task is symmetric, we only consider nPMI similarities. We again represent each sentence s by features based on 5 types of constituents: S = {s S , s sbj , s V P , s v , s dobj }. Since sentences might be complex, they might contain multiple constituents of the same type, and we therefore think of each feature as a feature over sets of nodes. For each constituent C we consider two sets of nodes in the denotation graph: C itself (typically leaf nodes),  Table 5: Performance on the STS MSRvid task: DKPro (Bär et al., 2013) plus compositional (Σ, Π) and/or denotational similarities (nPMI ) from our corpus and C anc , their parents and grandparents. For each pair of sentences, C-C similarities compute the similarity of the constituents of the same type, while C-all similarities compute the similarity of a C constituent in one sentence against all constituents in the other sentence. For each pair of constituents we consider three similarity features: sim(C 1 , C 2 ), max(sim(C 1 C anc 2 ), sim(C anc 1 , C 2 )), sim(C anc 1 , C anc 2 ). The similarity of two sets of nodes is determined by the maximal similarity of any pair of their elements: sim(C 1 , C 2 ) = max c 1 ∈C 1 ,c 2 ∈C 2 nPMI (c 1 , c 2 ). This gives us 15 C-C features and 15 C-all features.

Experiments
We use the STS 2012 train/test data, normalized in the same way as the image captions for the denotation graph (i.e. we re-tokenize, lemmatize, and remove determiners). Table 5 shows experimental results for four models: DKPro is the off-the-shelf DKProSimilarity model (Bär et al., 2013). From our corpus, we either add additive and multiplicative compositional features (Σ, Π) from Section 6 (img), the C-C and C-All denotational features based on nPMI , or both compositional and denotational features. Systems are evaluated by the Pearson correlation (r) of their predicted similarity scores to the human-annotated ones. We see that the denotational similarities outperform the compositional similarities, and that including compositional similarity features in addition to denotational similarity features has little effect. For additional comparison, the published numbers for the TakeLab Semantic Text Similarity System (Šarić et al., 2012), another topperforming model from the 2012 shared task, are r = 0.880 on this dataset.

Conclusion
Summary of Contributions We have defined novel denotational metrics of linguistic similarity (Section 2), and have shown them to be at least competitive with, if not superior to, distributional similarities for two tasks that require simple semantic inferences (Sections 6, 7), even though our current method of computing them is somewhat brittle (Section 5). We have also introduced two new resources: a large data set of images paired with descriptive captions, and a denotation graph that pairs generalized versions of these captions with their visual denotations, i.e. the sets of images they describe. Both of these resources are freely available (http://nlp.cs.illinois.edu/ Denotation.html) Although the aim of this paper is to show their utility for a purely linguistic task, we believe that they should also be of great interest for people who aim to build systems that automatically associate image with sentences that describe them (Farhadi et al., 2010;Yang et al., 2011;Mitchell et al., 2012;Kuznetsova et al., 2012;Gupta et al., 2012;Hodosh et al., 2013).

Related Work and Resources
We believe that the work reported in this paper has the potential to open up promising new research directions. There are other data sets that pair images or video with descriptive language, but we have not yet applied our approach to them. Chen and Dolan (2011)'s MSR Video Description Corpus (of which the STS data is a subset) is most similar to ours, but its curated part is significantly smaller. Instead of several independent captions, Grubinger et al. (2006)'s IAPR TC-12 data set contains longer descriptions. Ordonez et al. (2011) harvested 1 million images and their user-generated captions from Flickr to create the SBU Captioned Photo Dataset. These captions tend to be less descriptive of the image. The denotation graph is similar to Berant et al. (2012)'s 'entailment graph', but differs from it in two ways: first, entailment relations in the denotation graph are defined extensionally in terms of the images described by the expressions at each node, and second, nodes in Berant et al.'s entailment graph correspond to generic propositional templates (X treats Y), whereas nodes in our denotation graph correspond to complete propositions (a dog runs).