TreeTalk: Composition and Compression of Trees for Image Descriptions

We present a new tree based approach to composing expressive image descriptions that makes use of naturally occuring web images with captions. We investigate two related tasks: image caption generalization and generation, where the former is an optional subtask of the latter. The high-level idea of our approach is to harvest expressive phrases (as tree fragments) from existing image descriptions, then to compose a new description by selectively combining the extracted (and optionally pruned) tree fragments. Key algorithmic components are tree composition and compression, both integrating tree structure with sequence structure. Our proposed system attains significantly better performance than previous approaches for both image caption generalization and generation. In addition, our work is the first to show the empirical benefit of automatically generalized captions for composing natural image descriptions.


Introduction
The web is increasingly visual, with hundreds of billions of user contributed photographs hosted online. A substantial portion of these images have some sort of accompanying text, ranging from keywords, to free text on web pages, to textual descriptions directly describing depicted image content (i.e. captions). We tap into the last kind of text, using naturally occuring pairs of images with natural language descriptions to compose expressive descriptions for query images via tree composition and compression.
Such automatic image captioning efforts could potentially be useful for many applications: from automatic organization of photo collections, to facilitating image search with complex natural language queries, to enhancing web accessibility for the visually impaired. On the intellectual side, by learning to describe the visual world from naturally existing web data, our study extends the domains of language grounding to the highly expressive language that people use in their everyday online activities.
There has been a recent spike in efforts to automatically describe visual content in natural language (Yang et al., 2011;Farhadi et al., 2010;Krishnamoorthy et al., 2013;Elliott and Keller, 2013;Yu and Siskind, 2013;Socher et al., 2014). This reflects the long standing understanding that encoding the complexities and subtleties of image content often requires more expressive language constructs than a set of tags. Now that visual recognition algorithms are beginning to produce reliable estimates of image content (Perronnin et al., 2012;Deng et al., 2012a;Deng et al., 2010;Krizhevsky et al., 2012), the time seems ripe to begin exploring higher level semantic tasks.
There have been two main complementary directions explored for automatic image captioning. The first focuses on describing exactly those items (e.g., objects, attributes) that are detected by vision recognition, which subsequently confines what should be described and how (Yao et al., 2010;Kojima et al., 2002). Approaches in this direction could be ideal for various practical applications such as image description for the visually impaired. However, it is not clear whether the semantic expressiveness of these approaches can eventually scale up to the casual, but highly expressive language peo-
ple naturally use in their online activities. In Figure 1, for example, it would be hard to compose "I noticed that this funny cow was staring at me" or "You can see these beautiful hills only in the countryside" in a purely bottom-up manner based on the exact content detected. The key technical bottleneck is that the range of describable content (i.e., objects, attributes, actions) is ultimately confined by the set of items that can be reliably recognized by state-ofthe-art vision techniques.
The second direction, in a complementary avenue to the first, has explored ways to make use of the rich spectrum of visual descriptions contributed by online citizens (Kuznetsova et al., 2012;Feng and Lapata, 2013;Mason, 2013;Ordonez et al., 2011). In these approaches, the set of what can be described can be substantially larger than the set of what can be recognized, where the former is shaped and defined by the data, rather than by humans. This allows the resulting descriptions to be substantially more expressive, elaborate, and interesting than what would be possible in a purely bottom-up manner. Our work contributes to this second line of research.
One challenge in utilizing naturally existing multimodal data, however, is the noisy semantic alignment between images and text . Therefore, we also investigate a related task of image caption generalization (Kuznetsova et al., 2013), which aims to improve the semantic image-text alignment by removing bits of text from existing captions that are less likely to be transferable to other images.
The high-level idea of our system is to harvest useful bits of text (as tree fragments) from existing image descriptions using detected visual content similarity, and then to compose a new description by selectively combining these extracted (and optionally pruned) tree fragments. This overall idea of composition based on extracted phrases is not new in itself (Kuznetsova et al., 2012), however, we make several technical and empirical contributions.
First, we propose a novel stochastic tree composition algorithm based on extracted tree fragments that integrates both tree structure and sequence cohesion into structural inference. Our algorithm permits a substantially higher level of linguistic expressiveness, flexibility, and creativity than those based on rules or templates Yang et al., 2011;, while also addressing long-distance grammatical relations in a more principled way than those based on hand-coded constraints (Kuznetsova et al., 2012).
Second, we address image caption generalization as an optional subtask of image caption generation, and propose a tree compression algorithm that performs a light-weight parsing to search for the optimal set of tree branches to prune. Our work is the first to report empirical benefits of automatically compressed captions for image captioning.
The proposed approaches attain significantly better performance for both image caption generalization and generation tasks over competitive baselines and previous approaches. Our work results in an improved image caption corpus with automatic generalization, which is publicly available. 1

Harvesting Tree Fragments
Given a query image, we retrieve images that are visually similar to the query image, then extract potentially useful segments (i.e., phrases) from their corresponding image descriptions. We then compose a new image description using these retrieved text fragments ( §3). Extraction of useful phrases is guided by both visual similarity and the syntactic parse of the corresponding textual descrip-tion. This extraction strategy, originally proposed by Kuznetsova et al. (2012), attempts to make the best use of linguistic regularities with respect to objects, actions, and scenes, making it possible to obtain richer textual descriptions than what current state-of-the-art vision techniques can provide in isolation. In all of our experiments we use the captioned image corpus of Ordonez et al. (2011), first pre-processing the corpus for relevant content by running deformable part model object detectors (Felzenszwalb et al., 2010). For our study, we run detectors for 89 object classes set a high confidence threshold for detection.
As illustrated in Figure 1, for a query image detection, we extract four types of phrases (as tree fragments). First, we retrieve relevant noun phrases from images with visually similar object detections. We use color, texture (Leung and Malik, 1999), and shape (Dalal and Triggs, 2005;Lowe, 2004) based features encoded in a histogram of vector quantized responses to measure visual similarity. Second, we extract verb phrases for which the corresponding noun phrase takes the subject role. Third, from those images with "stuff" detections, e.g."water", or "sky" (typically mass nouns), we extract prepositional phrases based on similarity of both visual appearance and relative spatial relationships between detected objects and "stuff". Finally, we use global "scene" similarity 2 to extract prepositional phrases referring to the overall scene, e.g., "at the conference," or "in the market".
We perform this phrase retrieval process for each detected object in the query image and generate one sentence for each object. All sentences are then combined together to produce the final description. Optionally, we apply image caption generalization (via compression) ( §4) to all captions in the corpus prior to the phrase extraction and composition.

Tree Composition
We model tree composition as constraint optimization. The input to our algorithm is the set of retrieved phrases (i.e., tree fragments), as illustrated in §2. Let P = {p 0 , ..., p L−1 } be the set of all phrases across the four phrase types (objects, actions, stuff and scene). We assume a mapping func-tion pt : [0, L) → T , where T is the set of phrase types, so that the phrase type of p i is pt(i). In addition, let R be the set of PCFG production rules and N T be the set of nonterminal symbols of the PCFG. The goal is to find and combine a good sequence of phrases G, |G| ≤ |T | = N = 4, drawn from P , into a final sentence. More concretely, we want to select and order a subset of phrases (at most one phrase of each phrase type) while considering both the parse structure and n-gram cohesion across phrasal boundaries. Figure 2 shows a simplified example of a composed sentence with its corresponding parse structure. For brevity, the figure shows only one phrase for each phrase type, but in actuality there would be a set of candidate phrases for each type. Figure 3 shows the CKY-style representation of the internal mechanics of constraint optimization for the example composition from Figure 2. Each cell ij of the CKY matrix corresponds to G ij , a subsequence of G starting at position i and ending at position j. If a cell in the CKY matrix is labeled with a nonterminal symbol s, it means that the corresponding tree of G ij has s as its root.
Although we visualize the operation using a CKYstyle representation in Figure 3, note that composition requires more complex combinatorial decisions than CKY parsing due to two additional considerations. We are: (1) selecting a subset of candidate phrases, and (2) re-ordering the selected phrases (hence making the problem NP-hard). Therefore, we encode our problem using Integer Linear Programming (ILP) (Roth and tau Yih, 2004;Clarke and Lapata, 2008) and use the CPLEX (ILOG, Inc, 2006) solver.

ILP Variables
Variables for Sequence Structure: Variables α encode phrase selection and ordering: Where k is one of the N=4 positions in a sentence. 3 Additionally, we define variables for each pair of adjacent phrases to capture sequence cohesion: A"cow in"the"countryside was"staring"at"me in#the#grass evel and each node of that level, algorithm has to ecide, which parse tag to choose. This process is epresented by assignment of a particular tag to a atrix cell. The chosen tag must be a head of a rule, example cell 12 is assigned tag V P , correspondng to rule V P ! V P P P . This rule connects leafs going out to sea" and "in the ocean". The probem is to find tag assignment for each cell of the marix, given some cells can be empty, if they do not onnect children cells. latter correspond to children ranches of the tree and belong to the previous diagnal in the left-to-right order. Also we do not try all ossible pairs 5 of children from previous diagonal. e use technique similar to the one used in CKY arsing approach. Matrix cell pairs corresponding o <right,left> children pairs are < ik, (k + 1)j >, here k 2 [i, j). Here and for the remainder of the aper, notation [i, j) means {i, i + 1, ..., j 1} and is h pq unless otherwise stated. The problem of choosing phrase order together ith the best parse tree of the description is a comlex optimization problem, which we solve using nteger Linear Programming (ILP). We use a sepaate ILP formulation for for sentence reordering and alient object selection, which we omit for brevity. s mentioned earlier, overall for each object we ave four types of phrases. We use CKY-driven ILP ormulation to combine them together into a plausile descriptions which obeys PCFG rules. For the emainder of the paper we will call our ILP formuation ILP-TREE. We exploit Cplex (ILOG, Inc, 006) to solve ILP problem.
Todo:[mention cplex parameters. For instance, 0sec limit on generation] .0.2 ILP variables hrase Indicator Variables: We define variables ↵ hich indicate phrase selection and phrase ordering.
is selected for position k 5 There is only two children as we use Chomsky Normal orm Where k 2 [0, N)Todo:[check for the whole paper if k ranges from 0] indexes one of N=4 positions in a sentence 6 .
Phrase ordering is captured by indicator variables for adjacent pairs of phrases: An example of ILP-CKY at Figure 3 shows selection of phrases and their ordering: "The little boat", "going out to sea" and "in the ocean". Tree Indicator Variables: We also define variables , which are indicators of CKY matrix content ( In order to model rule selection at each CKY step, we define variables, which correspond to a PCFG rule used at the given cell ij of CKY matrix: Where r = h pq 2 R and k 2 [i, j). Value k corresponds to the choice of children for the current cell. 6 The number of positions is equal to the number of phrase types Figure 2: An example scenario of tree composition. Only the first three phrases are chosen for the composition.
Variables for Tree Structure: Variables β encode the parse structure: maps to the nonterminal symbol s ∈ N T Where i ∈ [0, N ) and j ∈ [i, N ) index rows and columns of the CKY-style matrix in Figure 3. A corresponding example tree is shown in Figure 2, where the phrase sequence G 02 corresponds to the cell labeled with S. We also define variables to indicate selected PCFG rules in the resulting parse: Where r = h → pq ∈ R and k ∈ [i, j). Index k points to the boundary of split between two children as shown in Figure 2 for the sequence G 02 . Auxiliary Variables: For notational convenience, we also include:

ILP Objective Function
We model tree composition as maximization of the following objective function:

22" 23"
k=1$ k=0$ level and each node of that level, algorithm has to decide, which parse tag to choose. This process is represented by assignment of a particular tag to a matrix cell. The chosen tag must be a head of a rule, fi example cell 12 is assigned tag V P , corresponding to rule V P ! V P P P . This rule connects leafs "going out to sea" and "in the ocean". The problem is to find tag assignment for each cell of the matrix, given some cells can be empty, if they do not connect children cells. latter correspond to children branches of the tree and belong to the previous diagonal in the left-to-right order. Also we do not try all possible pairs 5 of children from previous diagonal. We use technique similar to the one used in CKY parsing approach. Matrix cell pairs corresponding to <right,left> children pairs are < ik, Here and for the remainder of the paper, notation [i, j) means {i, i + 1, ..., j 1} and r is h pq unless otherwise stated. The problem of choosing phrase order together with the best parse tree of the description is a complex optimization problem, which we solve using Integer Linear Programming (ILP). We use a separate ILP formulation for for sentence reordering and salient object selection, which we omit for brevity. As mentioned earlier, overall for each object we have four types of phrases. We use CKY-driven ILP formulation to combine them together into a plausible descriptions which obeys PCFG rules. For the remainder of the paper we will call our ILP formulation ILP-TREE. We exploit Cplex (ILOG, Inc, 2006) to solve ILP problem.
Todo:[mention cplex parameters. For instance, 30sec limit on generation] 3.0.2 ILP variables Phrase Indicator Variables: We define variables ↵ which indicate phrase selection and phrase ordering.
is selected for position k 5 There is only two children as we use Chomsky Normal Form Phrase ordering is captured by indica for adjacent pairs of phrases: Figure 3 tion of phrases and their ordering: "The "going out to sea" and "in the ocean". Tree Indicator Variables: We also defi , which are indicators of CKY matrix ure 3). ijs = 1 iff cell ij of the matrix is parse tree symbol s

Todo:[Rename symbols to tags throug per]
Where i 2 [0, N) indexes CKY matr and j 2 [0, N i) indexes elements of In order to model rule selection at eac we define variables, which correspond rule used at the given cell ij of CKY m Where r = h pq 2 R and k 2 [i corresponds to the choice of children fo cell. 6 The number of positions is equal to the nu types decide, which parse tag to choose. This process is represented by assignment of a particular tag to a matrix cell. The chosen tag must be a head of a rule, fi example cell 12 is assigned tag V P , corresponding to rule V P ! V P P P . This rule connects leafs "going out to sea" and "in the ocean". The problem is to find tag assignment for each cell of the matrix, given some cells can be empty, if they do not connect children cells. latter correspond to children branches of the tree and belong to the previous diagonal in the left-to-right order. Also we do not try all possible pairs 5 of children from previous diagonal. We use technique similar to the one used in CKY parsing approach. Matrix cell pairs corresponding to <right,left> children pairs are < ik, (k + 1)j >, where k 2 [i, j). Here and for the remainder of the paper, notation [i, j) means {i, i + 1, ..., j 1} and r is h pq unless otherwise stated. The problem of choosing phrase order together with the best parse tree of the description is a complex optimization problem, which we solve using Integer Linear Programming (ILP). We use a separate ILP formulation for for sentence reordering and salient object selection, which we omit for brevity. As mentioned earlier, overall for each object we have four types of phrases. We use CKY-driven ILP formulation to combine them together into a plausible descriptions which obeys PCFG rules. For the remainder of the paper we will call our ILP formulation ILP-TREE. We exploit Cplex (ILOG, Inc, 2006) to solve ILP problem.
Todo:[mention cplex parameters. For instance, 30sec limit on generation] 3.0.2 ILP variables Phrase Indicator Variables: We define variables ↵ which indicate phrase selection and phrase ordering.
is selected for position k 5 There is only two children as we use Chomsky Normal Form Where k 2 [0, N)Todo:[check for the whole paper if k ranges from 0] indexes one of N=4 positions in a sentence 6 .
Phrase ordering is captured by indicator variables for adjacent pairs of phrases: An example of ILP-CKY at Figure 3 shows selection of phrases and their ordering: "The little boat", "going out to sea" and "in the ocean". Tree Indicator Variables: We also define variables , which are indicators of CKY matrix content (Figure 3). ijs = 1 iff cell ij of the matrix is assigned (3) parse tree symbol s Todo:[Rename symbols to tags throughout the paper] Where In order to model rule selection at each CKY step, we define variables, which correspond to a PCFG rule used at the given cell ij of CKY matrix: Where r = h pq 2 R and k 2 [i, j). Value k corresponds to the choice of children for the current cell. 6 The number of positions is equal to the number of phrase types fi example cell 12 is assigned tag V P , corresponding to rule V P ! V P P P . This rule connects leafs "going out to sea" and "in the ocean". The problem is to find tag assignment for each cell of the matrix, given some cells can be empty, if they do not connect children cells. latter correspond to children branches of the tree and belong to the previous diagonal in the left-to-right order. Also we do not try all possible pairs 5 of children from previous diagonal. We use technique similar to the one used in CKY parsing approach. Matrix cell pairs corresponding to <right,left> children pairs are < ik, (k + 1)j >, where k 2 [i, j). Here and for the remainder of the paper, notation [i, j) means {i, i + 1, ..., j 1} and r is h pq unless otherwise stated. The problem of choosing phrase order together with the best parse tree of the description is a complex optimization problem, which we solve using Integer Linear Programming (ILP). We use a separate ILP formulation for for sentence reordering and salient object selection, which we omit for brevity. As mentioned earlier, overall for each object we have four types of phrases. We use CKY-driven ILP formulation to combine them together into a plausible descriptions which obeys PCFG rules. For the remainder of the paper we will call our ILP formulation ILP-TREE. We exploit Cplex (ILOG, Inc, 2006) to solve ILP problem.
Todo:[mention cplex parameters. For instance, 30sec limit on generation] 3.0.2 ILP variables Phrase Indicator Variables: We define variables ↵ which indicate phrase selection and phrase ordering.
is selected for position k 5 There is only two children as we use Chomsky Normal Form Phrase ordering is captured by indicator varia for adjacent pairs of phrases: An example of ILP-CKY at Figure 3 shows se tion of phrases and their ordering: "The little bo "going out to sea" and "in the ocean". Tree Indicator Variables: We also define varia , which are indicators of CKY matrix content ( ure 3). ijs = 1 iff cell ij of the matrix is assigned parse tree symbol s Todo:[Rename symbols to tags throughout the per] Where In order to model rule selection at each CKY s we define variables, which correspond to a PC rule used at the given cell ij of CKY matrix: Where r = h pq 2 R and k 2 [i, j). Valu corresponds to the choice of children for the cur cell. 6 The number of positions is equal to the number of p types k=0$ two variables have been discussed by Clarke and pata (2008). For Equation 2, we add the followg constraints (similar constraints are also added for uations 4,5).
nsistency between Tree Leafs and Sequences: e ordering of phrases implied by ↵ ijk must be nsistent with the ordering of phrases implied by e variables. This can be achieved by aligning the f cells (i.e., kks ) in the CKY-style matrix with ↵ riables as follows: here S i refers to the set of PCFG nonterminals at are compatible with the phrase type of p i . For ample, S i = {NN,NP, ...} if p i corresponds to "object" (noun-phrase). Thus, Equation 8 enrces the correspondence between phrase types and nterminal symbols at the tree leafs. Equation 9 forces the constraint that the number of selected rases and instantiated tree leafs must be the same.
ee Congruence Constraints: To ensure that ch CKY cell has at most one symbol we require e also require that here R h = {r 2 R : r = h ! pq}. We enforce ese constraints only for non-leafs. This constraint rbids instantiations where a nonterminal symbol h selected for cell ij without selecting a correspondg PCFG rule.
We also ensure that we produce a valid tree strucre. For instance, if we select 3 phrases as shown Figure 3, we must have the root of the tree at the rresponding cell 02.
We also require cells that are not selected for the resulting parse structure to be empty: Additionally, we penalize solutions without the S tag at the parse root as a soft-constraint.
Miscellaneous Constraints: Finally, we include several constraints to avoid degenerate solutions or otherwise to enhance the composed output: (1) enforce that a noun-phrase is selected (to ensure semantic relevance to the image content), (2) allow at most one phrase of each type, (3) do not allow multiple phrases with identical headwords (to avoid redundancy), (4) allow at most one scene phrase for all sentences in the description. We find that handling of sentence boundaries is important if the ILP formulation is based only on sequence structure, but with the integration of tree-based structure, we need not handle sentence boundaries.

Discussion
An interesting aspect of description generation explored in this paper is that building blocks of composition are tree fragments, rather than individual words. There are three practical benefits: (1) syntactic and semantic expressiveness, (2) correctness, and (3) computational efficiency. Because we extract nice segments from human written captions, we are able to use expressive language, and less likely to make syntactic or semantic errors. Our phrase extraction process can be viewed at a high level as visually-grounded or visually-situated paraphrasing. Also, because the unit of operation is tree fragments, the ILP formulation encoded in this work is computationally lightweight. If the unit of composition was words, the ILP instances would be significantly more computationally intensive, and more likely to suffer from grammatical and semantic errors. of two variables have been discussed by Clarke and Lapata (2008). For Equation 2, we add the following constraints (similar constraints are also added for Equations 4,5).
Consistency between Tree Leafs and Sequences: The ordering of phrases implied by ↵ ijk must be consistent with the ordering of phrases implied by the variables. This can be achieved by aligning the leaf cells (i.e., kks ) in the CKY-style matrix with ↵ variables as follows: Where S i refers to the set of PCFG nonterminals that are compatible with the phrase type of p i . For example, S i = {NN,NP, ...} if p i corresponds to an "object" (noun-phrase). Thus, Equation 8 enforces the correspondence between phrase types and nonterminal symbols at the tree leafs. Equation 9 enforces the constraint that the number of selected phrases and instantiated tree leafs must be the same.
Tree Congruence Constraints: To ensure that each CKY cell has at most one symbol we require We also require that Where R h = {r 2 R : r = h ! pq}. We enforce these constraints only for non-leafs. This constraint forbids instantiations where a nonterminal symbol h is selected for cell ij without selecting a corresponding PCFG rule.
We also ensure that we produce a valid tree structure. For instance, if we select 3 phrases as shown in Figure 3, we must have the root of the tree at the corresponding cell 02.
We also require cells that are not selected for the resulting parse structure to be empty: Additionally, we penalize solutions without the S tag at the parse root as a soft-constraint.
Miscellaneous Constraints: Finally, we include several constraints to avoid degenerate solutions or otherwise to enhance the composed output: (1) enforce that a noun-phrase is selected (to ensure semantic relevance to the image content), (2) allow at most one phrase of each type, (3) do not allow multiple phrases with identical headwords (to avoid redundancy), (4) allow at most one scene phrase for all sentences in the description. We find that handling of sentence boundaries is important if the ILP formulation is based only on sequence structure, but with the integration of tree-based structure, we need not handle sentence boundaries.

Discussion
An interesting aspect of description generation explored in this paper is that building blocks of composition are tree fragments, rather than individual words. There are three practical benefits: (1) syntactic and semantic expressiveness, (2) correctness, and (3) computational efficiency. Because we extract nice segments from human written captions, we are able to use expressive language, and less likely to make syntactic or semantic errors. Our phrase extraction process can be viewed at a high level as visually-grounded or visually-situated paraphrasing. Also, because the unit of operation is tree fragments, the ILP formulation encoded in this work is computationally lightweight. If the unit of composition was words, the ILP instances would be significantly more computationally intensive, and more likely to suffer from grammatical and semantic errors. of two variables have been discussed by Clarke and Lapata (2008). For Equation 2, we add the following constraints (similar constraints are also added for Equations 4,5).

Consistency between Tree Leafs and Sequences:
The ordering of phrases implied by ↵ ijk must be consistent with the ordering of phrases implied by the variables. This can be achieved by aligning the leaf cells (i.e., kks ) in the CKY-style matrix with ↵ variables as follows: Where S i refers to the set of PCFG nonterminals that are compatible with the phrase type of p i . For example, S i = {NN,NP, ...} if p i corresponds to an "object" (noun-phrase). Thus, Equation 8 enforces the correspondence between phrase types and nonterminal symbols at the tree leafs. Equation 9 enforces the constraint that the number of selected phrases and instantiated tree leafs must be the same.
Tree Congruence Constraints: To ensure that each CKY cell has at most one symbol we require We also require that Where R h = {r 2 R : r = h ! pq}. We enforce these constraints only for non-leafs. This constraint forbids instantiations where a nonterminal symbol h is selected for cell ij without selecting a corresponding PCFG rule. We also ensure that we produce a valid tree structure. For instance, if we select 3 phrases as shown in Figure 3, we must have the root of the tree at the corresponding cell 02.
We also require cells that are not selected for the resulting parse structure to be empty: Additionally, we penalize solutions without the S tag at the parse root as a soft-constraint.
Miscellaneous Constraints: Finally, we include several constraints to avoid degenerate solutions or otherwise to enhance the composed output: (1) enforce that a noun-phrase is selected (to ensure semantic relevance to the image content), (2) allow at most one phrase of each type, (3) do not allow multiple phrases with identical headwords (to avoid redundancy), (4) allow at most one scene phrase for all sentences in the description. We find that handling of sentence boundaries is important if the ILP formulation is based only on sequence structure, but with the integration of tree-based structure, we need not handle sentence boundaries.

Discussion
An interesting aspect of description generation explored in this paper is that building blocks of composition are tree fragments, rather than individual words. There are three practical benefits: (1) syntactic and semantic expressiveness, (2) correctness, and (3) computational efficiency. Because we extract nice segments from human written captions, we are able to use expressive language, and less likely to make syntactic or semantic errors. Our phrase extraction process can be viewed at a high level as visually-grounded or visually-situated paraphrasing. Also, because the unit of operation is tree fragments, the ILP formulation encoded in this work is computationally lightweight. If the unit of composition was words, the ILP instances would be significantly more computationally intensive, and more likely to suffer from grammatical and semantic errors. of two variables have been discussed by Clarke and Lapata (2008). For Equation 2, we add the following constraints (similar constraints are also added for Equations 4,5).

Consistency between Tree Leafs and Sequences:
The ordering of phrases implied by ↵ ijk must be consistent with the ordering of phrases implied by the variables. This can be achieved by aligning the leaf cells (i.e., kks ) in the CKY-style matrix with ↵ variables as follows: Where S i refers to the set of PCFG nonterminals that are compatible with the phrase type of p i . For example, S i = {NN,NP, ...} if p i corresponds to an "object" (noun-phrase). Thus, Equation 8 enforces the correspondence between phrase types and nonterminal symbols at the tree leafs. Equation 9 enforces the constraint that the number of selected phrases and instantiated tree leafs must be the same.
Tree Congruence Constraints: To ensure that each CKY cell has at most one symbol we require We also require that Where R h = {r 2 R : r = h ! pq}. We enforce these constraints only for non-leafs. This constraint forbids instantiations where a nonterminal symbol h is selected for cell ij without selecting a corresponding PCFG rule. We also ensure that we produce a valid tree structure. For instance, if we select 3 phrases as shown in Figure 3, we must have the root of the tree at the corresponding cell 02.
We also require cells that are not selected for the resulting parse structure to be empty: Additionally, we penalize solutions without the S tag at the parse root as a soft-constraint.
Miscellaneous Constraints: Finally, we include several constraints to avoid degenerate solutions or otherwise to enhance the composed output: (1) enforce that a noun-phrase is selected (to ensure semantic relevance to the image content), (2) allow at most one phrase of each type, (3) do not allow multiple phrases with identical headwords (to avoid redundancy), (4) allow at most one scene phrase for all sentences in the description. We find that handling of sentence boundaries is important if the ILP formulation is based only on sequence structure, but with the integration of tree-based structure, we need not handle sentence boundaries.

Discussion
An interesting aspect of description generation explored in this paper is that building blocks of composition are tree fragments, rather than individual words. There are three practical benefits: (1) syntactic and semantic expressiveness, (2) correctness, and (3) computational efficiency. Because we extract nice segments from human written captions, we are able to use expressive language, and less likely to make syntactic or semantic errors. Our phrase extraction process can be viewed at a high level as visually-grounded or visually-situated paraphrasing. Also, because the unit of operation is tree fragments, the ILP formulation encoded in this work is computationally lightweight. If the unit of composition was words, the ILP instances would be significantly more computationally intensive, and more likely to suffer from grammatical and semantic errors. This objective is comprised of three types of weights (confidence scores): F i , F ij , F r . 4 F i represents the phrase selection score based on visual similarity, described in §2. F ij quantifies the sequence cohesion across phrase boundaries. For this, we use ngram scores (n ∈ [2, 5]) between adjacent phrases computed using the Google Web 1-T corpus (Brants and Franz., 2006). Finally, F r quantifies PCFG rule scores (log probabilities) estimated from the 1M image caption corpus (Ordonez et al., 2011) parsed using the Stanford parser (Klein and Manning, 2003).
One can view F i as a content selection score, while F ij and F r correspond to linguistic fluency scores capturing sequence and tree structure respectively. If we set positive values for all of these weights, the optimization function would be biased toward verbose production, since selecting an additional phrase will increase the objective function. To control for verbosity, we set scores corresponding to linguistic fluency, i.e., F ij and F r using negative values (smaller absolute values for higher fluency), to balance dynamics between content selection and linguistic fluency.

ILP Constraints
Soundness Constraints: We need constraints to enforce consistency between different types of vari-ables (Equations 2, 4, 5). Constraints for a product of two variables have been discussed by Clarke and Lapata (2008). For Equation 2, we add the following constraints (similar constraints are also added for Equations 4,5).
Consistency between Tree Leafs and Sequences: The ordering of phrases implied by α ijk must be consistent with the ordering of phrases implied by the β variables. This can be achieved by aligning the leaf cells (i.e., β kks ) in the CKY-style matrix with α variables as follows: Where N T i refers to the set of PCFG nonterminals that are compatible with a phrase type pt(i) of p i . For example, N T i = {NN,NP, ...} if p i corresponds to an "object" (noun-phrase). Thus, Equation 8 enforces the correspondence between phrase types and nonterminal symbols at the tree leafs. Equation 9 enforces the constraint that the number of selected phrases and instantiated tree leafs must be the same.
Tree Congruence Constraints: To ensure that each CKY cell has at most one symbol we require We also require that Where R h = {r ∈ R : r = h → pq}. We enforce these constraints only for non-leafs. This constraint forbids instantiations where a nonterminal symbol h is selected for cell ij without selecting a corresponding PCFG rule.
We also ensure that we produce a valid tree structure. For instance, if we select 3 phrases as shown in Figure 3, we must have the root of the tree at the corresponding cell 02.
We also require cells that are not selected for the resulting parse structure to be empty: Additionally, we penalize solutions without the S tag at the parse root as a soft-constraint.
Miscellaneous Constraints: Finally, we include several constraints to avoid degenerate solutions or to otherwise enhance the composed output. We: (1) enforce that a noun-phrase is selected (to ensure semantic relevance to the image content), (2) allow at most one phrase of each type, (3) do not allow multiple phrases with identical headwords (to avoid redundancy), (4) allow at most one scene phrase for all sentences in the description. We find that handling of sentence boundaries is important if the ILP formulation is based only on sequence structure, but with the integration of tree-based structure, we do not need to specifically handle sentence boundaries.

Discussion
An interesting aspect of description generation explored in this paper is using tree fragments as the building blocks of composition rather than individual words. There are three practical benefits: (1) syntactic and semantic expressiveness, (2) correctness, and (3) computational efficiency. Because we extract phrases from human written captions, we are able to use expressive language, and less likely to make syntactic or semantic errors. Our phrase extraction process can be viewed at a high level as visually-grounded or visually-situated paraphrasing. Also, because the unit of operation is tree fragments, the ILP formulation encoded in this work is computationally lightweight. If the unit of composition was words, the ILP instances would be significantly more computationally intensive, and more likely to suffer from grammatical and semantic errors.

Tree Compression
As noted by recent studies (Mason and Charniak, 2013;Kuznetsova et al., 2013;Jamieson et al., 2010), naturally existing image captions often include contextual information that does not directly describe visual content, which ultimately hinders their usefulness for describing other images. Therefore, to improve the fidelity of the generated descriptions, we explore image caption generalization as an  optional pre-processing step. Figure 4 illustrates a concrete example of image caption generalization in the context of image caption generation.
We cast caption generalization as sentence compression. We encode the problem as tree pruning via lightweight CKY parsing, while also incorporating several other considerations such as leaf-level ngram cohesion scores and visually informed content selection. Figure 5 shows an example compression, and Figure 6 shows the corresponding CKY matrix.
At a high level, the compression operation resembles bottom-up CKY parsing, but in addition to parsing, we also consider deletion of parts of the trees. When deleting parts of the original tree, we might need to re-parse the remainder of the tree. Note that we consider re-parsing only with respect to the original parse tree produced by a state-of-the-art parser, hence it is only a light-weight parsing. 5

Dynamic Programming
Input to the algorithm is a sentence, represented as a vector x = x 0 ...x n−1 = x[0 : n − 1], and its PCFG parse π(x) obtained from the Stanford parser. For simplicity of notation, we assume that both the parse tree and the word sequence are encoded in x. Then, the compression can be formalized as: 5 Integrating full parsing into the original sentence would be a straightforward extension conceptually, but may not be an empirically better choice when parsing for compression is based on vanilla unlexicalized parsing.ŷ = arg max Where each φ i is a potential function, corresponding to a criteria of the desired compression: Where θ i is the weight for a particular criteria (described in §4.2), whose scoring function is f i . We solve the decoding problem (Equation 14) using dynamic programming. For this, we need to solve the compression sub-problems for sequences x[i : j], which can be viewed as branchesŷ [i, j] of the final treeŷ[0 : n − 1]. For example, in Figure 5 and backtrack to reconstruct the final compression (the exact solution to equation 14).
(2) D[i, k, p] + ∆φ [r, ij] (3) D[k + 1, j, p] + ∆φ[r, ij] Where   Pruning Case (2)/(3): Deletion of the left/right child respectively. There are two types of deletion, as illustrated in Figures 5 and 6. The first corresponds to deletion of a child node. For example, the second child N N of rule N P → N P N N is deleted, which yields deletion of "shot". The second type is a special case of propagating a node to a higher-level of the tree. In Figure 6, this situation occurs when deleting JJ "Vintage", which causes the propagation of N N from cell 11 to cell 01. For this purpose, we expand the set of rules R with additional special rules of the form h → h, e.g., N N → N N , which allows propagation of tree nodes to higher levels of the compressed tree. 6

Modeling Compression Criteria
The ∆φ term 7 in Equation 17 denotes the sum of log of potential functions for each criteria q: Note that ∆φ depends on the current rule r, along with the historical information before the current step ij, such as the original rule r ij , and ngrams on the border between left and right child branches of rule r ij . We use the following four criteria f q in our model, which are demonstrated in Figures 5 and 6. I. Tree Structure: We capture PCFG rule probabilities estimated from the corpus as ∆f pcf g = log P pcf g (r). 6 We assign probabilities of these special propagation rules to 1 so that they will not affect the final parse tree score. Turner and Charniak (2005) handled propagation cases similarly. 7 We use ∆ to distinguish the potential value for the whole sentence from the gain of the potential during a single step of the algorithm. II. Sequence Structure: We incorporate ngram cohesion scores only across the border between two branches of a subtree. III. Branch Deletion Probabilities: We compute probabilities of deletion for children as: Where count(r t , r ij ) is the frequency in which r ij is transformed to r t by deletion of one of the children. We estimate this probability from a training corpus, described in §4.3. count(r ij ) is the count of r ij in uncompressed sentences.

IV. Vision Detection (Content Selection):
We want to keep words referring to actual objects in the image. Thus, we use V (x j ), a visual similarity score, as our confidence of an object corresponding to word x j . This similarity is obtained from the visual recognition predictions of (Deng et al., 2012b). Note that some test instances include rules that we have not observed during training. We default to the original caption in those cases. The weights θ i are set using a tuning dataset. We control overcompression by setting the weight for f del to a small value relative to the other weights.

Relevance(problem(
Orig:"There's"something"about" having"5"trucks"parked"in"front"of"my" house"that"makes"me"feel"all" importantClike." SeqCompression:%Front"of"my"house." " TreePruning:"Trucks"in"front"my" house.% be extraneous for image caption generalization). To learn the syntactic patterns for caption generalization, we collect a small set of example compressed captions (380 in total) using Amazon Mechanical Turk (AMT) (Snow et al., 2008). For each image, we asked 3 turkers to first list all visible objects in an image and then to write a compressed caption by removing not visually verifiable bits of text. We then align the original and compressed captions to measure rule deletion probabilities, excluding misalignments, similar to Knight and Marcu (2000). Note that we remove this dataset from the 1M caption corpus when we perform description generation.

Experiments
We use the 1M captioned image corpus of Ordonez et al. (2011). We reserve 1K images as a test set, and use the rest of the corpus for phrase extraction. We experiment with the following approaches: Proposed Approaches: • TREEPRUNING: Our tree compression approach as described in §4. • SEQ+TREE: Our tree composition approach as described in §3. • SEQ+TREE+PRUNING: SEQ+TREE using compressed captions of TREEPRUNING as building blocks.

Baselines for Composition:
• SEQ+LINGRULE: The most equivalent to the older sequence-driven system (Kuznetsova et al., 2012). Uses a few minor enhancements, such as sentence-boundary statistics, to improve grammaticality. • SEQ: The §3 system without tree models and mentioned enhancements of SEQ+LINGRULE. We also experiment with the compression of human written captions, which are used to generate image descriptions for the new target images. Baselines for Compression: • SEQCOMPRESSION (Kuznetsova et al., 2013): Inference operates over the sequence structure. Although optimization is subject to constraints derived from dependency parse, parsing is not an explicit part of the inference structure. Example outputs are shown in Figure 7.

Automatic Evaluation
We perform automatic evaluation using two measures widely used in machine translation: BLEU (Papineni et al., 2002) 8 and METEOR (Denkowski and Lavie, 2011). 9 We remove all punctuation and convert captions to lower case. We use 1K test images from the captioned image corpus, 10 and assume the original captions as the gold standard captions to compare against. The results in Table 1 Method-1  Table 2: Human Evaluation: posed as a binary question "which of the two options is better?" with respect to Relevance (Rel), Grammar (Gmar), and Overall (All). According to Pearson's χ 2 test, all results are statistically significant.
show that both the integration of the tree structure (+TREE) and the generalization of captions using tree compression (+PRUNING) improve the BLEU score without brevity penalty significantly, 11 while improving METEOR only moderately (due to an improvement on precision with a decrease in recall.)

Human Evaluation
Neither BLEU nor METEOR directly measure grammatical correctness over long distances and may not correspond perfectly to human judgments. Therefore, we supplement automatic evaluation with human evaluation. For human evaluations, we present two options generated from two competing systems, and ask turkers to choose the one that is better with respect to: relevance, grammar, and overall. Results are shown in Table 2 with 3 turker ratings per image. We filter out turkers based on a control question. We then compute the selection rate (%) of preferring method-1 over method-2. The agreement among turkers is a frequent concern. Therefore, we vary the set of dependable users based on their Cohen's kappa score (κ) against other users. It turns out, filtering users based on κ does not make a big difference in determining the winning method. As expected, tree-based systems significantly outperform sequence-based counterparts. For example, Seq+Tree+Pruning:"Not"the" clock"face"in"the"world."

Grammar( problems(
Human:"A"buEerfly"in"a" field"in"the"Santa"Monica" mountains." & Seq+Tree+Pruning:" Monarch"in"her"bedroom" before"the"wedding" ceremony." plausible, thanks to the expressive, but somewhat predictable descriptions online users write about their photos. Even among the bad examples (Figure 10) one can find highly creative captions with not literal but metaphorical relevance: "Monarch in her bedroom before the wedding ceremony". 12 The complete system captions and the original captions are available at http://ilp-cky.appspot. com/ 6 Related Work Sentence Fusion Sentence fusion has been studied mostly for multi-document summarization (Barzilay and McKeown, 2005), where redundancy across multiple sentences serves as a guideline for syntactic and semantic validity of generation. In contrast, we do not have the natural redundancy to rely upon in our task, therefore requiring the composition algorithm to be intrinsically better constrained for correct sentence structures. 12 "Monarch" can be a type of butterfly.

Literally(not(relevant,(but( metaphorically(crea&ve!(
Sentence Compression At the core of the image caption generalization task is sentence compression. Much work has considered deletion-only edits like ours (Knight and Marcu, 2000;Turner and Charniak, 2005;Cohn and Lapata, 2007;Filippova and Altun, 2013), while recent ones explore more complex edits, such as substitutions, insertions and reordering (Cohn and Lapata, 2008). The latter generally requires a larger training corpus. We leave more expressive compression as a future research work.

Conclusion
In this paper, we have presented a novel tree composition approach for generating expressive image descriptions. As an optional preprocessing step, we also presented a tree compression approach and reported the empirical benefit of using automatically compressed captions to improve image description generation. By integrating both the tree structure and the sequence structure, we have significantly improved the quality of composed image captions over several competitive baselines. 360