Combined Distributional and Logical Semantics

We introduce a new approach to semantics which combines the benefits of distributional and formal logical semantics. Distributional models have been successful in modelling the meanings of content words, but logical semantics is necessary to adequately represent many function words. We follow formal semantics in mapping language to logical representations, but differ in that the relational constants used are induced by offline distributional clustering at the level of predicate-argument structure. Our clustering algorithm is highly scalable, allowing us to run on corpora the size of Gigaword. Different senses of a word are disambiguated based on their induced types. We outperform a variety of existing approaches on a wide-coverage question answering task, and demonstrate the ability to make complex multi-sentence inferences involving quantifiers on the FraCaS suite.


Introduction
Mapping natural language to meaning representations is a central challenge of NLP. There has been much recent progress in unsupervised distributional semantics, in which the meaning of a word is induced based on its usage in large corpora. This approach is useful for a range of key applications including question answering and relation extraction (Lin and Pantel, 2001;Poon and Domingos, 2009;Yao et al., 2011). Because such a semantics can be automically induced, it escapes the limitation of depending on relations from hand-built training data, knowledge bases or ontologies, which have proved of limited use in capturing the huge variety of meanings that can be expressed in language.
However, distributional semantics has largely developed in isolation from the formal semantics literature. Whilst distributional semantics has been effective in modelling the meanings of content words such as nouns and verbs, it is less clear that it can be applied to the meanings of function words. Semantic operators, such as determiners, negation, conjunctions, modals, tense, mood, aspect, and plurals are ubiquitous in natural language, and are crucial for high performance on many practical applicationsbut current distributional models struggle to capture even simple examples. Conversely, computational models of formal semantics have shown low recall on practical applications, stemming from their reliance on ontologies such as WordNet (Miller, 1995) to model the meanings of content words (Bobrow et al., 2007;Bos and Markert, 2005).
For example, consider what is needed to answer a question like Did Google buy YouTube? from the following sentences: 1. Google purchased YouTube 2. Google's acquisition of YouTube 3. Google acquired every company 4. YouTube may be sold to Google 5. Google will buy YouTube or Microsoft 6. Google didn't takeover YouTube All of these require knowledge of lexical semantics (e.g. that buy and purchase are synonyms), but some also need interpretation of quantifiers, negatives, modals and disjunction. It seems unlikely that distributional or formal approaches can accomplish the task alone.
We propose a method for mapping natural language to first-order logic representations capable of capturing the meanings of function words such as every, not and or, but which also uses distributional statistics to model the meaning of content words. Our approach differs from standard formal semantics in that the non-logical symbols used in the logical form are cluster identifiers. Where standard semantic formalisms would map the verb write to a write' symbol, we map it to a cluster identifier such as relation37, which the noun author may also map to. This mapping is learnt by offline clustering.
Unlike previous distributional approaches, we perform clustering at the level of predicate-argument structure, rather than syntactic dependency structure. This means that we abstract away from many syntactic differences that are not present in the semantics, such as conjunctions, passives, relative clauses, and long-range dependencies. This significantly reduces sparsity, so we have fewer predicates to cluster and more observations for each.
Of course, many practical inferences rely heavily on background knowledge about the world-such knowledge falls outside the scope of this work.

Background
Our approach is based on Combinatory Categorial Grammar (CCG; Steedman, 2000), a strongly lexicalised theory of language in which lexical entries for words contain all language-specific information. The lexical entry for each word contains a syntactic category, which determines which other categories the word may combine with, and a semantic interpretation, which defines the compositional semantics. For example, the lexicon may contain the entry: Crucially, there is a transparent interface between the syntactic category and the semantics. For example the transitive verb entry above defines the verb syntactically as a function mapping two nounphrases to a sentence, and semantically as a binary relation between its two argument entities. This means that it is relatively straightforward to deterministically map parser output to a logical form, as in the Boxer system (Bos, 2008). This Every dog barks Figure 1: A standard logical form derivation using CCG. The NP ↑ notation means that the subject is type-raised, and taking the verb-phrase as an argument-so is an abbreviation of S/(S\NP). This is necessary in part to support a correct semantics for quantifiers.

Input Sentence
Shakespeare wrote Macbeth ⇓ Intial semantic analysis write arg0,arg1 (shakespeare, macbeth) ⇓ Entity Typing write arg0:PER,arg1:BOOK (shakespeare:PER, macbeth:BOOK) ⇓ Distributional semantic analysis relation37(shakespeare:PER, macbeth:BOOK) form of semantics captures the underlying predicateargument structure, but fails to license many important inferences-as, for example, write and author do not map to the same predicate.
In addition to the lexicon, there is a small set of binary combinators and unary rules, which have a syntactic and semantic interpretation. Figure 1 gives an example CCG derivation.

Overview of Approach
We attempt to learn a CCG lexicon which maps equivalent words onto the same logical form-for example learning entries such as: author N/PP[o f ] : λ xλ y.relation37(x, y) write (S\NP)/NP : λ xλ y.relation37(x, y) The only change to the standard CCG derivation is that the symbols used in the logical form are arbitrary relation identifiers. We learn these by first mapping to a deterministic logical form (using predicates such as author' and write'), using a process similar to Boxer, and then clustering predicates based on their arguments. This lexicon can then be used to parse new sentences, and integrates seamlessly with CCG theories of formal semantics.
Typing predicates-for example, determining that writing is a relation between people and bookshas become standard in relation clustering (Schoenmackers et al., 2010;Berant et al., 2011;Yao et al., 2012). We demonstate how to build a typing model into the CCG derivation, by subcategorizing all terms representing entities in the logical form with a more detailed type. These types are also induced from text, as explained in Section 5, but for convenience we describe them with human-readable labels, such as PER, LOC and BOOK.
A key advantage of typing is that it allows us to model ambiguous predicates. Following Berant et al. (2011), we assume that different type signatures of the same predicate have different meanings, but given a type signature a predicate is unambiguous. For example a different lexical entry for the verb born is used in the contexts Obama was born in Hawaii and Obama was born in 1961, reflecting a distinction in the semantics that is not obvious in the syntax 1 . Typing also greatly improves the efficiency of clustering, as we only need to compare predicates with the same type during clustering (for example, we do not have to consider clustering a predicate between people and places with predicates between people and dates).
In this work, we focus on inducing binary relations. Many existing approaches have shown how to produce good clusterings of (non-event) nouns (Brown et al., 1992), any of which could be simply integrated into our semantics-but relation clustering remains an open problem (see Section 9). N-ary relations are binarized, by creating a binary relation between each pair of arguments. For example, for the sentence Russia sold Alaska to the United States, the system creates three binary relations-corresponding to sellToSomeone(Russia, Alaska), buyFromSomeone(US, Alaska), sellSome-thingTo(Russia, US). This transformation does not exactly preserve meaning, but still captures the most important relations. Note that this allows us to compare semantic relations across different syntactic types-for example, both transitive verbs and argument-taking nouns can be seen as expressing binary semantic relations between entities. Figure 2 shows the layers used in our model.

Initial Semantic Analysis
The initial semantic analysis maps parser output onto a logical form, in a similar way to Boxer. The semantic formalism is based on Steedman (2012). The first step is syntactic parsing. We use the C&C parser (Clark and Curran, 2004), trained on CCGBank (Hockenmaier and Steedman, 2007), using the refined version of Honnibal et al. (2010) which brings the syntax closer to the predicateargument structure. An automatic post-processing step makes a number of minor changes to the parser output, which converts the grammar into one more suitable for our semantics. PP (prepositional phrase) and PR (phrasal verb complement) categories are sub-categorised with the relevant preposition. Noun compounds with the same MUC named-entity type (Chinchor and Robinson, 1997) are merged into a single non-compositional node 2 (we otherwise ignore named-entity types). All argument NPs and PPs are type-raised, allowing us to represent quantifiers. All prepositional phrases are treated as core arguments (i.e. given the category PP, not adjunct categories like (N\N)/NP or ((S\NP)\(S\NP))/NP), as it is difficult for the parser to distinguish arguments and adjuncts.
Initial semantic lexical entries for almost all words can be generated automatically from the syntactic category and POS tag (obtained from the parser), as the syntactic category captures the underlying predicate-argument structure. We use a Davidsonian-style representation of arguments (Davidson, 1967), which we binarize by creating a separate predicate for each pair of arguments of a word. These predicates are labelled with the lemma of the head word and a Propbank-style argument key (Kingsbury and Palmer, 2002), e.g. arg0, argIn. We distinguish noun and verb predicates based on POS Word Category Figure 3: Example initial lexical entries tag-so, for example, we have different predicates for effect as a noun or verb. This algorithm can be overridden with manual lexical entries for specific closed-class function words. Whilst it may be possible to learn these from data, our approach is pragmatic as there are relatively few such words, and the complex logical forms required would be difficult to induce from distributional statistics. We add a small number of lexical entries for words such as negatives (no, not etc.), and quantifiers (numbers, each, every, all, etc.).
Some example initial lexical entries are shown in Figure 3.

Entity Typing Model
Our entity-typing model assigns types to nouns, which is useful for disambiguating polysemous predicates. Our approach is similar to O'Seaghdha (2010) in that we aim to cluster entities based on the noun and unary predicates applied to them (it is simple to convert from the binary predicates to unary predicates). For example, we want the pair (born argIn , 1961) to map to a DAT type, and (born argIn , Hawaii) to map to a LOC type. This is non-trivial, as both the predicates and arguments can be ambiguous between multiple types-but topic models offer a good solution (described below).

Topic Model
We assume that the type of each argument of a predicate depends only on the predicate and argument, although Ritter et al. (2010) demonstrate an advantage of modelling the joint probability of the types of multiple arguments of the same predicate. We use the standard Latent Dirichlet Allocation model (Blei et al., 2003), which performs comparably to more complex models proposed in O'Seaghdha (2010).
In topic-modelling terminology, we construct a document for each unary predicate (e.g. born argIn ), based on all of its argument entities (words). We assume that these arguments are drawn from a small number of types (topics), such as PER, DAT or LOC 3 . Each type j has a multinomial distribution φ j over arguments (for example, a LOC type is more likely to generate Hawaii than 1961). Each unary predicate i has a multinomial distribution θ i over topics, so the born argIn predicate will normally generate a DAT or LOC type. Sparse Dirichlet priors α and β on the multinomials bias the distributions to be peaky. The parameters are estimated by Gibbs sampling, using the Mallet implementation (McCallum, 2002).
The generative story to create the data is: For every type k: Draw the p(arg|k) distribution φ k from Dir(β ) For every unary predicate i: Draw the p(type|i) distribution θ i from Dir(α) For every argument j: Draw a type z i j from Mult(θ i ) Draw an argument w i j from Mult(φ θ i )

Typing in Logical Form
In the logical form, all constants and variables representing entities x can be assigned a distribution over types p x (t) using the type model. An initial type distribution is applied in the lexicon, using the φ distributions for the types of nouns, and the θ i distributions for the type of arguments of binary predicates (inverted using Bayes' rule). Then at each βreduction in the derivation, we update probabilities of the types to be the product of the type distributions of the terms being reduced. If two terms x and file a suit Figure 4: Using the type model for disambiguation in the derivation of file a suit. Type distributions are shown after the variable declarations. Both suit and the object of file are lexically ambiguous between different types, but after the β -reduction only one interpretation is likely. If the verb were wear, a different interpretation would be preferred.
y combine to a term z: For example, in wore a suit and file a suit, the variable representing suit may be lexically ambiguous between CLOTHES and LEGAL types, but the variables representing the objects of wear and f ile will have preferences that allow us to choose the correct type when the terms combine. Figure 4 shows an example derivation using the type model for disambiguation 4 .

Distributional Relation Clustering Model
The typed binary predicates can be grouped into clusters, each of which represents a distinct semantic relation. Note that because we cluster typed predicates, born arg0:PER,argIn:LOC and born arg0:PER,argIn:DAT can be clustered separately.

Corpus statistics
Typed binary predicates are clustered based on the expected number of times they hold between each argument-pair in the corpus. This means we create a single vector of argument-pair counts for each predicate (not a separate vector for each argument). For example, the vector for the typed predicate write arg0:PER,arg1:BOOK may contain non-zero counts for entity-pairs such as (Shakespeare, Macbeth), (Dickens, Oliver Twist) and (Rowling, Harry Potter).
The entity-pair counts for author arg0:PER,argOf :BOOK may be similar, on the assumption that both are samples from the same underlying semantic relation.
To find the expected number of occurrences of argument-pairs for typed binary predicates in a corpus, we first apply the type model to the derivation of each sentence, as described in Section 5.2. This outputs untyped binary predicates, with distributions over the types of their arguments. The type of the predicate must match the type of its arguments, so the type distribution of a binary predicate is simply the joint distribution of the two argument type distributions.

Clustering
Many algorithms have been proposed for clustering predicates based on their arguments (Poon and Domingos, 2009;Yao et al., 2012). The number of relations in the corpus is unbounded, so the clustering algorithm should be non-parametric. It is also important that it remains tractable for very large numbers of predicates and arguments, in order to give us a greater coverage of language than can be achieved by hand-built ontologies.
We cluster the typed predicate vectors using the Chinese Whispers algorithm (Biemann, 2006)-although somewhat ad-hoc, it is both non-parametric and highly scalable 5 . This has previously been used for noun-clustering by Fountain and Lapata (2011), who argue it is a cognitively plausible model for language acquisition. The collection of predicates and arguments is converted into a graph with one node per predicate, and edge weights representing the similarity between predicates. Predicates with different types have zero-similarity, and otherwise similarity is computed as the cosine-similarity of the tf-idf vectors of argument-pairs. We prune nodes occurring fewer than 20 times, edges with weights less than 10 −3 , and a short list of stop predicates.
The algorithm proceeds as follows:

Semantic Parsing using Relation Clusters
The final phase is to use our relation clusters in the lexical entries of the CCG semantic derivation. This is slightly complicated by the fact that our predicates are lexically ambiguous between all the possible types they could take, and hence the relations they could express. For example, the system cannot tell whether born in is expressing a birthplace or birthdate relation until later in the derivation, when it combines with its arguments. However, all the possible logical forms are identical except for the symbols used, which means we can produce a packed logical form capturing the full distribution over logical forms. To do this, we make the predicate a function from argument types to relations. For each word, we first take the lexical semantic definition produced by the algorithm in Section 4. For binary predicates in this definition (which will be untyped), we perform a deterministic lookup in the cluster model learned in Section 6, using all possible corresponding typed predicates. This allows us to represent the binary predicates as packed predicates: functions from argument types to relations.
If 1961 has a type-distribution (LOC=0.1, DAT=0.9), the output packed-logical form for Obama was born in Hawaii in 1961 will be: The probability of a given logical form can be read from this packed logical form.

Experiments
Our approach aims to offer a strong model of both formal and lexical semantics. We perform two evaluations, aiming to target each of these separately, but using the same semantic representations in each. We train our system on Gigaword (Graff et al., 2003), which contains around 4 billion words of newswire. The type-model is trained using 15 types 7 , and 5,000 iterations of Gibbs sampling (using the distributions from the final sample). Table 1  Type Top Words  1  suspect, assailant, fugitive, accomplice  2 author, singer, actress, actor, dad 5 city, area, country, region, town, capital 8 subsidiary, automaker, airline, Co., GM 10 musical, thriller, sequel, special shows some example types. The relation clustering uses only proper nouns, to improve precision (sparsity problems are partly offset by the large input corpus). Aside from parsing, the pipeline takes around a day to run using 12 cores.

Question Answering Experiments
As yet, there is no standard way of evaluating lexical semantics. Existing tasks like Recognising Textual Entailment (RTE; Dagan et al., 2006) rely heavily on background knowledge, which is beyond the scope of this work. Intrinsic evaluations of entailment relations have low inter-annotator agreement (Szpektor et al., 2007), due to the difficulty of evaluating relations out of context. Our evaluation is based on that performed by Poon and Domingos (2009). We automatically construct a set of questions by sampling from text, and then evaluate how many answers can be found in a different corpus. From dependency-parsed newswire, we sample either X nsub j where X and Y are proper nouns and the verb is not on a list of stop verbs, and deterministically convert these to questions. For example, from Google bought YouTube we create the questions What did Google buy? and What bought YouTube?. The task is to find proper-noun answers to these questions in a different corpus, which are then evaluated by human annotators based on the sentence the answer was retrieved from 8 . Systems can return multiple answers to the same question (e.g. What did Google buy? may have many valid answers), and all of these contribute to the result. As none of the systems model tense or temporal semantics, annotators were instructed to annotate answers as correct if they were true at any time. This approach means we evaluate on relations in proportion to corpus frequency. We sample 1000 questions from the New York Times subset of Gigaword from 2010, and search for answers in the New York Times from 2009. We evaluate the following approaches: • CCG-Baseline The logical form produced by our CCG derivation, without the clustering.
• CCG-WordNet The CCG logical form, plus WordNet as a model of lexical semantics.
• CCG-Distributional The logical form including the type model and clusters.
• Relational LDA An LDA based model for clustering dependency paths (Yao et al., 2011). We train on New York Times subset of Gigaword 9 , using their setup of 50 iterations with 100 relation types.
Unsupervised Semantic Parsing (USP; Poon and Domingos, 2009;USP;Poon and Domingos, 2010;USP;Titov and Klementiev, 2011) would be another obvious baseline. However, memory requirements mean it is not possible to run at this scale (our system is trained on 4 orders of magnitude more data than the USP evaluation). Yao et al. (2011) found it had comparable performance to Relational LDA.
For the CCG models, rather than performing full first-order inference on a large corpus, we simply test whether the question predicate subsumes a candidate answer predicate, and whether the arguments match 10 . In the case of CCG-Distributional, we calculate the probability that the two packed-predicates  Table 2: Results on wide-coverage Question Answering task. CCG-Distributional ranks question/answer pairs by confidence-@250 means we evaluate the top 250 of these. It is not possible to give a recall figure, as the total number of correct answers in the corpus is unknown.
are in the same cluster, marginalizing over their argument types. Answers are ranked by this probability. For CCG-WordNet, we check if the question predicate is a hypernym of the candidate answer predicate (using any WordNet sense of either term). Results are shown in Table 2. Relational-LDA induces many meaningful clusters, but predicates must be assigned to one of 100 relations, so results are dominated by large, noisy clusters (it is not possible to take the N-best answers as the cluster assignments do not have a confidence score). The CCG-Baseline errors are mainly caused by parser errors, or relations in the scope of non-factive operators. CCG-WordNet adds few answers to CCG-Baseline, reflecting the limitations of hand-built ontologies.
CCG-Distributional substantially improves recall over other approaches whilst retaining good precision, demonstrating that we have learnt a powerful model of lexical semantics. Table 3 shows some correctly answered questions. The system improves over the baseline by mapping expressions such as merge with and acquisition of to the same relation cluster. Many of the errors are caused by conflating predicates where the entailment only holds in one direction, such as was elected to with ran for. Hierarchical clustering could be used to address this.

Experiments on the FraCaS Suite
We are also interested in evaluating our approach as a model of formal semantics-demonstrating that it is possible to integrate the formal semantics of Steedman (2012) with our distributional clusters.
The FraCaS suite (Cooper et al., 1996) 11 contains a hand-built set of entailment problems designed to be challenging in terms of formal semantics. We use Section 1, which contains 74 problems requiring an understanding of quantifiers 12 . They do not require any knowledge of lexical semantics, meaning we can evaluate the formal component of our system in isolation. However, we use the same representations as in our previous experiment, even though the clusters provide no benefit on this task. Figure 5 gives an example problem.
The only previous work we are aware of on this dataset is by MacCartney and Manning (2007). This approach learns the monotonicity properties of words from a hand-built training set, and uses this to transform a sentence into a polarity annotated string. The system then aims to transform the premise string into a hypothesis. Positively polarized words can be replaced with less specific ones (e.g. by deleting adjuncts), whereas negatively polarized words can be replaced with more specific ones (e.g. by adding adjuncts). Whilst this is highprecision and often useful, this logic is unable to perform inferences with multiple premise sentences (in contrast to our first-order logic).
Development consists of adding entries to our lexicon for quantifiers. For simplicity, we treat multiword quantifiers like at least a few, as being multiword expressions-although a more compositional analysis may be possible. Following MacCartney and Manning (2007), we do not use held-out dataeach problem is designed to test a different issue, so it is not possible to generalize from one subset of the suite to another. As we are interested in evaluating the semantics, not the parser, we manually supply gold-standard lexical categories for sentences with parser errors (any syntactic mistake causes incorrect semantics). Our derivations produce a distribution over logical forms-we license the inference if it holds in any interpretation with non-zero probability. We use the Prover9 (McCune, 2005) theorem prover for inference, returning yes if the premise implies the hypothesis, no if it implies the negation of the hypothesis, and unknown otherwise.
Results are shown in Table 4. Our system im- 11 We use the version converted to machine readable format by MacCartney and Manning (2007)    proves on previous work by making multi-sentence inferences. Causes of errors include missing a distinct lexical entry for plural the, only taking existential interpretations of bare plurals, failing to interpret mass-noun determiners such as a lot of, and not providing a good semantics for non-monotone determiners such as most. We believe these problems will be surmountable with more work. Almost all errors are due to incorrectly predicting unknown -the system makes just one error on yes or no predictions (with or without gold syntax). This suggests that making first-order logic inferences in applications will not harm precision. We are less robust than MacCartney and Manning (2007) to syntax errors but, conversely, we are able to attempt more of the problems (i.e. those with multi-sentence premises).
Other approaches based on distributional semantics seem unable to tackle any of these problems, as they do not represent quantifiers or negation.

Related Work
Much work on semantics has taken place in a supervised setting-for example the GeoQuery (Zelle and Mooney, 1996) and ATIS (Dahl et al., 1994) semantic parsing tasks. This approach makes sense for generating queries for a specific database, but means the semantic representations do not generalize to other datasets. There have been several attempts to annotate larger corpora with semantics-such as Ontonotes (Hovy et al., 2006) or the Groningen Meaning Bank (Basile et al., 2012). These typically map words onto senses in ontologies such as Word-Net, VerbNet (Kipper et al., 2000) and FrameNet (Baker et al., 1998). However, limitations of these ontologies mean that they do not support inferences such as X is the author of Y → X wrote Y. Given the difficulty of annotating large amounts of text with semantics, various approaches have attempted to learn meaning without annotated text. Distant Supervision approaches leverage existing knowledge bases, such as Freebase (Bollacker et al., 2008), to learn semantics (Mintz et al., 2009;Krishnamurthy and Mitchell, 2012). Dependency-based Compositional Semantics (Liang et al., 2011) learns the meaning of questions by using their answers as denotations-but this appears to be specific to question parsing. Such approaches can only learn the pre-specified relations in the knowledge base.
The approaches discussed so far in this section have all attempted to map language onto some pre-specified set of relations. Various attempts have been made to instead induce relations from text by clustering predicates based on their arguments. For example, Yao et al. (2011) propose a series of LDAbased models which cluster relations between entities based on a variety of lexical, syntactic and semantic features. Unsupervised Semantic Parsing (Poon and Domingos, 2009) recursively clusters fragments of dependency trees based on their arguments. Although USP is an elegant model, it is too computationally expensive to run on large corpora. It is also based on frame semantics, so does not cluster equivalent predicates with different frames. To our knowledge, our work is the first such approach to be integrated within a linguistic theory supporting formal semantics for logical operators.
Vector space models represent words by vectors based on co-occurrence counts. Recent work has tackled the problem of composing these matrices to build up the semantics of phrases or sentences (Mitchell and Lapata, 2008). Another strand (Coecke et al., 2010;Grefenstette et al., 2011) has shown how to represent meanings as tensors, whose order depends on the syntactic category, allowing an elegant correspondence between syntactic and semantic types. Socher et al. (2012) train a composition function using a neural network-however their method requires annotated data. It is also not obvious how to represent logical relations such as quantification in vector-space models. Baroni et al. (2012) make progress towards this by learning a classifier that can recognise entailments such as all dogs =⇒ some dogs, but this remains some way from the power of first-order theorem proving of the kind required by the problem in Figure 5.
An alternative strand of research has attempted to build computational models of linguistic theories based on formal compositional semantics, such as the CCG-based Boxer (Bos, 2008) and the LFGbased XLE (Bobrow et al., 2007). Such approaches convert parser output into formal semantic representations, and have demonstrated some ability to model complex phenomena such as negation. For lexical semantics, they typically compile lexical resources such as VerbNet and WordNet into inference rules-but still achieve only low recall on opendomain tasks, such as RTE, mostly due to the low coverage of such resources. Garrette et al. (2011) use distributional statistics to determine the probability that a WordNet-derived inference rule is valid in a given context. Our approach differs in that we learn inference rules not present in WordNet. Our lexical semantics is integrated into the lexicon, rather than being implemented as additional inference rules, meaning that inference is more efficient, as equivalent statements have the same logical form.
Natural Logic (MacCartney and Manning, 2007) offers an interesting alternative to symbolic logics, and has been shown to be able to capture complex logical inferences by simply identifying the scope of negation in text. This approach achieves similar precision and much higher recall than Boxer on the RTE task. Their approach also suffers from such limitations as only being able to make inferences between two sentences. It is also sensitive to word order, so cannot make inferences such as Shakespeare wrote Macbeth =⇒ Macbeth was written by Shakespeare.

Conclusions and Future Work
This is the first work we are aware of that combines a distributionally induced lexicon with formal semantics. Experiments suggest our approach compares favourably with existing work in both areas.
Many potential areas for improvement remain. Hierachical clustering would allow us to capture hypernym relations, rather than the synonyms captured by our flat clustering. There is much potential for integrating existing hand-built resources, such as Ontonotes and WordNet, to improve the accuracy of clustering. There are cases where the existing CCGBank grammar does not match the required predicate-argument structure-for example in the case of light verbs. It may be possible to rebank CCGBank, in a way similar to Honnibal et al. (2010), to improve it on this point.