Learning a Compositional Semantics for Freebase with an Open Predicate Vocabulary

We present an approach to learning a model-theoretic semantics for natural language tied to Freebase. Crucially, our approach uses an open predicate vocabulary, enabling it to produce denotations for phrases such as “Republican front-runner from Texas” whose semantics cannot be represented using the Freebase schema. Our approach directly converts a sentence’s syntactic CCG parse into a logical form containing predicates derived from the words in the sentence, assigning each word a consistent semantics across sentences. This logical form is evaluated against a learned probabilistic database that defines a distribution over denotations for each textual predicate. A training phase produces this probabilistic database using a corpus of entity-linked text and probabilistic matrix factorization with a novel ranking objective function. We evaluate our approach on a compositional question answering task where it outperforms several competitive baselines. We also compare our approach against manually annotated Freebase queries, finding that our open predicate vocabulary enables us to answer many questions that Freebase cannot.


Introduction
Traditional knowledge representation assumes that world knowledge can be encoded using a closed vocabulary of formal predicates. In recent years, semantic parsing has enabled us to build compositional models of natural language semantics using such a closed predicate vocabulary (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005).
These semantic parsers map natural language statements to database queries, enabling applications such as answering questions using a large knowledge base (Yahya et al., 2012;Krishnamurthy and Mitchell, 2012;Cai and Yates, 2013;Kwiatkowski et al., 2013;Berant et al., 2013;Berant and Liang, 2014;Reddy et al., 2014). Furthermore, the modeltheoretic semantics provided by such parsers have the potential to improve performance on other tasks, such as information extraction and coreference resolution.
However, a closed predicate vocabulary has inherent limitations. First, its coverage will be limited, as such vocabularies are typically manually constructed. Second, it may abstract away potentially relevant semantic differences. For example, the semantics of "Republican front-runner" cannot be adequately encoded in the Freebase schema because it lacks the concept of a "front-runner." We could choose to encode this concept as "politician" at the cost of abstracting away the distinction between the two. As this example illustrates, these two problems are prevalent in even the largest knowledge bases.
An alternative paradigm is an open predicate vocabulary, where each natural language word or phrase is given its own formal predicate. This paradigm is embodied in both open information extraction (Banko et al., 2007) and universal schema . Open predicate vocabularies have the potential to capture subtle semantic distinctions and achieve high coverage. However, we have yet to develop compelling approaches to compositional semantics within this paradigm. This paper takes a step toward compositional se-  Figure 1: Overview of our approach. Top left: the text is converted to logical form by CCG syntactic parsing and a collection of manually-defined rules. Bottom: low-dimensional embeddings of each entity (entity pair) and category (relation) are learned from an entity-linked web corpus. These embeddings are used to construct a probabilistic database. The labels of these matrices are shortened for space reasons. Top right: evaluating the logical form on the probabilistic database computes the marginal probability that each entity is an element of the text's denotation.
mantics with an open predicate vocabulary. Our approach defines a distribution over denotations (sets of Freebase entities) given an input text. The model has two components, shown in Figure 1. The first component is a rule-based semantic parser that uses a syntactic CCG parser and manually-defined rules to map entity-linked texts to logical forms containing predicates derived from the words in the text. The second component is a probabilistic database with a possible worlds semantics that defines a distribution over denotations for each textually-derived predicate. This database assigns independent probabilities to individual predicate instances, such as P (FRONT-RUNNER(/EN/GEORGE BUSH)) = 0.9. Together, these components define an exponentiallylarge distribution over denotations for an input text; to simplify this output, we compute the marginal probability, over all possible worlds, that each entity is an element of the text's denotation.
The learning problem in our approach is to train the probabilistic database to predict a denotation for each predicate. We pose this problem as probabilistic matrix factorization with a novel query/answer ranking objective. This factorization learns a lowdimensional embedding of each entity (entity pair) and category (relation) such that the denotation of a predicate is likely to contain entities or entity pairs with nearby vectors. To train the database, we first collect training data by analyzing entity-linked sentences in a large web corpus with the rule-based semantic parser. This process generates a collection of logical form queries with observed entity answers. The query/answer ranking objective, when optimized, trains the database to rank the observed answers for each query above unobserved answers.
We evaluate our approach on a question answering task, finding that our approach outperforms several baselines and that our new training objective improves performance over a previously-proposed objective. We also evaluate the trade-offs between open and closed predicate vocabularies by comparing our approach to a manually-annotated Freebase query for each question. This comparison reveals that, when Freebase contains predicates that cover the question, it achieves higher precision and recall than our approach. However, our approach can correctly answer many questions not covered by Freebase.

System Overview
The purpose of our system is to predict a denotation γ for a given natural language text s. The denotation γ is the set of Freebase entities that s refers to; for example, if s = "president of the US," then γ = {/EN/OBAMA, /EN/BUSH, ...}. 1 Our system represents this prediction problem using the following probabilistic model: The first term in this factorization, P ( |s), is a distribution over logical forms given the text s. This term corresponds to the rule-based semantic parser (Section 3). This semantic parser is deterministic, so this term assigns probability 1 to a single logical form for each text. The second term, P (w), represents a distribution over possible worlds, where each world is an assignment of truth values to all possible predicate instances. The distribution over worlds is represented by a probabilistic database (Section 4). The final term, P (γ| , w), deterministically evaluates the logical form on the world w to produce a denotation γ. This term represents query evaluation against a fixed database, as in other work on semantic parsing.
Section 5 describes inference in our model. To produce a ranked list of entities ( Figure 1, top right) from P (γ|s), our system computes the marginal probability that each entity is an element of the denotation γ. This problem corresponds to query evaluation in a probabilistic database, which is known to be tractable in many cases (Suciu et al., 2011).
Section 6 describes training, which estimates parameters for the probabilistic database P (w). This step first automatically generates training data using the rule-based semantic parser. This data is used to formulate a matrix factorization problem that is optimized to estimate the database parameters.

Rule-Based Semantic Parser
The first part of our compositional semantics system is a rule-based system that deterministically computes a logical form for a text s. This component is used during inference to analyze the logical structure of text, and during training to generate training data (see Section 6.1). Several input/output pairs for this system are shown in Figure 2.
The conversion to logical form has 3 phases: 1. CCG syntactic parsing parses the text and applies several deterministic syntactic transformations to facilitate semantic analysis.
2. Entity linking marks known Freebase entities in the text.
3. Semantic analysis assigns a logical form to each word, then composes them to produce a logical form for the complete text.

Syntactic Parsing
The first step in our analysis is to syntactically parse the text. We use the ASP-SYN parser (Krishnamurthy and Mitchell, 2014) trained on CCG-Bank (Hockenmaier and Steedman, 2002). We then automatically transform the resulting syntactic parse to make the syntactic structure more amenable to semantic analysis. This step marks NP s in conjunctions by replacing their syntactic category with NP [conj]. This transformation allows semantic analysis to distinguish between appositives and comma-separated lists. It also transforms all verb arguments to core arguments, i.e., using the category PP /NP as opposed to ((S \NP )\(S \NP ))/NP . This step simplifies the semantic analysis of verbs with prepositional phrase arguments. The final transformation adds a word feature to each PP category, e.g., mapping PP to PP [by]. These features are used to generate verb-preposition relation predicates, such as DIRECTED BY.

Entity Linking
The second step is to identify mentions of Freebase entities in the text. This step could be performed by an off-the-shelf entity linking system (Ratinov et al., 2011;Milne and Witten, 2008) or string matching. However, our training and test data is derived from Clueweb 2009, so we rely on the entity linking for this corpus provided by Gabrilovich et. al (2013).
Our system incorporates the provided entity links into the syntactic parse provided that they are consistent with the parse structure. Specifically, we require that each mention is either (1) a constituent in the parse tree with syntactic category N or N P or (2) a collection of N/N or N P/N P modifiers with a single head word. The first case covers noun and noun phrase mentions, while the second case covers noun compounds. In both cases, we substitute a single multi-word terminal into the parse tree spanning the mention and invoke special semantic rules for mentions described in the next section.

Semantic analysis
The final step uses the syntactic parse and entity links to produce a logical form for the text. The system induces a logical form for every word in the text based on its syntactic CCG category. Composing these logical forms according to the syntactic parse produces a logical form for the entire text.
Our semantic analyses are based on a relatively naïve model-theoretic semantics. We focus on language whose semantics can be represented with existentially-quantified conjunctions of unary and binary predicates, ignoring, for example, temporal scope and superlatives. Generally, our system models nouns and adjectives as unary predicates, and verbs and prepositions as binary predicates. Special multi-word predicates are generated for verb-preposition combinations. Entity mentions are mapped to the mentioned entity in the logical form. We also created special rules for analyzing conjunctions, appositives, and relativizing conjunctions. The complete list of rules used to produce these logical forms is available online. 2 We made several notable choices in designing this component. First, multi-argument verbs are analyzed using pairwise relations, as in the third example in Figure 2. This analysis allows us to avoid reasoning about entity triples (quadruples, etc.), which are challenging for the matrix factorization due to sparsity. Second, noun-preposition combinations are analyzed as a category and relation, as in the first example in Figure 2. We empirically found that combining the noun and preposition in such 2 http://rtw.ml.cmu.edu/tacl2015_csf instances resulted in worse performance, as it dramatically increased the sparsity of training instances for the combined relations. Third, entity mentions with the N /N category are analyzed using a special noun-noun relation, as in the second example in Figure 2. Our intuition is that this relation shares instances with other relations (e.g., "city in Texas" implies "Texan city"). Finally, we lowercased each word to create its predicate name, but performed no lemmatization or other normalization.

Discussion
The scope of our semantic analysis system is somewhat limited relative to other similar systems (Bos, 2008;Lewis and Steedman, 2013) as it only outputs existentially-quantified conjunctions of predicates. Our goal in building this system was to analyze noun phrases and simple sentences, for which this representation generally suffices. The reason for this focus is twofold. First, this subset of language is sufficient to capture much of the language surrounding Freebase entities. Second, for various technical reasons, this restricted semantic representation is easier to use (and more informative) for training the probabilistic database (see Section 6.3). Note that this system can be straightforwardly extended to model additional linguistic phenomena, such as additional logical operators and generalized quantifiers, by writing additional rules. The semantics of logical forms including these operations are well-defined in our model, and the system does not even need to be re-trained to incorporate these additions.

Probabilistic Database
The second part of our compositional semantics system is a probabilistic database. This database represents a distribution over possible worlds, where each world is an assignment of truth values to every predicate instance. Equivalently, the probabilistic database can be viewed as a distribution over databases or knowledge bases.
Formally, a probabilistic database is a collection of random variables, each of which represents the truth value of a single predicate instance. Given entities e ∈ E, categories c ∈ C, and relations r ∈ R the probabilistic database contains boolean random variables c(e) and r(e 1 , e 2 ) for each category and relation instance, respectively. All of these random variables are assumed to be independent. Let a world w represent an assignment of truth values to all of these random variables, where c(e) = w c,e and r(e 1 , e 2 ) = w r,e 1 ,e 2 . By independence, the probability of a world can be written as: The next section discusses how probabilistic matrix factorization is used to model the probabilities of these predicate instances.

Matrix Factorization Model
The probabilistic matrix factorization model treats the truth of each predicate instance as an independent boolean random variable that is true with probability: x is the logistic function. In these equations, θ c and θ r represent k-dimensional vectors of per-predicate parameters, while φ e and φ (e 1 ,e 2 ) represent k-dimensional vector embeddings of each entity and entity pair. This model contains a low-dimensional embedding of each predicate and entity such that each predicate's denotation has a high probability of containing entities with nearby vectors. The probability that each variable is false is simply 1 minus the probability that it is true.
This model can be viewed as matrix factorization, as depicted in Figure 1. The category and relation instance probabilities can be arranged in a pair of matrices of dimension |E| × |C| and |E| 2 × |R|. Each row of these matrices represents an entity or entity pair, each column represents a category or relation, and each value is between 0 and 1 and represents a truth probability (Figure 1, bottom right). These two matrices are factored into matrices of size |E| × k and k × |C|, and |E| 2 × k and k × |R|, respectively, containing k-dimensional embeddings of each entity, category, entity pair and relation ( Figure  1, bottom left). These low-dimensional embeddings are represented by the parameters φ and θ.

Inference: Computing Marginal Probabilities
Inference computes the marginal probability, over all possible worlds, that each entity is an element of a text's denotation. In many cases -depending on the text -these marginal probabilities can be computed exactly in polynomial time.
The inference problem is to calculate P (e ∈ γ|s) for each entity e. Because both the semantic parser P ( |s) and query evaluation P (γ| , w) are deterministic, this problem can be rewritten as: Above, represents the logical form for the text s produced by the rule-based semantic parser, and 1 represents the indicator function. The notation w represents denotation produced by (deterministically) evaluating the logical form on world w. This inference problem corresponds to query evaluation in a probabilistic database, which is #P-hard in general. Intuitively, this problem can be difficult because P (γ|s) is a joint distribution over sets of entities that can be exponentially large in the number of entities.
However, a large subset of probabilistic database queries, known as safe queries, permit polynomial time evaluation (Dalvi and Suciu, 2007). Safe queries can be evaluated extensionally using a probabilistic notion of a denotation that treats each entity as independent. Let P denote a probabilistic denotation, which is a function from entities (or entity pairs) to probabilities, i.e., P (e) ∈ [0, 1]. The denotation of a logical form is then computed recursively, in the same manner as a non-probabilistic denotation, using probabilistic extensions of the typical rules, such as: The first two rules are base cases that simply retrieve predicate probabilities from the probabilistic database. The remaining rules compute the probabilistic denotation of a logical form from the denotations of its parts. 3 The formula for the probabilistic computation on the right of each of these rules is a straightforward consequence of the (assumed) independence of entities. For example, the last rule computes the probability of an OR of a set of independent random variables (indexed by y) using the identity A ∨ B = ¬(¬A ∧ ¬B). For safe queries, P (e) = P (e ∈ γ|s), that is, the probabilistic denotation computed according to the above rules is equal to the marginal probability distribution. In practice, all of the queries in the experiments are safe, because they contain only one query variable and do not contain repeated predicates. For more information on query evaluation in probabilistic databases, we refer the reader to Suciu et al. (2011).
Note that inference does not compute the most probable denotation, max γ P (γ|s). In some sense, the most probable denotation is the correct output for a model-theoretic semantics. However, it is highly sensitive to the probabilities in the database, and in many cases it is empty (because a conjunction of independent boolean random variables is unlikely to be true). Producing a ranked list of entities is also useful for evaluation purposes.

Training
The training problem in our approach is to learn parameters θ and φ for the probabilistic database. We consider two different objective functions for learning these parameters that use slightly different forms of training data. In both cases, training has two phases. First, we generate training data, in the form of observed assertions or query-answer pairs, by applying the rule-based semantic parser to a corpus of entity-linked web text. Second, we optimize the parameters of the probabilistic database to rank observed assertions or answers above unobserved assertions or answers.

Training Data
Training data is generated by applying the process illustrated in Figure 3 to each sentence in an entitylinked web corpus. First, we apply our rule-based semantic parser to the sentence to produce a logical form. Next, we extract portions of this logical form where every variable is bound to a particular Freebase entity, resulting in a simplified logical form. Because the logical forms are existentiallyquantified conjunctions of predicates, this step simply discards any conjuncts in the logical form containing a variable that is not bound to a Freebase entity. From this simplified logical form, we generate two types of training data: (1) predicate instances, and (2) queries with known answers (see Figure 3). In both cases, the corpus consists entirely of assumed-to-be-true statements, making obtaining negative examples a major challenge for training. 4 and {(r j , t j )} m j=1 , of observed category and relation instances. We use t j to denote a tuple of entities, t j = (e j,1 , e j,2 ), to simplify notation. The predicate ranking objective is:

Predicate Ranking Objective
where e i is a randomly sampled entity such that (c i , e i ) does not occur in the training data. Similarly, t j is a random entity tuple such that (r j , t j ) does not occur. Maximizing this function attempts to find θ c i , φ e i and φ e i such that P (c i (e i )) is larger than P (c i (e i )) (and similarly for relations). During training, e i and t j are resampled on each pass over the data set according to each entity or tuple's empirical frequency.

Query Ranking Objective
The previous objective aims to rank the entities within each predicate well. However, such withinpredicate rankings are insufficient to produce correct answers for queries containing multiple predicatesthe scores for each predicate must further be calibrated to work well with each other given the independence assumptions of the probabilistic database. We introduce a new training objective that encourages good rankings for entire queries instead of single predicates. The data for this objective consists of tuples, {( i , e i )} n i=1 , of a query i with an observed answer e i (Figure 3, bottom right). Each i is a function with exactly one entity argument, and i (e) is a conjunction of predicate instances. For example, the last query in Figure 3 is a function of one argument z, and (e) is a single predicate instance, 'S(e, /EN/LATE). The new objective aims to rank the observed entity answer above unobserved entities for each query: log P rank ( i , e i , e i ) P rank generalizes the approximate ranking probability defined by the predicate ranking objec-tive to more general queries. The expression σ(θ T c (φ e − φ e )) in the predicate ranking objective can be viewed as an approximation of the probability that e is ranked above e in category c. P rank uses this approximation for each individual predicate in the query. For example, given the query = λx.c(x) ∧ r(x, y) and entities (e, e ), P rank ( , e, e ) = σ(θ c (φ e − φ e )) × σ(θ r (φ (e,y) − φ (e ,y) )). For this objective, we sample e i such that ( i , e i ) does not occur in the training data.
When 's body consists of a conjunction of predicates, the query ranking objective simplifies considerably. In this case, can be described as three sets of one-argument functions: categories C( ) = {λx.c(x)}, left arguments of relations R L ( ) = {λx.r(x, y)}, and right arguments of relations R R ( ) = {λx.r(y, x)}. Furthermore, P rank is a product so we can distribute the log: This simplification reveals that the main difference between O Q and O P is the sampling of the unobserved entities e and tuples t . O P samples them in an unconstrained fashion from their empirical distributions for every predicate. O Q considers the larger context in which each predicate occurs, with two major effects. First, more negative examples are generated for categories because the logical forms are more specific. For example, both "president of Sprint" and "president of the US" generate instances of the PRESIDENT predicate; O Q will use entities that only occur with one of these as negative examples for the other. Second, the relation parameters are trained to rank tuples with a shared argument, as opposed to tuples in general.
Note that, although P rank generalizes to more complex logical forms than existentially-quantified conjunctions, training with these logical forms is more difficult because P rank is no longer a product. In these cases, it becomes necessary to perform inference within the gradient computation, which can be expensive. The restriction to conjunctions makes inference trivial, enabling the factorization above.

Evaluation
We evaluate our approach to compositional semantics on a question answering task. Each test example is a (compositional) natural language question whose answer is a set of Freebase entities. We compare our open domain approach to several baselines based on prior work, as well as a human-annotated Freebase query for each example.

Data
We used Clueweb09 web corpus 5 with the corresponding Google FACC entity linking (Gabrilovich et al., 2013) to create the training and test data for our experiments. The training data is derived from 3 million webpages, and contains 2.1m predicate instances, 1.1m queries, 172k entities and 181k entity pairs. Predicates that appeared fewer than 6 times in the training data were replaced with the predicate UNK, resulting in 25k categories and 2.2k relations.
Our test data consists of fill-in-the-blank natural language questions such as "Incan emperor " or "Cunningham directed Auchtre's second music video ." These questions were created by applying the training data generation process (Section 6.1) to a collection of held-out webpages. Each natural language question has a corresponding logical form  query containing at least one category and relation. We chose not to use existing data sets for semantic parsing into Freebase as our goal is to model the semantics of language that cannot necessarily be modelled using the Freebase schema. Existing data sets, such as Free917 (Cai and Yates, 2013) and We-bQuestions (Berant et al., 2013), would not allow us to evaluate performance on this subset of language. Consequently, we evaluate our system on a new data set with unconstrained language. However, we do compare our approach against manually-annotated Freebase queries on our new data set (Section 7.5).
All of the data for our experiments is available at http://rtw.ml.cmu.edu/tacl2015_csf.

Methodology
Our evaluation methodology is inspired by information retrieval evaluations (Manning et al., 2008). Each system predicts a ranked list of 100 answers for each test question. We then pool the top 30 answers of each system and manually judge their correctness. The correct answers from the pool are then used to evaluate the precision and recall of each system. In particular, we compute average precision (AP) for each question and report the mean average precision (MAP) across all questions. We also report a weighted version of MAP, where each question's AP is weighted by its number of annotated correct answers. Average precision is computed as where Prec(k) is the precision at rank k, Correct(k) is an indicator function for whether the kth answer is correct, and m is the number of returned answers (at most 100).
Statistics of the annotated test set are shown in Table 1. A consequence of our unconstrained data generation approach is that some test questions are difficult to answer: of the 220 queries, at least one system was able to produce a correct answer for 116. The remaining questions are mostly unanswerable  because they reference rare entities unseen in the training data.

Models and Baselines
We implemented two baseline models based on existing techniques. The CORPUSLOOKUP baseline answers test questions by directly using the predicate instances in the training data as its knowledge base. For example, given the query λx.CEO(x) ∧ OF(x, /EN/SPRINT), this model will return the set of entities e such that CEO(e) and OF(e, /EN/SPRINT) both appear in the training data. All answers found in this fashion are assigned probability 1. The CLUSTERING baseline first clusters the predicates in the training corpus, then answers questions using the clustered predicates. The clustering aggregates predicates with similar denotations, ideally identifying synonyms to smooth over sparsity in the training data. Our approach is closely based on Lewis and Steedman (2013), though is also conceptually related to approaches such as DIRT (Lin and Pantel, 2001) and USP (Poon and Domingos, 2009). We use the Chinese Whispers clustering algorithm (Biemann, 2006) and calculate the similarity between predicates as the cosine similarity of their TF-IDF weighted entity count vectors. The denotation of each cluster is the union of the denotations of the clustered predicates, and each entity in the denotation is assigned probability 1.
We also trained two probabilistic database models, FACTORIZATION (O P ) and FACTORIZATION (O Q ), using the two objective functions described in Sections 6.2 and 6.3, respectively. We optimized both objectives by performing 100 passes over the training data with AdaGrad (Duchi et al., 2011) using an L2 regularization parameter of λ = 10 −4 . The predicate and entity embeddings have 300 dimensions. These parameters were selected on the basis of preliminary experiments with a small validation set. Finally, we observed that CORPUSLOOKUP has high precision but low recall, while both matrix factorization models have high recall with somewhat lower precision. This observation suggested that an ensemble of CORPUSLOOKUP and FACTORIZA-TION could outperform either model individually. We created two ensembles, ENSEMBLE (O P ) and ENSEMBLE (O Q ), by calculating the probability of each predicate as a 50/50 mixture of each model's predicted probability. Table 2 shows the results of our MAP evaluation, and Figure 4 shows a precision/recall curve for each model. The MAP numbers are somewhat low because almost half of the test questions have no correct answers and all models get an average precision of 0 on these questions. The upper bound on MAP is the fraction of questions with at least 1 correct answer. Note that the models perform well on the answerable questions, as reflected by the ratio of the achieved MAP to the upper bound. The weighted MAP metric also corrects for these unanswerable questions, as they are assigned 0 weight in the weighted average.

Results
These results demonstrate several findings. First, we find that both FACTORIZATION models outperform the baselines in both MAP and weighted MAP.  The performance improvement seems to be most significant in the high recall regime (right side of Figure 4). Second, we find that the query ranking objective O Q improves performance over the predicate ranking objective O P by 2-4% on the answerable queries. The precision/recall curves show that this improvement is concentrated in the low recall regime. Finally, the ensemble models are considerably better than their component models; however, even in the ensembled models, we find that O Q outperforms O P by a few percent.

Comparison to Semantic Parsing to Freebase
A natural question is whether our open vocabulary approach outperforms a closed approach for the same problem, such as semantic parsing to Freebase (e.g., Reddy et al. (2014)).
In order to answer this question, we compared our best performing model to a manually-annotated Freebase query for each test question. This comparison allows us to understand the relative advantages of open and closed predicate vocabularies. The first author manually annotated a Freebase MQL query for each natural language question in the test data set. This annotation is somewhat subjective, as many of the questions can only be inexactly mapped on to the Freebase schema. We used the following guidelines in performing the mapping: (1) all relations in the text must be mapped to one or more Freebase relations, (2) all entities mentioned in the text must be included in the query, (3) adjective modifiers can be ignored and (4) entities not mentioned in the text may be included in the query. The fourth condition is necessary because many one-place predicates, such as MAYOR(x), are represented in Freebase using a binary relation to a particular entity, such as GOVERNMENT    We compared our best performing model, EN-SEMBLE (O Q ), to the manually annotated Freebase queries using the same pooled evaluation methodology. The set of correct answers contains the correct predictions of ENSEMBLE (O Q ) from the previous evaluation along with all answers from Freebase.
Results from this evaluation are shown in Table  4. 6 In terms of overall MAP, Freebase outperforms our approach by a fair margin. However, this initial impression belies a more complex reality, which is shown in Table 5. This table compares both approaches by their relative performance on each test question. On approximately one-third of the questions, Freebase has a higher AP than our approach. On another third, our approach has a higher AP than Freebase. On the final third, both approaches perform equally well -these are typically questions where neither approach returns any correct answers (67 of the 75). Freebase outperforms in the overall MAP evaluation because it tends to return more correct answers to each question.
Note that the annotated Freebase queries have several advantages in this evaluation. First, Freebase contains significantly more predicate instances than our training data, which allows it to produce more complete answers. Second, the Freebase queries  correspond to the performance of a perfect semantic parser, while current semantic parsers achieve accuracies around 68% (Berant and Liang, 2014). The results from this experiment suggest that closed and open predicate vocabularies are complementary. Freebase produces high quality answers when it covers a question. However, many of the remaining questions can be answered correctly using an open vocabulary approach like ours. This evaluation also suggests that recall is a limiting factor of our approach; in the future, recall can be improved by using a larger corpus or including Freebase instances during training.

Related Work Open Predicate Vocabularies
There has been considerable work on generating semantic representations with an open predicate vocabulary. Much of the work is non-compositional, focusing on identifying similar predicates and entities. DIRT (Lin and Pantel, 2001), Resolver (Yates and Etzioni, 2007) and other systems (Yao et al., 2012) cluster synonymous expressions in a corpus of relation triples. Matrix factorization is an alternative approach to clustering that has been used for relation extraction  and finding analogies (Turney, 2008;Speer et al., 2008). All of this work is closely related to distributional semantics, which uses distributional information to identify semantically similar words and phrases (Turney and Pantel, 2010;Griffiths et al., 2007). Some work has considered the problem of compositional semantics with an open predicate vocabulary. Unsupervised semantic parsing (Poon and Domingos, 2009;Titov and Klementiev, 2011) is a clustering-based approach that incorporates com-position using a generative model for each sentence that factors according to its parse tree. Lewis and Steedman (2013) also present a clustering-based approach that uses CCG to perform semantic composition. This approach is similar to ours, except that we use matrix factorization and Freebase entities.
Finally, some work has focused on the problem of textual inference within this paradigm. Fader et al. (2013) present a question answering system that learns to paraphrase a question so that it can be answered using a corpus of Open IE triples (Fader et al., 2011). Distributional similarity has also been used to learn weighted logical inference rules that can be used for recognizing textual entailment or identifying semantically similar text (Garrette et al., 2011;Beltagy et al., 2013). This line of work focuses on performing inference between texts, whereas our work computes a text's denotation.
A significant difference between our work and most of the related work above is that our work computes denotations containing Freebase entities. Using these entities has two advantages: (1) it enables us to use entity linking to disambiguate textual mentions, and (2) it facilitates a comparison against alternative approaches that rely on a closed predicate vocabulary. Disambiguating textual mentions is a major challenge for previous approaches, so an entity-linked corpus is a much cleaner source of data. However, our approach could also work with automatically constructed entities, for example, created by clustering mentions in an unsupervised fashion (Singh et al., 2011).

Semantic Parsing
Several semantic parsers have been developed for Freebase (Cai and Yates, 2013;Kwiatkowski et al., 2013;Berant et al., 2013;Berant and Liang, 2014). Our approach is most similar to that of Reddy et al. (2014), which uses fixed syntactic parses of unlabeled text to train a Freebase semantic parser. Like our approach, this system automatically-generates query/answer pairs for training. However, this system, like all Freebase semantic parsers, uses a closed predicate vocabulary consisting of only Freebase predicates. In contrast, our approach uses an open predicate vocabulary and can learn denotations for words whose semantics cannot be represented using 267 Freebase predicates. Consequently, our approach can answer many questions that these Freebase semantic parsers cannot (see Section 7.5).
The rule-based semantic parser used in this paper is very similar to several other rule-based systems that produce logical forms from syntactic CCG parses (Bos, 2008;Lewis and Steedman, 2013). We developed our own system in order to have control over the particulars of the analysis; however, our approach is compatible with these systems as well.

Probabilistic Databases
Our system assigns a model-theoretic semantics to statements in natural language (Dowty et al., 1981) using a learned distribution over possible worlds. This distribution is concisely represented in a probabilistic database, which can be viewed as a simple Markov Logic Network (Richardson and Domingos, 2006) where all of the random variables are independent. This independence simplifies query evaluation: probabilistic databases permit efficient exact inference for safe queries (Suciu et al., 2011), and approximate inference for the remainder (Gatterbauer et al., 2010;Gatterbauer and Suciu, 2015).

Discussion
This paper presents an approach for compositional semantics with an open predicate vocabulary. Our approach defines a probabilistic model over denotations (sets of Freebase entities) conditioned on an input text. The model has two components: a rulebased semantic parser that produces a logical form for the text, and a probabilistic database that defines a distribution over denotations for each predicate. A training phase learns the probabilistic database by applying probabilistic matrix factorization with a query/answer ranking objective to logical forms derived from a large, entity-linked web corpus. An experimental analysis demonstrates that this approach outperforms several baselines and can answer many questions that cannot be answered by semantic parsing into Freebase.
Our approach learns a model-theoretic semantics for natural language text tied to Freebase, as do some semantic parsers, except with an open predicate vocabulary. This difference influences several other aspects of the system's design. First, because no knowledge base with the necessary knowledge exists, the system is forced to learn its knowledge base (in the form of a probabilistic database). Second, the system can directly map syntactic CCG parses to logical forms, as it is no longer necessary to map words to a closed vocabulary of knowledge base predicates. In some sense, our approach is the exact opposite of the typical semantic parsing approach: usually, the semantic parser is learned and the knowledge base is fixed; here, the knowledge base is learned and the semantic parser is fixed. From a machine learning perspective, training a probabilistic database via matrix factorization is easier than training a semantic parser, as there are no difficult inference problems. However, it remains to be seen whether a learned knowledge base can achieve similar recall as a fixed knowledge base on the subset of language it covers.
There are two limitations of this work. The most obvious limitation is the restriction to existentially quantified conjunctions of predicates. This limitation is not inherent to the approach, however, and can be removed in future work by using a system like Boxer (Bos, 2008) for semantic parsing. A more serious limitation is the restriction to one-and two-argument predicates, which prevents our system from representing events and n-ary relations. Conceptually, a similar matrix factorization approach could be used to learn embeddings for n-ary entity tuples; however, in practice, the sparsity of these tuples makes learning challenging. Developing methods for learning n-ary relations is an important problem for future work.
A direction for future work is scaling up the size of the training corpus to improve recall. Low recall is the main limitation of our current system as demonstrated by the experimental analysis. Both stages of training, the data generation and matrix factorization, can be parallelized using a cluster. All of the relation instances in Freebase can also be added to the training corpus. It should be feasible to increase the quantity of training data by a factor of 10-100, for example, to train on all of ClueWeb. Scaling up the training data may allow a semantic parser with an open predicate vocabulary to outperform comparable closed vocabulary systems. 268