A Novel Feature-based Bayesian Model for Query Focused Multi-document Summarization

Supervised learning methods and LDA based topic model have been successfully applied in the field of multi-document summarization. In this paper, we propose a novel supervised approach that can incorporate rich sentence features into Bayesian topic models in a principled way, thus taking advantages of both topic model and feature based supervised learning methods. Experimental results on DUC2007, TAC2008 and TAC2009 demonstrate the effectiveness of our approach.


Introduction
Query-focused multi-document summarization (Nenkova et al., 2006;Wan et al., 2007;Ouyang et al., 2010) can facilitate users to grasp the main idea of documents. In query-focused summarization, a specific topic description, such as a query, which expresses the most important topic information is proposed before the document collection, and a summary would be generated according to the given topic.
Supervised models have been widely used in summarization (Li, et al., 2009, Shen et al., 2007, Ouyang et al., 2010. Supervised models usually regard summarization as a classification or regression problem and use various sentence features to build a classifier based on labeled negative or positive samples. However, existing supervised approaches seldom exploit the intrinsic structure among sentences. This disadvantage usually gives rise to serious problems such as unbalance and low recall in summaries. Recently, LDA-based (Blei et al., 2003) Bayesian topic models have widely been applied in multidocument summarization in that Bayesian approaches can offer clear and rigorous probabilistic interpretations for summaries (Daume and Marcu, 2006;Haghighi and Vanderwende, 2009;Jin et al., 2010;Mason and Charniak, 2011;Delort and Alfonseca, 2012). Exiting Bayesian approaches label sentences or words with topics and sentences which are closely related with query or can highly generalize documents are selected into summaries. However, LDA topic model suffers from the intrinsic disadvantages that it only uses word frequency for topic modeling and can not use useful text features such as position, word order etc (Zhu and Xing, 2010). For example, the first sentence in a document may be more important for summary since it is more likely to give a global generalization about the document. It is hard for LDA model to consider such information, making useful information lost.
It naturally comes to our minds that we can improve summarization performance by making full use of both useful text features and the latent semantic structures from by LDA topic model. One related work is from Celikyilmaz and Hakkani-Tur (2010). They built a hierarchical topic model called Hybhsum based on LDA for topic discovery and assumed this model can produce appropriate scores for sentence evaluation. Then the scores are used for tuning the weights of various features that helpful for summary generation. Their work made a good step of combining topic model with feature based supervised learning. However, what their approach confuses us is that whether a topic model only based on word frequency is good enough to generate an appropriate sentence score for regression. Actually, how to incorporate features into LDA topic model has been a open problem. Supervised topic models such as sLDA (Blei and MacAuliffe 2007) give us some inspiration. In sLDA, each document is associated with a labeled feature and sLDA can integrate such feature into LDA for topic modeling in a prin-cipled way.
With reference to the work of supervised LDA models, in this paper, we propose a novel sentence feature based Bayesian model S-sLDA for multidocument summarization. Our approach can naturally combine feature based supervised methods and topic models. The most important and challenging problem in our model is the tuning of feature weights. To solve this problem, we transform the problem of finding optimum feature weights into an optimization algorithm and learn these weights in a supervised way. A set of experiments are conducted based on the benchmark data of DUC2007, TAC2008 and TAC2009, and experimental results show the effectiveness of our model.
The rest of the paper is organized as follows. Section 2 describes some background and related works. Section 3 describes our details of S-sLDA model. Section 4 demonstrates details of our approaches, including learning, inference and summary generation. Section 5 provides experiments results and Section 6 concludes the paper.

Related Work
A variety of approaches have been proposed for query-focused multi-document summarizations such as unsupervised (semi-supervised) approaches, supervised approaches, and Bayesian approaches.
Unsupervised (semi-supervised) approaches such as Lexrank (Erkan and Radex, 2004), manifold (Wan et al., 2007) treat summarization as a graphbased ranking problem. The relatedness between the query and each sentence is achieved by imposing querys influence on each sentence along with the propagation of graph. Most supervised approaches regard summarization task as a sentence level two class classification problem. Supervised machine learning methods such as Support Vector Machine(SVM) (Li, et al., 2009), Maximum Entropy (Osborne, 2002) , Conditional Random Field (Shen et al., 2007 and regression models (Ouyang et al., 2010) have been adopted to leverage the rich sentence features for summarization.
Recently, Bayesian topic models have shown their power in summarization for its clear probabilistic interpretation. Daume and Marcu (2006) proposed Bayesum model for sentence extraction based on query expansion concept in information retrieval. Haghighi and Vanderwende (2009) proposed topicsum and hiersum which use a LDA-like topic model and assign each sentence a distribution over background topic, doc-specific topic and content topics. Celikyilmaz and Hakkani-Tur (2010) made a good step in combining topic model with supervised feature based regression for sentence scoring in summarization. In their model, the score of training sentences are firstly got through a novel hierarchical topic model. Then a featured based support vector regression (SVR) is used for sentence score prediction. The problem of Celikyilmaz and Hakkani-Turs model is that topic model and feature based regression are two separate processes and the score of training sentences may be biased because their topic model only consider word frequency and fail to consider other important features. Supervised feature based topic models have been proposed in recent years to incorporate different kinds of features into LDA model. Blei (2007) proposed sLDA for document response pairs and Daniel et al. (2009) proposed Labeled LDA by defining a one to one correspondence between latent topic and user tags. Zhu and Xing (2010) proposed conditional topic random field (CTRF) which addresses feature and independent limitation in LDA. The hierarchical Bayesian LDA (Blei et al., 2003) models the probability of a corpus on hidden topics as shown in Figure 1(a). Let K be the number of topics , M be the number of documents in the corpus and V be vocabulary size. The topic distribution of each document θ m is drawn from a prior Dirichlet distribution Dir(α), and each document word w mn is sampled from a topic-word distribution φ z specified by a drawn from the topic-document distribution θ m . β is a K × M dimensional matrix and each β k is a distribution over the V terms. The generating procedure of LDA is illustrated in Figure 2. θ m is a mixture proportion over topics of document m and z mn is a K dimensional variable that presents the topic assignment distribution of different words.
Supervised LDA (sLDA) (Blei and McAuliffe 2007) is a document feature based model and intro- duces a response variable to each document for topic discovering, as shown in Figure 1(b). In the generative procedure of sLDA, the document pairwise la-

Problem Formulation
Here we firstly give a standard formulation of the task. Let K be the number of topics, V be the vocabulary size and M be the number of documents.
where N m denotes the number of sentences in m th document. Each sentence is represented with a collection of words {w msn } n=Nms n=1 where N ms denotes the number of words in current sentence.
− − → Y ms denotes the feature vector of current sentence and we assume that these features are independent.

S-sLDA
z ms is the hidden variable indicating the topic of current sentence. In S-sLDA, we make an assumption that words in the same sentence are generated from the same topic which was proposed by Gruber (2007). z msn denotes the topic assignment of current word. According to our assumption, z msn =  Figure 4: generation process for S-sLDA z ms for any n ∈ [1, N ms ]. The generative approach of S-sLDA is shown in Figure 3 and Figure 4. We can see that the generative process involves not only the words within current sentence, but also a series of sentence features. The mixture weights over features in S-sLDA are defined with a generalized linear model (GLM).
Here we assume that each sentence has T features and − − → Y ms is a T × 1 dimensional vector. η is a K × T weight matrix of each feature upon topics, which largely controls the feature generation procedure. Unlike s-LDA where η is a latent variable estimated from the maximum likelihood estimation algorithm, in S-sLDA the value of η is trained through a supervised algorithm which will be illustrated in detail in Section 3.

Posterior Inference and Estimation
Given a document and labels for each sentence, the posterior distribution of the latent variables is: Eqn.
(2) cannot be efficiently computed. By applying the Jensens inequality, we obtain a lower bound of the log likelihood of document where H(q) = −E[logq] and it is the entropy of variational distribution q is defined as here γ a K-dimensional Dirichlet parameter vector and multinomial parameters. The first, third and forth terms of Eqn.
(3) are identical to the corresponding terms for unsupervised LDA (Blei et al., 2003). The second term is the expectation of log probability of features given the latent topic assignments.
The Bayes estimation for S-sLDA model can be got via a variational EM algorithm. In EM procedure, the lower bound is firstly minimized with respect to γ and φ, and then minimized with α and β by fixing γ and φ. E-step: The updating of Dirichlet parameter γ is identical to that of unsupervised LDA, and does not involve where Ψ(·) denotes the log Γ function. m s denotes the document that current sentence comes from and Y st denotes the t th feature of sentence s.

M-step:
The M-step for updating β is the same as the procedure in unsupervised LDA, where the probability of a word generated from a topic is proportional to the number of times this word assigned to the topic. In this subsection, we describe how we learn the feature weight η in a supervised way. The learning process of η is a supervised algorithm combined with variational inference of S-sLDA. Given a topic description Q 1 and a collection of training sentences S from related documents, human assessors assign a score v(v = −2, −1, 0, 1, 1) to each sentence in S. The score is an integer between −2 (the least desired summary sentences) and +2 (the most desired summary sentences), and score 0 denotes neutral atti- is the set containing sentences with score v. Let φ Qk denote the probability that query is generated from topic k. Since query does not belong to any document, we use the following strategy to leverage φ Qk (9) In Equ.(9), w∈Q β kw denotes the probability that all terms in query are generated from topic k and 1 can be seen as the average probability that all documents in the corpus are talking about topic k. Eqn. (9) is based on the assumption that query topic is relevant to the main topic discussed by the document corpus. This is a reasonable assumption and most previous LDA summarization models are based on similar assumptions.
Next, we define φ Ov,k for sentence set O v , which can be interpreted as the probability that all sentences in collection O v are generated from topic k.
|O v | denotes the number of sentences in set O v . Inspired by the idea that desired summary sentences would be more semantically related with the query, we transform problem of finding optimum η to the following optimization problem: T t=1 η kt = 1 (11) 1 We select multiple queries and their related sentences for training where KL(O v ||Q) is the Kullback-Leibler divergence between the topic and sentence set O v as shown in Eqn. (12).
In Eqn. (11), we can see that O 2 , which contain desirable sentences, would be given the largest penalty for its KL divergence from Query. The case is just opposite for undesired set.
Our idea is to incorporate the minimization process of Eqn.(11) into variational inference process of S-sLDA model. Here we perform gradient based optimization method to minimize Eqn.(11). Firstly, we derive the gradient of L(η) with respect to η.
For simplification, we regard β and γ as constant during updating process of η, so ∂φ Qk ∂ηxy = 0. 2 We can further get first derivative for each labeled sentence.

Feature Space
Lots of features have been proven to be useful for summarization (Louis et al., 2010  In addition, we also use the commonly used features including sentence position, paragraph position, sentence length and sentence bigram frequency.

E-step
initialize φ 0 sk := 1/K for all i and s. initialize γ mi := α mi + N )m/K for all i. initialize η kt = 0 for all k and t.

Sentence Selection Strategy
Next we explain our sentence selection strategy. According to our intuition that the desired summary should have a small KL divergence with query, we propose a function to score a set of sentences Sum. We use a decreasing logistic function ζ(x) = 1/(1+ e x ) to refine the score to the range of (0,1).
Let Sum denote the optimum update summary. We can get Sum by maximizing the scoring function.

Score(Sum)
1. Learning: Given labeled set O v , learn the feature weight vector η using algorithm in Figure 5. 2. Given new data set and η, use algorithm in section 3.3 for inference. (The only difference between this step and step (1) is that in this step we do not need minimize L(η). 3. Select sentences for summarization from algorithm in Figure 6. A greedy algorithm is applied by adding sentence one by one to obtain Sum . We use G to denote the sentence set containing selected sentences. The algorithm first initializes G to Φ and X to SU . During each iteration, we select one sentence from X which maximize Score(s m ∪ G). To avoid topic redundancy in the summary, we also revise the MMR strategy (Goldstein et al., 1999;Ouyang et al., 2007) in the process of sentence selection. For each s m , we compute the semantic similarity between s m and each sentence s t in set Y in Eqn. (18).
We need to assure that the value of semantic similarity between two sentences is less than T h sem . The whole procedure for summarization using S-sLDA model is illustrated in Figure 6. T h sem is set to 0.5 in the experiments.

Experiments Set-up
The query-focused multi-document summarization task defined in DUC 3 (Document Understanding Conference) and TAC 4 (Text Analysis Conference) evaluations requires generating a concise and well organized summary for a collection of related news documents according to a given query which describes the users information need. The query usually consists of a title and one or more narrative/question sentences. The system-generated summaries for DUC and TAC are respectively limited to 250 words and 100 words. Stop-words in both documents and queries are removed using a stop-word list of 598 words, and the remaining words are stemmed by Porter Stemmer 6 . As for the automatic evaluation of summarization, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures, including ROUGE-1, ROUGE-2, and ROUGE-SU4 7 and their corresponding 95% confidence intervals, are used to evaluate the performance of the summaries. In order to obtain a more comprehensive measure of summary quality, we also conduct manual evaluation on TAC data with reference to (Haghighi and Vanderwende, 2009;Celikyilmaz and Hakkani-Tur, 2011;Delort and Alfonseca, 2011).

Comparison with other Bayesian models
In this subsection, we compare our model with the following Bayesian baselines: KL-sum: It is developed by Haghighi and Vanderwende (Lin et al., 2006) by using a KLdivergence based sentence selection strategy.
where P s is the unigram distribution of candidate summary and Q d denotes the unigram distribution of document collection. Sentences with higher ranking score is selected into the summary. HierSum: A LDA based approach proposed by Haghighi and Vanderwende (2009), where unigram distribution is calculated from LDA topic model in Equ.(14).
For fair comparison, baselines use the same proprecessing methods with our model and all sum-maries are truncated to the same length of 100 words.
From Table 1 and  see that among all the Bayesian baselines, Hybhsum achieves the best result. This further illustrates the advantages of combining topic model with supervised method. In Table 1, we can see that our S-sLDA model performs better than Hybhsum and the improvements are 3.4% and 3.7% with respect to ROUGE-2 and ROUGE-SU4 on TAC2008 data. The comparison can be extended to TAC2009 data as shown in Table 2: the performance of S-sLDA is above Hybhsum by 4.3% in ROUGE-2 and 5.1% in ROUGE-SU4. It is worth explaining that these achievements are significant, because in the TAC2008 evaluation, the performance of the top ranking systems are very close, i.e. the best system is only 4.2% above the 4th best system on ROUGE-2 and 1.2% on ROUGE-SU4.

Comparison with other baselines.
In this subsection, we compare our model with some widely used models in summarization. Manifold: It is the one-layer graph based semisupervised summarization approach developed by Wan et al.(2008). The graph is constructed only considering sentence relations using tf-idf and neglects topic information.
LexRank: Graph based summarization approach (Erkan and Radev, 2004), which is a revised version of famous web ranking algorithm PageRank. It is an unsupervised ranking algorithms compared with Manifold.
SVM: A supervised method -Support Vector Machine (SVM) (Vapnik 1995) which uses the same features as our approach.
MEAD: A centroid based summary algorithm by Radev et al. (2004). Cluster centroids in MEAD consists of words which are central not only to one article in a cluster, but to all the articles. Similarity is measure using tf-idf.
At the same time, we also present the top three participating systems with regard to ROUGE-2 on TAC2008 and TAC2009 for comparison, denoted as (denoted as SysRank 1st, 2nd and 3rd) (Gillick et al., 2008;Zhang et al., 2008;Gillick et al., 2009;Varma et al., 2009). The ROUGE scores of the top TAC system are directly provided by the TAC evaluation.
From Table 3 and Table 4, we can see that our approach outperforms the baselines in terms of ROUGE metrics consistently. When compared with the standard supervised method SVM, the relative improvements over the ROUGE-1, ROUGE-2 and ROUGE-SU4 scores are 4.3%, 13.1%, 8.3% respectively on TAC2008 and 7.2%, 14.9%, 14.3% on TAC2009. Our model is not as good as top participating systems on TAC2008 and TAC2009. But considering the fact that our model neither uses sentence compression algorithm nor leverage domain knowledge bases like Wikipedia or training data, such small difference in ROUGE scores is reasonable.

Manual Evaluations
In order to obtain a more accurate measure of summary quality for our S-sLDA model and Hybhsum, we performed a simple user study concerning the following aspects: (1) Overall quality: Which summary is better overall? (2) Focus: Which summary contains less irrelevant content? (3)Responsiveness: Which summary is more responsive to the query. (4) Non-Redundancy: Which summary is less redundant? 8 judges who specialize in NLP participated in the blind evaluation task. Evaluators are presented with two summaries generated by S-sLDA   and Hybhsum, as well as the four questions above. Then they need to answer which summary is better (tie). We randomly select 20 document collections from TAC 2008 data and randomly assign two summaries for each collection to three different evaluators to judge which model is better in each aspect. As we can see from  Responsiveness, S-sLDA model outputs Hybhsum based on t-test on 95% confidence level. Table 6 shows the example summaries generated respectively by two models for document collection D0803A-A in TAC2008, whose query is "Describe the coal mine accidents in China and actions taken". From table 6, we can see that each sentence in these two summaries is somewhat related to topics of coal mines in China. We also observe that the summary in Table 6(a) is better than that in Table 6(b), tending to select shorter sentences and provide more information. This is because, in S-sLDA model, topic modeling is determined simultaneously by various features including terms and other ones such as sentence length, sentence position and so on, which can contribute to summary quality. As we can see, in Table 6(b), sentences (3) and (5) provide some unimportant information such as "somebody said", though they contain some words which are related to topics about coal mines.
(1)China to close at least 4,000 coal mines this year: official (2)By Oct. 10 this year there had been 43 coal mine accidents that killed 10 or more people, (3)Officials had stakes in coal mines. (4)All the coal mines will be closed down this year. (5) In the first eight months, the death toll of coal mine accidents rose 8.5 percent last year. (6) The government has issued a series of regulations and measures to improve the coun.try's coal mine safety situation. (7)The mining safety technology and equipments have been sold to countries. (8)More than 6,000 miners died in accidents in China (1) In the first eight months, the death toll of coal mine accidents across China rose 8.5 percent from the same period last year. (2)China will close down a number of ill-operated coal mines at the end of this month, said a work safety official here Monday.
(3) Li Yizhong, director of the National Bureau of Production Safety Supervision and Administration, has said the collusion between mine owners and officials is to be condemned.
(4)from January to September this year, 4,228 people were killed in 2,337 coal mine accidents. (5) Chen said officials who refused to register their stakes in coal mines within the required time

Conclusion
In this paper, we propose a novel supervised approach based on revised supervised topic model for query-focused multi document summarization. Our approach naturally combines Bayesian topic model with supervised method and enjoy the advantages of both models. Experiments on benchmark demonstrate good performance of our model.