As the amount of online information increases, the task of query-focused multi-document summarization has received growing attention. Query-focused summarization is the process of summarizing document clusters in response to a specific input query. This literature review gives a broad insight into the recent growth in extractive query-focused multi-document summarization methods. It also explains the state-of-the-art research in this area. Moreover, the effectiveness, limitations, and the most common quality evaluation measures of the different methods were described. This paper acknowledges some of the challenging problems persist. Most importantly, the demand for a collective approach that combines varied methods that would lead to the best coherent, non-redundant, and concise summary in response to a user query. This review could be the first step in developing a hybrid of different approaches that will yield good results for varied types of documents.
In the modern world, there has been a rapid growth of the available documents on the World Wide Web. As a result, retrieving useful information from the vast amount of electronically accessible documents has become a big challenge. Text summarization can be used to reduce this problem efficiently. An automatic text summarization system is created to take a single document or a cluster of documents as an input; then, as an output, it produces a succinct and coherent summary of essential information. Recent years have seen various summarization applications developments as a fundamental and efficient tool for document understanding, organization, and navigation.
Query-focused multi-document text summarization is a task of returning a shortened, concise, and summarized response to a query entered by a user. Recently, this variant of automatic summarization has been very active. In fact, text summarization has received an increasing amount of attention with the plethora of websites and plentiful information available on the internet. It is a valid and fundamental tool to understand and navigate online document such as the news, blogs, facts, and articles. Modern search engines use query-focused text summarization techniques to produce a summary for each retrieved document which helps the users to grasp the critical content of a document. This method will save the users time and, therefore, increase the quality of the search engine’s service. Moreover, it is important in marketing concept that as companies get an idea, through user queries, of what interests a particular user so that information can be stored in cookies and other information-gathering tools later.
In this literature review, query-focused multi-document summarization shall be assessed in detail, and different techniques, such as graph-based, machine-learning, and integer linear programming approaches, shall be discussed. Furthermore, this literature review will explore the hypothesis that multiple approaches can be used in combination to improve the quality of the produced summary. Specifically, the prospect of combining integer linear programming, graph-based, and machine learning approaches will be investigated.
The paper is organized as follow: the next section is discussing the background information about query focused text summarization to understand the topic thoroughly. Section three is explaining the purpose and the hypothesis of this literature review. This will include the faced challenges, different approaches mainly graph-based, machine-learning and integer linear programming. After that, the fourth section is a comparison between these approaches. Section five is describing the research direction and proposal. Finally, the summary and concluding will be seen in section six.
The explosion of online information makes the knowledge extraction a complicated process for the users. As a result, many relevant and essential documents are discarded. Nowadays, automatic summarization has a considerable attention in natural language processing field due to its broad practical applications.
Text summarization is explained to be the process of shortening a longer text such that its central message and crux are preserved, with the shortened text still being concise and fluent. Indeed, there are many distinctions made in summarization that are frequently mentioned in automatic text summarization literature. Extractive and abstractive summaries are the two major categories 16. An extractive summary is created by linking together some chosen sentences from the original text to form a summary. These chosen sentences are written in the output summary precisely as they appear in the original text without any modification. On the other hand, an abstractive summary is described as reading and understanding to recognize its content.
Then the system paraphrases the primary information of the input document in order to make the summary looks more human-like. Obviously, automatic abstractive summarization is much harder to construct, in general, than automatic extractive summarization.
Moreover, depending on the number of input documents, the summarization can be broadly classified into a single-document system or a multi-document system. Early work in automatic summarization dealt with only one document where the system reduces a single document into a shorter representation. After that, along with the progression of research and internet popularization, the use of multi-document summarization became more useful. Considering the tremendous amount of redundancy on the web, summarization systems can be more helpful if they provide a brief digest of multiple documents on the same topic. To emphasize, multi-document summarization is the task of producing a summary from clusters of related documents. The output summary enables the users to understand the documents set and helps to identify the essential common aspects of these documents and their differences.
Furthermore, an output summary can be generic, query-focused or updated one. Typically, the general summary gives an overview of the important information of the input document (single or multi) as a whole to assist the reader to quickly grasp what the document is about, possibly avoiding reading the document itself. In this matter, the importance of the information is determined only concerning the content of the whole input document. On the contrary, when the summary is produced in response to a user query, the query itself determines what relevant information should be included in the output summary. In other words, query-focused summarization produces a summary which answers the user need for information that is expressed in the input query. Finally, an updated summary is the one that is time sensitive such that the output summary must express the important development of an event above what the user has already read.
In the long run, the main two issues for automatic summarization methods are ranking and selection problems 25. Ranking problem is defined as how to rank the sentences in the document set; this problem requires a model that assigns relevance score for each sentence based on the given query. The selection problem is known as how to select a subset of those ranked sentences to form the summary. This demands a system that increases variety and reduces redundancy to make the summary maximal informative under the limited length.
Generally, from an information retrieval aspect, the system represents the entered query and each sentence in the document set as vectors. The similarity score between them is calculated. After that, the top k ranked sentences from each document are retrieved as a response to that query. Next, the selection process is implemented to create the best summary. One of the most frequently used similarity measures is the cosine measure since it is very sensitive to text vector pattern. It calculates the cosine of the angle between two feature vectors. Cosine similarity is measured by the sum of the products of corresponding words in both the sentences divided by the squared sum of words separately 4.
In fact, the basic score factor used in the traditional information retrieval is term frequency (tf) which is, simply, the number of occurrence of the term (t) in a sentence. Using only (tf) score is not enough. For example, even though some terms have a higher frequency than others, it does not necessarily mean that they are more relevant to the query. This is clear in the case of help words such as: and, then, is, which are not essential. Therefore, (tf) should be used as a combination with other factors to get more accurate results. Inverse sentence frequency (isf) measure has been introduced to improve the discriminating power of terms. It represents the global weighting factor of term (t). It assigns a meager score for a term that appears in all sentences in the collection representing that the term is unusable as a sentence discriminator 4,17.
3 Purpose and Hypothesis
Query focused multi-documents summarization has the potential to solve the struggle of retrieving beneficial information from mass data and improves the effectiveness of obtaining and utilizing information. In this literature review, extractive query-focused multi-document summarization will be discussed. Three methods of summarization will be applied which are graph-based, machine-learning, and integer linear programming approaches. The output summary will be more efficient since each approach will solve some of the limitations of the others.
3.1 Challenges of query focused multi-documents summarization
There are many difficulties facing the automatic text summarization especially the query focused one. He, Shao, Yang, and Ma 1 posit that query-based summarization is complex because the user-query has to determine the relevancy of sentences. It is not only expected to decide the essential information contained in all document set maximally, but it is, also, expected to ensure that the information is biased to the given query. In other words, the query decides the sentences that are suitable for inclusion in summary. So, query-characteristics should be taken in to account during the summary process by computing the correlation measure between the entered query and every sentence in the documents, then rank the sentences based on the produced scores. Considering only the exact match between the query’s terms and the sentences’ terms might not always deliver the information need. Furthermore, applying that to multi-documents set will make the task potentially more challenging since we have to deal with the differences and similarities between the documents set. Recently, several techniques have been developed for assigning the importance of a sentence concerning a query. These methods use different types of the sentence’s features and incorporate the information of the query in order to rank sentences such as the relevance feature, the information richness feature 1, term overlap feature 18, raw frequency of query terms, log-likelihood ratio 5, sentence position, length 14.
Moreover, redundancy is a common problem that is faced with almost every approach to text-summarization 12. A summary must be more informative and less redundant. In single-document summarization, all sentences are unique mostly do not have redundant information. On the other hand, multi-documents certainly have overlapping information. The information in the documenst set might be redundant or can express the exact information with different sentence structures without adding any additional information. This problem will make the summarization more complicated since it requires knowledge synthesis and discovery to analyze the sentences in all documents to filter out redundant and repetitive information 13.
Also, time efficiency is an essential issue for query-focused summarization. Practically, the user needs the response to this query on a minimal time possible. Therefore, this task combined both information retrieval and automatic summarization difficulties.
Tal, Cohen, and Elhadad 9 delved into the challenges that are faced in testing query-based text summarization. Their study revealed that Document Understanding Conferences (DUC) datasets make it difficult to efficiently determine the effectiveness of algorithms.
Additionally, it is still tough to connect between the document’s semantic meanings and the basic sentence units. Some studies try to overcome this problem through deep learning 25.
Last but not the least, the most commonly used score-based methods do not guarantee that the summaries generated will contain words that were part of queries 3. Hence, it is established that query-focused text summarization has a long way to go to overcome the said challenges.
Integer linear programming and machine learning have been found to improve query focused text summarization significantly. However, graph-based approaches are the most common. Log-likelihood and frequency-based calculation, to determine relevancy in a pair of sentence and topic vectors, are used as sub-methods in the above-said approaches. We will discuss the above approaches and their possible combinations in this section.
3.2.1 Graph-Based Text Summarization
The Graph-based approach to text summarization is the frequently used one for automatic summarization. Basically, It treats a document as a graph. Vertices and edges are formed by similarities and relevancy with the query sentence. The general algorithm is representing the sentences as nodes, and the correlation measure score between these nodes are reflected on the edges. In fact, graph-based approaches predominantly rely on ranking algorithms. As of now, many variants of graph-based methods of text summarization exist, with varying efficiency. A multi-modality system – which improves on the manifold-ranking algorithm by taking into account intra-document relationships and relevancy as well as similarities among other documents – developed by Xiaojun and Xia 16 showed improvements over the basic manifold-ranking algorithm. Redundancies, as mentioned earlier, are common bottlenecks in text summarizations. Balaji, Geetha and Parthasarathi 12 proposed a model that successfully reduced redundancies in graph-based query-focused text summarization. A global semantic graph is generated by graph matching in this research, which reduces the frequency of showing similar sentences. In other words, query graph is integrated into the spreading activation algorithm, along with the semantic relation, which is used to reduce redundant subgraph 12.
An efficient query-focused summary is expected to have minimum redundancy and maximum query relevancy. In fact, existing basic graph-based methods do not consider the semantic meaning, so they might cluster together two low lexical similarity sentences into two different topics even if they are semantically very similar. As a result, the top scored sentences in each cluster might be, actually, very similar which will cause redundancy. Xiong and Ji 14 formulated a hyper-graph method that considers many factors, such as relevancy and similarity to query and feature-based approaches, which help in determining the rank of sentences. In this approach, the “cosine distance” of sentences is calculated before inclusion in the hypergraph 14. Similar sentences are clustered into a group, and a graph is plotted against them. The said hypergraph-based methods yielded encouraging results on test data.
Also, Zhao, Wu, and Huang 11 used a graph-ranking algorithm that relied on “pair-wise similarities” and “affinity” matrices to summarize documents. The results were extremely positive, to the extent that the authors compared the algorithm with state-of-the-art models. Varadarajan and V. Hristidis 8 used ranking spanning trees to find the relationship between topic sentences and actual sentences on multiple documents. An information retrieval ranking system was used to rate sentences in this model 8. Bhaskar and Bandyopadhyay 15 have presented an algorithm of text-summarization that clusters similar sentences – similarity is determined by edge (each sentence in a document is an edge) scores. Top-scored sentences in the edge clusters are selected compressed using parsers for relevancy. The approach gave commendable results on standard documents. Graph-based methods are most commonly used in text summarization, and much of the research stated above is based on some variations on graph-based algorithms.