Semantic passage segmentation based on sentence topics for question answering
Introduction
Segmenting a document into passages can be helpful in a variety of information access activities. When passages are identified based on semantics, they can be used as units of analysis and access, and are smaller and more coherent to handle than whole documents in a variety of text-based information systems.
Passage-level information access offers several advantages over document-based information access. First, the effectiveness of the retrieval system can be improved when passages are used as retrieval units first and their retrieval status values are combined later to rank documents. This is partly because the problem caused by varying document lengths can be avoided. Some similarity measures tend to favor shorter documents and thus may produce anomalous results for collections of documents of different lengths. For fixed-length passages, however, the problem of length normalization is less significant [19], [36], [38].
Second, segmented passages can help in the operation of other text-based information systems. For example, the identification of relevant passages of a document, which are smaller and more coherent than the whole document containing them, helps in locating the answers for the question answering (QA) system. Empirical studies on QA systems show that extracting answers from passages is more effective than extracting answers from documents [9], [13], [25], [32].
Finally, passages are more convenient for presentation of retrieval results than full documents that may be overly long and contain several intertwined topics. That is, passages allow users to focus on relevant parts of a document rather than on the document as a whole.
There are three approaches to passage boundary identification. The first one uses the structural information of a document for passages, i.e. paragraphs or section boundaries [36]. The second defines passages of fixed length [5], [7], [25]. The last approach uses semantic clues or topicality for identifying passages [2], [14], [33]. Determining passages based on topicality has proven to be remarkably successful in QA systems.
In this paper, we propose a semantic passage segmentation method based on the topic of a sentence and examine its efficacy in several contexts. The topic of a sentence is defined to be the subject or event that is most often mentioned and deemed significant in a particular domain to which the document containing the sentence belongs. We then show how semantic passages can serve as useful and meaningful units for improving the accuracy of the QA system.
The paper is organized as follows. In Section 2 we introduce related work on passage segmentation and retrieval and its use for QA. Section 3 explains the notions of semantic passages and sentence topics in order to set the stage for semantic passage segmentation. In Section 4, we present our new semantic passage segmentation method in detail. Section 5 describes how we apply semantic passages in our QA system; and Section 6 presents our evaluation results and conclusions.
Section snippets
Related work
Text segmentation has been studied in various contexts, with or without specific applications in mind. We first introduce text segmentation methods developed for flat text with no structural information and those applied for segmentation of spoken dialogues. Contrasted to these are the techniques developed for situations in which some text structure information is available. We then introduce ways in which text segmentation is used for information retrieval and question answering. Throughout
Semantic passages
The ultimate goal of our semantic passage identification is to help the QA system by providing small, meaningful text units as candidate answer passages. More specifically, we use this method in building our QA system that provides answers from a Korean encyclopedia. As such, we studied the characteristics of the encyclopedia and incorporated them in our semantic passage segmentation method.
Our encyclopedia, Pascaltm Encyclopedia (http://www.epascal.co.kr), contains articles in various domains
Semantic passage segmentation
Our semantic passage segmentation process consists of two phases: the topic assignment phase, where sentences are classified into sentence topics; and the sentence reorganization phase, where sentences are grouped into semantic passages. For the first phase, we use terms and other linguistic features in our learning-based classification approach. Compared with typical document classification methods using term distribution statistics, sentence classification needs additional features because
Use of semantic passages in QA
Many studies on QA show that extracting answers from passages is more effective than from documents [9], [13], [25]. With the use of information extraction (IE) techniques, some QA systems use knowledge bases containing pre-acquired answers. In this section, we describe how the semantic passages are used for passage retrieval and information extraction for QA. Semantic passages can be used as more meaningful units than whole documents for indexing and retrieval as well as extracting facts to
Empirical evaluation
To verify the efficacy of our proposed methods, we conducted a set of experiments. In a preliminary experiment, we first compared several classification algorithms to find the most suitable one for sentence topic classification. After fixing the topic classification method, we evaluated the performance of semantic passage segmentation and then showed the effects of sentence topics in passage retrieval and KB construction for QA.
Conclusion
We proposed a semantic passage segmentation method based on the notion of sentence topics to enhance the performance of our question answer system. We defined a semantic passage as a set of sentences grouped by semantic coherence determined by the topic assigned to individual sentences. We built a sentence topic classifier based on the Maximum Entropy (ME) model using terms and additional linguistic information called sentence patterns as features. Finally, we showed experimental results on the
References (41)
- et al.
Automatic text structuring and summarization
Information Processing and Management
(1997) - et al.
Document length normalization
Information Processing and Management
(1996) - et al.
Maximum entropy approach to natural language processing
Computational Linguistics
(1996) - T. Brants, F. Chen, I. Tsochantaridis, Topic-based document segmentation with probabilistic latent semantic analysis,...
- M. Clillet, J. Pessiot, M. Amini, P. Gallinari, Unsupervised learning with term clustering for thematic segmentation of...
- J.P. Callan, Passage-retrieval evidence in document retrieval, in: Proceedings of 17th annual international ACM-SIGIR,...
- et al.
An approach for constructing complex discriminating surfaces based on Bayesian interference of the maximum entropy
Information Sciences
(2004) - G. Chao, M.G. Dyer, Maximum entropy models for word sense disambiguation, in: Proceedings of 19th International...
- H. Christensen, B. Kolluru, Y. Gotoh, S. Renals, Maximum entropy segmentation on broadcast news, in: Proceedings of the...
- M.R. Choi, J. Hur, M.G. Jang, Constructing Korean lexical concept network for encyclopedia question-answering system,...
Generalized iterative scaling for log-linear models
The Annals of Mathematical Statistics
Inducing features of random fields
IEEE Transaction on Pattern Analysis and Machine Intelligence
WordNet, an electronic lexical database
Learning to classify text using support vector machines
Cited by (39)
Data Envelopment Analysis of linguistic features and passage relevance for open-domain Question Answering
2022, Knowledge-Based SystemsHybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering
2020, Information SciencesCitation Excerpt :QA systems are characterized by a common high-level architecture, according to which, first, a natural language question is processed and a set of clues is extracted from it to understand what it is being asked for [15,42]. Then, these clues are used to query a knowledge base or a collection of documents and retrieve relevant information correlated to the question [3,26]. Finally, the retrieved information is evaluated in order to select, among more possible alternatives, the most confident answer expressed in a succinct form [44,47].
A top-down information theoretic word clustering algorithm for phrase recognition
2014, Information SciencesCitation Excerpt :The task of arbitrary phrase chunking has recently received increasing attention in many natural language processing (NLP) research issues [16,33].
Integrating statistical and lexical information for recognizing textual entailments in text
2013, Knowledge-Based SystemsCitation Excerpt :The relationships directly provide useful information for downstream purposes. Useful applications include eliminating duplicate descriptions in question answering (QA) systems [25,17], finding redundant sentences for machine translation and text summarization [9]. The goal of recognizing textual entailment relations is to identify, given two text fragments t and h, whether t entails h or not (where t means the entailing text and h is the hypothesis or the entailed text).
Compositional question answering: A divide and conquer approach
2011, Information Processing and ManagementCitation Excerpt :For statistical significance of the results, we employed paired t-test that assesses whether the means of two groups are statistically different from each other. The proposed compositional QA method was compared against the following: (1) a traditional QA approach of using general indexing and passage retrieval (Oh et al., 2007), (2) a simple routing approach where a question is sent to all the available QA modules and the results are combined, and (3) strategy-driven QA (Oh & Myaeng, 2009) that corresponds to the step 1 of compositional QA described in Section 4.3, where each answer from a primary QA module are verified and its confidence value boosted based on the chosen strategy. All the cases are considered atomic QA because the questions are not decomposed into simpler ones.