Semantic passage segmentation based on sentence topics for question answering

doi:10.1016/j.ins.2007.02.038

Information Sciences

Volume 177, Issue 18, 15 September 2007, Pages 3696-3717

https://doi.org/10.1016/j.ins.2007.02.038 Get rights and content

Abstract

We propose a semantic passage segmentation method for a Question Answering (QA) system. We define a semantic passage as sentences grouped by semantic coherence, determined by the topic assigned to individual sentences. Topic assignments are done by a sentence classifier based on a statistical classification technique, Maximum Entropy (ME), combined with multiple linguistic features. We ran experiments to evaluate the proposed method and its impact on application tasks, passage retrieval and template-filling for question answering. The experimental result shows that our semantic passage retrieval method using topic matching is more useful than fixed length passage retrieval. With the template-filling task used for information extraction in the QA system, the value of the sentence topic assignment method was reinforced.

Introduction

Segmenting a document into passages can be helpful in a variety of information access activities. When passages are identified based on semantics, they can be used as units of analysis and access, and are smaller and more coherent to handle than whole documents in a variety of text-based information systems.

Passage-level information access offers several advantages over document-based information access. First, the effectiveness of the retrieval system can be improved when passages are used as retrieval units first and their retrieval status values are combined later to rank documents. This is partly because the problem caused by varying document lengths can be avoided. Some similarity measures tend to favor shorter documents and thus may produce anomalous results for collections of documents of different lengths. For fixed-length passages, however, the problem of length normalization is less significant [19], [36], [38].

Second, segmented passages can help in the operation of other text-based information systems. For example, the identification of relevant passages of a document, which are smaller and more coherent than the whole document containing them, helps in locating the answers for the question answering (QA) system. Empirical studies on QA systems show that extracting answers from passages is more effective than extracting answers from documents [9], [13], [25], [32].

Finally, passages are more convenient for presentation of retrieval results than full documents that may be overly long and contain several intertwined topics. That is, passages allow users to focus on relevant parts of a document rather than on the document as a whole.

There are three approaches to passage boundary identification. The first one uses the structural information of a document for passages, i.e. paragraphs or section boundaries [36]. The second defines passages of fixed length [5], [7], [25]. The last approach uses semantic clues or topicality for identifying passages [2], [14], [33]. Determining passages based on topicality has proven to be remarkably successful in QA systems.

In this paper, we propose a semantic passage segmentation method based on the topic of a sentence and examine its efficacy in several contexts. The topic of a sentence is defined to be the subject or event that is most often mentioned and deemed significant in a particular domain to which the document containing the sentence belongs. We then show how semantic passages can serve as useful and meaningful units for improving the accuracy of the QA system.

The paper is organized as follows. In Section 2 we introduce related work on passage segmentation and retrieval and its use for QA. Section 3 explains the notions of semantic passages and sentence topics in order to set the stage for semantic passage segmentation. In Section 4, we present our new semantic passage segmentation method in detail. Section 5 describes how we apply semantic passages in our QA system; and Section 6 presents our evaluation results and conclusions.

Section snippets

Related work

Text segmentation has been studied in various contexts, with or without specific applications in mind. We first introduce text segmentation methods developed for flat text with no structural information and those applied for segmentation of spoken dialogues. Contrasted to these are the techniques developed for situations in which some text structure information is available. We then introduce ways in which text segmentation is used for information retrieval and question answering. Throughout

Semantic passages

The ultimate goal of our semantic passage identification is to help the QA system by providing small, meaningful text units as candidate answer passages. More specifically, we use this method in building our QA system that provides answers from a Korean encyclopedia. As such, we studied the characteristics of the encyclopedia and incorporated them in our semantic passage segmentation method.

Our encyclopedia, Pascal^tm Encyclopedia (http://www.epascal.co.kr), contains articles in various domains

Semantic passage segmentation

Our semantic passage segmentation process consists of two phases: the topic assignment phase, where sentences are classified into sentence topics; and the sentence reorganization phase, where sentences are grouped into semantic passages. For the first phase, we use terms and other linguistic features in our learning-based classification approach. Compared with typical document classification methods using term distribution statistics, sentence classification needs additional features because

Use of semantic passages in QA

Many studies on QA show that extracting answers from passages is more effective than from documents [9], [13], [25]. With the use of information extraction (IE) techniques, some QA systems use knowledge bases containing pre-acquired answers. In this section, we describe how the semantic passages are used for passage retrieval and information extraction for QA. Semantic passages can be used as more meaningful units than whole documents for indexing and retrieval as well as extracting facts to

Empirical evaluation

To verify the efficacy of our proposed methods, we conducted a set of experiments. In a preliminary experiment, we first compared several classification algorithms to find the most suitable one for sentence topic classification. After fixing the topic classification method, we evaluated the performance of semantic passage segmentation and then showed the effects of sentence topics in passage retrieval and KB construction for QA.

Conclusion

We proposed a semantic passage segmentation method based on the notion of sentence topics to enhance the performance of our question answer system. We defined a semantic passage as a set of sentences grouped by semantic coherence determined by the topic assigned to individual sentences. We built a sentence topic classifier based on the Maximum Entropy (ME) model using terms and additional linguistic information called sentence patterns as features. Finally, we showed experimental results on the

References (41)

G. Salton et al.
Automatic text structuring and summarization
Information Processing and Management
(1997)
A.K. Singhal et al.
Document length normalization
Information Processing and Management
(1996)
A.L. Berger et al.
Maximum entropy approach to natural language processing
Computational Linguistics
(1996)
T. Brants, F. Chen, I. Tsochantaridis, Topic-based document segmentation with probabilistic latent semantic analysis,...
M. Clillet, J. Pessiot, M. Amini, P. Gallinari, Unsupervised learning with term clustering for thematic segmentation of...
J.P. Callan, Passage-retrieval evidence in document retrieval, in: Proceedings of 17th annual international ACM-SIGIR,...
F.E. Chakik et al.
An approach for constructing complex discriminating surfaces based on Bayesian interference of the maximum entropy
Information Sciences
(2004)
G. Chao, M.G. Dyer, Maximum entropy models for word sense disambiguation, in: Proceedings of 19th International...
H. Christensen, B. Kolluru, Y. Gotoh, S. Renals, Maximum entropy segmentation on broadcast news, in: Proceedings of the...
M.R. Choi, J. Hur, M.G. Jang, Constructing Korean lexical concept network for encyclopedia question-answering system,...

C.L.A. Clarke, G.V. Cormack, T.R. Lynam, C.M. Li, G.L. McLearn, Web reinforced question answering (MultiText...

J. Darroch et al.

Generalized iterative scaling for log-linear models

The Annals of Mathematical Statistics

(1972)

S. Della Pietra et al.

Inducing features of random fields

IEEE Transaction on Pattern Analysis and Machine Intelligence

(1997)

Christiane Fellbaum

WordNet, an electronic lexical database

(1998)

S.M. Harabagiu, S.J. Maiorano, Finding answers in large collections of texts: paragraph indexing+abductive inference,...

M. Hearst, Multi-paragraph segmentation of expository text, in: Proceedings of the 32nd Annual meeting of the...

P. Hsueh, J. Moore, S. Penals, Automatic segmentation of multiparty dialogue, in: Proceeding of the 11th Conference of...

X. Ji, H. Zha, Domain-independent text segmentation using anisotropic diffusion and dynamic programming, in:...

Thorsten Joachims

Learning to classify text using support vector machines

(2002)

B.Y. Kang, S.H. Myaeng, Theme assignment for sentences based on head-driven patterns, in: Proceedings of 8th Conference...

Cited by (39)

Data Envelopment Analysis of linguistic features and passage relevance for open-domain Question Answering
2022, Knowledge-Based Systems
Question Answering (QA) systems play an important role in today’s human–computer interaction systems. QA performance can be significantly improved using effective answer passage retrieval and ranking techniques. Our focus in this paper is on both non machine learning-based and deep learning-based passage retrieval and ranking systems for QA to leverage linguistic features within the text of questions and passages and improve passage ranking effectiveness. We propose a decoupled linguistic and linear programming-based approach for passage ranking using the Data Envelopment Analysis (DEA) technique to improve over well-established answer passage retrieval techniques. Our method scores passages using information retrieval and deep learning relevance metrics, represents retrieved passages using their relevance scores and several linguistic features, and finally makes use of DEA to re-rank the retrieved list of passages. The high effectiveness and significance of our proposed passage ranking method is demonstrated based on several experiments that we have conducted on a standard benchmark data set.
RepeatPadding: Balancing words and sentence length for language comprehension in visual question answering
2020, Information Sciences
Visual question answering (VQA) is a complicated Turing-AI task which needs not only to understand the multi-modality inputs but also reason to provide correct answer. Nowadays, there are complicated and sophisticated modules for reasoning in popular works. However, the language representation which is frequently treated as the guider of VQA hasn’t been fully explored in current researches, leading to insufficient reasoning and unsatisfactory answer. In this work, two types of method including VieAns and RepeatPadding which focus on language processing are proposed to balance the sentence by cropping and padding the question, where the language information is transformed to different expressions and further pushes the language model to grab more representative features for further boosting the accuracy of predicted answers. Experiments on the benchmark COCO-QA and VQA2.0 datasets are conducted to demonstrate the effectiveness of the proposed method. Particularly, the proposed RepeatPadding method is more suitable for different language models.
Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering
2020, Information Sciences
Citation Excerpt :
QA systems are characterized by a common high-level architecture, according to which, first, a natural language question is processed and a set of clues is extracted from it to understand what it is being asked for [15,42]. Then, these clues are used to query a knowledge base or a collection of documents and retrieve relevant information correlated to the question [3,26]. Finally, the retrieved information is evaluated in order to select, among more possible alternatives, the most confident answer expressed in a succinct form [44,47].
Question Answering (QA) systems based on Information Retrieval return precise answers to natural language questions, extracting relevant sentences from document collections. However, questions and sentences cannot be aligned terminologically, generating errors in the sentence retrieval. In order to augment the effectiveness in retrieving relevant sentences from documents, this paper proposes a hybrid Query Expansion (QE) approach, based on lexical resources and word embeddings, for QA systems. In detail, synonyms and hypernyms of relevant terms occurring in the question are first extracted from MultiWordNet and, then, contextualized to the document collection used in the QA system. Finally, the resulting set is ranked and filtered on the basis of wording and sense of the question, by employing a semantic similarity metric built on the top of a Word2Vec model. This latter is locally trained on an extended corpus pertaining the same topic of the documents used in the QA system. This QE approach is implemented into an existing QA system and experimentally evaluated, with respect to different possible configurations and selected baselines, for the Italian language and in the Cultural Heritage domain, assessing its effectiveness in retrieving sentences containing proper answers to questions belonging to four different categories.
A top-down information theoretic word clustering algorithm for phrase recognition
2014, Information Sciences
Citation Excerpt :
The task of arbitrary phrase chunking has recently received increasing attention in many natural language processing (NLP) research issues [16,33].
Semi-supervised machine learning methods have the features of both, integrating labeled and unlabeled training data. In most structural problems, such as natural language processing and image processing, developing labeled data for a specific domain requires considerable amount of human resources. In this paper, we present a cluster-based method to fuse labeled training and unlabeled raw data. We design a top-down divisive clustering algorithm that ensures maximal information gain in the use of unlabeled data via clustering similar words. To implement this idea, we design a top-down iterative K-means clustering algorithm to merge word clusters. Differently, the derived term groups are then encoded as new features for the supervised learners in order to improve the coverage of lexical information. Without additional training data or external materials, this approach yields state-of-the-art performance on the shallow parsing and base-chunking benchmark datasets (94.50 and 93.12 in F₍_β₎ rates).
Integrating statistical and lexical information for recognizing textual entailments in text
2013, Knowledge-Based Systems
Citation Excerpt :
The relationships directly provide useful information for downstream purposes. Useful applications include eliminating duplicate descriptions in question answering (QA) systems [25,17], finding redundant sentences for machine translation and text summarization [9]. The goal of recognizing textual entailment relations is to identify, given two text fragments t and h, whether t entails h or not (where t means the entailing text and h is the hypothesis or the entailed text).
Recognizing textual entailment is to infer that a given text span follows from the meaning of a given hypothesis. To have better recognition capability, it is necessary to employ deep text processing units such as syntactic parsers and semantic taggers. However, these resources are not usually available in other non-English languages. In this paper, we present a light-weight Chinese textual entailment recognition system using part-of-speech information only. We designed two different feature models from training data and employed the well-known kernel method to learn to predict testing data. One feature set abstracts the generic statistics between the text pairs, while the other set directly models lexical features based on the traditional bag-of-words model. The ability of the proposed feature models not only brings additional statistical information from their datasets but also helps to enhance the prediction capability. To validate this, we conducted the experiments on the novel benchmark corpus – NTCIR-RITE-2011. The empirical results demonstrate that our method achieves the best results in comparison to the other competitors. In terms of accuracy, our method achieves 54.77% for the NTCIR RITE MC task.
Compositional question answering: A divide and conquer approach
2011, Information Processing and Management
Citation Excerpt :
For statistical significance of the results, we employed paired t-test that assesses whether the means of two groups are statistically different from each other. The proposed compositional QA method was compared against the following: (1) a traditional QA approach of using general indexing and passage retrieval (Oh et al., 2007), (2) a simple routing approach where a question is sent to all the available QA modules and the results are combined, and (3) strategy-driven QA (Oh & Myaeng, 2009) that corresponds to the step 1 of compositional QA described in Section 4.3, where each answer from a primary QA module are verified and its confidence value boosted based on the chosen strategy. All the cases are considered atomic QA because the questions are not decomposed into simpler ones.
This paper describes how questions can be characterized for question answering (QA) along different facets and focuses on questions that cannot be answered directly but can be divided into simpler ones so that they can be answered directly using existing QA capabilities. Since individual answers are composed to generate the final answer, we call this process as compositional QA. The goal of the proposed QA method is to answer a composite question by dividing it into atomic ones, instead of developing an entirely new method tailored for the new question type. A question is analyzed automatically to determine its class, and its sub-questions are sent to the relevant QA modules. Answers returned from the individual QA modules are composed based on the predetermined plan corresponding to the question type. The experimental results based on 615 questions show that the compositional QA approach outperforms the simple routing method by about 17%. Considering 115 composite questions only, the F-score was almost tripled from the baseline.

View all citing articles on Scopus

View full text