Abstract
Documents, especially long ones, may contain very diverse passages related to different topics. Passages Retrieval approaches have shown that, in most cases, there is a great potential benefit in considering these passages independently when computing the similarity of a document with a user’s query. Experiments have been realized in order to identify the kinds of passage which are the best suited for such a process. Contrarily to what could have been expected, working with thematic segments, which are likely to represent only one topic each, has led to greatly lower effectiveness results than the use of arbitrary sequences of words. In this paper, we show that this paradoxical observation is mainly due to biases induced by the great length diversity of the thematic passages. Therefore, we propose here to cope with these biases by using a more powerful text length normalization technique. Experiments show that, when length biases are laid aside, the use of thematic passages is better suited than arbitrary sequences of words to retrieve relevant informations as response to a user’s query.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Voorhees, E.M., Harman, D.: Overview of the fifth text retrieval conference (trec-5). In: TREC 1996 (1996)
Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32(1), 18–34 (1998)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press /Addison-Wesley (1999)
Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gull, A., Lau, M.: Okapi at TREC. In: TREC 1992, pp. 21–30 (1992)
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: SIGIR 1993, pp. 49–58. ACM, New York (1993)
Callan, J.P.: Passage-level evidence in document retrieval. In: SIGIR 1994, pp. 302–310. Springer, Heidelberg (1994)
Kaszkiel, M., Zobel, J.: Effective ranking with arbitrary passages. Journal of the American Society of Information Science 52(4), 344–364 (2001)
Liu, X., Croft, W.B.: Passage retrieval based on language models. In: CIKM 2002, pp. 375–382. ACM, New York (2002)
Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: Seggen: A genetic algorithm for linear text segmentation. In: Veloso, M.M. (ed.) IJCAI 2007, pp. 1647–1652 (2007)
Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: Document length normalization by statistical regression. In: ICTAI 2007, vol. (2), pp. 19–26. IEEE, Los Alamitos (2007)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Singhal, A., Salton, G., Mitra, M., Buckley, C.: Document length normalization. Information Processing and Management 32(5), 619–633 (1996)
Chung, T.L., Luk, R.W.P., Wong, K.F., Kwok, K.L., Lee, D.L.: Adapting pivoted document-length normalization for query size: Experiments in chinese and english. In: TALIP 2006, vol. 5(3), pp. 245–263 (2006)
Zobel, J., Moffat, A., Wilkinson, R., Sacks-Davis, R.: Efficient retrieval of partial documents. Information Processing and Management 31(3), 361–377 (1995)
Stanfill, C., Waltz, D.L.: Statistical methods, artificial intelligence, and information retrieval, pp. 215–225 (1992)
Kaszkiel, M., Zobel, J.: Passage retrieval revisited. In: SIGIR 1997, pp. 178–185. ACM Press, New York (1997)
Salton, G., Singhal, A., Buckley, C., Mitra, M.: Automatic text decomposition using text segments and text themes. In: Hypertext 1996, pp. 53–65. ACM Press, New York (1996)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lamprier, S., Amghar, T., Levrat, B., Saubion, F. (2008). Thematic Segment Retrieval Revisited. In: Dochev, D., Pistore, M., Traverso, P. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2008. Lecture Notes in Computer Science(), vol 5253. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85776-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-85776-1_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85775-4
Online ISBN: 978-3-540-85776-1
eBook Packages: Computer ScienceComputer Science (R0)