Abstract
We propose a method for grasping the content of each Web page and extracting a part of the Web page related to query keywords, in order to make more effective snippets of a Web search engine. We regard the content as a set of words in the text of a Web page, and we generate the content-density distribution by using both the position and the influence of the word. In our experiments, we found that the proposed method facilitated the recognition of the content of Web pages, as compared to conventional methods based on snippets.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Manning, C.D., Raghavan, P., Schuetze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Montague, M., Aslam, A.J.: Relevance score normalization for metasearch. In: CIKM 2001, pp. 427–433. ACM (2001)
Ercan, G., Cicekli, I.: Lexical Cohesion Based Topic Modeling for Summarization. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 582–592. Springer, Heidelberg (2008)
Hearst, M.A.: Multi-paragraph segmentation of expository text. In: ACL 1994, pp. 9–16. ACL (1994)
Kojima, H., Teiji, F.: Segmenting Narrative Text into Coherent Scenes. Literary and Linguistic Computing 9(1), 13–19 (1994)
Li, Q., Candan, K.S., Qi, Y.: Extracting Relevant Snippets from Web Documents through Language Model based Text Segmentation. In: WI 2007, pp. 287–290. IEEE Computer Society (2007)
Salton, G., Allan, J.M., Buckley, C.: Approaches to Passage Retrieval in Full Text Information System. In: ACM SIGIR 1993, pp. 49–58. ACM (1993)
Lv, Y., Zhai, C.X.: Positional Language Models for Information Retrieval. In: ACM SIGIR 2009, pp. 299–306. ACM (2009)
Kise, K., Mizuno, H., Yamaguchi, M., Matsumoto, K.: On the Use of Density Distribution of Keywords for Automated Generation of Hypertext Links from Arbitrary Parts of Documents. In: ICDAR 1999, pp. 301–304. ACM (1999)
Tur, G., Mori, R.D.: Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. Wiley (2011)
Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying Conditional Random Fields to Japanese Morphological Analysis. In: EMNLP 2004, pp. 230–237. ACL (2004)
Google Code, http://code.google.com/more/
MeCab: Yet Another Part-of-Speech and Morphological Analyzer, http://mecab.sourceforge.net/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kitahara, S., Tamura, K., Hatano, K. (2011). Extraction of Web Texts Using Content-Density Distribution. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds) Information Retrieval Technology. AIRS 2011. Lecture Notes in Computer Science, vol 7097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25631-8_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-25631-8_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25630-1
Online ISBN: 978-3-642-25631-8
eBook Packages: Computer ScienceComputer Science (R0)