Extraction of Web Texts Using Content-Density Distribution

Kitahara, Saori; Tamura, Koya; Hatano, Kenji

doi:10.1007/978-3-642-25631-8_25

Extraction of Web Texts Using Content-Density Distribution

Saori Kitahara²¹,
Koya Tamura²² &
Kenji Hatano²³

Conference paper

1331 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7097))

Abstract

We propose a method for grasping the content of each Web page and extracting a part of the Web page related to query keywords, in order to make more effective snippets of a Web search engine. We regard the content as a set of words in the text of a Web page, and we generate the content-density distribution by using both the position and the influence of the word. In our experiments, we found that the proposed method facilitated the recognition of the content of Web pages, as compared to conventional methods based on snippets.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Manning, C.D., Raghavan, P., Schuetze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Google Scholar
Montague, M., Aslam, A.J.: Relevance score normalization for metasearch. In: CIKM 2001, pp. 427–433. ACM (2001)
Google Scholar
Ercan, G., Cicekli, I.: Lexical Cohesion Based Topic Modeling for Summarization. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 582–592. Springer, Heidelberg (2008)
Chapter Google Scholar
Hearst, M.A.: Multi-paragraph segmentation of expository text. In: ACL 1994, pp. 9–16. ACL (1994)
Google Scholar
Kojima, H., Teiji, F.: Segmenting Narrative Text into Coherent Scenes. Literary and Linguistic Computing 9(1), 13–19 (1994)
Article Google Scholar
Li, Q., Candan, K.S., Qi, Y.: Extracting Relevant Snippets from Web Documents through Language Model based Text Segmentation. In: WI 2007, pp. 287–290. IEEE Computer Society (2007)
Google Scholar
Salton, G., Allan, J.M., Buckley, C.: Approaches to Passage Retrieval in Full Text Information System. In: ACM SIGIR 1993, pp. 49–58. ACM (1993)
Google Scholar
Lv, Y., Zhai, C.X.: Positional Language Models for Information Retrieval. In: ACM SIGIR 2009, pp. 299–306. ACM (2009)
Google Scholar
Kise, K., Mizuno, H., Yamaguchi, M., Matsumoto, K.: On the Use of Density Distribution of Keywords for Automated Generation of Hypertext Links from Arbitrary Parts of Documents. In: ICDAR 1999, pp. 301–304. ACM (1999)
Google Scholar
Tur, G., Mori, R.D.: Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. Wiley (2011)
Google Scholar
Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying Conditional Random Fields to Japanese Morphological Analysis. In: EMNLP 2004, pp. 230–237. ACL (2004)
Google Scholar
Google Code, http://code.google.com/more/
MeCab: Yet Another Part-of-Speech and Morphological Analyzer, http://mecab.sourceforge.net/

Download references

Author information

Authors and Affiliations

Graduate School of Culture and Information Science, Doshisha University, 1-3 Tatara Miyakodani, Kyotanabe, Kyoto, 610-0394, Japan
Saori Kitahara
UX Department, Mixi Inc., 1-2-20 Higashi, Shibuya, Tokyo, 150-0011, Japan
Koya Tamura
Faculty of Culture and Information Science, Doshisha University, 1-3 Tatara Miyakodani, Kyotanabe, Kyoto, 610-0394, Japan
Kenji Hatano

Authors

Saori Kitahara
View author publications
You can also search for this author in PubMed Google Scholar
Koya Tamura
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Hatano
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer Science and Engineering, University of Wollongong, Dubai Knowledge Village, P.O. Box 20182, Dubai, United Arab Emirates
Mohamed Vall Mohamed Salem
Faculty of Engineering and IT, Dubai International Academic City, Block 11, 1st and 2nd Floor, P.O. Box 345015, Dubai, United Arab Emirates
Khaled Shaalan
Faculty of Computer Science and Engineering, University of Wollongong, Dubai Knowledge Village, P.O. Box 20183, Dubai, United Arab Emirates
Farhad Oroumchian
Department of Electrical and Computer Engineering, University of Tehran, Faculty of Engineering, North Kargar Street, P.O. Box 14395-515, Tehran, Iran
Azadeh Shakery
Faculty of Computer Science and Engineering, University of Wollongong, Dubai knowledge Village, P.O. Box 20183, Dubai, United Arab Emirates
Halim Khelalfa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kitahara, S., Tamura, K., Hatano, K. (2011). Extraction of Web Texts Using Content-Density Distribution. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds) Information Retrieval Technology. AIRS 2011. Lecture Notes in Computer Science, vol 7097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25631-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-25631-8_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25630-1
Online ISBN: 978-3-642-25631-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics