Finding and Using the Content Texts of HTML Pages

MA, Jun; Chen, Zhumin; Lian, Li; Li, Lianxia

doi:10.1007/978-3-540-68636-1_79

Jun MA¹,
Zhumin Chen¹,
Li Lian¹ &
…
Lianxia Li¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Asia Information Retrieval Symposium

1380 Accesses

Abstract

A novel algorithm to find the content text in an HTML page is proposed based on a number of features of textual blocks in the page. Experiments show the new algorithm is better than known ones in terms of the ratios of the correctly removed noise blocks and the correctly found content blocks respectively. The application of the algorithm in hidden web classification is demonstrated as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cai, D., Yu, S., Ma, J.W.W.: VIPS: a Vision-based Page Segmentation Algorithm, MSR-TR_2003-79
Google Scholar
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages Based on Visual Representation. In: APWeb 2003, pp. 406–417 (2003)
Google Scholar
CWT200G: http://www.cwirf.org/SharedRes/DataSet/cwt.html
Debnath, S., Mitra, P., Giles, C.L.: Identifying Content Blocks from Web Documents. In: Hacid, M.-S., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, p. 2005. Springer, Heidelberg (2005)
Google Scholar
Feng, H., Liu, B., Liu, Y.: A framework for extracting the content and analysis for the Web pages with the position coordinates tree. Tsinghua Science and technology 45(S1), 1767–1771 (2005)
Google Scholar
Gravano, L., Ipeirotis, P.G., Sahami, M.: QProber: A system for automatic classification of hidden-Web databases. ACM TOIS 21(1), 1–41 (2003)
Article Google Scholar
He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)
Google Scholar
Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: Proceedings of the 9th SIGMOD International Workshop on Web and Databases (SIGMOD-WebDB 2006), Chicago, Illinois, June 30 (2006)
Google Scholar
Liu, B., Zhao, K., Yi, L.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 296–305 (2003)
Google Scholar
Simon, K., Lausen, G.: Augmenting Automatic Information Extraction with Visual Perceptions. 2005 ACM 1595931406/05/0010 (2005)
Google Scholar
Ou, J., Dong, S., Cai, B.: A method to extract the topic information from the HTML pages with design model. Tsinghua Science and technology 45(S1), 1743–1747 (2005)
Google Scholar
Ru, Y., Horowitz, E.: Indexing the invisibleWeb: a survey. Online Information Review 29(3), 249–265 (2005)
Article Google Scholar
Song, R., Liu, H., Wen, J., Ma, W.: Learning important models for web page blocks based on layout and content analysis. SIGKDD Explorations 6(2), 14–23 (2004)
Article Google Scholar
Song, Y., Ma, S., Chen, G., li, J.: A Parse method for HTML pages to enhance the quality of Chinese Search Engine. J. of Chinese Information Process, 1003–1077 (2003) 04-0019-08
Google Scholar
The UIUC Web Integration repository, http://metaqerier.cs.uiuc.edu/repository
Wang, J., Loehovsky, F.: Data-rich section extraction from HTML pages. In: Proc. 3rd Int. Conf. On Web Info. Syst. Eng., Singapore, pp. 1–10. IEEE Computer Society Press, Los Alamitos (2002)
Google Scholar
Yi, L., Liu, B.: Web Page Cleaning for Web Mining through Feature Weighting. In: The Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico (August 2003)
Google Scholar

Download references

Author information

Authors and Affiliations

The Colledge of Computer Science and Technology, Shandong University, Jinan, China
Jun MA, Zhumin Chen, Li Lian & Lianxia Li

Authors

Jun MA
View author publications
You can also search for this author in PubMed Google Scholar
Zhumin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Li Lian
View author publications
You can also search for this author in PubMed Google Scholar
Lianxia Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

MA, J., Chen, Z., Lian, L., Li, L. (2008). Finding and Using the Content Texts of HTML Pages. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_79

Download citation

DOI: https://doi.org/10.1007/978-3-540-68636-1_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics