Skip to main content

Finding and Using the Content Texts of HTML Pages

  • Conference paper
Information Retrieval Technology (AIRS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

  • 1380 Accesses

Abstract

A novel algorithm to find the content text in an HTML page is proposed based on a number of features of textual blocks in the page. Experiments show the new algorithm is better than known ones in terms of the ratios of the correctly removed noise blocks and the correctly found content blocks respectively. The application of the algorithm in hidden web classification is demonstrated as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cai, D., Yu, S., Ma, J.W.W.: VIPS: a Vision-based Page Segmentation Algorithm, MSR-TR_2003-79

    Google Scholar 

  2. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages Based on Visual Representation. In: APWeb 2003, pp. 406–417 (2003)

    Google Scholar 

  3. CWT200G: http://www.cwirf.org/SharedRes/DataSet/cwt.html

  4. Debnath, S., Mitra, P., Giles, C.L.: Identifying Content Blocks from Web Documents. In: Hacid, M.-S., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, p. 2005. Springer, Heidelberg (2005)

    Google Scholar 

  5. Feng, H., Liu, B., Liu, Y.: A framework for extracting the content and analysis for the Web pages with the position coordinates tree. Tsinghua Science and technology 45(S1), 1767–1771 (2005)

    Google Scholar 

  6. Gravano, L., Ipeirotis, P.G., Sahami, M.: QProber: A system for automatic classification of hidden-Web databases. ACM TOIS 21(1), 1–41 (2003)

    Article  Google Scholar 

  7. He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)

    Google Scholar 

  8. Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: Proceedings of the 9th SIGMOD International Workshop on Web and Databases (SIGMOD-WebDB 2006), Chicago, Illinois, June 30 (2006)

    Google Scholar 

  9. Liu, B., Zhao, K., Yi, L.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 296–305 (2003)

    Google Scholar 

  10. Simon, K., Lausen, G.: Augmenting Automatic Information Extraction with Visual Perceptions. 2005 ACM 1595931406/05/0010 (2005)

    Google Scholar 

  11. Ou, J., Dong, S., Cai, B.: A method to extract the topic information from the HTML pages with design model. Tsinghua Science and technology 45(S1), 1743–1747 (2005)

    Google Scholar 

  12. Ru, Y., Horowitz, E.: Indexing the invisibleWeb: a survey. Online Information Review 29(3), 249–265 (2005)

    Article  Google Scholar 

  13. Song, R., Liu, H., Wen, J., Ma, W.: Learning important models for web page blocks based on layout and content analysis. SIGKDD Explorations 6(2), 14–23 (2004)

    Article  Google Scholar 

  14. Song, Y., Ma, S., Chen, G., li, J.: A Parse method for HTML pages to enhance the quality of Chinese Search Engine. J. of Chinese Information Process, 1003–1077 (2003) 04-0019-08

    Google Scholar 

  15. The UIUC Web Integration repository, http://metaqerier.cs.uiuc.edu/repository

  16. Wang, J., Loehovsky, F.: Data-rich section extraction from HTML pages. In: Proc. 3rd Int. Conf. On Web Info. Syst. Eng., Singapore, pp. 1–10. IEEE Computer Society Press, Los Alamitos (2002)

    Google Scholar 

  17. Yi, L., Liu, B.: Web Page Cleaning for Web Mining through Feature Weighting. In: The Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico (August 2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

MA, J., Chen, Z., Lian, L., Li, L. (2008). Finding and Using the Content Texts of HTML Pages. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_79

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68636-1_79

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68633-0

  • Online ISBN: 978-3-540-68636-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics