Abstract
A novel algorithm to find the content text in an HTML page is proposed based on a number of features of textual blocks in the page. Experiments show the new algorithm is better than known ones in terms of the ratios of the correctly removed noise blocks and the correctly found content blocks respectively. The application of the algorithm in hidden web classification is demonstrated as well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Cai, D., Yu, S., Ma, J.W.W.: VIPS: a Vision-based Page Segmentation Algorithm, MSR-TR_2003-79
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages Based on Visual Representation. In: APWeb 2003, pp. 406–417 (2003)
Debnath, S., Mitra, P., Giles, C.L.: Identifying Content Blocks from Web Documents. In: Hacid, M.-S., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, p. 2005. Springer, Heidelberg (2005)
Feng, H., Liu, B., Liu, Y.: A framework for extracting the content and analysis for the Web pages with the position coordinates tree. Tsinghua Science and technology 45(S1), 1767–1771 (2005)
Gravano, L., Ipeirotis, P.G., Sahami, M.: QProber: A system for automatic classification of hidden-Web databases. ACM TOIS 21(1), 1–41 (2003)
He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)
Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: Proceedings of the 9th SIGMOD International Workshop on Web and Databases (SIGMOD-WebDB 2006), Chicago, Illinois, June 30 (2006)
Liu, B., Zhao, K., Yi, L.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 296–305 (2003)
Simon, K., Lausen, G.: Augmenting Automatic Information Extraction with Visual Perceptions. 2005 ACM 1595931406/05/0010 (2005)
Ou, J., Dong, S., Cai, B.: A method to extract the topic information from the HTML pages with design model. Tsinghua Science and technology 45(S1), 1743–1747 (2005)
Ru, Y., Horowitz, E.: Indexing the invisibleWeb: a survey. Online Information Review 29(3), 249–265 (2005)
Song, R., Liu, H., Wen, J., Ma, W.: Learning important models for web page blocks based on layout and content analysis. SIGKDD Explorations 6(2), 14–23 (2004)
Song, Y., Ma, S., Chen, G., li, J.: A Parse method for HTML pages to enhance the quality of Chinese Search Engine. J. of Chinese Information Process, 1003–1077 (2003) 04-0019-08
The UIUC Web Integration repository, http://metaqerier.cs.uiuc.edu/repository
Wang, J., Loehovsky, F.: Data-rich section extraction from HTML pages. In: Proc. 3rd Int. Conf. On Web Info. Syst. Eng., Singapore, pp. 1–10. IEEE Computer Society Press, Los Alamitos (2002)
Yi, L., Liu, B.: Web Page Cleaning for Web Mining through Feature Weighting. In: The Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico (August 2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
MA, J., Chen, Z., Lian, L., Li, L. (2008). Finding and Using the Content Texts of HTML Pages. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_79
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)