Extracting Content Structure for Web Pages Based on Visual Representation

Cai, Deng; Yu, Shipeng; Wen, Ji-Rong; Ma, Wei-Ying

doi:10.1007/3-540-36901-5_42

Deng Cai^7,6,
Shipeng Yu^8,6,
Ji-Rong Wen⁶ &
…
Wei-Ying Ma⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2642))

Included in the following conference series:

Asia-Pacific Web Conference

1079 Accesses
3 Altmetric

Abstract

A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure. Experiments show satisfactory results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Information Extraction from the Web by Matching Visual Presentation Patterns

Extracting Web Content by Exploiting Multi-Category Characteristics

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Article Open access 07 June 2018

References

Bailey, P., Craswell, N., and Hawking, D., Engineering a multi-purpose test collection for Web retrieval experiments, Information Processing and Management, 2001.
Google Scholar
Brin, S. and Page, L., The Anatomy of a Large-Scale Hypertextual Web Search Engine, In the Seventh International World Wide Web Conference, Brisbane, Australia, 1998.
Google Scholar
Buneman, P., Davidson, S., Fernandez, M., and Suciu, D., Adding Structure to Unstructured Data, In Proceedings of the 6th International Conference on Database Theory (ICDT’97), 1997, pp. 336–350.
Google Scholar
Chakrabarti, S., Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction, In the 10th International World Wide Web Conference, 2001.
Google Scholar
Chakrabarti, S., Punera, K., and Subramanyam, M., Accelerated focused crawling through online relevance feedback, In Proceedings of the eleventh international conference on World Wide Web (WWW2002), 2002, pp. 148–159.
Google Scholar
Chen, J., Zhou, B., Shi, J., Zhang, H., and Wu, Q., Function-Based Object Model Towards Website Adaptation, In the 10th International World Wide Web Conference, 2001.
Google Scholar
Efthimiadis, N. E., Query Expansion, In Annual Review of Information Systems and Technology, Vol. 31, 1996, pp. 121–187.
Google Scholar
Embley, D. W., Jiang, Y., and Ng, Y.-K., Record-boundary discovery in Web documents, In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Philadelphia PA, 1999, pp. 467–478.
Google Scholar
Gu, X., Chen, J., Ma, W.-Y., and Chen, G., Visual Based Content Understanding towards Web Adaptation, In Second International Conference on Adaptive Hypermedia and Adaptive Web-based Systems (AH2002), Spain, 2002, pp. 29–31.
Google Scholar
Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., and Laakko, T., Two Approaches to Bringing Internet Services to WAP Devices, In Proceedings of 9th International World-Wide Web Conference, 2000, pp. 231–246.
Google Scholar
Kleinberg, J., Authoritative sources in a hyperlinked environment, In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998, pp. 668–677.
Google Scholar
Lin, S.-H. and Ho, J.-M., Discovering Informative Content Blocks from Web Documents, In Proceedings of ACM SIGKDD’02, 2002.
Google Scholar
Robertson, S. E., Overview of the okapi projects, Journal of Documentation, Vol. 53, No. 1, 1997, pp. 3–7.
Article Google Scholar
Tang, Y. Y., Cheriet, M., Liu, J., Said, J.N., and Suen, C. Y., Document Analysis and Recognition by Computers, Handbook of Pattern Recognition and Computer Vision, edited by C. H. Chen, L. F. Pau, and P. S. P. Wang World Scientific Publishing Company, 1999.
Google Scholar
Wong, W. and Fu, A. W., Finding Structure and Characteristics of Web Documents for Classification, In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Dallas, TX., USA, 2000.
Google Scholar
Yang, Y. and Zhang, H., HTML Page Analysis Based on Visual Cues, In 6th International Conference on Document Analysis and Recognition, Seattle, Washington, USA, 2001.
Google Scholar
Yu, S., Cai, D., Wen, J.-R., and Ma, W.-Y., Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation, To appear in the Twelfth International World Wide Web Conference (WWW2003), 2003.
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research Asia, China
Deng Cai, Shipeng Yu, Ji-Rong Wen & Wei-Ying Ma
Tsinghua University, Beijing, P.R.China
Deng Cai
Peking University, Beijing, P.R.China
Shipeng Yu

Authors

Deng Cai
View author publications
You can also search for this author in PubMed Google Scholar
Shipeng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Rong Wen
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Ying Ma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD, 4072, Australia
Xiaofang Zhou & Maria E. Orlowska &
Department of Mathematics and Computing, University of Southern Queensland, Toowoomba, QLD, 4350, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, D., Yu, S., Wen, JR., Ma, WY. (2003). Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds) Web Technologies and Applications. APWeb 2003. Lecture Notes in Computer Science, vol 2642. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36901-5_42

Download citation

DOI: https://doi.org/10.1007/3-540-36901-5_42
Published: 15 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-02354-8
Online ISBN: 978-3-540-36901-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Extracting Content Structure for Web Pages Based on Visual Representation

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Information Extraction from the Web by Matching Visual Presentation Patterns

Extracting Web Content by Exploiting Multi-Category Characteristics

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Extracting Content Structure for Web Pages Based on Visual Representation

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Information Extraction from the Web by Matching Visual Presentation Patterns

Extracting Web Content by Exploiting Multi-Category Characteristics

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation