Capturing Semantics in HTML Documents

Liu, Mengchi

doi:10.1007/3-540-46146-9_11

Mengchi Liu⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2453))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1421 Accesses

Abstract

Most documents available over the web confirm to the HTML specification. They are intended to be human readable through a web browser and thus are constructed following some common conventions. Based on such common conventions, the Conceptual Model for HTML was proposed recently to automatically capture the hierarchical structure within web documents. However, certain key semantic information about the contents in the documents, which are obvious to human, are often omitted. As a result, web data processing, manipulation and integration are still quite dificult. In this paper, we discuss how to extend the Conceptual Model for HTML to capture the intended semantics of the HTML documents. We show that with the new constructs introduced, using an Intelligent Wrapper, and limited human interaction, semantics can be transferred from human into the Extended Conceptual Model so that further meaningful processing, manipulation and integration of web documents become possible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

On the Need for Assistance in HTML5 Web Authoring Systems

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Article Open access 20 August 2017

Designing Web Pages with HTML

References

N. Ashish and C. Knoblock. Modeling Web Sources for Information Integration. In Proc. Workshop on Management of Semistructured Data, 1997.
Google Scholar
P. Atzeni, G. Mecca, and P. Merialdo. Semistructured and structured data in the web: Going back and forth. In Workshop on Management of Semistructured Data, 1997.
Google Scholar
C. Bornhvd. Semantic Metadata for the Integration of Web-Based Data for Electronic Commerce. In International Workshop on Advance Issues of E-Commerce and Web-Based Information Systems, 1999.
Google Scholar
T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/1998/REC-xml-19980210, February 1998.
J. Clark and S. De Rose. XML Path Language (XPath) Version 1.0. http://www.w3.org/TR/1999/REC-xpath-19991116, November 2001.
M. Fernandez, D. Florescu, and A. Levy. A Query Language for a Web-Site Management System. SIGMOD Record, 26(3):4–11, 1997.
Article Google Scholar
T. Fiebig, J. Weiss, and G. Moerkotte. Raw: A relational algebra for the web. http://www.research.att.com/ suciu/workshop-paper/paper05.ps, 1997.
D. Florescu, A. Levy, and A. Mendelzon. Database Techniques for the World-Wide Web: A Survey. SIGMOD Record, 27(3):59–74, 1998.
Article Google Scholar
Hamme, Garcia-Molina, Cho, Aranha, and Crespo. Extracting Semistructured Information from the Web. In Proceedings of Workshop on Management of Semistructured Data, 1997.
Google Scholar
V. Kashyap and M. Rusinkiewicz. Modeling and querying textual data using e-r model and sql. http://citeseer.nj.nec.com/kashyap97modeling.html, 1997.
C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi, I. Muslea, A. G. Philpot, and S. Tejada. Modeling Web Sources for Information Integration. In Proceedings of the 15th National Conference on AI, 1998.
Google Scholar
M. Liu and T. W. Ling. A Data Model for Semistructured Data with Partial and Inconsistent Information. In Proceedings of the International Conference on Advances in Database Technology (EDBT 2000), pages 317–331, Konstanz, Germany, March 27–31 2000. Springer-Verlag LNCS 1777.
Google Scholar
M. Liu and T. W. Ling. A Conceptual Model and Rule-based Query Language for HTML. World Wide Web Journal, 4:49–77, 2001.
Article MATH Google Scholar
M. Liu and T. W. Ling. Towards semistructured data integration. In A. Dahanayake and W. Gerhard, editors, Web-Enabled Systems Integration: Practice and Challenges, chapter 2, pages 19–39. Idea Group Publishing, 2003.
Google Scholar
P. O’neil and E. O’Neil. Database Principles: Programming, and Performance. Morga Kaufmann Publishers, 2 edition, 2000.
Google Scholar
D. Smith and M. Lopez. Information extraction for semi-structured documents. In Proceedings of Workshop on Management of Semistructured Data, pages 225–238, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Carleton University, K1S 5B6, Ontario, Canada
Mengchi Liu

Authors

Mengchi Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Université Paul Sabatier, IRIT, 118 route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
Département Informatique, Université Aix-Marseille II, IUT, 413 Avenue Gaston Berger, 13625, Aix-en-Provence Cedex 1, France
Rosine Cicchetti
Institute of Applied Computer Science, University of Linz, Altenbergerstr. 69, 4040, Linz, Austria
Roland Traunmüller

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, M. (2002). Capturing Semantics in HTML Documents. In: Hameurlain, A., Cicchetti, R., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2002. Lecture Notes in Computer Science, vol 2453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46146-9_11

Download citation

DOI: https://doi.org/10.1007/3-540-46146-9_11
Published: 20 August 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44126-7
Online ISBN: 978-3-540-46146-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Capturing Semantics in HTML Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

On the Need for Assistance in HTML5 Web Authoring Systems

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Designing Web Pages with HTML

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Capturing Semantics in HTML Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

On the Need for Assistance in HTML5 Web Authoring Systems

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Designing Web Pages with HTML

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation