Skip to main content

Capturing Semantics in HTML Documents

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2453))

Included in the following conference series:

  • 1378 Accesses

Abstract

Most documents available over the web confirm to the HTML specification. They are intended to be human readable through a web browser and thus are constructed following some common conventions. Based on such common conventions, the Conceptual Model for HTML was proposed recently to automatically capture the hierarchical structure within web documents. However, certain key semantic information about the contents in the documents, which are obvious to human, are often omitted. As a result, web data processing, manipulation and integration are still quite dificult. In this paper, we discuss how to extend the Conceptual Model for HTML to capture the intended semantics of the HTML documents. We show that with the new constructs introduced, using an Intelligent Wrapper, and limited human interaction, semantics can be transferred from human into the Extended Conceptual Model so that further meaningful processing, manipulation and integration of web documents become possible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. N. Ashish and C. Knoblock. Modeling Web Sources for Information Integration. In Proc. Workshop on Management of Semistructured Data, 1997.

    Google Scholar 

  2. P. Atzeni, G. Mecca, and P. Merialdo. Semistructured and structured data in the web: Going back and forth. In Workshop on Management of Semistructured Data, 1997.

    Google Scholar 

  3. C. Bornhvd. Semantic Metadata for the Integration of Web-Based Data for Electronic Commerce. In International Workshop on Advance Issues of E-Commerce and Web-Based Information Systems, 1999.

    Google Scholar 

  4. T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/1998/REC-xml-19980210, February 1998.

  5. J. Clark and S. De Rose. XML Path Language (XPath) Version 1.0. http://www.w3.org/TR/1999/REC-xpath-19991116, November 2001.

  6. M. Fernandez, D. Florescu, and A. Levy. A Query Language for a Web-Site Management System. SIGMOD Record, 26(3):4–11, 1997.

    Article  Google Scholar 

  7. T. Fiebig, J. Weiss, and G. Moerkotte. Raw: A relational algebra for the web. http://www.research.att.com/ suciu/workshop-paper/paper05.ps, 1997.

  8. D. Florescu, A. Levy, and A. Mendelzon. Database Techniques for the World-Wide Web: A Survey. SIGMOD Record, 27(3):59–74, 1998.

    Article  Google Scholar 

  9. Hamme, Garcia-Molina, Cho, Aranha, and Crespo. Extracting Semistructured Information from the Web. In Proceedings of Workshop on Management of Semistructured Data, 1997.

    Google Scholar 

  10. V. Kashyap and M. Rusinkiewicz. Modeling and querying textual data using e-r model and sql. http://citeseer.nj.nec.com/kashyap97modeling.html, 1997.

  11. C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi, I. Muslea, A. G. Philpot, and S. Tejada. Modeling Web Sources for Information Integration. In Proceedings of the 15th National Conference on AI, 1998.

    Google Scholar 

  12. M. Liu and T. W. Ling. A Data Model for Semistructured Data with Partial and Inconsistent Information. In Proceedings of the International Conference on Advances in Database Technology (EDBT 2000), pages 317–331, Konstanz, Germany, March 27–31 2000. Springer-Verlag LNCS 1777.

    Google Scholar 

  13. M. Liu and T. W. Ling. A Conceptual Model and Rule-based Query Language for HTML. World Wide Web Journal, 4:49–77, 2001.

    Article  MATH  Google Scholar 

  14. M. Liu and T. W. Ling. Towards semistructured data integration. In A. Dahanayake and W. Gerhard, editors, Web-Enabled Systems Integration: Practice and Challenges, chapter 2, pages 19–39. Idea Group Publishing, 2003.

    Google Scholar 

  15. P. O’neil and E. O’Neil. Database Principles: Programming, and Performance. Morga Kaufmann Publishers, 2 edition, 2000.

    Google Scholar 

  16. D. Smith and M. Lopez. Information extraction for semi-structured documents. In Proceedings of Workshop on Management of Semistructured Data, pages 225–238, 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, M. (2002). Capturing Semantics in HTML Documents. In: Hameurlain, A., Cicchetti, R., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2002. Lecture Notes in Computer Science, vol 2453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46146-9_11

Download citation

  • DOI: https://doi.org/10.1007/3-540-46146-9_11

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44126-7

  • Online ISBN: 978-3-540-46146-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics