Abstract
Most documents available over the web confirm to the HTML specification. They are intended to be human readable through a web browser and thus are constructed following some common conventions. Based on such common conventions, the Conceptual Model for HTML was proposed recently to automatically capture the hierarchical structure within web documents. However, certain key semantic information about the contents in the documents, which are obvious to human, are often omitted. As a result, web data processing, manipulation and integration are still quite dificult. In this paper, we discuss how to extend the Conceptual Model for HTML to capture the intended semantics of the HTML documents. We show that with the new constructs introduced, using an Intelligent Wrapper, and limited human interaction, semantics can be transferred from human into the Extended Conceptual Model so that further meaningful processing, manipulation and integration of web documents become possible.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
N. Ashish and C. Knoblock. Modeling Web Sources for Information Integration. In Proc. Workshop on Management of Semistructured Data, 1997.
P. Atzeni, G. Mecca, and P. Merialdo. Semistructured and structured data in the web: Going back and forth. In Workshop on Management of Semistructured Data, 1997.
C. Bornhvd. Semantic Metadata for the Integration of Web-Based Data for Electronic Commerce. In International Workshop on Advance Issues of E-Commerce and Web-Based Information Systems, 1999.
T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/1998/REC-xml-19980210, February 1998.
J. Clark and S. De Rose. XML Path Language (XPath) Version 1.0. http://www.w3.org/TR/1999/REC-xpath-19991116, November 2001.
M. Fernandez, D. Florescu, and A. Levy. A Query Language for a Web-Site Management System. SIGMOD Record, 26(3):4–11, 1997.
T. Fiebig, J. Weiss, and G. Moerkotte. Raw: A relational algebra for the web. http://www.research.att.com/ suciu/workshop-paper/paper05.ps, 1997.
D. Florescu, A. Levy, and A. Mendelzon. Database Techniques for the World-Wide Web: A Survey. SIGMOD Record, 27(3):59–74, 1998.
Hamme, Garcia-Molina, Cho, Aranha, and Crespo. Extracting Semistructured Information from the Web. In Proceedings of Workshop on Management of Semistructured Data, 1997.
V. Kashyap and M. Rusinkiewicz. Modeling and querying textual data using e-r model and sql. http://citeseer.nj.nec.com/kashyap97modeling.html, 1997.
C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi, I. Muslea, A. G. Philpot, and S. Tejada. Modeling Web Sources for Information Integration. In Proceedings of the 15th National Conference on AI, 1998.
M. Liu and T. W. Ling. A Data Model for Semistructured Data with Partial and Inconsistent Information. In Proceedings of the International Conference on Advances in Database Technology (EDBT 2000), pages 317–331, Konstanz, Germany, March 27–31 2000. Springer-Verlag LNCS 1777.
M. Liu and T. W. Ling. A Conceptual Model and Rule-based Query Language for HTML. World Wide Web Journal, 4:49–77, 2001.
M. Liu and T. W. Ling. Towards semistructured data integration. In A. Dahanayake and W. Gerhard, editors, Web-Enabled Systems Integration: Practice and Challenges, chapter 2, pages 19–39. Idea Group Publishing, 2003.
P. O’neil and E. O’Neil. Database Principles: Programming, and Performance. Morga Kaufmann Publishers, 2 edition, 2000.
D. Smith and M. Lopez. Information extraction for semi-structured documents. In Proceedings of Workshop on Management of Semistructured Data, pages 225–238, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, M. (2002). Capturing Semantics in HTML Documents. In: Hameurlain, A., Cicchetti, R., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2002. Lecture Notes in Computer Science, vol 2453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46146-9_11
Download citation
DOI: https://doi.org/10.1007/3-540-46146-9_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44126-7
Online ISBN: 978-3-540-46146-3
eBook Packages: Springer Book Archive