Skip to main content
Log in

A Conceptual Model and Rule-Based Query Language for HTML

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Most documents available over the Web conform to the HTML specification. Such documents are hierarchically structured in nature. The existing data models for the Web either fail to capture the hierarchical structure within the documents or can only provide a very low level representation of such hierarchical structure. How to represent and query HTML documents at a higher level is an important issue. In this paper, we first propose a novel conceptual model for HTML. This conceptual model has only a few simple constructs but is able to represent the complex hierarchical structure within HTML documents at a level that is close to human conceptualization/visualization of the documents. We also describe how to convert HTML documents based on this conceptual model. Using the conceptual model and conversion method, one can capture the essence (i.e., semistructure) of HTML documents in a natural and simple way. Based on this conceptual model, we then present a rule–based language to query HTML documents over the Internet. This language provides a simple but very powerful way to query both intra–document structures and inter–document structures and allows the query results to be restructured. Being rule–based, it naturally supports negation and recursion and therefore is more expressive than SQL–based languages. A logical semantics is also provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. S. Abiteboul, “Querying semistructured data,” in Proc. of the Internat. Conf. on Data Base Theory, Lecture Notes in Computer Science 1186, Springer: New York, 1997, pp. 1–18.

    Google Scholar 

  2. S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. L. Wiener, “The Lorel query language for semistructured data,” Internat. J. Digital Libraries 1(1), 1997, 68–88.

    Google Scholar 

  3. G. Arocena and A. Mendelzon, “WebOQL: Restructuring documents, databases and Webs,” in Proc. of the Internat. Conf. on Data Engineering, IEEE Computer Soc., 1998, pp. 24–33.

  4. C. Beeri, S. Naqvi, O. Shmueli, and S. Tsur, “Set construction in a logic database language,” J. Logic Programming 10(3,4), 1991, 181–232.

    Google Scholar 

  5. T. Bray, J. Paoli, and C. M. Sperberg–McQueen, Extensible markup language (XML) 1.0, W3C Recommendation; see http://www.w3c.org/TR/1999/REC–xml–19980210, February 1998.

  6. P. Buneman, S. Davidson, G. Hilebrand, and D. Suciu, “A query language and optimization techniques for unstructured data,” in Proc. of the ACM SIGMOD Internat. Conf. on Management of Data, 1996, pp. 505–516.

  7. J. Clark and S. DeRose, XML path language (XPath) version 1.0, W3C Recommendation; see http://www.w3c.org/TR/1999/REC–xpath–19991116, November 1999.

  8. O. Shmueli and D. Konopnicki, “W3QS: A query system for the World–Wide Web,” in Proc. of the Internat. Conf. on Very Large Data Bases, Zurich, Switzerland, Morgan Kaufmann, 1995, pp. 54–65.

  9. M. Fernandez, D. Florescu, A. Levy, and D. Suciu, “A query language for aWeb–site management system,” SIGMOD Record, 1997, 4–11.

  10. M. Fernandez, D. Florescu, A. Levy, and D. Suciu, “Reasoning about Web–site structure,” in Proc. of AAAI'98 Workshop on AI and Information Integration, 1998.

  11. D. Florescu, A. Levy, and A. Mendelzon, “Database techniques for the World–Wide Web: A survey,” SIGMOD Record 26(3), 1997.

  12. D. Florescu, A. Levy, and A. Mendelzon, “Database techniques for the World–Wide Web: A survey,” SIGMOD Record 27(3), 1998, 59–74.

    Google Scholar 

  13. J. Hammer, H. Garcia–Molina, J. Cho, A. Crespo, and R. Aranha, “Extracting semistructured information from the Web,” in Proc. of the Workshop on Management of Semistructured Data, 1997.

  14. R. Himmeroder, G. Lausen, B. Ludascher, and C. Schlepphorst, “On a declarative semantics for Web queries,” in Proc. of the Internat. Conf. on Deductive and Object–Oriented Databases, Switzerland, 1997, Lecture Notes in Computer Science, Springer: New York, pp. 386–398.

    Google Scholar 

  15. C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi, I. Muslea, A. G. Philpot, and S. Tejada, “Modeling Web sources for information integration,” in Proc. of the 15th National Conf. on AI, 1998.

  16. L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian, “A declarative language for querying and restructuring the Web,” in Proc. of the 6th Internat. Workshop on Research Issues in Data Engineering, 1996.

  17. M. Liu, “ROL: A deductive object base language,” Information Systems 21(5), 1996, 431–457.

    Google Scholar 

  18. M. Liu, “Relationlog: A typed extension to datalog with sets and tuples,” J. Logic Programming 36(3), 1998, 271–299.

    Google Scholar 

  19. M. Liu and T. W. Ling, “A conceptual model for the Web,” in Proc. of the Internat. Conf. on Conceptual Modeling (ER 2000), Salt Lake City, 9–12 October 2000, Lecture Notes in Computer Science, Springer: New York, pp. 225–238.

    Google Scholar 

  20. M. Liu and T.W. Ling, “A data model for semistructured data with partial and inconsistent information,” in Proc. of the Internat. Conf. on Advances in Database Technology (EDBT 2000), Konstanz, Germany, 27–31 March 2000, Lecture Notes in Computer Science 1777, Springer: New York, pp. 317–331.

    Google Scholar 

  21. M. Liu and T. W. Ling, “A rule–based query language for the Web,” in Proc. of the 7th Internat. Conf. on Database Systems for Advanced Applications (DASFAA 2001), Hong Kong, China, 18–20 April 2001, IEEE Computer Soc. Press: Silver Spring, MD, pp. 6–13.

    Google Scholar 

  22. M. Liu, T. W. Ling, and T. Guan, “Integration of semistructured data with partial and inconsistent information,” in Proc. of the Internat. Database Engineering and Application Symposium (IDEAS '99), Montreal, Canada, 2–4 August 1999, IEEE Computer Soc. Press: Silver Spring, MD, pp. 44–52.

    Google Scholar 

  23. A. Mendelzon, G. Mihaila, and T. Milo, “Querying the World Wide Web,” in Proc. of the 1st Internat. Conf. on Parellel and Distributed Information System, 1996, pp. 80–91.

  24. A. O. Mendelzon and T. Milo, “Formal models of Web queries,” in Proc. of the ACM Symposium on Principles of Database Systems, 1997.

  25. I. Muslea, S. Minton, and C. A. Knoblock, “Hierarchical wrapper induction for semistructured information sources,” J. Autonom. Agents Multi–Agent Systems 4(1/2), 2001, 93–114.

    Google Scholar 

  26. J. Myllymaki, “Effective Web data extraction with standard XML technologies,” in Proc. of the 10th Internat. World Wide Web Conf., Hong Kong, China, 2001, ACM: New York, pp. 689–696.

    Google Scholar 

  27. Y. Papakonstantinou, H. Garcia–Molina, and J. Widom, “Object exchange across heterogeneous information,” in Proc. of the Internat. Conf. on Data Engineering, IEEE Computer Soc. Press: Silver Spring, MD, 1995, pp. 251–260.

    Google Scholar 

  28. D. Raggett, A. L. Hors, and I. Jacobs, “HTML 4.01 specification,” W3C Recommendation; see http://www.w3c.org/TR/html401, December 1999.

  29. J. D. Ullman, Principles of Database and Knowledge–Base Systems, Vol. 1, Computer Soc. Press: Silver Spring, MD, 1988.

    Google Scholar 

  30. L. Wood, A. L. Hors et al., “Document Object Model (DOM) Level 2 Specification,” W3C Recommendation; see http://www.w3c.org/TR/2000/CR–DOM–Level–2–20000307, March 2000.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, M., Ling, T.W. A Conceptual Model and Rule-Based Query Language for HTML. World Wide Web 4, 49–77 (2001). https://doi.org/10.1023/A:1012408428703

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1012408428703

Navigation