research-article

Exploring XML web collections with DescribeX

Authors:
Mariano P. Consens

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

,
Renée J. Miller

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

,
Flavio Rizzolo

University of Ottawa and Carleton University

University of Ottawa and Carleton University
View Profile

,
Alejandro A. Vaisman

Universidad de Buenos Aires, Buenos Aires, Argentina

Universidad de Buenos Aires, Buenos Aires, Argentina
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 4 Issue 3Article No.: 11pp 1–46https://doi.org/10.1145/1806916.1806920

Published:20 July 2010Publication History

ACM Transactions on the Web

Abstract

As Web applications mature and evolve, the nature of the semistructured data that drives these applications also changes. An important trend is the need for increased flexibility in the structure of Web documents. Hence, applications cannot rely solely on schemas to provide the complex knowledge needed to visualize, use, query and manage documents. Even when XML Web documents are valid with regard to a schema, the actual structure of such documents may exhibit significant variations across collections for several reasons: the schema may be very lax (e.g., RSS feeds), the schema may be large and different subsets of it may be used in different documents (e.g., industry standards like UBL), or open content models may allow arbitrary schemas to be mixed (e.g., RSS extensions like those used for podcasting). For these reasons, many applications that incorporate XPath queries to process a large Web document collection require an understanding of the actual structure present in the collection, and not just the schema.

To support modern Web applications, we introduce DescribeX, a powerful framework that is capable of describing complex XML summaries of Web collections. DescribeX supports the construction of heterogenous summaries that can be declaratively defined and refined by means of axis path regular expression (AxPREs). AxPREs provide the flexibility necessary for declaratively defining complex mappings between instance nodes (in the documents) and summary nodes. These mappings are capable of expressing order and cardinality, among other properties, which can significantly help in the understanding of the structure of large collections of XML documents and enhance the performance of Web applications over these collections. DescribeX captures most summary proposals in the literature by providing (for the first time) a common declarative definition for them. Experimental results demonstrate the scalability of DescribeX summary operations (summary creation, as well as refinement and stabilization, two key enablers for tailoring summaries) on multi-gigabyte Web collections.

References

Al-Khalifa, S., Jagadish, H. V., Patel, J. M., Wu, Y., Koudas, N., and Srivastava, D. 2002. Structural joins: A primitive for efficient XML query pattern matching. In Proceedings of the 18th International Conference on Data Engineering. 141--152. Google ScholarDigital Library
Ali, M. S., Consens, M. P., Gu, X., Kanza, Y., Rizzolo, F., and Stasiu, R. K. 2006. Efficient, effective and flexible XML retrieval using summaries. In Proceedings of the 5th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX'06). Lecture Notes in Computer Science, vol. 4518. Springer, 89--103.Google Scholar
Ali, M. S., Consens, M. P., and Khatchadourian, S. 2007. XML retrieval by improving structural relevance measures obtained from summary models. In Proceedings of the 6th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX'07). Springer, 34--48.Google Scholar
Ali, M. S., Consens, M. P., Khatchadourian, S., and Rizzolo, F. 2008. DescribeX: interacting with AxPRE summaries. In Proceedings of the 24th International Conference on Data Engineering (Demonstrations). 1540--1543. Google ScholarDigital Library
Amato, G., Debole, F., Rabitti, F., Savino, P., and Zezula, P. 2004. A signature-based approach for efficient relationship search on XML data collections. In Proceedings of the 2nd International XML Database Symposium, XSym. 82--96.Google Scholar
Balmin, A., Ozcan, F., Beyer, K. S., Cochrane, R., and Pirahesh, H. 2004. A framework for using materialized XPath views in XML query processing. In Proceedings of the 30th International Conference on Very Large Data Bases. 60--71. Google ScholarDigital Library
Barta, A., Consens, M. P., and Mendelzon, A. O. 2005. Benefits of path summaries in an XML query optimizer supporting multiple access methods. In Proceedings of the 31st International Conference on Very Large Data Bases. 133--144. Google ScholarDigital Library
Bex, G. J., Neven, F., Schwentick, T., and Tuyls, K. 2006. Inference of concise DTDs from XML data. In Proceedings of the 32nd International Conference on Very Large Data Bases. 115--126. Google ScholarDigital Library
Bruno, N., Koudas, N., and Srivastava, D. 2002. Holistic twig joins: Optimal XML pattern matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 310--321. Google ScholarDigital Library
Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R., and Viglas, S. 2005. Vectorizing and querying large XML repositories. In Proceedings of the 21st International Conference on Data Engineering. 261--272. Google ScholarDigital Library
Chien, S.-Y., Vagena, Z., Zhang, D., Tsotras, V. J., and Zaniolo, C. 2002. Efficient structural joins on indexed XML documents. In Proceedings of the 28th International Conference on Very Large Data Bases. 263--274. Google ScholarDigital Library
Chung, C.-W., Min, J.-K., and Shim, K. 2002. APEX: An adaptive path index for XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 121--132. Google ScholarDigital Library
Clark, J. and Makoto, M. 2001. RELAX NG specification. http://www.oasis-open.org/committees/relax-ng/spec-20011203.html.Google Scholar
Consens, M. P. and Milo, T. 1994. Optimizing queries on files. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 301--312. Google ScholarDigital Library
Consens, M. P. and Rizzolo, F. 2007. Fast answering of XPath query workloads on Web collections. In Proceedings of the 5th International XML Database Symposium, XSym. 31--45. Google ScholarDigital Library
Consens, M. P., Rizzolo, F., and Vaisman, A. A. 2008. AxPRE summaries: Exploring the (semi-) structure of XML Web collections. In Proceedings of the 24th International Conference on Data Engineering. 1519--1521. Google ScholarDigital Library
Cooper, B. F., Sample, N., Franklin, M. J., Hjaltason, G. R., and Shadmon, M. 2001. A fast index for semistructured data. In Proceedings of the 27th International Conference on Very Large Data Bases. 341--350. Google ScholarDigital Library
Denoyer, L. and Gallinari, P. 2006. The Wikipedia XML Corpus. SIGIR Forum. Google ScholarDigital Library
Dietz, P. F. 1982. Maintaining order in a linked list. In Proceedings of the 14th Annual ACM Symposium on Theory of Computing. 122--127. Google ScholarDigital Library
Dovier, A., Piazza, C., and Policriti, A. 2004. An efficient algorithm for computing bisimulation equivalence. Theoret. Comput. Sci. 311, 1--3, 221--256. Google ScholarDigital Library
Fletcher, G. H. L., Gucht, D. V., Wu, Y., Gyssens, M., Brenes, S., and Paredaens, J. 2007. A methodology for coupling fragments of XPath with structural indexes for XML documents. In Proceedings of the 11th International Symposium on Database Programming Languages (DBPL'07). 48--65. Google ScholarDigital Library
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., and Shim, K. 2003. XTRACT: Learning document type descriptors from XML document collections. Data Mining Knowl. Disc. 7, 1, 23--56. Google ScholarDigital Library
Goldman, R. and Widom, J. 1997. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases. 436--445. Google ScholarDigital Library
He, H. and Yang, J. 2004. Multiresolution indexing of XML for frequent queries. In Proceedings of the 20th International Conference on Data Engineering. 683--694. Google ScholarDigital Library
Hopcroft, J. E. and Ullman, J. D. 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley. Google ScholarDigital Library
Jiang, H., Lu, H., Wang, W., and Ooi, B. C. 2003a. XR-Tree: Indexing XML data for efficient structural joins. In Proceedings of the 19th International Conference on Data Engineering. 253--263.Google Scholar
Jiang, H., Wang, W., Lu, H., and Yu, J. X. 2003b. Holistic twig joins on indexed XML documents. In Proceedings of the 29th International Conference on Very Large Data Bases. 273--284. Google ScholarDigital Library
Kaplan, H., Milo, T., and Shabo, R. 2002. A comparison of labeling schemes for ancestor queries. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms. 954--963. Google ScholarDigital Library
Kaushik, R., Bohannon, P., Naughton, J. F., and Korth, H. F. 2002a. Covering indexes for branching path queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 133--144. Google ScholarDigital Library
Kaushik, R., Bohannon, P., Naughton, J. F., and Shenoy, P. 2002b. Updates for structure indexes. In Proceedings of the 28th International Conference on Very Large Data Bases. 239--250. Google ScholarDigital Library
Kaushik, R., Shenoy, P., Bohannon, P., and Gudes, E. 2002c. Exploiting local similarity for indexing paths in graph-structured data. In Proceedings of the 18th International Conference on Data Engineering. 129--140. Google ScholarDigital Library
Kazai, G., Gövert, N., Lalmas, M., and Fuhr, N. 2003. The INEX evaluation initiative. In Intelligent Search on XML Data. 279--293.Google Scholar
Kha, D. D., Yoshikawa, M., and Uemura, S. 2001. An XML indexing structure with relative region coordinate. In Proceedings of the 17th International Conference on Data Engineering. 313--320. Google ScholarDigital Library
Lakshmanan, L. V., Wang, H. W., and Zhao, Z. J. 2006. Answering tree pattern queries using views. In Proceedings of the 32nd International Conference on Very Large Data Bases. 571--582. Google ScholarDigital Library
Li, Q. and Moon, B. 2001. Indexing and querying XML data for regular path expressions. In Proceedings of the 27th International Conference on Very Large Data Bases. 361--370. Google ScholarDigital Library
Li, Y., Yu, C., and Jagadish, H. V. 2008. Enabling Schema-Free XQuery with meaningful query focus. Int. J. VLDB 17, 3, 355--377. Google ScholarDigital Library
Lu, J., Ling, T. W., Chan, C. Y., and Chen, T. 2005. From region encoding to extended Dewey: On efficient processing of XML twig pattern matching. In Proceedings of the 31st International Conference on Very Large Data Bases. 193--204. Google ScholarDigital Library
Mandhani, B. and Suciu, D. 2005. Query caching and view selection for XML databases. In Proceedings of the 31st International Conference on Very Large Data Bases. 469--480. Google ScholarDigital Library
Martens, W., Neven, F., Schwentick, T., and Bex, G. J. 2006. Expressiveness and complexity of XML schema. ACM Trans. Datab. Syst. 31, 3, 770--813. Google ScholarDigital Library
Mendelzon, A. O. and Wood, P. T. 1995. Finding regular simple paths in graph databases. SIAM J. Comput. 24, 6, 1235--1258. Google ScholarDigital Library
Miller, R. J., Haas, L. M., and Hernández, M. 2000. Schema mapping as query discovery. In Proceedings of the 26th International Conference on Very Large Data Bases. 77--88. Google ScholarDigital Library
Milo, T. and Suciu, D. 1999. Index structures for path expressions. In Proceedings of the 7th International Conference on Database Theory. 277--295. Google ScholarDigital Library
Murata, M., Lee, D., Mani, M., and Kawaguchi, K. 2005. Taxonomy of XML schema languages using formal language theory. ACM Trans. Intern. Techn. 5, 4, 660--704. Google ScholarDigital Library
Nestorov, S., Ullman, J. D., Wiener, J. L., and Chawathe, S. S. 1997. Representative objects: Concise representations of semistructured, hierarchial data. In Proceedings of the 13th International Conference on Data Engineering. 79--90. Google ScholarDigital Library
Paige, R. and Tarjan, R. E. 1987. Three partition refinement algorithms. SIAM J. Comput. 16, 6, 973--989. Google ScholarDigital Library
Polyzotis, N. and Garofalakis, M. N. 2006a. XCluster synopses for structured XML content. In Proceedings of the 22nd International Conference on Data Engineering. Google ScholarDigital Library
Polyzotis, N. and Garofalakis, M. N. 2006b. XSketch synopses for XML data graphs. ACM Trans. Datab. Syst. 31, 3, 1014--1063. Google ScholarDigital Library
Polyzotis, N., Garofalakis, M. N., and Ioannidis, Y. E. 2004. Approximate XML query answers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 263--274. Google ScholarDigital Library
Popa, L., Velegrakis, Y., Miller, R. J., Hernández, M. A., and Fagin, R. 2002. Translating Web data. In Proceedings of the 28th International Conference on Very Large Data Bases. 598--609. Google ScholarDigital Library
Qun, C., Lim, A., and Ong, K. W. 2003. D(k)-index: An adaptive structural summary for graph-structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 134--144. Google ScholarDigital Library
Rao, P. and Moon, B. 2004. PRIX: Indexing and querying XML using prufer sequences. In Proceedings of the 20th International Conference on Data Engineering. 288--300. Google ScholarDigital Library
Rizzolo, F. 2008. DescribeX: A framework for exploring and querying XML Web collections. Ph.D. thesis, University of Toronto. CoRR arXiv:0807.2972v1, http://arXiv.org/abs/0807.2972.Google Scholar
Rizzolo, F. and Mendelzon, A. O. 2001. Indexing XML data with ToXin. In Proceedings of 4th International Workshop on the Web and Databases. 49--54.Google Scholar
Rizzolo, F. and Vaisman, A. A. 2008. Temporal XML: Modeling, indexing, and query processing. Int. J. VLDB 17, 5, 1179--1212. Google ScholarDigital Library
Samavi, R., Consens, M., Khatchadourian, S., and Topaloglou, T. 2007. Exploring PSI-MI XML collections using DescribeX. J. Integr. Bioinform. 4, 3.Google ScholarCross Ref
Santoro, N. and Khatib, R. 1985. Labelling and implicit routing in networks. Comput. J. 28, 5--8.Google ScholarCross Ref
Vagena, Z., Moro, M. M., and Tsotras, V. J. 2004. Efficient processing of XML containment queries using partition-based schemes. In Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS'04). 161--170. Google ScholarDigital Library
W3C. 1999. XML Path Language (XPath) 1.0. http://www.w3.org/TR/xpath.Google Scholar
W3C. 2004. XML Schema. http://www.w3.org/TR/xmlschema-0.Google Scholar
W3C. 2006. Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/REC-xml.Google Scholar
W3C. 2007. XML Path Language (XPath) 2.0. http://www.w3.org/TR/xpath20.Google Scholar
Wang, H., Park, S., Fan, W., and Yu, P. S. 2003a. ViST: A dynamic index method for querying XML data by tree structures. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 110--121. Google ScholarDigital Library
Wang, W., Jiang, H., Lu, H., and Yu, J. X. 2003b. PBiTree coding and efficient processing of containment joins. In Proceedings of the 19th International Conference on Data Engineering. 391.Google Scholar
Xu, W. and Özsoyoglu, Z. M. 2005. Rewriting XPath queries using materialized views. In Proceedings of the 31st International Conference on Very Large Data Bases. 121--132. Google ScholarDigital Library
Yannakakis, M. 1990. Graph-theoretic methods in database theory. In Proceedings of the 9th Symposium on Principles of Database Systems. 230--242. Google ScholarDigital Library
Yi, K., He, H., Stanoi, I., and Yang, J. 2004. Incremental maintenance of XML structural indexes. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 491--502. Google ScholarDigital Library
Young-Lai, M. and Tompa, F. W. 2003. One-pass evaluation of region algebra expressions. Inform. Syst. 28, 3, 159--168. Google ScholarDigital Library
Yu, C. and Jagadish, H. V. 2006a. Efficient discovery of XML data redundancies. In Proceedings of the 32nd International Conference on Very Large Data Bases. 103--114. Google ScholarDigital Library
Yu, C. and Jagadish, H. V. 2006b. Schema summarization. In Proceedings of the 32nd International Conference on Very Large Data Bases. 319--330. Google ScholarDigital Library
Yu, C. and Jagadish, H. V. 2007. Querying complex structured databases. In Proceedings of the 33rd International Conference on Very Large Data Bases. 1010--1021. Google ScholarDigital Library
Yu, C. and Jagadish, H. V. 2008. XML schema refinement through redundancy detection and normalization. Int. J. VLDB 17, 2, 203--223. Google ScholarDigital Library
Zhang, N., Kacholia, V., and Özsu, M. T. 2004. A succinct physical storage scheme for efficient evaluation of path queries in XML. In Proceedings of the 20th International Conference on Data Engineering. 54--65. Google ScholarDigital Library
Zhang, N., Özsu, M. T., Ilyas, I. F., and Aboulnaga, A. 2006. FIX: Feature-based indexing technique for XML documents. In Proceedings of the 32nd International Conference on Very Large Data Bases. 259--270. Google ScholarDigital Library

Index Terms

Exploring XML web collections with DescribeX
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Information systems
  1. Data management systems

Recommendations

XXS: Efficient XPath Evaluation on Compressed XML Documents

The eXtensible Markup Language (XML) is acknowledged as the de facto standard for semistructured data representation and data exchange on the Web and many other scenarios. A well-known shortcoming of XML is its verbosity, which increases manipulation, ...
Read More
Temporal XML: modeling, indexing, and query processing

In this paper we address the problem of modeling and implementing temporal data in XML. We propose a data model for tracking historical information in an XML document and for recovering the state of the document as of any given time. We study the ...
Read More
The essence of XML

The World-Wide Web Consortium (W3C) promotes XML and related standards, including XML Schema, XQuery, and XPath. This paper describes a formalization of XML Schema. A formal semantics based on these ideas is part of the official XQuery and XPath ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 4, Issue 3
July 2010
166 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/1806916
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 July 2010
- Accepted: 1 January 2010
- Revised: 1 June 2009
- Received: 1 July 2008
Published in tweb Volume 4, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Semistructured data
XML
XPath
structural summaries
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 446
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ePub

View this article in ePub.

View ePub

Exploring XML web collections with DescribeX

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

XXS: Efficient XPath Evaluation on Compressed XML Documents

Temporal XML: modeling, indexing, and query processing

The essence of XML

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

ePub

Caption

Exploring XML web collections with DescribeX

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

XXS: Efficient XPath Evaluation on Compressed XML Documents

Temporal XML: modeling, indexing, and query processing

The essence of XML

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

ePub

Share this Publication link

Share on Social Media