Replacement strategies for XQuery caching systems

https://doi.org/10.1016/j.datak.2003.10.002Get rights and content

Abstract

To improve the query performance over XML documents in a distributed environment, we develop a semantic caching system named ACE-XQ for XQuery queries. ACE-XQ applies innovative query containment and rewriting techniques to answer user queries using cached queries. We also design a fine-grained replacement strategy which records user access statistics at a finer granularity than the complete XML query regions. As a result, less frequently used XML view fragments are replaced to maintain a better utilization of the cache space. Extensive experimental results illustrate the performance improvement achieved by this strategy over the traditional one for a variety of situations.

Introduction

Due to the growing demand by web applications for retrieving information from multiple remote XML sources, it has become increasingly critical to improve the efficiency of XML query evaluation. One key step towards achieving such an optimization is to exploit caching technology to reduce the response latency caused by data transmission over the Internet. Inspired by the semantic caching idea [14], which utilizes cached queries and their results to answer subsequent queries by reasoning about their containment relationships, we propose to build such a caching system to facilitate XML query processing in the Web environment.

One major difference between semantic caching systems [6], [14], [21] and the traditional tuple [16] or page-based [6] caching systems is that the data cached at the client side of the former is logically organized by queries instead of physical tuple identifications or page numbers. To achieve effective cache management, the access and management of the cached data in a semantic caching system is thus typically at the level of query descriptions. For example, the decision of whether the answers of a new query can be retrieved from the local cache is based on the query containment analysis of the new query and the cached query descriptors themselves, rather than by looking up each and every tuple or page identification of objects that could possibly answer a current user request.

The semantic caching idea has been extensively studied in the relational context [14]. However, query evaluation and containment dealing with XML data differ in their nature and difficulty from those in the relational setting. New challenges are being imposed by the tree-oriented nature of XML and the XQuery language on the tasks of query containment and rewriting, as we will point out in this paper.

We have developed the first XQuery-based caching system, named ACE-XQ [9], [11], to deploy our proposed query containment and cache management techniques in the XML context. In ACE-XQ, new and cached queries are both expressed in XQuery, a quickly thriving XML query language proposed by W3C as the standard [44]. The query descriptors in the ACE-XQ system help to capture the query semantics which are utilized in the decision for query containment. While [9] describes the XQuery containment and rewriting techniques in ACE-XQ, in this paper we focus on cache management of ACE-XQ, in particular cache replacement issues.

Typically, a cache system utilizes a replacement manager to decide what to retain in the cache and what to discard in case of a full cache. In a query-based caching system, the data granularity for replacement is the query and its associated query result. The cache manager in ACE-XQ maintains a collection of query regions, each composed of a query descriptor and the corresponding XML view document, i.e., query region=query descriptor + result XML view. Query descriptors can be utilized for reasoning about the containment relationships between the cached queries and the new query. Also, user access statistics information may be attached to the query descriptors by the deployed replacement strategy to calculate the region utility values. The replacement manager usually picks the cached query with the lowest utility value and purges it to make room for the new query.

Since a new query is often conceptually subsumed by or overlapping with previously cached queries, the query region of the latter can be seen logically segmented into two pieces. One corresponds to the overlapping part which is to be retrieved by the probe query for answering the new query. The left-over piece does not contribute to answering the new query. The replacement manager of a traditional query-based caching system may split the containing query region into two regions corresponding to their respective usefulness in this latest query answering process. After the splitting, a uniform utility value is then maintained for each query region. Whenever the cache is full, a complete query region would be the unit for replacement. However, such a region-splitting scheme entails a large decomposition overhead each time when a new query overlaps with the cached queries. Also, it would result in more and more smaller XML view documents over time which are possibly less useful in answering future queries due to their fragmentation.

An alternative solution is to tolerate some redundancy in the cached queries. That is, even if newly incoming queries partially overlap with existing queries, we would opt to not split existing queries in order to avoid fragmentation. Then a straightforward application of replacement would be to replace a complete query region at each iteration. However, the data granularity of a whole query region being deleted each time in such a replacement strategy may be too coarse for “large” XML views. This would impact the cache space utilization. Also, such a replacement strategy does not reflect the contribution of different fragments in a cached XML view which may participate in answering different subsequent queries. Replacement at the granularity of complete XML views hence suffers apparent drawbacks.

We now propose a refined replacement strategy, namely, to record utility values for finer regions of existing cached views in terms of their internal structure rather than assigning a uniform value for the whole cached query region [12]. To be precise, we attach to each query descriptor a detailed path table listing all paths returned in the query. When a cached query contains or partially overlaps with a new query, the utility statistics of those paths requested by the probe query are updated, however without splitting the cached query. When the cache is full, the replacement manager does not select complete regions but only specific paths with the lowest utility value within such query regions for replacement. It then composes a filter query to remove the fragments corresponding to those paths from the cached XML view. The relevant query descriptors are then modified accordingly to be consistent with the changed XML view.

This proposed partial replacement strategy utilizes the view structure to maintain utility values at a finer granularity than complete query regions. This way, the replacement helps to maintain in the cache the most likely “hot” query regions. This is because the original cached queries may be refined by future filter queries that remove the less useful fragments within them. It hence forgoes the explicit region splitting upon every new incoming query, avoiding the generation of too many small region fragments with little use for answering future queries.

We have also implemented both the proposed partial replacement strategy as well as the complete region replacement strategy (which we now call total replacement) within our ACE-XQ caching system. In this paper, we now report upon the extensive experimental study we have conducted to compare the performance of our partial replacement and the alternative total replacement strategies in a variety of scenarios. The results show that in most cases especially when the cache size is medium, the partial replacement strategy outperforms the total replacement strategy in terms of hit count ratio (HCR), hit byte ratio (HBR) and query response delay.

The rest of the paper is organized as follows. In Section 2, we show the running example queries to motivate the need for a query containment and rewriting solution in the context of XQuery. An overview of our overall XQuery caching solution ACE-XQ is given in Section 3. We then focus on the cache management aspect of the ACE-XQ system. We analyze the advantages and disadvantages of alternative query region managing schemes in Section 4, while in Section 5 we describe a fine-grained replacement strategy (a la partial replacement) deployed in ACE-XQ. The experimental studies comparing our partial replacement strategy with the traditional total replacement strategy are given in Section 7. The related work is described in Section 8 and we conclude in Section 9.

Section snippets

Running example of XQuery containment and rewriting

The foundation of query-based caching is query containment, i.e., verifying whether one query yields necessarily a subset of the result of another query. In the relational context, the containment problem for conjunctive queries has been extensively studied [8], [27], [29], [38]. Its complexity was shown to be NP-complete in [7].

A query Q1 is contained in a query Q2, denoted Q1⊑Q2, if for any database D, the answers to Q1 form a subset of the answers to Q2. The two queries are equivalent,

The ACE-XQ system overview

The framework of the ACE-XQ system is depicted in Fig. 5. It consists of two subsystems, a Query Matcher which implements the query containment and rewriting techniques and a Cache Manager which manages the cache space and applies replacement and coalescing techniques.

When a new user query comes in, the Query Decomposer (shown in the Query Matcher subsystem on the left hand side of Fig. 5) applies normalization rules [9], [32] to derive its nesting format, revealing the variable dependency

Design choices for alternative cache region management schemes

In a traditional query-based caching system [14], a query region is the minimal granularity managed in the cache. A query region consists of an encoded query descriptor and a pointer to access the associated result XML view. In this section, we will take a look at existing alternative schemes and compare them in their ways of managing the query regions in the presence of replacement activities.

When a new query arrives, the containment mapper will first determine if it is contained or partially

Query descriptor hierarchy

To overcome the drawbacks identified above of naive region-splitting replacement strategies, we instead suggest here that different utility values may be maintained for finer parts within a given region to account for different levels of accesses by users. This then should be done independently from the final decision of splitting the region. When a new query overlaps with a cached query region, the overlapped portion in the cached region is accessed by the return expressions of the probe

The analysis of cache performance

We have discussed earlier the methodologies adopted by our partial replacement approach versus the alternative approaches. Here, we attempt to give an analytical model for better understanding how the caching system interacts with various factors such as cache size and query access pattern. Based on this model, we analyze how the cache would behave in the face of different replacement strategies. This may help us to gain insights into the reason why the cache equipped with our partial

System setup

We have implemented our ACE-XQ [9] in Java 1.3. We utilize the Quilt parser and Kweelt query engine available at: http://cheops.cis.upenn.edu/Kweelt to analyze and evaluate the input XQuery. To realize the type-enhanced query containment and rewriting algorithm, we deploy the type inference and subtyping mechanisms provided by the XDuce system [20] in ACE-XQ.

We installed the Kweelt query engine on a local UNIX machine where the ACE-XQ system resides, and another one on a remote web server where

XML query containment

The problem of query containment is fundamental to query evaluation and optimization in database systems. This problem was first studied by Chandra and Merlin [7] for conjunctive queries, whose expressive power is equivalent to that of the Select-Project-Join (SPJ) queries in relational algebra. A flurry of extensive research efforts have followed to investigate all the relevant aspects ranging from the complexity theory to its practical applications in optimizing queries, answering queries

Conclusion

We have proposed a fine granularity replacement strategy and deployed it in our ACE-XQ XQuery caching system. As opposed to the total replacement at the query level, this strategy maintains utility values at the granularity of the XPath structures of a cached view. That is, our partial replacement discards the non-beneficial XML fragments while retaining the useful portions within the XML view document.

In this paper, we also report on extensive experiments which are conducted to compare the

Acknowledgements

This work was supported in part by the NSF NYI grant IIS-979624. Li Chen would like to thank IBM for the IBM Corporate Fellowship.

References (46)

  • A. Deutsch, V. Tannen, Containment of regular path expressions under integrity constraints, in: 8th International...
  • P. Buneman, S. Davidson, W. Fan, C. Hara, Keys for XML, in: World Wide Web Conference (WWW10), Hong Kong, China, 2001,...
  • D. Calvanese, G.D. Giacomo, M. Lenzerini, M.Y. Vardi, View-based query processing for regular path queries with...
  • P. Cao, E.W. Felten, K. Li, Application-controlled file caching policies, in: Proceedings of the USENIX Summer 1994...
  • P. Cao, S. Irani, Cost aware WWW proxy caching algorithms, in: Proceedings of USENIX Symposium on Internet Technologies...
  • M.J. Carey, M.J. Franklin, M. Zaharioudakis, Fine-grained sharing in a page server OODBMS, in: SIGMOD, Minneapolis,...
  • A.K. Chandra, P.M. Merlin, Optimal implementations of conjunctive queries in relational data bases, in: STOC, 1977, pp....
  • C.M. Chen, N. Roussopoulos, The implementation and performance evaluation of the ADMS query optimizer: integrating...
  • L. Chen, E.A. Rundensteiner, ACE-XQ: a CachE-aware XQuery answering system, in: Proceedings of the 5th International...
  • L. Chen, E.A. Rundensteiner, A semantic caching system for XQueries, Technical Report, Computer Science Department,...
  • L. Chen, E.A. Rundensteiner, S. Wang, XCache––a semantic caching system for XML queries, in: SIGMOD demonstration...
  • L. Chen, S. Wang, E.A. Rundensteiner, A fine-grained replacement strategy for XML query cache, in: 4th International...
  • E.G. Coffman et al.

    Operating Systems Theory

    (1973)
  • S. Dar, M.J. Franklin, B. Jonsson, Semantic data caching and replacement, in: VLDB, Bombay, India, 1996, pp....
  • A. Deutsch, M. Fernandez, D. Florescu, A. Levy, D. Suciu, A query language for XML, in: Proceedings of the 8th...
  • D. DeWitt, P. Futtersack, D. Maier, F. Velez, A study of three alternative workstation–server architectures for...
  • D. Florescu, A. Levy, D. Suciu, Query containment for conjunctive queries with regular expressions, in: Symposium on...
  • G. Miklau, D. Suciu, Containment and equivalence for an XPath fragment, in: Symposium on Principles of Database Systems...
  • H. Hosoya, B.C. Pierce, XDuce: a typed XML processing language, in: WebDB, Dallas, TX, May 2000, pp....
  • H. Hosoya, J. Vouillon, B.C. Pierce, Regular expression types for XML, Montreal, Canada, in: International Conference...
  • L.M. Haas, D. Kossmann, I. Ursu, Loading a cache with query results, in: Proceedings of the 25th VLDB Conference,...
  • V. Hristidis, M. Petropoulos. Semantic caching of XML databases, in: 5th International Workshop on the Web and...
  • T. Johnson, D. Shasha, 2Q: a low overhead high performance buffer management replacement algorithm, in: Proceedings of...
  • Cited by (6)

    • Value-based predicate filtering of XML documents

      2008, Data and Knowledge Engineering
      Citation Excerpt :

      The views may contain copies of XML fragments and can be used to answer a user query containing XPath expressions. A semantic caching system, called ACE-XQ [26], has been proposed to improve the query performance over XML documents in a distributed environment. However, FleXPath, the materialized XPath views approach, and ACE-XQ system do not support multi-query evaluation.

    • Query optimization against XML data

      2016, Studies in Informatics and Control
    • GeoCache: A cache for GML geographical data

      2009, Handbook of Research on Geoinformatics
    • On the discovery of conserved XML query patterns for evolution-conscious caching

      2009, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • GeoCache: A cache for GML geographical data

      2007, International Journal of Data Warehousing and Mining
    View full text