Skip to main content

Current Approaches to XML Benchmarking

(Invited Talk)

  • Conference paper
  • 471 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5667))

Abstract

XML benchmarking is as versatile an issue as numerous and diverse are the potential applications of XML. It is however not yet clear which of these anticipated applications will be prevalent and which of their features and components will have such performance requirements that necessitate benchmarking.

The performance evaluation of XML-based systems, tools and techniques can either use benchmarks that consist of a predefined data set and workload or it can use a data set with an ad hoc workload. In both cases the data set can be real or synthetic. XML data generators such as Toxgene and Alphawork can generate XML documents whose characteristics, such as depth, breadth and various distributions, are controlled. It is also expected that benchmarks provide data generator with a fair amount of control of the size and shape of the data, if the data is synthesized, or offer a suite of data subsets of varying size and shape, if the data is real. Application level evaluation emphasizes the representativeness of the data set ad workload in terms of typical applications while micro-level evaluation focuses on elementary and individual technical features.

The dual view of XML, data view and document view, is reflected in its benchmarks. There exist several well established benchmarks for XML data management systems that can be used for the evaluation of the performance of query processing. The main application level benchmarks in this category are XOO7, XMach1, XMark, and XBench while The Michigan Benchmark is a micro-benchmark. For the evaluation of XML information retrieval the prevalent benchmark is the series of INEX corpora and topics. However, in practice, whether for the evaluation of XML data management techniques or for the evaluation of XML-retrieval techniques, researchers seem to favor real or synthetic data sets with ad hoc workloads when needed. The university of Washington repository gathers links to a variety of XML data sets. Noticeably most of these data sets are small. The largest is 603MB. Popular data sets like Mundial or the Baseball Boxscore XML are much smaller. The Database and Logic Programming Bibliography XML data set, also used by many scientists, is around 500MB. All of these data sets are generally relatively structured and quite shallow thus not necessarily conveying the expected challenges associated with the semi-structure nature of XML.

If the application level data sets and workloads are not satisfactory, It may well be the case that XML as a language used to structure and manage content has not yet matured. We must ask ourselves the question as to what is there really to benchmark. As of today, XML data are most commonly produced by office suites and software development kits. Office suites supporting Office Open XML and in Open Document Format are or will soon become the principal producers of XML. Yet in these environments XML is principally used to represent formatting instructions. Similarly, the widespread adoption of Web service standards in software development frameworks and kits (in the .Net framework, for instance) also contributes to the creation of large amounts of XML data. Again here XML is primarily used to represent formats (e.g. SOAP messages).

Although both XML-based document standards and Web service standards have intrinsic provision for XML content and have been designed to enable the management of content in XML, few users have yet the tools, the wants and the culture to manage their data in XML. Consequently, at least for now, it seems that these huge amounts of XML data created in the background of authoring and programming activities need neither be queried nor searched but rather only need to be processed by the office suites and compilers. The emphasis is still on format rather than content structuring and management. Of course, it is hoped by proponent of XML as a format for content that the XML-ization of formats will facilitate the XML-zation of the content.

With XML-based protocols and formats, XML as a ”standards’ standard” (as there are compiler compilers) has been most successful at the lower layers of information management. The efforts for content organization and management, on the other hand, do not seem to have been as pervasive and prolific (in terms of the amount of XML data produced and used). For instance, the volume of data in the much talked about business XML standards (Rosettanet or Universal Business Language, for instance) is still difficult to measure and may not be or become significant. In this presentation we critically review the existing approaches to benchmarking of XML-based systems and applications.We try to analyze the trends in the usage of XML and in order to determine the needs and requirements for the successful design, development and adoption of benchmarks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bressan, S. (2009). Current Approaches to XML Benchmarking. In: Chen, L., Liu, C., Liu, Q., Deng, K. (eds) Database Systems for Advanced Applications. DASFAA 2009. Lecture Notes in Computer Science, vol 5667. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04205-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04205-8_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04204-1

  • Online ISBN: 978-3-642-04205-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics