Abstract
XML data is queried with XPath expressions, which are a limited form of regular expressions.New XML stream processing applications, such as content-based routing or selective dissemination of information, require thousands or millions of XPath expressions to be evaluated simultaneously on the incoming XML stream at a high, sustained rate.Conceptually, the XPath evaluation problem is analogous to the text search problem, in which one or several regular expressions need to be matched to a given text, but the number of regular expressions here is much larger, while the “text” is much shorter, since it corresponds to the depth of the XML stream.In this paper we examine techniques that have been proposed for XML stream processing, which are variations of either a non-deterministic or a deterministic finite automata (NFA and DFA).For the latter, we describe a series or theoretical results establishing lower and upper bounds on the number of DFA states for sets of XPath expressions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Aho and M. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18:333–340, 1975.
M. Altinel and M. Franklin. Efficient filtering of XML documents for selective dissemination.In Proceedings of VLDB, pages 53–64, Cairo, Egipt, September 2000.
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems.In Proceedings of the ACM SIGART/SIGMOD Symposium on Principles of Database Systems, pages 1–16, June 2002.
C. Chan, P. Felber, M. Garofalakis, and R. Rastogi. Efficient filtering of XML documents with XPath expressions.In Proceedings of the International Conference on Data Engineering, 2002.
V. Christophides, S.Abiteboul, S.Cluet, and M. Scholl. From structured documents to novel query facilities.In R. Snodgrass and M. Winslett, editors, Proceedings of 1994 ACM SIGMOD International Conference on Management of Data, pages 313–324, Minneapolis, Minnesota, May 1994.
T.H. Cormen, C. E. Leiserson, and R.L. Rivest. Introduction to Algorithms. MI T Press, 1990.
R. Goldman and J. Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In Proceedings of Very Large Data Bases, pages 436–445, September 1997.
G. Gonnet, R. Baeza-Yates, and T. Snider. Lexicographical indices for text: inverted files vs. PAT trees. In W.B. Frakes and R.A. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms, pages 66–82. Prentice-Hall, 1992.
G. Gonnet and F. Tompa. Mind your grammar: A new approach to modelling text. In Proceedings of 13th International Conference on Very Large Databases, pages 339–346, 1987.
T.J. Green, A. Gupta, M. Onizuka, and D. Suciu. XMLTK: an XML toolkit for scalable XML stream processing, 2002.manuscript.
T.J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing xml streams with deterministic automata and stream indexes, 2002. manuscript.
M. Gudgin, M. Hadley, J. Moreau, and H. Nielsen. SOAP version 1.2 part 1: Messaging framework, 2001. available from the W3C, http://www.w3.org/2000/xp/Group/.
M. Gudgin, M. Hadley, J. Moreau, and H. Nielsen. SOAP version 1.2 part 2: Adjuncts, 2001.available from the W3C, http://www.w3.org/2000/xp/Group/.
A. Gupta, A. Halevy, and D. Suciu. View selection for XML stream processing. In WebDB’2000, 2002.
J. Hopcroft and J. Ullman. Introduction to automata theory, languages, and computation. Addison-Wesley, 1979.
Z. Ives, A. Halevy, and D. Weld. An XML query engine for network-bound data. Unpublished, 2001.
H. Liefke and D. Suciu. XMill: an efficent compressor for XML data. In Proceedings of SIGMOD, pages 153–164, Dallas, TX, 2000.
M. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of English: the Penn Treenbak. Computational Linguistics, 19, 1993.
NASA’s astronomical data center. ADC XML resource page. http://xml.gsfc.nasa.gov/.
G. Navarro and R. Baeza-Yates. Proximal nodes: a model to query document databases by content and structure. ACM Transactions on Information Systems, 15(4):400–435, October 1997.
B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on the web. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 437–448, Santa Barbara, 2001.
G. Rozenberg and A. Salomaa. Handbook of Formal Languages. Springer Verlag, 1997.
A. Salminen and F.W. Tompa. Pat expressions: An algebra for text search. In Papers in Computational Lexicography: COMPLEX’92, pages 309–332, 1992.
A. Salminen and F.W. Tompa. Pat expressions: An algebra for text search. Acta Linguistica Hungarica, 41(1-4):277–306, 1994.
A. Snoeren, K. Conley, and D. Gifford. Mesh-based content routing using XML. In Proceedings of the 18th Symposium on Operating Systems Principles, 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Suciu, D. (2002). From Searching Text to Querying XML Streams. In: Laender, A.H.F., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2002. Lecture Notes in Computer Science, vol 2476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45735-6_2
Download citation
DOI: https://doi.org/10.1007/3-540-45735-6_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44158-8
Online ISBN: 978-3-540-45735-0
eBook Packages: Springer Book Archive