Skip to main content

Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2476))

Abstract

Techniques in processing text files “as is” are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the “as-is” principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Korean, Chinese, and Taiwanese. A text file from such languages is a mixture of single-byte characters and multi-byte characters. Naive solution would be (1) to convert a given text into a fixed length encoded one and then apply any string matching routine to it; or (2) to directly search the text file byte after byte for (the encoding of) a pattern in which an extra work is needed for synchronization to avoid false detection. Both the solutions, however, sacrifice the searching speed. Our algorithm runs on such a multi-byte character text file at the same speed as on an ordinary ASCII text file, without false detection. The technique is applicable to any prefix code such as the Huffman code and variants of Unicode. We also generalize the technique so as to handle structured texts such as XML documents. Using this technique, we can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A.V. Aho and M. Corasick. Efficient string matching: An aid to bibliographic search. Comm. ACM, 18(6):333–340, 1975.

    Article  MATH  MathSciNet  Google Scholar 

  2. A. Amir and G. Benson. Efficient two-dimensional compressed matching. In Proc. Data Compression Conference, page 279, 1992.

    Google Scholar 

  3. S. Arikawa and T. Shinohara. A run-time efficient realization of Aho-Corasick pattern matching machines. New Generation Computing, 2(2):171–186, 1984.

    Article  Google Scholar 

  4. S. Arikawa et al. The text database management syste SIGMA: An improvement of the main engine. In Proc. of Berliner Informatik-Tage, pages 72–81, 1989.

    Google Scholar 

  5. J. Jaakkola and P. Kilpeläinen. A tool to search structured text. University of Helsinki. (In preparation).

    Google Scholar 

  6. S. T. Klein and D. Shapira. Pattern matching in Huffman encoded texts. In Proc. Data Compression Conference 2001, pages 449–458. IEEE Computer Society, 2001.

    Google Scholar 

  7. D. E. Knuth. The Art of Computer Programing, Sorting and Searching, volume 3. Addison-Wesley, 1973.

    Google Scholar 

  8. N. J. Larsson and A. Moffat. Offline dictionary-based compression. In Proc. Data Compression Conference’ 99, pages 296–305. IEEE Computer Society, 1999.

    Google Scholar 

  9. M. Miyazaki, S. Fukamachi, M. Takeda, and T. Shinohara. Speeding up the pattern matching machine for compressed texts. Transactions of Information Processing Society of Japan, 39(9):2638–2648, 1998. (in Japanese).

    MathSciNet  Google Scholar 

  10. D. Revuz. Minimisation of acyclic deterministic automata in linear time. Theoretical Computer Science, 92(1):181–189, 1992.

    Article  MATH  MathSciNet  Google Scholar 

  11. Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa. A Boyer-Moore type algorithm for compressed pattern matching. In Proc. 11th Ann. Symp. on Combinatorial Pattern Matching, volume 1848 of Lecture Notes in Computer Science, pages 181–194. Springer-Verlag, 2000.

    Chapter  Google Scholar 

  12. N. Uratani and M. Takeda. A fast string-searching algorithm for multiple patterns. Information Processing & Management, 29(6):775–791, 1993.

    Article  Google Scholar 

  13. M. Yoshikawa and T. Amagasa. XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Transactions on Internet Technology, 1(1):110–141, August 2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Takeda, M. et al. (2002). Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts. In: Laender, A.H.F., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2002. Lecture Notes in Computer Science, vol 2476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45735-6_16

Download citation

  • DOI: https://doi.org/10.1007/3-540-45735-6_16

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44158-8

  • Online ISBN: 978-3-540-45735-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics