Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

Takeda, Masayuki; Miyamoto, Satoru; Kida, Takuya; Shinohara, Ayumi; Fukamachi, Shuichi; Shinohara, Takeshi; Arikawa, Setsuo

doi:10.1007/3-540-45735-6_16

Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

Masayuki Takeda^6,7,
Satoru Miyamoto⁶,
Takuya Kida⁸,
Ayumi Shinohara^6,7,
Shuichi Fukamachi⁹,
Takeshi Shinohara⁹ &
…
Setsuo Arikawa⁶

Conference paper
First Online: 01 January 2002

808 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2476))

Abstract

Techniques in processing text files “as is” are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the “as-is” principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Korean, Chinese, and Taiwanese. A text file from such languages is a mixture of single-byte characters and multi-byte characters. Naive solution would be (1) to convert a given text into a fixed length encoded one and then apply any string matching routine to it; or (2) to directly search the text file byte after byte for (the encoding of) a pattern in which an extra work is needed for synchronization to avoid false detection. Both the solutions, however, sacrifice the searching speed. Our algorithm runs on such a multi-byte character text file at the same speed as on an ordinary ASCII text file, without false detection. The technique is applicable to any prefix code such as the Huffman code and variants of Unicode. We also generalize the technique so as to handle structured texts such as XML documents. Using this technique, we can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A.V. Aho and M. Corasick. Efficient string matching: An aid to bibliographic search. Comm. ACM, 18(6):333–340, 1975.
Article MATH MathSciNet Google Scholar
A. Amir and G. Benson. Efficient two-dimensional compressed matching. In Proc. Data Compression Conference, page 279, 1992.
Google Scholar
S. Arikawa and T. Shinohara. A run-time efficient realization of Aho-Corasick pattern matching machines. New Generation Computing, 2(2):171–186, 1984.
Article Google Scholar
S. Arikawa et al. The text database management syste SIGMA: An improvement of the main engine. In Proc. of Berliner Informatik-Tage, pages 72–81, 1989.
Google Scholar
J. Jaakkola and P. Kilpeläinen. A tool to search structured text. University of Helsinki. (In preparation).
Google Scholar
S. T. Klein and D. Shapira. Pattern matching in Huffman encoded texts. In Proc. Data Compression Conference 2001, pages 449–458. IEEE Computer Society, 2001.
Google Scholar
D. E. Knuth. The Art of Computer Programing, Sorting and Searching, volume 3. Addison-Wesley, 1973.
Google Scholar
N. J. Larsson and A. Moffat. Offline dictionary-based compression. In Proc. Data Compression Conference’ 99, pages 296–305. IEEE Computer Society, 1999.
Google Scholar
M. Miyazaki, S. Fukamachi, M. Takeda, and T. Shinohara. Speeding up the pattern matching machine for compressed texts. Transactions of Information Processing Society of Japan, 39(9):2638–2648, 1998. (in Japanese).
MathSciNet Google Scholar
D. Revuz. Minimisation of acyclic deterministic automata in linear time. Theoretical Computer Science, 92(1):181–189, 1992.
Article MATH MathSciNet Google Scholar
Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa. A Boyer-Moore type algorithm for compressed pattern matching. In Proc. 11th Ann. Symp. on Combinatorial Pattern Matching, volume 1848 of Lecture Notes in Computer Science, pages 181–194. Springer-Verlag, 2000.
Chapter Google Scholar
N. Uratani and M. Takeda. A fast string-searching algorithm for multiple patterns. Information Processing & Management, 29(6):775–791, 1993.
Article Google Scholar
M. Yoshikawa and T. Amagasa. XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Transactions on Internet Technology, 1(1):110–141, August 2001.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, Fukuoka, 812-8581, Japan
Masayuki Takeda, Satoru Miyamoto, Ayumi Shinohara & Setsuo Arikawa
PRESTO, Japan Science and Technology Corporation (JST), Japan
Masayuki Takeda & Ayumi Shinohara
Kyushu University Library, Fukuoka, 812-8581, Japan
Takuya Kida
Department of Artificial Intelligence, Kyushu Institute of Technology, Izuka, 820-8502, Japan
Shuichi Fukamachi & Takeshi Shinohara

Authors

Masayuki Takeda
View author publications
You can also search for this author in PubMed Google Scholar
Satoru Miyamoto
View author publications
You can also search for this author in PubMed Google Scholar
Takuya Kida
View author publications
You can also search for this author in PubMed Google Scholar
Ayumi Shinohara
View author publications
You can also search for this author in PubMed Google Scholar
Shuichi Fukamachi
View author publications
You can also search for this author in PubMed Google Scholar
Takeshi Shinohara
View author publications
You can also search for this author in PubMed Google Scholar
Setsuo Arikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Ciěncia da Computação, Universidade Federal de Minas Gerais, 31270-901, Belo Horizonte, MG, Brazil
Alberto H. F. Laender
Instituto Superior Técnico, INESC-ID, R. Alves Redol 9, 1000-029, Lisboa, Portugal
Arlindo L. Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takeda, M. et al. (2002). Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts. In: Laender, A.H.F., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2002. Lecture Notes in Computer Science, vol 2476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45735-6_16

Download citation

DOI: https://doi.org/10.1007/3-540-45735-6_16
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44158-8
Online ISBN: 978-3-540-45735-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics