short-paper

Frequent Multi-Byte Character Subtring Extraction using a Succinct Data Structure

Authors:
Phanucheep Chotnithi

SOKENDAI (The Graduate University for Advanced Studies), Tokyo, Japan

SOKENDAI (The Graduate University for Advanced Studies), Tokyo, Japan
View Profile

,
Atsuhiro Takasu

National Institute of Informatics, Tokyo, Japan

National Institute of Informatics, Tokyo, Japan
View Profile

DocEng '16: Proceedings of the 2016 ACM Symposium on Document EngineeringSeptember 2016Pages 103–106https://doi.org/10.1145/2960811.2967161

Published:13 September 2016Publication History

DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering

Pages 103–106

ABSTRACT

Frequent string mining is widely used in text processing to extract text features. Most researchers have focused on text using single-byte characters. Consequently, their applications have problems when applied to text represented with multibyte characters such as Japanese and Chinese text. The main drawback is huge memory us-age for treating multibyte character strings. To solve this problem,we use wavelet tree-based compressed suffix arrays instead of the normal suffix array to reduce the memory usage, and a novel technique that utilizes the rank operation to improve runtime efficiency.Our experimental evaluation shows that the proposed method reduces the processing time by 45% compared with a method usingonly compressed suffix arrays. The proposed method also reduces the memory usage by 75%.

References

M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. In DIGITAL SRC RESEARCH REPORT. Citeseer, 1994.Google Scholar
L. De Raedt, M. Jaeger, S. D. Lee, and H. Mannila. A theory of inductive query answering. In Data Mining, 2002. ICDM2003. Proceedings. 2002 IEEE International Conference on,pages 123--130. IEEE, 2002. Google ScholarDigital Library
P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Foundations of Computer Science,2000. Proceedings. 41st Annual Symposium on, pages 390--398. IEEE, 2000. Google ScholarDigital Library
J. Fischer, V. Heun, and S. Kramer. Fast frequent string mining using suffix arrays. In Proceedings of the Fifth IEEE International Conference on Data Mining, pages 609--612.IEEE, 2005. Google ScholarDigital Library
R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages 841--850. Society for Industrial and Applied Mathematics, 2003. Google ScholarDigital Library
T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park.Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial pattern matching, pages 181--192. Springer, 2001. Google ScholarDigital Library
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935--948, 1993. Google ScholarDigital Library
G. Navarro. Wavelet trees for all.J. of Discrete Algorithms,25:2--20, Mar. 2014. Google ScholarDigital Library
D. Okanohara and J. Tsujii. Text categorization with all substring features. In SIAM International Conference on Data Mining, pages 838--846. SIAM, 2009.Google ScholarCross Ref
S. J. Puglisi, W. F. Smyth, and A. H. Turpin. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., 39(2), July 2007. Google ScholarDigital Library
R. K. Wong, F. Shi, and N. Lam. Full-text search on multi-byte encoded documents. In Proceedings of the 2012 ACM symposium on Document engineering, pages 227--236. ACM, 2012. Google ScholarDigital Library

Index Terms

Frequent Multi-Byte Character Subtring Extraction using a Succinct Data Structure
1. Information systems
2. Theory of computation
  1. Design and analysis of algorithms
    1. Algorithm design techniques
    2. Data structures design and analysis
      1. Pattern matching

Recommendations

Computing Longest Previous Factor in linear time and applications

We give two optimal linear-time algorithms for computing the Longest Previous Factor (LPF) array corresponding to a string w. For any position i in w, LPF[i] gives the length of the longest factor of w starting at position i that occurs previously in w. ...
Read More
On the number of elements to reorder when updating a suffix array

Recently new algorithms appeared for updating the Burrows-Wheeler Transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length ...
Read More
A Simple Algorithm for Computing the Lempel Ziv Factorization
DCC '08: Proceedings of the Data Compression Conference

We give a space-efficient simple algorithm for computing the Lempel--Ziv factorization of a string. For a string of length n over an integer alphabet, it runs in O(n) time independently of alphabet size and uses o(n) additional space.

Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering
September 2016
222 pages
ISBN:9781450344388
DOI:10.1145/2960811
General Chair:
Robert Sablatnig
TU Wien, Austria
,
Program Chair:
Tamir Hassan
HP Labs, Austria
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 September 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
frequent string mining
longest common prefix
multibyte
suffix array
wavelet tree
Qualifiers
- short-paper
Conference

Acceptance Rates
DocEng '16 Paper Acceptance Rate11of35submissions,31%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 56
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Frequent Multi-Byte Character Subtring Extraction using a Succinct Data Structure

DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Computing Longest Previous Factor in linear time and applications

On the number of elements to reorder when updating a suffix array

A Simple Algorithm for Computing the Lempel Ziv Factorization