Towards a real time algorithm for parameterized longest common prefix computation☆,☆☆
Introduction
Most indexing constructions can be done in optimal linear time. However, in many cases this means that the amortized time per text extension is constant, while on some input text symbols the algorithm spends up to linear time (e.g. [14], [22], [29], [30]). However, there is an advantage in supplying an indexing construction with a good worst case time per each symbol extension. Much effort has been devoted to developing indexing algorithms where the time per every new symbol insertion is bounded (e.g. [5], [6], [7], [12], [18], [19]).
The parameterized matching problem was introduced by Baker [10], [11]. Her main motivation lay in software maintenance, where program fragments are to be considered “identical” even if variable names are different. Therefore, strings under this model are comprised of symbols from two disjoint sets Σ and Π containing fixed symbols and parameter symbols respectively. In this paradigm, one seeks parameterized occurrences, i.e., occurrences up to a renaming of the parameter symbols, of one string in another. This renaming is a bijection . An optimal algorithm for parameterized pattern matching appeared in [4]. In this problem the pattern and the text are given as input and one seeks to report all parameterized occurrences. Approximate parameterized pattern matching was investigated in [8], [10], [15]. Idury and Schäffer [17] considered matching of multiple parameterized patterns.
Parameterized matching has proven useful in other contexts as well. An interesting problem is searching for color images (e.g. [3], [9], [26]). Assume, for example, that we are seeking a given icon in any possible color map. If the colors were fixed, then this is exact two-dimensional pattern matching [2]. However, if the color map is different the exact matching algorithm would not find the pattern. Parameterized two-dimensional search is precisely what is needed. If, in addition, one is also willing to lose resolution, then a two dimensional function matching search should be used, where the renaming function is not necessarily a bijection [1].
Parameterized matching can be solved in linear time, when a constant-sized alphabet is considered [4]. Baker [10] showed that a parameterized suffix tree can be constructed in linear time, for a text over a constant-sized alphabet. Lee, Na and Park [20] showed that it can be constructed online in randomized linear time for unbounded alphabets.
In this paper, we construct a suffix tree for parameterized matching, where the time per text symbol extension is logarithmic. Our algorithm works for unbounded alphabets. Amir, Farach, and Muthukrishnan proved that parameterized matching can not be done in linear time for unbounded alphabets [4], thus it is impossible to make such symbol extension in constant time.
Our idea is to use the mechanism of Amir et al. [5]. They define a transformation from any structure that stores keys in a sorted order, where comparison is done in constant time, to a new structure which works on keys of unbounded length. This transformation is useful in enabling known structures (e.g. [16], [23], [24], [25], [27]) to support long keys, i.e. strings. Then using an external data structure they apply the transformation and supply an online indexing scheme with logarithmic time per symbol extension, and search in time.
In this paper we generalize that algorithm to parameterized matching. This is done by the following main ideas:
- 1.
The order between two parameterized strings is defined to be the order between their respective first distinct symbols which follow their longest common prefixes that p-match. The order between two symbols in p-strings is determined by the order after applying the pv transformation. The pv of a symbol is the distance to the last occurrence of the symbol. It was introduced by Baker [11] and is formally defined in Section 2.
- 2.
We show how to efficiently calculate the parameterized longest common prefix () of two suffixes , by using the value of the following suffixes, .
Section snippets
Preliminaries
Let Σ be an alphabet. A string S over Σ is a finite sequence of letters from Σ. By , for , we denote the ith letter of S. The empty string is denoted by ϵ. By we denote the string , called a factor of S (if , then the factor is the empty string). A factor is called a prefix if and a suffix if . By we denote the suffix which starts from position i in S. And by we denote to the longest common prefix (lcp) of S and . In case S and
Fast calculation
In this section, we show how to calculate using the of the following suffixes. We then show how to maintain two structures, one for supporting this calculation in constant time, and the second for comparing two p-strings in constant time based on their .
Online indexing for the parameterized case
In this section, we describe two online indexing algorithms which work on strings over unbounded alphabets and support logarithmic worst case time per text extension. During the exposition, we show how to support the parameterized case. Our method is also correct for the original data structure which stores unbounded-length keys and therefore is generalized to support unbounded-length parameterized keys.
Amir et al. [5] define a transformation from any structure that stores keys in a sorted
Conclusions and open problems
We have provided a fast method to compute the of two suffixes based on the value of the following suffixes. We have shown how to apply this theorem to achieve a parameterized online indexing algorithm. The algorithm requires external structures that enable the calculation in constant time. The structures use space. We have also presented an online indexing algorithm that supports worst case symbol extension in time.
A number of open questions arise from our work. The
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (30)
- et al.
Alphabet dependence in parameterized matching
Inf. Process. Lett.
(1994) - et al.
Parameterized matching with mismatches
J. Discret. Algorithms
(2007) Parameterized pattern matching: algorithms and applications
J. Comput. Syst. Sci.
(1996)- et al.
Function matching
SIAM J. Comput.
(2006) - et al.
An alphabet independent approach to two dimensional pattern matching
SIAM J. Comput.
(1994) - et al.
Separable attributes: a technique for solving the submatrices character count problem
- et al.
Managing unbounded-length keys in comparison-driven data structures with applications to online indexing
SIAM J. Comput.
(2014) - et al.
Towards real-time suffix tree construction
- et al.
Real-time indexing over fixed finite alphabets
- et al.
Color indexing for efficient image retrieval
Multimed. Tools Appl.
(Nov. 1995)
Parameterized duplication in strings: algorithms and an application to software maintenance
SIAM J. Comput.
Real-time streaming string-matching
Two algorithms for maintaining order in a list
Optimal suffix tree construction with large alphabets
Approximate parameterized matching
Cited by (2)
Reconstructing Parameterized Strings from Parameterized Suffix and LCP Arrays
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
- ☆☆
This work is part of the second author's Ph.D. dissertation.