Elsevier

Theoretical Computer Science

Volume 852, 8 January 2021, Pages 132-137
Theoretical Computer Science

Towards a real time algorithm for parameterized longest common prefix computation,☆☆

https://doi.org/10.1016/j.tcs.2020.11.023Get rights and content

Abstract

Parameterized matching has proven to be an efficient and useful tool for detecting code duplications. This paper presents a technique for calculating parameterized Longest Common Prefix (plcp) in constant time based on the knowledge about the plcp of the following suffixes. Using this technique, online p-suffix tree construction can be done in worst case time O(logn) per input symbol. Searching for a pattern of length m in the resulting suffix tree takes O(min{mlog(|Σ|+|Π|),m+logn}+mτΠ+tocc) time, where tocc is the number of occurrences of the pattern, and τΠ depends on Π. For constant-sized Π, τΠ=1, for polynomial-sized Π, τΠ=loglog|Π|, and for unbounded Π, τΠ=log|Π|.

Introduction

Most indexing constructions can be done in optimal linear time. However, in many cases this means that the amortized time per text extension is constant, while on some input text symbols the algorithm spends up to linear time (e.g. [14], [22], [29], [30]). However, there is an advantage in supplying an indexing construction with a good worst case time per each symbol extension. Much effort has been devoted to developing indexing algorithms where the time per every new symbol insertion is bounded (e.g. [5], [6], [7], [12], [18], [19]).

The parameterized matching problem was introduced by Baker [10], [11]. Her main motivation lay in software maintenance, where program fragments are to be considered “identical” even if variable names are different. Therefore, strings under this model are comprised of symbols from two disjoint sets Σ and Π containing fixed symbols and parameter symbols respectively. In this paradigm, one seeks parameterized occurrences, i.e., occurrences up to a renaming of the parameter symbols, of one string in another. This renaming is a bijection b:ΠΠ. An optimal algorithm for parameterized pattern matching appeared in [4]. In this problem the pattern and the text are given as input and one seeks to report all parameterized occurrences. Approximate parameterized pattern matching was investigated in [8], [10], [15]. Idury and Schäffer [17] considered matching of multiple parameterized patterns.

Parameterized matching has proven useful in other contexts as well. An interesting problem is searching for color images (e.g. [3], [9], [26]). Assume, for example, that we are seeking a given icon in any possible color map. If the colors were fixed, then this is exact two-dimensional pattern matching [2]. However, if the color map is different the exact matching algorithm would not find the pattern. Parameterized two-dimensional search is precisely what is needed. If, in addition, one is also willing to lose resolution, then a two dimensional function matching search should be used, where the renaming function is not necessarily a bijection [1].

Parameterized matching can be solved in linear time, when a constant-sized alphabet is considered [4]. Baker [10] showed that a parameterized suffix tree can be constructed in linear time, for a text over a constant-sized alphabet. Lee, Na and Park [20] showed that it can be constructed online in randomized linear time for unbounded alphabets.

In this paper, we construct a suffix tree for parameterized matching, where the time per text symbol extension is logarithmic. Our algorithm works for unbounded alphabets. Amir, Farach, and Muthukrishnan proved that parameterized matching can not be done in linear time for unbounded alphabets [4], thus it is impossible to make such symbol extension in constant time.

Our idea is to use the mechanism of Amir et al. [5]. They define a transformation from any structure that stores keys in a sorted order, where comparison is done in constant time, to a new structure which works on keys of unbounded length. This transformation is useful in enabling known structures (e.g. [16], [23], [24], [25], [27]) to support long keys, i.e. strings. Then using an external data structure they apply the transformation and supply an online indexing scheme with logarithmic time per symbol extension, and search in O(min{mlog|Σ|,m+logn}+tocc) time.

In this paper we generalize that algorithm to parameterized matching. This is done by the following main ideas:

  • 1.

    The order between two parameterized strings is defined to be the order between their respective first distinct symbols which follow their longest common prefixes that p-match. The order between two symbols in p-strings is determined by the order after applying the pv transformation. The pv of a symbol is the distance to the last occurrence of the symbol. It was introduced by Baker [11] and is formally defined in Section 2.

  • 2.

    We show how to efficiently calculate the parameterized longest common prefix (plcp) of two suffixes i,j, by using the plcp value of the following suffixes, i+1,j+1.

The above two facts allow us to generalize the Amir et al. [5] algorithm to support parameterized matching.

Section snippets

Preliminaries

Let Σ be an alphabet. A string S over Σ is a finite sequence of letters from Σ. By S[i], for 1i|S|, we denote the ith letter of S. The empty string is denoted by ϵ. By S[i..j] we denote the string S[i]S[j], called a factor of S (if i>j, then the factor is the empty string). A factor is called a prefix if i=1 and a suffix if j=|S|. By Si=S[i..] we denote the suffix which starts from position i in S. And by lcp(S,S) we denote to the longest common prefix (lcp) of S and S. In case S and S

Fast plcp calculation

In this section, we show how to calculate plcp(i,j) using the plcp of the following suffixes. We then show how to maintain two structures, one for supporting this calculation in constant time, and the second for comparing two p-strings in constant time based on their plcp.

Online indexing for the parameterized case

In this section, we describe two online indexing algorithms which work on strings over unbounded alphabets and support logarithmic worst case time per text extension. During the exposition, we show how to support the parameterized case. Our method is also correct for the original data structure which stores unbounded-length keys and therefore is generalized to support unbounded-length parameterized keys.

Amir et al. [5] define a transformation from any structure that stores keys in a sorted

Conclusions and open problems

We have provided a fast method to compute the plcp of two suffixes based on the plcp value of the following suffixes. We have shown how to apply this theorem to achieve a parameterized online indexing algorithm. The algorithm requires external structures that enable the plcp calculation in constant time. The structures use O(n) space. We have also presented an online indexing algorithm that supports worst case symbol extension in O(logn) time.

A number of open questions arise from our work. The

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (30)

  • A. Amir et al.

    Alphabet dependence in parameterized matching

    Inf. Process. Lett.

    (1994)
  • A. Apostolico et al.

    Parameterized matching with mismatches

    J. Discret. Algorithms

    (2007)
  • B.S. Baker

    Parameterized pattern matching: algorithms and applications

    J. Comput. Syst. Sci.

    (1996)
  • A. Amir et al.

    Function matching

    SIAM J. Comput.

    (2006)
  • A. Amir et al.

    An alphabet independent approach to two dimensional pattern matching

    SIAM J. Comput.

    (1994)
  • A. Amir et al.

    Separable attributes: a technique for solving the submatrices character count problem

  • A. Amir et al.

    Managing unbounded-length keys in comparison-driven data structures with applications to online indexing

    SIAM J. Comput.

    (2014)
  • A. Amir et al.

    Towards real-time suffix tree construction

  • A. Amir et al.

    Real-time indexing over fixed finite alphabets

  • G.P. Babu et al.

    Color indexing for efficient image retrieval

    Multimed. Tools Appl.

    (Nov. 1995)
  • B.S. Baker

    Parameterized duplication in strings: algorithms and an application to software maintenance

    SIAM J. Comput.

    (1997)
  • D. Breslauer et al.

    Real-time streaming string-matching

  • P.F. Dietz et al.

    Two algorithms for maintaining order in a list

  • M. Farach

    Optimal suffix tree construction with large alphabets

  • C. Hazay et al.

    Approximate parameterized matching

  • Cited by (2)

    This work was partially supported by ISF grant 1475/18 and BSF grant 2018141.

    ☆☆

    This work is part of the second author's Ph.D. dissertation.

    View full text