Abstract
Detecting repeated portions of strings has important applications to many areas of study including data compression and computational biology. This paper defines and presents a solution for the Most Contributory Substring Problem, which identifies the single substring that represents the largest proportion of the characters within a set of strings. We show that a solution to the problem can be achieved with an O(n) running time (where n is the total number of characters in all of the input strings) when overlapping occurrences of the most contributory substring are permitted. Furthermore, we present an extended algorithm that does not permit occurrences of the most contributory substring to overlap. The expected running time of the extended algorithm is O(n logn) while its worst case performance is O(n 2).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Apostolico, A., Szpankowski, W.: Self-alignments in words and their applications. J. Algorithms 13(3), 446–467 (1992)
Devroye, L., Szpankowski, W., Rais, B.: A note on the height of suffix trees. SIAM J. Comput. 21(1), 48–53 (1992)
Farach, M.: Optimal suffix tree construction with large alphabets. In: FOCS 1997. Proceedings of the 38th Annual Symposium on Foundations of Computer Science, Washington, DC, USA, 1997, p. 137. IEEE Computer Society Press, Los Alamitos (1997)
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA (1997)
Jacquet, P., Szpankowski, W.: Autocorrelation on words and its applications: Analysis of suffix trees by string-ruler approach. JCTA: Journal of Combinatorial Theory, Series A, 66 (1994)
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Ukkonen, E.: Constructing suffix trees on-line in linear time. In: Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing 1992, vol. 1, pp. 484–492. North-Holland, Amsterdam (1992)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th IEEE Symp on Switching and Automata Theory, pp. 1–11. IEEE Computer Society Press, Los Alamitos (1973)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stephenson, B. (2007). An Efficient Algorithm for Identifying the Most Contributory Substring. In: Song, I.Y., Eder, J., Nguyen, T.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2007. Lecture Notes in Computer Science, vol 4654. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74553-2_25
Download citation
DOI: https://doi.org/10.1007/978-3-540-74553-2_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74552-5
Online ISBN: 978-3-540-74553-2
eBook Packages: Computer ScienceComputer Science (R0)