skip to main content
10.1145/2939672.2939842acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm

Published: 13 August 2016 Publication History

Abstract

Information in various applications is often expressed as character sequences over a finite alphabet (e.g., DNA or protein sequences). In Big Data era, the lengths and sizes of these sequences are growing explosively, leading to grand challenges for the classical NP-hard problem, namely searching for the Multiple Longest Common Subsequences (MLCS) from multiple sequences. In this paper, we first unveil the fact that the state-of-the-art MLCS algorithms are unable to be applied to long and large-scale sequences alignments. To overcome their defects and tackle the longer and large-scale or even big sequences alignments, based on the proposed novel problem-solving model and various strategies, e.g., parallel topological sorting, optimal calculating, reuse of intermediate results, subsection calculation and serialization, etc., we present a novel parallel MLCS algorithm. Exhaustive experiments on the datasets of both synthetic and real-world biological sequences demonstrate that both the time and space of the proposed algorithm are only linear in the number of dominants from aligned sequences, and the proposed algorithm significantly outperforms the state-of-the-art MLCS algorithms, being applicable to longer and large-scale sequences alignments.

References

[1]
A. Apostolico, S. Browne, and C. Guerra. Fast linear-space computations of longest common subsequences. Theoretical Computer Science, 92(1):3--17, 1992.
[2]
Y. Chen, A. Wan, and W. Liu. A fast parallel algorithm for finding the longest common sequence of multiple biosequences. BMC Bioinformatics, 7(Suppl 4):S4, 2006.
[3]
D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6):341--343, 1975.
[4]
I. L. Hofacker, M. A. Huynen, P. F. Stadler, and P. E. Stolorz. Knowledge discovery in RNA sequence families of HIV using scalable computers. In KDD, pages 20--25, 1996.
[5]
E. Horowitz and S. Sahni. Fundamentals of data structures. Pitman, 1983.
[6]
W. Hsu and M. Du. Computing a longest common subsequence for a set of strings. BIT Numerical Mathematics, 24(1):45--59, 1984.
[7]
J. W. Hunt and T. G. Szymanski. A fast algorithm for computing longest common subsequences. Communications of the ACM, 20(5):350--353, 1977.
[8]
G. Ifrim and C. Wiuf. Bounded coordinate-descent for biological sequence classification in high dimensional predictor space. In KDD, pages 708--716, 2011.
[9]
D. E. Knuth. The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition. Addison-Wesley, 1973.
[10]
D. Korkin. A new dominant point-based parallel algorithm for multiple longest common subsequence problem. Technical report, TR01--148, Univ. of New Brunswick, 2001.
[11]
Y. Li, Y. Wang, and L. Bao. Facc: a novel finite automaton based on cloud computing for the multiple longest common subsequences search. Mathematical Problems in Engineering, 2012, 2012.
[12]
M. Lu and H. Lin. Parallel algorithms for the longest common subsequence problem. IEEE Transactions on Parallel and Distributed Systems, 5(8):835--848, 1994.
[13]
D. Maier. The complexity of some problems on subsequences and supersequences. Journal of the ACM (JACM), 25(2):322--336, Apr. 1978.
[14]
W. J. Masek and M. S. Paterson. A faster algorithm computing string edit distances. Journal of Computer and System Sciences, 20(1):18--31, 1980.
[15]
D. Sankoff. Matching sequences under deletion/insertion constraints. Proceedings of the National Academy of Sciences, 69(1):4--6, 1972.
[16]
Q. Wang, D. Korkin, and Y. Shang. A fast multiple longest common subsequence (MLCS) algorithm. Knowledge and Data Engineering, IEEE Transactions on, 23(3):321--334, 2011.
[17]
J. Yang, Y. Xu, and Y. Shang. An efficient parallel algorithm for longest common subsequence problem on GPUs. In Proceedings of the World Congress on Engineering, volume 1, pages 499--504, 2010.
[18]
J. Yang, Y. Xu, G. Sun, and Y. Shang. A new progressive algorithm for a multiple longest common subsequences problem and its efficient parallelization. IEEE Transactions on Parallel and Distributed Systems, 24(5):862--870, 2013.
[19]
T. K. Yap, O. Frieder, and R. L. Martino. Parallel computation in biological sequence analysis. IEEE Transactions on Parallel and Distributed Systems, 9(3):283--294, 1998.
[20]
M. Zvelebil and J. Baum. Understanding bioinformatics. Garland Science, 2007.

Cited By

View all
  • (2025)A Novel Key Point Based MLCS Algorithm for Big Sequences MiningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348523437:1(15-28)Online publication date: Jan-2025
  • (2023)A Space-Saving Based MLCS Algorithm2023 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech)10.1109/DASC/PiCom/CBDCom/Cy59711.2023.10361492(0589-0594)Online publication date: 14-Nov-2023
  • (2020)A Heuristic Approach for Finding Similarity Indexes of Multivariate Data SetsIEEE Access10.1109/ACCESS.2020.29682228(21759-21769)Online publication date: 2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multiple longest common subsequences (mlcs)
  2. non-redu-ndant common subsequence graph (ncsg)
  3. subsection calculation and serialization
  4. topological sorting

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '16
Sponsor:

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)3
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Novel Key Point Based MLCS Algorithm for Big Sequences MiningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348523437:1(15-28)Online publication date: Jan-2025
  • (2023)A Space-Saving Based MLCS Algorithm2023 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech)10.1109/DASC/PiCom/CBDCom/Cy59711.2023.10361492(0589-0594)Online publication date: 14-Nov-2023
  • (2020)A Heuristic Approach for Finding Similarity Indexes of Multivariate Data SetsIEEE Access10.1109/ACCESS.2020.29682228(21759-21769)Online publication date: 2020
  • (2019)LCSS-Based Algorithm for Computing Multivariate Data Set Similarity: A Case Study of Real-Time WSN DataSensors10.3390/s1901016619:1(166)Online publication date: 4-Jan-2019
  • (2019)Spell: Online Streaming Parsing of Large Unstructured System LogsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.287544231:11(2213-2227)Online publication date: 1-Nov-2019
  • (2019)Chemical reaction optimization for solving longest common subsequence problem for multiple stringSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-018-3200-323:14(5485-5509)Online publication date: 1-Jul-2019
  • (2018)Signature based trouble ticket classificationFuture Generation Computer Systems10.1016/j.future.2017.07.05478:P1(41-58)Online publication date: 1-Jan-2018
  • (2017)LCS algorithm with vector-markers2017 Computer Science and Information Technologies (CSIT)10.1109/CSITechnol.2017.8312148(92-96)Online publication date: Sep-2017
  • (2016)Spell: Streaming Parsing of System Event Logs2016 IEEE 16th International Conference on Data Mining (ICDM)10.1109/ICDM.2016.0103(859-864)Online publication date: Dec-2016
  • (undefined)COVID-19 Evolves in Human HostsSSRN Electronic Journal10.2139/ssrn.3562070

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media