The multiple sequence sets: problem and heuristic algorithms

Ning, Kang; Leong, Hon Wai

doi:10.1007/s10878-010-9329-3

The multiple sequence sets: problem and heuristic algorithms

Published: 08 May 2010

Volume 22, pages 778–796, (2011)
Cite this article

Journal of Combinatorial Optimization Aims and scope Submit manuscript

Kang Ning¹ &
Hon Wai Leong²

87 Accesses
4 Citations
Explore all metrics

Abstract

“Sequence set” is a mathematical model used in many applications such as biological sequences analysis and text processing. However, “single” sequence set model is not appropriate for the rapidly increasing problem size. For example, very large genome sequences should be separated and processed chunk by chunk. For these applications, the underlying mathematical model is “Multiple Sequence Sets” (MSS). To process multiple sequence sets, sequences are distributed to different sets and then sequences on each set are processed in parallel. Deriving effective algorithm for MSS processing is challenging.

In this paper, we have first defined the cost functions for the problem of Process of Multiple Sequence Sets (PMSS). The PMSS problem is then formulated as to minimize the total cost of process. Based on the analysis of the features of multiple sequence sets, we have proposed the Distribution and Deposition (DDA) algorithm and DDA^* algorithm for PMSS problem. In DDA algorithm, the sequences are first distributed to multiple sets according to their alphabet contents; then sequences in each set are processed by deposition algorithm. The DDA^* algorithm differs from the DDA algorithm in that the DDA^* algorithm distributes sequences by clustering based on a set of sequence features. Experiments showed that the results of DDA and DDA^* are always smaller than other algorithms, and DDA^* outperformed DDA in most instances. The DDA and DDA^* algorithms were also efficient both in time and space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Barone P, Bonizzoni P, Vedova GD, Mauri G (2001) An approximation algorithm for the shortest common supersequence problem: an experimental analysis. In: Symposium on applied computing, proceedings of the 2001 ACM symposium on applied computing, pp 56–60
Bennett K, Grothoff C, Horozov T, Patrascu I (2002) Efficient sharing of encrypted data. In: Information security and privacy. Lecture notes in computer science, vol 2384. Springer, Berlin, pp 107–120
Chapter Google Scholar
Benson DA, Boguski M, Lipman DJ, Ostell J (1994) GenBank. Nucleic Acids Res 22:3441–3444
Article Google Scholar
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms. MIT Press/McGraw-Hill, New York
MATH Google Scholar
Foulser DE, Li M, Yang Q (1992) Theory and algorithms for plan merging. Artif Intell 57:143–181
Article MathSciNet MATH Google Scholar
Garey MR, Johnson DS (1979) Computers and intractability. Freeman, San Francisco
MATH Google Scholar
Hannenhalli S, Hubell E, Lipshutz R, Pevzner PA (2002) Combinatorial algorithms for design of DNA arrays. Adv Biochem Eng Biotechnol 77:1–19
Google Scholar
Jiang T, Li M (1995) On the approximation of shortest common supersequences and longest common subsequences. SIAM J Comput 24:1122–1139
Article MathSciNet MATH Google Scholar
Kasif S, Weng Z, Derti A, Beigel R, DeLisi C (2002) A computational framework for optimal masking in the synthesis of oligonucleotide microarrays. Nucleic Acids Res 30:e106
Article Google Scholar
Ning K, Choi KP, Leong HW, Zhang L (2005) A post processing method for optimizing synthesis strategy for oligonucleotide microarrays. Nucleic Acids Res 33:e144
Article Google Scholar
Ning K, Leong HW (2006) The distribution and deposition method for the multiple oligo nucleotide arrays. BMC Bioinform 7(Suppl 4):S12
Article Google Scholar
Rozen S, Skaletsky HJ (2000) Primer3 on the WWW for general users and for biologist programmers. Humana Press, Totowa
Google Scholar
Sankoff D, Kruskal J (1983) Time warps, string edits and macromolecules: the theory and practice of sequence comparisons. Addison-Wesley, Reading
Google Scholar
Sellis TK (1988) Multiple-query optimization. ACM Trans Database Syst (TODS) 13:23–52
Article Google Scholar
Storer JA (1988) Data compression: methods and theory. Computer Science Press, New York
Google Scholar
Timkovsky VG (1993) On the approximation of shortest common non-subsequences and supersequences. Technical report
Wilcoxin F (1947) Probability tables for individual comparisons by ranking methods. Biometrics 3:119–122
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Pathology, University of Michigan, Ann Arbor, MI, 48109, USA
Kang Ning
Department of Computer Science, National University of Singapore, Singapore, 117590, Singapore
Hon Wai Leong

Authors

Kang Ning
View author publications
You can also search for this author in PubMed Google Scholar
Hon Wai Leong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kang Ning.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ning, K., Leong, H.W. The multiple sequence sets: problem and heuristic algorithms. J Comb Optim 22, 778–796 (2011). https://doi.org/10.1007/s10878-010-9329-3

Download citation

Published: 08 May 2010
Issue Date: November 2011
DOI: https://doi.org/10.1007/s10878-010-9329-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The multiple sequence sets: problem and heuristic algorithms

Abstract

Access this article

Similar content being viewed by others

BUSCO: Assessing Genome Assembly and Annotation Completeness

Clustering graph data: the roadmap to spectral techniques

A fast and efficient algorithm for DNA sequence similarity identification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The multiple sequence sets: problem and heuristic algorithms

Abstract

Access this article

Similar content being viewed by others

BUSCO: Assessing Genome Assembly and Annotation Completeness

Clustering graph data: the roadmap to spectral techniques

A fast and efficient algorithm for DNA sequence similarity identification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation