Database indexing for large DNA and protein sequence collections

Hunt, Ela; Atkinson, Malcolm P.; Irving, Robert W.

doi:10.1007/s007780200064

Database indexing for large DNA and protein sequence collections

Special issue VLDB Best papers 2001
Published: November 2002

Volume 11, pages 256–271, (2002)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Ela Hunt¹,
Malcolm P. Atkinson¹ &
Robert W. Irving¹

240 Accesses
43 Citations
Explore all metrics

Abstract.

Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200 Mb of protein and 300 Mbp of DNA, whose disk-image exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK; e-mail: {ela,mpa,rwi}@dcs.gla.ac.uk , , , , , , GB
Ela Hunt, Malcolm P. Atkinson & Robert W. Irving

Authors

Ela Hunt
View author publications
Search author on:PubMed Google Scholar
Malcolm P. Atkinson
View author publications
Search author on:PubMed Google Scholar
Robert W. Irving
View author publications
Search author on:PubMed Google Scholar

Additional information

Received: November 1, 2001 / Accepted: March 2, 2002 Published online: September 25, 2002

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hunt, E., Atkinson, M. & Irving, R. Database indexing for large DNA and protein sequence collections. The VLDB Journal 11, 256–271 (2002). https://doi.org/10.1007/s007780200064

Download citation

Issue Date: November 2002
DOI: https://doi.org/10.1007/s007780200064

Key words: Database index – Suffix tree – Approximate matching – Biological sequence

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Database indexing for large DNA and protein sequence collections

Abstract.

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

String Algorithms

Implementation of a Suffix Tree-Based Index for Searching for Substrings in a Large DBMS

Algorithms for String Comparison in DNA Sequences

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Database indexing for large DNA and protein sequence collections

Abstract.

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

String Algorithms

Implementation of a Suffix Tree-Based Index for Searching for Substrings in a Large DBMS

Algorithms for String Comparison in DNA Sequences

Explore related subjects

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now