ATDD: An Algorithmic Tool for Domain Discovery in Protein Sequences

Angelov, Stanislav; Khanna, Sanjeev; Li, Li; Pereira, Fernando

doi:10.1007/978-3-540-30219-3_18

Stanislav Angelov²¹,
Sanjeev Khanna²¹,
Li Li²² &
…
Fernando Pereira²¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3240))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

618 Accesses

Abstract

The problem of identifying sequence domains is essential for understanding protein function. Most current methods for protein domain identification rely on prior knowledge of homologous domains and construction of high quality multiple sequence alignments. With rapid accumulation of enormous data from genome sequencing, it is important to be able to automatically determine domain regions from a set of proteins solely based on sequence information.

We describe a new algorithm for automatic protein domain detection that does not require multiple sequence alignment and differs from alignment based methods by allowing arbitrary rearrangements (both in relative ordering and distance) of the domains within the set of proteins under study. Moreover, our algorithm extracts domains by simply performing a comparative analysis of a given set of sequences, and no auxiliary information is required. The method views protein sequences as collections of overlapping fixed length blocks. A pair of blocks within a sequence gets a “vote of confidence” to be part of a domain if several other sequences have similar pairs of blocks at roughly the same distance from each other. Candidate domains are then identified by discovering regions in each protein sequence where most block pairs get strong votes of confidence. We applied our method on several test data sets with a fixed choice of parameters. To evaluate the results we computed sensitivity and specificity measures using SMART-derived domain annotations as a reference.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Google Scholar
Pearson, W., Lipman, D.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448 (1988)
Article Google Scholar
Bork, P., Koonin, E.: Predicting functions from protein sequences–where are the bottlenecks. Nat. Genet. 18, 313–318 (1998)
Article Google Scholar
Hegyi, H., Bork, P.: On the classification and evolution of protein modules. J. Protein Chem. 16, 545–551 (1997)
Article Google Scholar
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L.L., Studholme, D.J., Yeats, C., Eddy, S.R.: The Pfam protein families database 32, D138–141 (2004)
Google Scholar
Sonnhammer, E., Eddy, S., Birney, E., Bateman, A., Durbin, R.: Pfam: multiple sequence alignments and hmm-profiles of protein domains. Nucl. Acids. Res. 26, 320–322 (1998)
Article Google Scholar
Letunic, I., Goodstadt, L., Dickens, N.J., Doerks, T., Schultz, J., Mott, R., Ciccarelli, F., Copley, R.R., Ponting, C.P., Bork, P.: Recent improvements to the SMART domain-based sequence annotation resource. Nucl. Acids. Res. 30, 242–244 (2002)
Article Google Scholar
Henikoff, J., Pietrokovski, S., McCallum, C., Henikoff, S.: Blocks-based methods for detecting protein homology. Electrophoresis 21, 1700–1706 (2000)
Article Google Scholar
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J.A., Hofmann, K., Bairoch, A.: The PROSITE database, its status in 2002. Nucl. Acids. Res. 30, 235–238 (2002)
Article Google Scholar
Mulder, N., Apweiler, R., Attwood, T., Bairoch, A., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R., Courcelle, E., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Griffith-Jones, S., Haft, D., Hermjakob, H., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Orchard, S., Pagni, M., Peyruc, D., Ponting, C., Servant, F., Sigrist, C.: Interpro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 3, 225–235 (2002)
Article Google Scholar
Attwood, T., Beck, M., Bleasby, A., Parry-Smith, D.: PRINTS–a database of protein motif fingerprints. Nucl. Acids. Res. 22, 3590–3596 (1994)
Google Scholar
Corpet, F., Servant, F., Gouzy, J., Kahn, D.: ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucl. Acids. Res. 28, 267–269 (2000)
Article Google Scholar
Letunic, I., Copley, R.R., Schmidt, S., Ciccarelli, F.D., Doerks, T., Schultz, J., Ponting, C.P., Bork, P.: SMART 4.0: towards genomic data integration. Nucl. Acids. Res. 32, D142–144 (2004)
Article Google Scholar
Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992)
Article Google Scholar
Wootton, J.C., Federhen, S.: Statistics of local complexity in amino acid sequences and sequence databases. Computers in Chemistry 17, 149–163 (1993)
Article MATH Google Scholar
Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Science, School of Engineering, University of Pennsylvania, PA, 19104, USA
Stanislav Angelov, Sanjeev Khanna & Fernando Pereira
Department of Biology, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, 19104, USA
Li Li

Authors

Stanislav Angelov
View author publications
You can also search for this author in PubMed Google Scholar
Sanjeev Khanna
View author publications
You can also search for this author in PubMed Google Scholar
Li Li
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Pereira
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics and Computational Biology Unit, HIB, University of Bergen, 5020, Bergen, Norway
Inge Jonassen
Department of Biology,, Penn Center for Bioinformatics, Penn Genomics Institute, 415 S. University Ave., PA 19104, Philadelphia, USA
Junhyong Kim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Angelov, S., Khanna, S., Li, L., Pereira, F. (2004). ATDD: An Algorithmic Tool for Domain Discovery in Protein Sequences. In: Jonassen, I., Kim, J. (eds) Algorithms in Bioinformatics. WABI 2004. Lecture Notes in Computer Science(), vol 3240. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30219-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-540-30219-3_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23018-2
Online ISBN: 978-3-540-30219-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics