Discriminating membrane proteins using the joint distribution of length sums of success and failure runs

Bersimis, Sotirios; Sachlas, Athanasios; Bagos, Pantelis G.

doi:10.1007/s10260-016-0370-y

Discriminating membrane proteins using the joint distribution of length sums of success and failure runs

Original Paper
Published: 01 October 2016

Volume 26, pages 251–272, (2017)
Cite this article

Statistical Methods & Applications Aims and scope Submit manuscript

Sotirios Bersimis¹,
Athanasios Sachlas¹ &
Pantelis G. Bagos²

233 Accesses
1 Altmetric
Explore all metrics

Abstract

Discriminating integral membrane proteins from water-soluble ones, has been over the past decades an important goal for computational molecular biology. A major drawback of methods appeared in the literature, is that most of the authors tried to solve the problem using machine learning techniques. Specifically, most of the proposed methods require an appropriate dataset for training, and consequently the results depend heavily on the suitability of the dataset, itself. Motivated by these facts, in this paper we develop a formal discrimination procedure that is based on appropriate theoretical observations on the sequence of hydrophobic and polar residues along the protein sequence and on the exact distribution of a two dimensional runs-related statistic defined on the same sequence. Specifically, for setting up our discrimination procedure, we study thoroughly the exact distribution of a bivariate random variable, which accumulates the exact lengths of both success and failure runs of at least a specific length in a sequence of Bernoulli trials. To investigate the properties of this bivariate random variable, we use the Markov chain embedding technique. Finally, we apply the new procedure to a well-defined dataset of proteins.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Introduction to Machine Learning

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Forest construction of Gaussian and discrete variables with the application of Watanabe Bayesian Information Criterion

Article 12 April 2024

References

Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P (2002) Molecular biology of the cell, 4th edn. Garland Science, New York
Google Scholar
Antzoulakos DL, Bersimis S, Koutras MV (2003) On the distribution of the total number of run lengths. Ann Inst Stat Math 55(4):865–884
Article MathSciNet MATH Google Scholar
Balakrishnan N, Koutras MV (2002) Runs and scans with applications. Wiley, New York
MATH Google Scholar
Bagos PG, Liakopoulos TD, Hamodrakas SJ (2005) Evaluation of methods for predicting the topology of beta-barrel outer membrane proteins and a consensus prediction method. BMC Bioinform 6:7
Article Google Scholar
Baldi P, Brunak S (2001) Bioinformatics: the machine learning approach. MIT press, Boston
MATH Google Scholar
Berger B, Leighton T (1998) Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete. J Comput Biol 5(1):27–40
Article Google Scholar
Casadio R, Fariselli P, Finocchiaro G, Martelli PL (2003) Fishing new proteins in the twilight zone of genomes: the test case of outer membrane proteins in Escherichia coli K12, Escherichia coli O157:H7, and other Gram-negative bacteria. Protein Sci 12:1158–1168
Article Google Scholar
Chakraborti S, Eryilmaz S (2007) A nonparametric Shewhart-type signed-rank control chart based on runs. Commun Stat Theory Methods 36(2):335–356
MathSciNet MATH Google Scholar
Dembo A, Karlin S (1992) Poisson approximations for r-scan processes. Ann Appl Probab 2:329–357
Article MathSciNet MATH Google Scholar
Dill KA (1985) Theory for the folding and stability of globular proteins. Biochemistry 24(6):1501–1509
Article Google Scholar
Eisenberg D, Schwarz E, Komaromy M, Wall R (1984) Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J Mol Biol 179(1):125–142
Article Google Scholar
Feller W (1968) An introduction to probability theory and its applications, vol I, 3rd edn. Wiley, New York
MATH Google Scholar
Fernández A, Kardos J, Goto Y (2003) Protein folding: could hydrophobic collapse be coupled with hydrogen-bond formation? FEBS Lett 536(1):187–192
Article Google Scholar
Freeman TC Jr, Wimley WC (2010) A highly accurate statistical approach for the prediction of transmembrane beta-barrels. Bioinformatics 26:1965–1974
Article Google Scholar
Fu JC (1996) Distribution theory of runs and patterns associated with a sequence of multistate trials. Stat Sin 6:957–974
MATH Google Scholar
Fu JC, Koutras MV (1994) Distribution theory of runs: a Markov chain approach. J Am Stat Assoc 89:1050–1058
Article MathSciNet MATH Google Scholar
Gibbons JD, Chakraborti S (2010) Nonparametric statistical inference, 5th edn. Chapman and Hall/CRC, New York
MATH Google Scholar
Glaz J, Naus JI (1991) Tight bounds and approximations for scan statistic probabilities for discrete data. Ann Appl Probab 1:306–318
Article MathSciNet MATH Google Scholar
Glaz J, Naus J, Wallenstein S (2001) Scan statistics. Springer, New-York
Book MATH Google Scholar
Goldstein L (1990) Poisson approximation in DNA sequence matching. Commun Stat Theory Methods 19:4167–4179
Article MATH Google Scholar
Gromiha MM, Suwa M (2005) A simple statistical method for discriminating outer membrane proteins with better accuracy. Bioinformatics 21:961–968
Article Google Scholar
Gromiha MM, Ahmad S, Suwa M (2005) Application of residue distribution along the sequence for discriminating outer membrane proteins. Comput Biol Chem 29:135–142
Article MATH Google Scholar
Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15(7):563–577
Article Google Scholar
Karlin S, Cardon LR (1994) Computational DNA-sequence analysis. Annu Rev Microbiol 48:619–654
Article Google Scholar
Karlin S, Macken C (1991) Some statistical problems in the assessment of inhomogeneities of DNA sequence data. J Am Stat Assoc 86:27–35
Article Google Scholar
Koutras MV, Alexandrou VA (1995) Runs, scans and urn model distributions: a unified Markov chain approach. Ann Inst Stat Math‘ 47:743–766
Article MathSciNet MATH Google Scholar
Koutras MV, Bersimis S, Antzoulakos DL (2008) Bivariate Markov chain embeddable variables of polynomial type. Ann Inst Stat Math 60(1):173–191
Article MathSciNet MATH Google Scholar
Lapidus LJ et al (2007) Protein hydrophobic collapse and early folding steps observed in a microfluidic mixer. Biophys J 93(1):218–224
Article Google Scholar
Leslie RT (1967) Recurrent composite events. J Appl Probab 4:34–61
Article MathSciNet MATH Google Scholar
Lou WYW (2003) The exact distribution of the k-tuple statistic for sequence homology. Stat Probab Lett 61:51–59
Article MathSciNet MATH Google Scholar
Martin DEK, Aston JAD (2001) Waiting time distribution of generalized later patterns. Comput Stat Data Anal 52:4879–4890
Article MathSciNet MATH Google Scholar
Möller S, Croning MD, Apweiler R (2001) Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17(7):646–653
Article Google Scholar
Mood AM (1940) The distribution theory of runs. Ann Math Stat 11:367–392
Article MathSciNet MATH Google Scholar
Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC (2012) The genomes online database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 40:D571–D579
Article Google Scholar
Rajarshi MB (1974) Success runs in a two-state Markov chain. J Appl Probab 11:190–192
Article MathSciNet MATH Google Scholar
Schulz GE (2002) The structure of bacterial outer membrane proteins. Biochim Biophys Acta 1565(2):308–317
Article Google Scholar
Tusnady GE, Zs Dosztanyi, Simon I (2005) PDB_TM: selection and membrane localization of transmembrane proteins in the Protein Data Bank. Nucleic Acids Res 33:D275–D278
Article Google Scholar
Wu TL, Glaz J (2015) A new adaptive procedure for multiple window scan statistics. Comput Stat Data Anal 82:164–172
Article MathSciNet Google Scholar
Zhou R, Huang X, Margulis CJ, Berne BJ (2004) Hydrophobic collapse in multidomain protein folding. Science 305(5690):1605–1609
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics and Insurance Science, University of Piraeus, 80 Karaoli and Dimitriou str., 185 34, Piraeus, Greece
Sotirios Bersimis & Athanasios Sachlas
Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2–4, Galaneika, 35100, Lamia, Greece
Pantelis G. Bagos

Authors

Sotirios Bersimis
View author publications
You can also search for this author in PubMed Google Scholar
Athanasios Sachlas
View author publications
You can also search for this author in PubMed Google Scholar
Pantelis G. Bagos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sotirios Bersimis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bersimis, S., Sachlas, A. & Bagos, P.G. Discriminating membrane proteins using the joint distribution of length sums of success and failure runs. Stat Methods Appl 26, 251–272 (2017). https://doi.org/10.1007/s10260-016-0370-y

Download citation

Accepted: 16 September 2016
Published: 01 October 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10260-016-0370-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discriminating membrane proteins using the joint distribution of length sums of success and failure runs

Abstract

Access this article

Similar content being viewed by others

Introduction to Machine Learning

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Forest construction of Gaussian and discrete variables with the application of Watanabe Bayesian Information Criterion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Discriminating membrane proteins using the joint distribution of length sums of success and failure runs

Abstract

Access this article

Similar content being viewed by others

Introduction to Machine Learning

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Forest construction of Gaussian and discrete variables with the application of Watanabe Bayesian Information Criterion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation