Elsevier

Computers & Chemistry

Volume 17, Issue 2, June 1993, Pages 149-163
Computers & Chemistry

Statistics of local complexity in amino acid sequences and sequence databases

https://doi.org/10.1016/0097-8485(93)85006-XGet rights and content

Abstract

Protein sequences contain surprisingly many local regions of low compositional complexity. These include different types of residue clusters, some of which contain homopolymers, short period repeats or aperiodic mosaics of a few residue types. Several different formal definitions of local complexity and probability are presented here and are compared for their utility in algorithms for localization of such regions in amino acid sequences and sequence databases. The definitions are:—(1) those derived from enumeration a priori by a treatment analogous to statistical mechanics, (2) a log likelihood definition of complexity analogous to informational entropy, (3) multinomial probabilities of observed compositions, (4) an approximation resembling the χ2 statistic and (5) a modification of the coefficient of divergence. These measures, together with a method based on similarity scores of self-aligned sequences at different offsets, are shown to be broadly similar for first-pass, approximate localization of low-complexity regions in protein sequences, but they give significantly different results when applied in optimal segmentation algorithms. These comparisons underpin the choice of robust optimization heuristics in an algorithm, SEG, designed to segment amino acid sequences fully automatically into subsequences of contrasting complexity. After the abundant low-complexity segments have been partitioned from the Swissprot database, the remaining high-complexity sequence set is adequately approximated by a first-order random model.

References (28)

  • S.F. Altschul et al.

    J. Mol. Biol.

    (1990)
  • I.E. Auger et al.

    Bull. Math. Biol.

    (1989)
  • D.M. Boulton et al.

    J. Theor. Biol.

    (1969)
  • S.R. Haynes et al.

    Dev. Biol.

    (1989)
  • S. Karlin et al.

    Meth. Enzym.

    (1990)
  • A.K. Konopka et al.

    Gene Anal. Tech.

    (1990)
  • P. Salamon et al.

    Comput. Chem.

    (1992)
  • A. Bairoch et al.

    Nucleic Acids Res.

    (1992)
  • S. Beck et al.

    DNA Seq.

    (1992)
  • R.A. Becker et al.
    (1988)
  • G.J. Chaitin

    J. Assoc. Comp. Mach.

    (1975)
  • C. Chappey et al.

    CABIOS

    (1992)
  • P.J. Clark

    Copeia

    (1952)
  • T.M. Cover et al.
    (1991)
  • Cited by (0)

    The preliminary version of this work was presented during the Second International Workshop on Open Problems of Computational Molecular Biology, Telluride Summer Research Center, Telluride, Colo., 19 July–2 August 1992.

    View full text