Power analysis of database search using multiple scoring matrices

https://doi.org/10.1016/j.csda.2006.06.009Get rights and content

Abstract

Protein sequence alignment may be viewed as either a classification or a multiple hypothesis testing problem. Whereas the type one error of a method is often studied for randomly generated sequences, the power is best investigated based on real protein sequences. The SCOP data base and its protein classification is used to investigate both the power and the type one error of sequence alignment as provided by BLAST. The focus is on the multiple testing case when more than one scoring matrix is used. It is demonstrated that a multiple testing correction needs to be applied in order to control the number of false positives while using more than one scoring matrix. It is also shown that a proper search procedure based on multiple scoring matrices detects slightly fewer homologous sequences present in the SCOP data base than the matrix BLOSUM62 itself, while giving the opportunity of detecting a wider variety of homologous types.

Introduction

Due to mutations as well as insertions and deletions, the genetic material of organisms changes substantially over long periods of time. The longer the time span from a past common ancestor, the more the genome (and therefore also the proteins) of two organisms will differ. Local sequence alignment provides a method to check whether two sequences are homologous, i.e. share a common ancestor sequence back in time. The idea is to look for the best matching substrings in each sequence. Focusing on substrings makes sense, since parts of genes that code for functionally essential pieces of a protein usually change more slowly than other parts. Such changes often lead to a reduced functionality of the protein and are therefore penalized by evolution.

The quality of a match is evaluated by using a scoring scheme that penalizes both for mutations (different letters in both sequences at the same position) and for the insertion of gaps into either of the sequences. While simple scoring schemes are usually considered sufficient when aligning two DNA sequences, more sophisticated scoring schemes are useful when comparing two protein sequences. The reason for this is that the different chemical properties of the twenty amino acids occurring in a protein make some amino acid substitutions more likely than others. The penalties for different substitutions within the protein sequence are coded in the form of scoring matrices. The current computer software usually uses two types of matrices. The PAM family of matrices proposed by Dayhoff et al. (1979) is based on a Markov chain transition model. The BLOSUM matrices, on the other hand, have been heuristically derived by clustering and aligning sequences at various degrees of relatedness (see Henikoff and Henikoff, 1992). As discussed for instance in Reese and Pearson (2002), different substitution matrices usually require different gap penalties.

Smith and Waterman (1981) proposed a dynamic programming algorithm for obtaining the best matching subsequences with respect to such a scoring scheme. Combining the Smith–Waterman algorithm with some search heuristics, alignment software like BLAST (Altschul et al., 1997) is frequently used to compare two sequences, often in the context of a data base search looking for all sequences that are related to a specific query DNA or protein sequence. Correctly identified homologous sequences in the database may then provide some evolutionary, structural or functional information about the query sequence.

From the statistical point of view, the alignment score provided by BLAST may be used as a basis for either hypothesis testing or classification. The null hypothesis is that the sequences are not homologous, whereas the sequences share a common ancestor under the alternative. Obviously, both a high sensitivity and specificity is desirable. For details on BLAST, and in particular distributional approximations for the score statistics under the null hypothesis see for instance Chapters 6 and 9 of Ewens and Grant (2001).

To increase the power of the search for homologous proteins, biologists often by routine use several scoring matrices. In previous papers, we have addressed the multiple testing problem that arises when sequence alignment is performed using several scoring matrices (Frommlet et al., 2004, Frommlet and Futschik, 2004) and we proposed methods to correct the obtained p-values in order to control the family wise error. In both papers, we focused on the distribution of scores under the null-hypothesis, and our analysis was based on randomly generated pairs of sequences. The purpose of the current paper is to investigate the sensitivity and the specificity based on real biological data, when several scoring matrices are used for classifying pairs of amino acids.

To gain an understanding of the sensitivity of an alignment procedure, it is of course important to consider specific alternative (homology) hypotheses. Now this is a somewhat delicate issue, because it is by no means an easy task to specify this alternative. There are several different protein structure databases available, all structured according to some mixture of subjective and objective classification methods. These databases vary with respect to their accuracy, and it can be expected that the power of any specific classification procedure will depend to some extent on the closeness of this procedure to the method used to position sequences within the data-base. For a recent review on the problems of assigning structural domains in proteins we refer to Veretnik et al. (2004).

For our purpose, the SCOP database developed by Murzin et al. (1995) seems to be particularly suitable. One reason is that the SCOP database is believed to be particularly reliable, another is that the database design based on protein structure similarity provides an easily tractable criterion to decide whether two sequences are related. For a recent description of the SCOP database see Hubbard et al. (1999) and Andreeva et al. (2004). For purposes similar to ours, Abagyan and Batalov (1997) used the database to compare eight different substitution matrices and eight sets of gap penalties with respect to their ability to detect structural similarity. However, they were not applying local alignment, but global alignment with zero end gap penalties. They coined the term “twilight zone” for the most challenging situations where the detection of homology actually depends on the chosen matrix. Rost (1999) elaborates on the importance of this twilight zone for practical purposes. Another paper on the use of SCOP to compare different alignment procedures is by Brenner et al. (1998). They used ‘coverage versus error’ plots to evaluate different methods. Several other approaches how to combine false positives and true positives to obtain informative plots have also been proposed. Chen (2003) gives a concise overview of these, and suggests the average precision criterion as an alternative. The classical form of presentation is of course the ROC curve (Gribskov and Robinson, 1996).

In the SCOP database, proteins are classified at four different levels called class, fold, superfamily and family. According to the work of Lindahl and Elofsson (2000) sequence-based algorithms are not really capable to detect structural relations of proteins at the fold level. For that reason, we assume sequences to be related if they belong to the same family or superfamily, and restrict our attention to the investigation of the power when using multiple scoring matrices to detect related proteins at the family level and at the super family level of the SCOP database.

Section snippets

Methods

To assess the power of classification based on different schemes of multiple scoring matrices we will count the number of pairs of sequences correctly grouped in the same class with respect to a given benchmark. For our analysis, we used the SCOP-based benchmark PDB40D-B, from which we removed all sequences containing the letter ‘X’, indicating that the sequence is not completely known. This way we obtained a set of 5035 well defined sequences. We aligned all pairs of these sequences using

Results

Our choice of all possible pairs out of the set of all reliable sequences in the SCOP database led to 14 048 pairs whose members belong to the same respective protein family, and 39 210 pairs of the same super family. Thus, 25 162 pairs belonged to the same super family but different families.

To illustrate the sensitivity of BLAST alignment with respect to the choice of the scoring matrix, we provide a table splitting the detections according to the number of matrices leading to E-values below the

Discussion

Our results indicate that the correction for multiple scoring matrices as proposed by Frommlet and Futschik (2004) works very well in terms of the type one error when applied to the protein sequences of the SCOP database. In view of our findings, a correction for multiplicity is clearly necessary when using more than one scoring matrix, if control of the type one error is desired.

The increase in power on the other hand when using more than one scoring matrix is compensated by the multiplicity

References (22)

  • Z. Chen

    Assessing sequence comparison methods with the average precision criterion

    Bioinformatics

    (2003)
  • Cited by (0)

    View full text