A major improvement to the Network Algorithm for Fisher's Exact Test in 2×c contingency tables

https://doi.org/10.1016/j.csda.2005.09.004Get rights and content

Abstract

Based on the Network Algorithm proposed by Mehta and Patel for Fisher's Exact Test on 2×c contingency tables, the relations between maximum subpath lengths are studied. A recurrence relation between maximum subpath lengths is obtained and an ordering of the maximum path lengths is established. Based on these results, some modifications in the Network Algorithm for 2×c tables are proposed. These modifications produce a drastic reduction in computation time which in some cases is higher than 99.5% compared to StatXact-5. Moreover, and with purely practical objectives, a grouping in intervals of subpath lengths of the Network Algorithm is proposed which enable us to obtain the p-value with a limited number of exact figures which is more than sufficient in practice, while with a drastic reduction in the amount of memory required and additional reductions in computational time. The proposed modifications are valid for any 2×c contingency table, and are compatible with other improvements already proposed for the Network Algorithm, and especially with the Hybrid Algorithm of Mehta and Patel.

Introduction

Let Xo be an observed 2×c contingency table, with fixed column sums C1,,Cc and fixed row sums R1,R2, and N=Ri=Cj. Let us denote by F the set of all the possible 2×c tables with the same marginal totals as Xo, by X a 2×c table of F and by x1,,xc the first row of X. It is known that, if we assume the hypothesis of row and column independence, the random vector x1,,xc has a Multivariate Hypergeometric distribution MHR1;C1,,Cc and P(X)=1Dj=1cCjxj,XF,D=NR1.

The best algorithm to carry out Fisher's Exact Test is the Network Algorithm (NA) proposed by Mehta and Patel, 1980, Mehta and Patel, 1983, Mehta and Patel, 1986a (see Hirji and Johnson, 1996). The NA for a 2×c table can be summed up as follows. A directed acyclic network of nodes and arcs is built. The nodes are structured in c+1 stages, labelled as c,c-1,,1,0. In any stage k there is a set of nodes and each of them is labelled by a pair of integer values k,R(k) with max0,R1-N+N(k)R(k)minN(k),R1,where N(k)=C1++Ck. In particular, in stage c there is a single node c,R(c) with R(c)=R1 (initial node) and in stage 0 there is also a single node 0,R(0) with R(0)=0 (terminal node). The R(k) values of the nodes of any other stage k form a series of positive consecutive integer numbers. The arcs emanate from each node of stage k and are directed towards a node of stage k-1. Each node of stage k-1 which is joined by an arc with a node k,R(k) of stage k is a daughter node (DN) of k,R(k), which is known as the mother node (MN). Each node of stage k-1 is always a DN of at least one MN of stage k. Each arc which reaches a node is connected to each one of the arcs which emanate from that same node. A path across the network is defined as a series of connected arcs which emanate from the initial node and reach the terminal node passing through c-1 intermediate nodes. Each intermediate node of a path divides the path into two subpaths.

The network is constructed in such a way that each subpath from the node k,R(k) to the terminal node corresponds to just one 2×k subtable with row sums R(k),N(k)-R(k) and column sums C1,,Ck. More specifically, subpath k,R(k)k-1,R(k-1)(0,0) corresponds to the 2×k subtable whose first row is xj=R(j)-R(j-1),j=1,,k. From now on, we will represent each subpath by its corresponding vector x1,,xk. For k=c we will have a one-to-one correspondence between all of the paths from the initial node to the terminal node and the set F.

The length of the arc which joins an MN k,R(k) with its DN k-1,R(k-1) is defined byk,R(k),R(k-1)=CkR(k)-R(k-1),and the length of a path (or subpath) is defined as the product of the length of their arcs. Therefore, the length of the path which corresponds to XF will be DP(X).

The p-value associated to the observed Xo table is given by p-valueXo=XFoP(X),where Fo=X|XF,P(X)PXo. The NA calculates this p-value by identifying and adding up the lengths of all of the paths in the network which are not longer than DPXo, but with no need to explicitly enumerate each path. The decision as to whether or not the paths of the network contribute to the p-value takes place based on sets of paths and this decision is made in the nodes of the network. For each DN k-1,R(k-1) of the MN k,R(k) we must check if one of the following conditions holds:PASTk,R(k),R(k-1)LPk-1,R(k-1)DPXoandPASTk,R(k),R(k-1)SPk-1,R(k-1)>DPXo,where LPk-1,R(k-1) and SPk-1,R(k-1) are the lengths of the longest and shortest subpath, respectively, from the DN k-1,R(k-1) to the terminal node, and PAST is the length of a subpath from the initial node to the MN k,R(k) (see Mehta and Patel, 1983). If Q is the set of all of the paths which pass through the DN k-1,R(k-1) and which have a common subpath (of length PAST) to the MN k,R(k), then no path of Q will contribute to the p-value if (3) holds and all of the paths of Q will contribute to the p-value if (2) holds, in which case the overall contribution is N(k-1)R(k-1)k,R(k),R(k-1)PAST.In both cases, the paths of Q are not considered again in the NA.

Since Mehta and Patel, 1980, Mehta and Patel, 1983, Mehta and Patel, 1986a, Mehta and Patel, 1986b, various improvements have been proposed for this NA in the general case of r×c tables (Joe, 1988, Clarkson et al., 1993, Aoki, 2002). The statistical package StatXact-5 (2001) incorporates this NA for the exact analysis of the unordered r×c tables. Furthermore, in the case of 2×c tables, Shao (1997) has proposed a modification of the NA which is more efficient when the column sums are equal, but is generally less efficient when the column sums are different.

In this paper, we present some modifications to the NA which are valid for any 2×c table and which always make the NA much more efficient (see Section 4). On the one hand, we propose a general recursive method to calculate all of the exact LP(.,.) quantities which are necessary in the processing of NA, based on the recurrence relation obtained in Section 2. On the other hand, in each stage of the NA, conditions (2) and (3) are checked for all of the DNs and all of the PAST values corresponding to each MN, but in the overwhelming majority of cases it is not necessary to check these conditions, especially in the case of condition (2). In Section 2, we obtain a relation between the maximum lengths of groups of subpaths which emanate from an MN, which, along with the consideration of the ordered PAST values, makes the number of times that condition (2) must be checked practically insignificant. At the same time, this makes it possible for the sets of paths which contribute to the p-value to be handled in big blocks. These modifications (which are described in Section 3) will produce a drastic reduction in the total computation time of the NA applied to a 2×c table.

Section snippets

Relation between maximum subpath lengths in the NA

According to the construction of the NA, for each node k,R(k)k>1 there is a one-to-one correspondence between the set of subpaths from the node k,R(k) to the terminal node and the set of points of Multivariate Hypergeometric distribution MHR(k);C1,,Ck, and there is also a one-to-one correspondence between the modes of this distribution and the longest subpaths from k,R(k) to the terminal node. Specifically, if x1*,,xk* is a mode of the previous distribution, then the longest subpath which

Modifications to the NA for 2×c tables

We can sum up the modifications to the NA in the following points:

  • 1.

    As Clarkson et al. (1993) have shown, in each stage k the exact LPk-1,R(k-1) quantities must be calculated and stored, in order to be recovered when necessary. Using the recurrence relations of Theorem 1, we propose a recursive method to calculate those LPs for all of the nodes of stage k-1 starting from the LP which corresponds to one of the nodes of that stage (for example, the first node). This initial LP is calculated through

Discussion

In this paper, we have presented some modifications to the NA in order to implement Fisher's Exact Test in 2×c tables, which make it much more efficient than the classic NA of Mehta and Patel (1986a). Henceforth, we shall call the NA with these proposed modifications, except the one referred to previously in Point 6, the modified NA. These modifications are valid for any 2×c contingency table and are compatible with other improvements which have already been proposed for the NA. It is

Acknowledgements

The authors would like to thank the co-editor and the two referees for helpful comments which improved this paper. This research was supported by the MEC, Spain, Grant number MTM-2004-00989 (with cofinancing of the FEDER) and by the C. I. C. E., Junta de Andalucía, Spain, Grant number FQM-0235.

References (12)

There are more references available in the full text version of this article.

Cited by (13)

View all citing articles on Scopus
View full text