A major improvement to the Network Algorithm for Fisher's Exact Test in contingency tables
Introduction
Let be an observed contingency table, with fixed column sums and fixed row sums , and . Let us denote by the set of all the possible tables with the same marginal totals as , by X a table of F and by the first row of X. It is known that, if we assume the hypothesis of row and column independence, the random vector has a Multivariate Hypergeometric distribution MH and
The best algorithm to carry out Fisher's Exact Test is the Network Algorithm (NA) proposed by Mehta and Patel, 1980, Mehta and Patel, 1983, Mehta and Patel, 1986a (see Hirji and Johnson, 1996). The NA for a table can be summed up as follows. A directed acyclic network of nodes and arcs is built. The nodes are structured in stages, labelled as . In any stage k there is a set of nodes and each of them is labelled by a pair of integer values with where . In particular, in stage c there is a single node with (initial node) and in stage 0 there is also a single node with (terminal node). The values of the nodes of any other stage k form a series of positive consecutive integer numbers. The arcs emanate from each node of stage k and are directed towards a node of stage . Each node of stage which is joined by an arc with a node of stage k is a daughter node (DN) of , which is known as the mother node (MN). Each node of stage is always a DN of at least one MN of stage k. Each arc which reaches a node is connected to each one of the arcs which emanate from that same node. A path across the network is defined as a series of connected arcs which emanate from the initial node and reach the terminal node passing through intermediate nodes. Each intermediate node of a path divides the path into two subpaths.
The network is constructed in such a way that each subpath from the node to the terminal node corresponds to just one subtable with row sums and column sums . More specifically, subpath corresponds to the subtable whose first row is . From now on, we will represent each subpath by its corresponding vector . For we will have a one-to-one correspondence between all of the paths from the initial node to the terminal node and the set F.
The length of the arc which joins an MN with its DN is defined byand the length of a path (or subpath) is defined as the product of the length of their arcs. Therefore, the length of the path which corresponds to will be P(X).
The p-value associated to the observed table is given by where . The NA calculates this p-value by identifying and adding up the lengths of all of the paths in the network which are not longer than DP, but with no need to explicitly enumerate each path. The decision as to whether or not the paths of the network contribute to the p-value takes place based on sets of paths and this decision is made in the nodes of the network. For each DN of the MN we must check if one of the following conditions holds:andwhere LP and SP are the lengths of the longest and shortest subpath, respectively, from the DN to the terminal node, and PAST is the length of a subpath from the initial node to the MN (see Mehta and Patel, 1983). If Q is the set of all of the paths which pass through the DN and which have a common subpath (of length PAST) to the MN , then no path of Q will contribute to the p-value if (3) holds and all of the paths of Q will contribute to the p-value if (2) holds, in which case the overall contribution is In both cases, the paths of Q are not considered again in the NA.
Since Mehta and Patel, 1980, Mehta and Patel, 1983, Mehta and Patel, 1986a, Mehta and Patel, 1986b, various improvements have been proposed for this NA in the general case of tables (Joe, 1988, Clarkson et al., 1993, Aoki, 2002). The statistical package StatXact-5 (2001) incorporates this NA for the exact analysis of the unordered tables. Furthermore, in the case of tables, Shao (1997) has proposed a modification of the NA which is more efficient when the column sums are equal, but is generally less efficient when the column sums are different.
In this paper, we present some modifications to the NA which are valid for any table and which always make the NA much more efficient (see Section 4). On the one hand, we propose a general recursive method to calculate all of the exact LP quantities which are necessary in the processing of NA, based on the recurrence relation obtained in Section 2. On the other hand, in each stage of the NA, conditions (2) and (3) are checked for all of the DNs and all of the PAST values corresponding to each MN, but in the overwhelming majority of cases it is not necessary to check these conditions, especially in the case of condition (2). In Section 2, we obtain a relation between the maximum lengths of groups of subpaths which emanate from an MN, which, along with the consideration of the ordered PAST values, makes the number of times that condition (2) must be checked practically insignificant. At the same time, this makes it possible for the sets of paths which contribute to the p-value to be handled in big blocks. These modifications (which are described in Section 3) will produce a drastic reduction in the total computation time of the NA applied to a table.
Section snippets
Relation between maximum subpath lengths in the NA
According to the construction of the NA, for each node there is a one-to-one correspondence between the set of subpaths from the node to the terminal node and the set of points of Multivariate Hypergeometric distribution MH, and there is also a one-to-one correspondence between the modes of this distribution and the longest subpaths from to the terminal node. Specifically, if is a mode of the previous distribution, then the longest subpath which
Modifications to the NA for tables
We can sum up the modifications to the NA in the following points:
- 1.
As Clarkson et al. (1993) have shown, in each stage k the exact LP quantities must be calculated and stored, in order to be recovered when necessary. Using the recurrence relations of Theorem 1, we propose a recursive method to calculate those LPs for all of the nodes of stage starting from the LP which corresponds to one of the nodes of that stage (for example, the first node). This initial LP is calculated through
Discussion
In this paper, we have presented some modifications to the NA in order to implement Fisher's Exact Test in tables, which make it much more efficient than the classic NA of Mehta and Patel (1986a). Henceforth, we shall call the NA with these proposed modifications, except the one referred to previously in Point 6, the modified NA. These modifications are valid for any contingency table and are compatible with other improvements which have already been proposed for the NA. It is
Acknowledgements
The authors would like to thank the co-editor and the two referees for helpful comments which improved this paper. This research was supported by the MEC, Spain, Grant number MTM-2004-00989 (with cofinancing of the FEDER) and by the C. I. C. E., Junta de Andalucía, Spain, Grant number FQM-0235.
References (12)
- et al.
A comparison of algorithms for exact analysis of unordered contingency tables
Comput. Statist. Data Anal.
(1996) - et al.
Characterization of maximum probability points in the Multivariate Hypergeometric Distribution
Statist. Probab. Lett.
(2000) An efficient algorithm for the exact test on unordered contingency tables with equal column sums
Comput. Statist. Data Anal.
(1997)Improving path trimming in a network algorithm for Fisher's Exact Test in two-way contingency tables
J. Stat. Comput. Simul.
(2002)- et al.
A remark on algorithm 643: FEXACT: an algorithm for performing Fisher's Exact Test in contingency tables
ACM Trans. Math. Software
(1993) Extreme probabilities for contingency tables under row and column independence with application to Fisher's Exact Test
Comm. Statist. Theory Methods
(1988)
Cited by (13)
New upper bounds for tight and fast approximation of Fisher's exact test in dependency rule mining
2016, Computational Statistics and Data AnalysisPrediction and failure analysis of composite resin restorations in the posterior sector applied in teaching dental students
2020, Journal of Ambient Intelligence and Humanized ComputingCharacterization of the maximum probability fixed marginals r × c contingency tables
2020, Revstat Statistical Journal