BSig: evaluating the statistical significance of biclustering solutions

Henriques, Rui; Madeira, Sara C.

doi:10.1007/s10618-017-0521-2

BSig: evaluating the statistical significance of biclustering solutions

Published: 28 June 2017

Volume 32, pages 124–161, (2018)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Rui Henriques¹ &
Sara C. Madeira^2,3

1127 Accesses
32 Citations
2 Altmetric
Explore all metrics

Abstract

Statistical evaluation of biclustering solutions is essential to guarantee the absence of spurious relations and to validate the high number of scientific statements inferred from unsupervised data analysis without a proper statistical ground. Most biclustering methods rely on merit functions to discover biclusters with specific homogeneity criteria. However, strong homogeneity does not guarantee the statistical significance of biclustering solutions. Furthermore, although some biclustering methods test the statistical significance of specific types of biclusters, there are no methods to assess the significance of flexible biclustering models. This work proposes a method to evaluate the statistical significance of biclustering solutions. It integrates state-of-the-art statistical views on the significance of local patterns and extends them with new principles to assess the significance of biclusters with additive, multiplicative, symmetric, order-preserving and plaid coherencies. The proposed statistical tests provide the unprecedented possibility to minimize the number of false positive biclusters without incurring on false negatives, and to compare state-of-the-art biclustering algorithms according to the statistical significance of their outputs. Results on synthetic and real data support the soundness and relevance of the proposed contributions, and stress the need to combine significance and homogeneity criteria to guide the search for biclusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A goodness-of-fit test on the number of biclusters in a relational data matrix

Article 17 April 2023

Chihiro Watanabe & Taiji Suzuki

Biclustering Algorithms Based on Metaheuristics: A Review

A systematic comparative evaluation of biclustering techniques

Article Open access 23 January 2017

Victor A. Padilha & Ricardo J. G. B. Campello

Notes

https://web.ist.utl.pt/rmch/software/bsig/.
An illustrative statistical test is to rely on the percentage of synthetic datasets with support higher than $\hat{\theta }$: $p(x)=\frac{1}{h}{\varSigma }_{i=1}^{h} f(x-\hat{\theta })$, where $f(z)=1$ if $z\le 0$ and 0 otherwise.
$((1-v(B))/(1-E[v(B)]))\cdot (E[v(B)]/v(B))$, where v(B) is the fraction of transactions with some but not all $\varphi _B$ items, and E[v(B)] is the expectation of v(B) in a random dataset (Aggarwal and Yu 1998).
Available in http://web.ist.utl.pt/rmch/software/bsig/.
http://www.bioinf.jku.at/software/fabia/gene_expression.html.
http://chemogenomics.stanford.edu/supplements/03nuc/datasets.html.
Using Yeastract http://yeastract.com and Enrichr http://amp.pharm.mssm.edu/Enrichr.
To run experiments, we used: fabia package from R, BicAT (Barkow et al. 2006) and BicPAMS (Henriques et al. 2017) software.

References

Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, ACM, New York, NY, USA, PODS ’98, pp 18–24, doi:10.1145/275487.275490
Alzahrani M, Kuwahara H, Wang W, Gao X (2017) Gracob: a novel graph-based constant-column biclustering method for mining growth phenotype data. Bioinformatics. doi:10.1093/bioinformatics/btx199
Google Scholar
Balakrishnan S, Kolar M, Rinaldo A, Singh A, Wasserman L (2011) Statistical and computational tradeoffs in biclustering. In: NIPS 2011 workshop on computational trade-offs in statistical learning, vol 4
Barkow S, Bleuler S, Prelić A, Zimmermann P, Zitzler E (2006) Bicat: a biclustering analysis toolbox. Bioinformatics 22(10):1282. doi:10.1093/bioinformatics/btl099
Article Google Scholar
Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246. doi:10.1023/A:1011429418057
Article MATH Google Scholar
Bellay J, Atluri G, Sing TL, Toufighi K, Costanzo M, Ribeiro PSM, Pandey G, Baller J, VanderSluis B, Michaut M, Han S, Kim P, Brown GW, Andrews BJ, Boone C, Kumar V, Myers CL (2011) Putting genetic interactions in context through a global modular decomposition. Genome Res 21(8):1375–1387. doi:10.1101/gr.117176.110
Article Google Scholar
Ben-Dor A, Chor B, Karp R, Yakhini Z (2003) Discovering local structure in gene expression data: the order-preserving submatrix problem. J Comput Biol 10(3–4):373–384. doi:10.1089/10665270360688075
Article Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society Series B (Methodological), pp 289–300, doi:10.2307/2346101
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 1165–1188. doi:10.1214/aos/1013699998
Bolton RJ, Hand DJ, Adams NM (2002) Determining hit rate in pattern search. Springer, Berlin, pp 36–48. doi:10.1007/3-540-45728-3_4
Brown GW (1947) On small-sample estimation. Ann Math Stat 18(4):582–585
Article MathSciNet MATH Google Scholar
Califano A, Stolovitzky G, Tu Y (2000) Analysis of gene expression microarrays for phenotype classification. Int Conf Intell Syst Mol Biol 8:75–85
Google Scholar
Carmona-Saez P, Chagoyen M, Rodriguez A, Trelles O, Carazo JM, Pascual-Montano A (2006) Integrated analysis of gene expression by association rules discovery. BMC Bioinform 7(1):54. doi:10.1186/1471-2105-7-54
Article Google Scholar
Chen Y, Xu J (2016) Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. J Mach Learn Res 17(1):882–938
MathSciNet MATH Google Scholar
Cheng Y, Church GM (2000) Biclustering of expression data. Intell Syst Mol BiolPress 8:93–103
Google Scholar
DuMouchel W (1999) Bayesian data mining in large frequency tables, with an application to the fda spontaneous reporting system. Am Stat 53(3):177–190. doi:10.2307/2686093
Google Scholar
DuMouchel W, Pregibon D (2001) Empirical bayes screening for multi-item associations. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’01, pp 67–76, doi:10.1145/502512.502526
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95(25):14,863–14,868
Article Google Scholar
Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11(12):4241–4257. doi:10.1091/mbc.11.12.4241
Article Google Scholar
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3). doi:10.1145/1297332.1297338
Gnatyshak D, Ignatov D, Semenov A, Poelmans J (2012) Gaining insight in social networks with biclustering and triclustering. In: Perspectives in business informatics research, LNBIP, vol 128. Springer, Berlin Heidelberg, pp 162–171, doi:10.1007/978-3-642-33281-4_13
Hämälinen W, Nykänen M (2008) Efficient discovery of statistically significant association rules. In: 2008 Eighth IEEE international conference on data mining (ICDM), pp 203–212. doi:10.1109/ICDM.2008.144
Henriques R (2016) Learning from high-dimensional data using local descriptive models. PhD thesis, Instituto Superior Tecnico, Universidade de Lisboa, Lisboa
Henriques R, Madeira S (2014a) Bicspam: flexible biclustering using sequential patterns. BMC Bioinform 15(1):130. doi:10.1186/1471-2105-15-130
Article Google Scholar
Henriques R, Madeira SC (2014b) Bicpam: pattern-based biclustering for biomedical data analysis. Algorithms Mol Biol 9(1):27. doi:10.1186/s13015-014-0027-z
Article Google Scholar
Henriques R, Madeira SC (2015) Biclustering with flexible plaid models to unravel interactions between biological processes. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 12(4):738–752. doi:10.1109/TCBB.2014.2388206
Article Google Scholar
Henriques R, Madeira SC (2016a) Bic2pam: constraint-guided biclustering for biological data analysis with domain knowledge. Algorithms Mol Biol 11(1):23. doi:10.1186/s13015-016-0085-5
Article Google Scholar
Henriques R, Madeira SC (2016b) Bicnet: flexible module discovery in large-scale biological networks using biclustering. Algorithms Mol Biol 11(1):1–30. doi:10.1186/s13015-016-0074-8
Article Google Scholar
Henriques R, Antunes C, Madeira SC (2015) A structured view on pattern mining-based biclustering. Pattern Recognit 48(12):3941–3958. doi:10.1016/j.patcog.2015.06.018
Article Google Scholar
Henriques R, Ferreira FL, Madeira SC (2017) Bicpams: software for biological data analysis with pattern-based biclustering. BMC Bioinform 18(1):82. doi:10.1186/s12859-017-1493-3
Article Google Scholar
Hochreiter S, Bodenhofer U, Heusel M et al (2010) Fabia: factor analysis for bicluster acquisition. Bioinformatics 26(12):1520–1527. doi:10.1093/bioinformatics/btq227
Article Google Scholar
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70
MathSciNet MATH Google Scholar
Huang DW, Sherman BT, Lempicki RA (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37(1):1. doi:10.1093/nar/gkn923
Article Google Scholar
Ihmels J, Bergmann S, Barkai N (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics 20(13):1993. doi:10.1093/bioinformatics/bth166
Article Google Scholar
Jaroszewicz S, Scheffer T (2005) Fast discovery of unexpected patterns in data, relative to a bayesian network. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, ACM, New York, NY, USA, KDD ’05, pp 118–127. doi:10.1145/1081870.1081887
Karian Z, Dudewicz E (2010) Handbook of fitting statistical distributions with R. Taylor & Francis, Milton Park
Book MATH Google Scholar
Kirsch A, Mitzenmacher M, Pietracaprina A, Pucci G, Upfal E, Vandin F (2012) An efficient rigorous approach for identifying statistically significant frequent itemsets. J ACM 59(3):12:1–12:22. doi:10.1145/2220357.2220359
Article MathSciNet MATH Google Scholar
Koyuturk M, Szpankowski W, Grama A (2004) Biclustering gene-feature matrices for statistically significant dense patterns. In: Proceedings. 2004 IEEE computational systems bioinformatics conference (CSB), pp 480–484. doi:10.1109/CSB.2004.1332467
Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Statistica Sinica 12(1):61–86. http://www.jstor.org/stable/24307036
Lee JD, Sun Y, Taylor JE (2015) Evaluating the statistical significance of biclusters. In: Advances in neural information processing systems 28 (NIPS), Curran Associates, Inc., pp 1324–1332
Lee W, Tillo D, Bray N, Morse RH, Davis RW, Hughes TR, Nislow C (2007) A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet 39(10):1235–1244. doi:10.1038/ng2117
Article Google Scholar
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 1(1):24–45. doi:10.1109/TCBB.2004.2
Article Google Scholar
Madeira SC, Oliveira AL (2007) An efficient biclustering algorithm for finding genes with similar patterns in time-series expression data. In: Asia Pacific bioinformatics conference, pp 67–80
Madeira SC, Teixeira MC, Sa-Correia I, Oliveira AL (2010) Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 7(1):153–165. doi:10.1109/TCBB.2008.34
Article Google Scholar
Mahfouz MA, Ismail MA (2009) Bidens: Iterative density based biclustering algorithm with application to gene expression analysis. Int J Comput Electr Automa Control Inf Eng 3(1):40–46
Google Scholar
Mankad S, Michailidis G (2014) Biclustering three-dimensional data arrays with plaid models. J Comput Graph Stat 23(4):943–965. doi:10.1080/10618600.2013.851608
Article MathSciNet Google Scholar
Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of the fourth international conference on knowledge discovery and data mining, AAAI Press, KDD’98, pp 274–278
Mitra S, Banka H (2006) Multi-objective evolutionary biclustering of gene expression data. Pattern Recognit 39(12):2464–2477. doi:10.1016/j.patcog.2006.03.003
Article MATH Google Scholar
Noureen N, Kulsoom N, de la Fuente A, Fazal S, Malik SI (2009) Functional and promoter enrichment based analysis of biclustering algorithms using gene expression data of yeast. In: 2009 IEEE 13th international multitopic conference (INMIC), IEEE, pp 1–6, doi:10.1109/INMIC.2009.5383144
Ojala M, Vuokko N, Kallio A, Haiminen N, Mannila H (2008) Randomization of real-valued matrices for assessing the significance of data mining results. In: Proceedings of the 2008 SIAM international conference on data mining (SDM), SIAM, vol 8, pp 494–505. doi:10.1137/1.9781611972788.45
Okada Y, Fujibuchi W, Horton P (2007) A biclustering method for gene expression module discovery using closed itemset enumeration algorithm. IPSJ Trans Bioinform 3(SIG5):183–192. doi:10.2197/ipsjdc.3.183
Google Scholar
Pio G, Ceci M, D’Elia D, Loglisci C, Malerba D (2012) A novel biclustering algorithm for the discovery of meaningful biological correlations between mirnas and mrnas. EMBnetjournal 18(A). doi:10.14806/ej.18.A.375
Ramon J, Miettinen P, Vreeken J (2013) Detecting bicliques in gf[q]. In: Proceedings of the European conference on machine learning and knowledge discovery in databases, vol 8188, Springer New York, Inc., New York, NY, USA, ECML PKDD, pp 509–524. doi:10.1007/978-3-642-40988-2_33
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM, Hurt EM, Zhao H, Averett L, Yang L, Wilson WH, Jaffe ES, Simon R, Klausner RD, Powell J, Duffey PL, Longo DL, Greiner TC, Weisenburger DD, Sanger WG, Dave BJ, Lynch JC, Vose J, Armitage JO, Montserrat E, López-Guillermo A, Grogan TM, Miller TP, LeBlanc M, Ott G, Kvaloy S, Delabie J, Holte H, Krajci P, Stokke T, Staudt LM (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. N Engl J Med 346(25):1937–1947. doi:10.1056/NEJMoa012914
Article Google Scholar
Scheffer T (2005) Finding association rules that trade support optimally against confidence. Intell Data Anal 9(4):381–395. doi:10.1007/3-540-44794-6_35
Google Scholar
Serin A, Vingron M (2011) Debi: Discovering differentially expressed biclusters using a frequent itemset approach. Algorithms Mol Biol 6:1–12. doi:10.1186/1748-7188-6-18
Article Google Scholar
Silberschatz A, Tuzhilin A (1996) What makes patterns interesting in knowledge discovery systems. IEEE Trans Knowl Data Eng 8(6):970–974. doi:10.1109/69.553165
Article Google Scholar
Silverstein C, Brin S, Motwani R (1998) Beyond market baskets: generalizing association rules to dependence rules. Data Min Knowl Discov 2(1):39–68. doi:10.1023/A:1009713703947
Article Google Scholar
Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(suppl1):S136. doi:10.1093/bioinformatics/18.suppl_1.S136
Article Google Scholar
Tavazoie S, Hughes J, Campbell M, Cho R, Church G (1999) Systematic determination of genetic network architecture. Nature Genet 22(3):281–285. doi:10.1038/10343
Article Google Scholar
Wang H, Wang W, Yang J, Yu PS (2002) Clustering by pattern similarity in large data sets. In: Proceedings of the 2002 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’02, pp 394–405. doi:10.1145/564691.564737
Webb GI (2007) Discovering significant patterns. Mach Learn 68(1):1–33. doi:10.1007/s10994-007-5006-x
Article Google Scholar
Yang J, Wang W, Wang H, Yu P (2002) delta-clusters: capturing subspace correlation in a large data set. In: Proceedings 18th international conference on data engineering (ICDE), IEEE, pp 517–528. doi:10.1109/ICDE.2002.994771
Zhang H, Padmanabhan B, Tuzhilin A (2004) On the discovery of significant statistical quantitative rules. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’04, pp 374–383. doi:10.1145/1014052.1014094

Download references

Acknowledgements

This work was supported by FCT under the Neuroclinomics2 Project PTDC/EEI-SII/1937/2014, Research Grant SFRH/BD/75924/2011 to RH, INESC-ID plurianual Ref. UID/CEC/50021/2013, and LASIGE Research Unit Ref. UID/CEC/00408/2013.

Author information

Authors and Affiliations

INESC-ID and DEI, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
Rui Henriques
LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
Sara C. Madeira
INESC-ID, Lisbon, Portugal
Sara C. Madeira

Authors

Rui Henriques
View author publications
You can also search for this author in PubMed Google Scholar
Sara C. Madeira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Henriques.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Responsible editor: Ian Davidson.

Appendices

Appendix 1: Relevance of biclustering with flexible coherency

High-dimensional biomedical and social data is characterized by the presence of biclusters with flexible coherency assumptions (Table 6). Table 6 motivates the relevance of such biclusters, highlighting data contexts where their discovery is relevant for different purposes.

Table 6 Relevance of non-constant biclusters for biomedical and social data analysis

Full size table

1.1 Appendix 2: Continuous adjustment factors

The proposed statistical tests can be extended to support biclusters with continuous adjustment factors. Consider the $\{1.3,2.2,1.7\}$ combination of values, a continuous range of coherent values under additive or multiplicative assumptions can be generated based on the exploration of $\gamma $ factors (e.g. shifting $\gamma \in [-1.3,1.8]$ or scaling $\gamma \in [0,1.8]$ factors for values $a_{ij}\in [0,4]$). In order to robustly compute the $p_{\varphi _B}$ probability of additive and multiplicative models, and subsequently of symmetric and plaid models (with underlying additive/multiplicative assumptions), we propose a technique based based on the integral of the product of (either slided or scaled) density probability functions.

1.2 Continuous ranges of shifting factors

Consider the additive coherency assumption. Let the maximum and minimum observed values for a particular row $x_i\in I$ of a bicluster to be, respectively, $max_{J|x_i}$ and $min_{J|x_i}$. Also consider the range of real-values of the matrix A to be $[min_{A},max_{A}]$. Then for a particular pattern $\mathbf {J}|\mathbf {x}_i$ the shifting factors are defined by the interval $\gamma \in [\gamma _1=-(min_{A}-min_{J|x_i}),\gamma _2=max_{A}-max_{J|x_i}]$. The probability of a particular value $a_{ij}$ to occur under this shifting interval is:

$$\begin{aligned} \int _{a_{ij}+\gamma _1}^{a_{ij}+\gamma _2}f(x)=\int _{\gamma _1}^{\gamma _2}f(x+a_{ij}) \end{aligned}$$

(A1)

where f(x) is the distribution function that approximates $a_{ij}$ values. This calculus assumes that the range of observed values $\hat{A}$ are linearly adjusted to guarantee an unitary coherency strength $\delta \approx 1$. The probability of two values $a_{ij}$ ($a_1$) and $a_{i(j+1)}$ ($a_2$) to occur under this shifting interval is not simply the product of their individual probabilities since a simple product would allow for non-coherent values (e.g. $\{a_{1}+\gamma _1,a_{2}+\gamma _2/2\}$). In order to correctly account for the combination of values with continuous shifting ranges, the distribution functions need to be aligned by the target column value and multiplied. The resulting function delivers the product of the individual probabilities. Finally, the area behind this curve between $\gamma _1$ and $\gamma _2$ values is computed in order to retrieve a estimate of the probability $p_{\varphi _B}$ for the Binomial tail calculus. This strategy is illustrated in Fig. 14, under the assumption that the values in A are either described by an Uniform or Gaussian distribution. Given $\varphi _{B}^i=\{a_{i1},..,a_{im}\}$ combination of values, $p_{\varphi _{B}^i}$ can be approximated by:

$$\begin{aligned} \int _{\gamma _1}^{\gamma _2}{\varPi }_{j=1}^m f(x+a_{ij}) \end{aligned}$$

(A2)

In order to compute this probability efficiently we propose the calculus of its approximate area by interpolating 100 points between $\gamma _1$ and $\gamma _2$.

1.3 Continuous ranges of scaling factors

The probability of occurrence of a combination of real values $\varphi _{B}^i$ on the $i\hbox {th}$ row of a bicluster under a multiplicative coherency across rows can be approximated using similar principles to the ones proposed in previous section. Considering $max_i$ and $min_i$ to be the maximum and minimum values of a given row $x_i$ and $\bar{A}$ to be the range real values in A. When only positive values are allowed, the scaling range is $[\gamma _1=0,\gamma _2=\bar{A}/max_i]$. When negatives values are allowed the scaling range is given by $[\gamma _1=-d,\gamma _2=d]$ where $d=max(max_i,-min_i)/(\bar{A}/2)$.

The probability of multiple values to occur is given by the integral of the product of the size-adjusted density functions for the $[\gamma _1,\gamma _2]$ interval. Why the size adjustment is necessary? Consider the pair of observed values $\{a_1=1,a_2=2.5\}$ and the scaling range to be $\gamma \in [0,1]$. This means that the density function to estimate the $a_1$ value is considered for the interval [0,1], while the density function to estimate $a_2$ is considered over [0,2.5]. Therefore, the density functions need to be normalized with regards to their size: $f(x/a_1)$ and $f(x/a_2)$. Given $\varphi _{B}^i=\{a_{i1},..,a_{im}\}$ combination of values for i row, $p_{\varphi _{B}^i}$ can be approximated by:

$$\begin{aligned} \int _{c_1}^{c_2}{\varPi }_{i=1}^n f(x/a_i) \end{aligned}$$

(A3)

Similarly, an efficient computation of the (A3) integral calculus is made available recurring to interpolation whenever the multiplication of the inputted density functions is complex. This strategy is illustrated in Fig. 15, under the assumption that the values of the A matrix are either described by a single Uniform or Gaussian distribution.

Table 7 Impact of data size and dimensionality on the expected minimum number of observations $(\mathrm{\mathbf{n}}_{min})$ in biclusters with continuous adjustment factors to guarantee their statistical significance (assuming a $\delta =0.2$ coherency strength, uniform background values, and additive and multiplicative coherencies with varying ranges of allowed shifts/scales). Algorithm 1 was applied to compute statistical significance

Full size table

Appendix 3: Complementary results

Table 7 shows how the required minimum of rows that guarantee the statistical significance of a real-valued bicluster with continuous shifts/scales varies with the number rows and columns of the input dataset. Two major observations can be retrieved. Both the size and dimensionality of data affect the significance levels, being the effect of varying the size of data clearly more accentuated since the assessment was applied over biclusters with coherency across rows. Second, the observed pattern also largely determines the computed significance levels as it determines the range of allowed shifts and scales (Table 8). Understandably, the larger the allowed range, the higher is the probability of a bicluster pattern to occur and thus the higher is the number of minimum rows in the bicluster to guarantee its significance.

Table 8 Expected probability of different patterns to occur in biclusters with continuous shifts and scales from data with approximately uniform distribution of values ($a_{ij}\in $[0,1])

Full size table

Figure 16 provides the graphical representation of the results gatheres throughout Tables 1, 2 and 3, thus showing the expected minimum number of rows in a bicluster that guarantees its significance for varying: coherency assumption, pattern expectations $\varphi _B$, coherency strength $|\mathcal {L}|$, pattern length m, data size N, and data dimensionality M.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Henriques, R., Madeira, S.C. BSig: evaluating the statistical significance of biclustering solutions. Data Min Knowl Disc 32, 124–161 (2018). https://doi.org/10.1007/s10618-017-0521-2

Download citation

Received: 08 February 2016
Accepted: 12 June 2017
Published: 28 June 2017
Issue Date: January 2018
DOI: https://doi.org/10.1007/s10618-017-0521-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BSig: evaluating the statistical significance of biclustering solutions

Abstract

Access this article

Similar content being viewed by others

A goodness-of-fit test on the number of biclusters in a relational data matrix

Biclustering Algorithms Based on Metaheuristics: A Review

A systematic comparative evaluation of biclustering techniques

Notes

References

Acknowledgements