Abstract
Frequent itemset mining (FIM) is one of the core problems in the field of Data Mining and occupies a central place in its literature. One equivalent form of FIM can be stated as follows: given a rectangular data matrix with binary entries, find every submatrix of 1s having a minimum number of columns. This paper presents a theoretical analysis of several statistical questions related to this problem when noise is present. We begin by establishing several results concerning the extremal behavior of submatrices of ones in a binary matrix with random entries. These results provide simple significance bounds for the output of FIM algorithms. We then consider the noise sensitivity of FIM algorithms under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. Thus such blocks cannot be directly recovered by FIM algorithms, which search for submatrices of all 1s. On the positive side, we show how, in the presence of noise, an error-tolerant criterion can recover a square submatrix of 1s against a background of 0s, even when the size of the target submatrix is very small.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD 1993, pp. 207–216 (1993)
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H.: Fast discovery of association rules. In: Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI/MITPress (1996)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD 1998, pp. 94–105 (1998)
Bollobás, B., Erdös, P.: Cliques in random graphs. Math. Proc. Cam. Phil. Soc. 80, 419–427 (1976)
Bollobás, B. (ed.): Random Graphs, 2nd edn. Cambridge Studies in Advanced Mathematics (2001)
Chakrabarti, D., Papadimitriou, S., Modha, D., Faloutsos, C.: Fully Automatic Cross-Associations. In: Proceedings of ACM SIGKDD 2004, pp. 79–88 (2004)
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of ISMB 2000, pp. 93–103 (2000)
Dawande, M., Keskinocak, P., Swaminathan, J., Tayur, S.: On bipartite and multipartite clique problems. J. Algorithms 41(2), 388–403 (2001)
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, New York (1996)
Dhillon, I., Mallela, S., Modha, D.: Information-Theoretic Co-clustering. In: Proceedings of ACM SIGKDD 2003, pp. 89–98 (2003)
Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9), 1090–1099 (2003)
Grimmett, G.R., McDiarmid, C.J.H.: On colouring random graphs. Math. Proc. Cam. Phil. Soc. 77, 313–324 (1975)
Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Proceedings of ACM SIGMOD 2000, pp. 1–12 (2000)
Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Cambridge (2001)
Karp, R.: Probabilistic Analysis of Algorithms. Class Notes, UC-Berkeley (1988)
Koyutürk, M., Szpankowski, W., Grama, A.: Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns. In: IEEE Computer Society Bioinformatics Conference, Stanford, pp. 480–483 (2004)
Lange, T., Roth, V., Braun, M., Buhmann, J.: Stability-Based Validation of Clustering Solution. Neural Computation 16(6), 1299–1323 (2004)
Liu, J., Paulsen, S., Wang, W., Nobel, A.B., Prins, J.: Mining Approximate Frequent Itemsets from Noisy Data. In: Proceedings of ICDM 2005, pp. 721–724 (2005)
Liu, J., Paulsen, S., Sun, X., Wang, W., Nobel, A.B., Prins, J.: Mining approximate frequent itemsets in the presence of noise: algorithm and analysis. In: Proceedings of SDM (to appear, 2006)
Matula, D.: The largest clique size in a random graph. Southern Methodist University, Tech. Report, CS 7608 (1976)
Madeira, S., Oliveira, A.: Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(1), 24–45 (2004)
Okamoto, M.: Some inequalities relating to the partial sum of binomial probabilities. Annals of the Institute of Statistical Mathematics 10, 29–35 (1958)
Pei, J., Tung, A.K., Han, J.: Fault-tolerant frequent pattern mining: Problems and challenges. In: Proceedings of DMKD 2001 (2001)
Pei, J., Dong, G., Zou, W., Han, J.: Mining Condensed Frequent-Pattern Bases. Knowledge and Information Systems 6(5), 570–594 (2002)
Park, G., Szpankowshi, W.: Analysis of biclusters with applications to gene expression data. In: Proceeding of AoA 2005 (2005)
Reuning-Scherer, J.D.: Mixture Models for Block Clustering. Phd Thesis, Yale university (1997)
Seppänen, J.K., Mannila, H.: Dense Itemsets. In: Proceedings of ACM SIGKDD 2004, pp. 683–688 (2004)
Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(1), 136–144 (2002)
Tanay, A., Sharan, R., Shamir, R.: Biclustering Algorithms: A Survey. In: Handbook of Computational Molecular Biology. Computer and Information Science Series, Chapman & Hall/CRC (in press, 2005)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via gap statistic. Technical Report 208, Dept of Statistics, Stanford University (2000)
Yang, C., Fayyad, U., Bradley, P.S.: Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of ACM SIGKDD 2001, pp. 194–203 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sun, X., Nobel, A. (2006). Significance and Recovery of Block Structures in Binary Matrices with Noise. In: Lugosi, G., Simon, H.U. (eds) Learning Theory. COLT 2006. Lecture Notes in Computer Science(), vol 4005. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11776420_11
Download citation
DOI: https://doi.org/10.1007/11776420_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35294-5
Online ISBN: 978-3-540-35296-9
eBook Packages: Computer ScienceComputer Science (R0)