Abstract
Integer datasets frequently appear in many applications in science and engineering. To analyze these datasets, we consider an integer matrix approximation technique that can preserve the original dataset characteristics. Because integers are discrete in nature, to the best of our knowledge, no previously proposed technique developed for real numbers can be successfully applied. In this study, we first conduct a thorough review of current algorithms that can solve integer least squares problems, and then we develop an alternative least square method based on an integer least squares estimation to obtain the integer approximation of the integer matrices. We discuss numerical applications for the approximation of randomly generated integer matrices as well as studies of association rule mining, cluster analysis, and pattern extraction. Our computed results suggest that our proposed method can calculate a more accurate solution for discrete datasets than other existing methods.
Similar content being viewed by others
References
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min Knowl. Discov. 11(1), 5–33 (2005)
Agrell, E., Eriksson, T., Vardy, A., Zeger, K.: Closest point search in lattices. IEEE Trans. Inf. Theory 48(8), 2201–2214 (2002). doi:10.1109/TIT.2002.800499
Asuncion, A., Newman, D.: UCI machine learning repository (2007). http://www.ics.uci.edu/~mlearn/MLRepository.html
Banerjee, A., Krumpelman, C., Ghosh, J., Basu, S., Mooney, R.J.: Model-based overlapping clustering. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, pp. 532–537. ACM, New York, NY, USA (2005). doi:10.1145/1081870.1081932
Breen, S., Chang, X.W.: Column reordering for box-constrained integer least squares problems. arXiv:1204.1407 (2012)
CBCL face database \(\sharp \) 1, MIT center for biological and computation learning (2000). http://www.ai.mit.edu/projects/cbcl
Chang, X.W., Golub, G.H.: Solving ellipsoid-constrained integer least squares problems. SIAM J. Matrix Anal. Appl. 31(3), 1071–1089 (2009). doi:10.1137/060660680
Chang, X.W., Han, Q.: Solving box-constrained integer least squares problems. IEEE Trans. Wirel. Commun. 7(1), 277–287 (2008). doi:10.1109/TWC.2008.060497
Chang, X.W., Yang, X.: An efficient regularization approach for underdetermined mimo system decoding (2007). http://www.cs.mcgill.ca/~chang/pub/ChaY07a.pdf
Chowdhury, G.G.: Introduction to Modern Information Retrieval, 2nd edn. Facet Publishing, Abingdon (2004)
Chu, M.T., Lin, M.M.: Low-dimensional polytope approximation and its applications to nonnegative matrix factorization. SIAM J. Sci. Comput. 30(3), 1131–1155 (2008). doi:10.1137/070680436
Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006)
Donoho, D., Stodden, V.: When does nonnegative matrix factorization give a correct decomposition into parts? In: Proceedings of 17th Annual Conference on Neural Information Processing Systems. NIPS, Stanford University, Stanford, CA, 2003 (2003)
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95(25), 14863–14868 (1998)
Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)
Filipovych, R., Resnick, S.M., Davatzikos, C.: Semi-supervised cluster analysis of imaging data. NeuroImage 54(3), 2185–2197 (2011)
Gillis, N., Glineur, F.: Using underapproximations for sparse nonnegative matrix factorization. Pattern Recogn. 43(4), 1676–1687 (2010). doi:10.1016/j.patcog.2009.11.013. http://www.sciencedirect.com/science/article/pii/S0031320309004324
Golub, G.H., Van Loan, C.F.: Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences, 4th edn. Johns Hopkins University Press, Baltimore (2013)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Hassibi, A., Boyd, S.: Integer parameter estimation in linear models with applications to gps. In: Proceedings of the 35th IEEE Conference on Decision and Control, 1996, vol. 3, pp. 3245–3251 vol.3 (1996). doi:10.1109/CDC.1996.573639
Houseman, E.A., Accomando, W.P., Koestler, D.C., Christensen, B.C., Marsit, C.J., Nelson, H.H., Wiencke, J.K., Kelsey, K.T.: Dna methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinform. 13, 86 (2012). doi:10.1186/1471-2105-13-86
Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457–1469 (2004)
Kabán, A., Bingham, E.: Factorisation and denoising of 0–1 data: a variational approach. Neurocomputing 71(10), 2291–2308 (2008)
Kannan, R., Ishteva, M., Park, H.: Bounded matrix low rank approximation. In: 2012 IEEE 12th International Conference on Data Mining (ICDM), pp. 319–328 (2012). doi:10.1109/ICDM.2012.131
Kawamoto, T., Hotta, K., Mishima, T., Fujiki, J., Tanaka, M., Kurita, T.: Estimation of single tones from chord sounds using non-negative matrix factorization. Neural Netw. World 3, 429–436 (2000)
Kim, H., Park, H.: Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 30(2), 713–730 (2008)
Kim, H., Park, H., Drake, B.L.: Extracting unrecognized gene relationships from the biomedical literature via matrix factorizations. BMC Bioinform. 9, 211 (2008). doi:10.1186/1471-2105-9-211
Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Global Optim. 58(2), 285–319 (2014). doi:10.1007/s10898-013-0035-4
Kim, J., Park, H.: Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J. Sci. Comput. 33(6), 3261–3281 (2011). doi:10.1137/110821172
Koyutürk, M., Grama, A.: Proximus: a framework for analyzing very high dimensional discrete-attributed datasets. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 147–156. ACM, New York, NY, USA (2003). doi:10.1145/956750.956770
Koyuturk, M., Grama, A., Ramakrishnan, N.: Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans. Knowl. Data Eng. 17(4), 447–461 (2005). doi:10.1109/TKDE.2005.55
Koyutürk, M., Grama, A., Ramakrishnan, N.: Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Trans. Math. Softw. 32(1), 33–69 (2006). doi:10.1145/1132973.1132976
Liao, J.C., Boscolo, R., Yang, Y.L., Tran, L.M., Sabatti, C., Roychowdhury, V.P.: Network component analysis: reconstruction of regulatory signals in biological systems. In: Proceedings of the National Academy of Sciences 100(26), 15522–15527 (2003). doi:10.1073/pnas.2136632100. http://www.pnas.org/content/100/26/15522.abstract
Lin, M.M.: Discrete eckart-young theorem for integer matrices. SIAM J. Matrix Anal. Appl. 32(4), 1367–1382 (2011). doi:10.1137/10081099X
Meeds, E., Ghahramani, Z., Neal, R.M., Roweis, S.T.: Modeling dyadic data with binary latent factors. Adv. Neural Inf. Process. Syst. 19, 977 (2007)
Morgan, S.D.: Cluster analysis in electronic manufacturing. Ph.D. dissertation, North Carolina State University, Raleigh, NC 27695 (2001)
Mow, W.H.: Universal lattice decoding: a review and some recent results. In: 2004 IEEE International Conference on Communications, vol. 5, pp. 2842–2846 (2004). doi:10.1109/ICC.2004.1313048
Segal, E., Battle, A., Koller, D.: Decomposing gene expression into cellular processes. In: In Proceedings of the 8th Pacific Symposium on Biocomputing (2003)
Slawski, M., Hein, M., Lutsik, P.: Matrix factorization with binary components. In: Advances in Neural Information Processing Systems, pp. 3210–3218 (2013)
Su, K., Wassell, I.: A new ordering for efficient sphere decoding. In: 2005 IEEE International Conference on Communications, ICC 2005, vol. 3, pp. 1906–1910 (2005). doi:10.1109/ICC.2005.1494671
Tu, S., Chen, R., Xu, L.: Transcription network analysis by a sparse binary factor analysis algorithm. J. Integr. Bioinform. 9(2), 68–79 (2012)
van der Veen, A.J.: Analytical method for blind binary signal separation. IEEE Trans. Signal Process. 45(4), 1078–1082 (1997). doi:10.1109/78.564198
van Emde-Boas, P.: Another NP-complete partition problem and the complexity of computing short vectors in a lattice. Report. Department of Mathematics. University of Amsterdam. Department, University (1981). http://books.google.com.tw/books?id=tCQiHQAACAAJ
Zhang, Z., Li, T., Ding, C., Zhang, X.: Binary matrix factorization with applications. In: Seventh IEEE International Conference on Data Mining, 2007, ICDM 2007, pp. 391–400. IEEE (2007)
Author information
Authors and Affiliations
Corresponding author
Additional information
Bo Dong research was supported in part by the National Natural Science Foundation of China under Grant 11101067 and the Fundamental Research Funds for the Central Universities. Matthew M. Lin research was supported in part by the National Center for Theoretical Sciences of Taiwan and by the Ministry of Science and Technology of Taiwan under Grants 104-2115-M-006-017-MY3 and 105-2634-E-002-001. Haesun Park research was supported in part by the Defense Advanced Research Projects Agency (DARPA) XDATA program Grant FA8750-12-2-0309 and NSF Grants CCF-0808863, IIS-1242304, and IIS-1231742.
Rights and permissions
About this article
Cite this article
Dong, B., Lin, M.M. & Park, H. Integer Matrix Approximation and Data Mining. J Sci Comput 75, 198–224 (2018). https://doi.org/10.1007/s10915-017-0531-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10915-017-0531-7