Skip to main content
Log in

Integer Matrix Approximation and Data Mining

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

Integer datasets frequently appear in many applications in science and engineering. To analyze these datasets, we consider an integer matrix approximation technique that can preserve the original dataset characteristics. Because integers are discrete in nature, to the best of our knowledge, no previously proposed technique developed for real numbers can be successfully applied. In this study, we first conduct a thorough review of current algorithms that can solve integer least squares problems, and then we develop an alternative least square method based on an integer least squares estimation to obtain the integer approximation of the integer matrices. We discuss numerical applications for the approximation of randomly generated integer matrices as well as studies of association rule mining, cluster analysis, and pattern extraction. Our computed results suggest that our proposed method can calculate a more accurate solution for discrete datasets than other existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min Knowl. Discov. 11(1), 5–33 (2005)

    Article  MathSciNet  Google Scholar 

  2. Agrell, E., Eriksson, T., Vardy, A., Zeger, K.: Closest point search in lattices. IEEE Trans. Inf. Theory 48(8), 2201–2214 (2002). doi:10.1109/TIT.2002.800499

    Article  MathSciNet  MATH  Google Scholar 

  3. Asuncion, A., Newman, D.: UCI machine learning repository (2007). http://www.ics.uci.edu/~mlearn/MLRepository.html

  4. Banerjee, A., Krumpelman, C., Ghosh, J., Basu, S., Mooney, R.J.: Model-based overlapping clustering. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, pp. 532–537. ACM, New York, NY, USA (2005). doi:10.1145/1081870.1081932

  5. Breen, S., Chang, X.W.: Column reordering for box-constrained integer least squares problems. arXiv:1204.1407 (2012)

  6. CBCL face database \(\sharp \) 1, MIT center for biological and computation learning (2000). http://www.ai.mit.edu/projects/cbcl

  7. Chang, X.W., Golub, G.H.: Solving ellipsoid-constrained integer least squares problems. SIAM J. Matrix Anal. Appl. 31(3), 1071–1089 (2009). doi:10.1137/060660680

    Article  MathSciNet  MATH  Google Scholar 

  8. Chang, X.W., Han, Q.: Solving box-constrained integer least squares problems. IEEE Trans. Wirel. Commun. 7(1), 277–287 (2008). doi:10.1109/TWC.2008.060497

    Article  Google Scholar 

  9. Chang, X.W., Yang, X.: An efficient regularization approach for underdetermined mimo system decoding (2007). http://www.cs.mcgill.ca/~chang/pub/ChaY07a.pdf

  10. Chowdhury, G.G.: Introduction to Modern Information Retrieval, 2nd edn. Facet Publishing, Abingdon (2004)

    Google Scholar 

  11. Chu, M.T., Lin, M.M.: Low-dimensional polytope approximation and its applications to nonnegative matrix factorization. SIAM J. Sci. Comput. 30(3), 1131–1155 (2008). doi:10.1137/070680436

    Article  MathSciNet  MATH  Google Scholar 

  12. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006)

    MATH  Google Scholar 

  13. Donoho, D., Stodden, V.: When does nonnegative matrix factorization give a correct decomposition into parts? In: Proceedings of 17th Annual Conference on Neural Information Processing Systems. NIPS, Stanford University, Stanford, CA, 2003 (2003)

  14. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95(25), 14863–14868 (1998)

    Article  Google Scholar 

  15. Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)

    Article  Google Scholar 

  16. Filipovych, R., Resnick, S.M., Davatzikos, C.: Semi-supervised cluster analysis of imaging data. NeuroImage 54(3), 2185–2197 (2011)

    Article  Google Scholar 

  17. Gillis, N., Glineur, F.: Using underapproximations for sparse nonnegative matrix factorization. Pattern Recogn. 43(4), 1676–1687 (2010). doi:10.1016/j.patcog.2009.11.013. http://www.sciencedirect.com/science/article/pii/S0031320309004324

  18. Golub, G.H., Van Loan, C.F.: Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences, 4th edn. Johns Hopkins University Press, Baltimore (2013)

    Google Scholar 

  19. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  20. Hassibi, A., Boyd, S.: Integer parameter estimation in linear models with applications to gps. In: Proceedings of the 35th IEEE Conference on Decision and Control, 1996, vol. 3, pp. 3245–3251 vol.3 (1996). doi:10.1109/CDC.1996.573639

  21. Houseman, E.A., Accomando, W.P., Koestler, D.C., Christensen, B.C., Marsit, C.J., Nelson, H.H., Wiencke, J.K., Kelsey, K.T.: Dna methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinform. 13, 86 (2012). doi:10.1186/1471-2105-13-86

  22. Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457–1469 (2004)

    MathSciNet  MATH  Google Scholar 

  23. Kabán, A., Bingham, E.: Factorisation and denoising of 0–1 data: a variational approach. Neurocomputing 71(10), 2291–2308 (2008)

    Article  Google Scholar 

  24. Kannan, R., Ishteva, M., Park, H.: Bounded matrix low rank approximation. In: 2012 IEEE 12th International Conference on Data Mining (ICDM), pp. 319–328 (2012). doi:10.1109/ICDM.2012.131

  25. Kawamoto, T., Hotta, K., Mishima, T., Fujiki, J., Tanaka, M., Kurita, T.: Estimation of single tones from chord sounds using non-negative matrix factorization. Neural Netw. World 3, 429–436 (2000)

    Google Scholar 

  26. Kim, H., Park, H.: Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 30(2), 713–730 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  27. Kim, H., Park, H., Drake, B.L.: Extracting unrecognized gene relationships from the biomedical literature via matrix factorizations. BMC Bioinform. 9, 211 (2008). doi:10.1186/1471-2105-9-211

  28. Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Global Optim. 58(2), 285–319 (2014). doi:10.1007/s10898-013-0035-4

    Article  MathSciNet  MATH  Google Scholar 

  29. Kim, J., Park, H.: Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J. Sci. Comput. 33(6), 3261–3281 (2011). doi:10.1137/110821172

    Article  MathSciNet  MATH  Google Scholar 

  30. Koyutürk, M., Grama, A.: Proximus: a framework for analyzing very high dimensional discrete-attributed datasets. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 147–156. ACM, New York, NY, USA (2003). doi:10.1145/956750.956770

  31. Koyuturk, M., Grama, A., Ramakrishnan, N.: Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans. Knowl. Data Eng. 17(4), 447–461 (2005). doi:10.1109/TKDE.2005.55

    Article  Google Scholar 

  32. Koyutürk, M., Grama, A., Ramakrishnan, N.: Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Trans. Math. Softw. 32(1), 33–69 (2006). doi:10.1145/1132973.1132976

    Article  MathSciNet  MATH  Google Scholar 

  33. Liao, J.C., Boscolo, R., Yang, Y.L., Tran, L.M., Sabatti, C., Roychowdhury, V.P.: Network component analysis: reconstruction of regulatory signals in biological systems. In: Proceedings of the National Academy of Sciences 100(26), 15522–15527 (2003). doi:10.1073/pnas.2136632100. http://www.pnas.org/content/100/26/15522.abstract

  34. Lin, M.M.: Discrete eckart-young theorem for integer matrices. SIAM J. Matrix Anal. Appl. 32(4), 1367–1382 (2011). doi:10.1137/10081099X

    Article  MathSciNet  MATH  Google Scholar 

  35. Meeds, E., Ghahramani, Z., Neal, R.M., Roweis, S.T.: Modeling dyadic data with binary latent factors. Adv. Neural Inf. Process. Syst. 19, 977 (2007)

    Google Scholar 

  36. Morgan, S.D.: Cluster analysis in electronic manufacturing. Ph.D. dissertation, North Carolina State University, Raleigh, NC 27695 (2001)

  37. Mow, W.H.: Universal lattice decoding: a review and some recent results. In: 2004 IEEE International Conference on Communications, vol. 5, pp. 2842–2846 (2004). doi:10.1109/ICC.2004.1313048

  38. Segal, E., Battle, A., Koller, D.: Decomposing gene expression into cellular processes. In: In Proceedings of the 8th Pacific Symposium on Biocomputing (2003)

  39. Slawski, M., Hein, M., Lutsik, P.: Matrix factorization with binary components. In: Advances in Neural Information Processing Systems, pp. 3210–3218 (2013)

  40. Su, K., Wassell, I.: A new ordering for efficient sphere decoding. In: 2005 IEEE International Conference on Communications, ICC 2005, vol. 3, pp. 1906–1910 (2005). doi:10.1109/ICC.2005.1494671

  41. Tu, S., Chen, R., Xu, L.: Transcription network analysis by a sparse binary factor analysis algorithm. J. Integr. Bioinform. 9(2), 68–79 (2012)

  42. van der Veen, A.J.: Analytical method for blind binary signal separation. IEEE Trans. Signal Process. 45(4), 1078–1082 (1997). doi:10.1109/78.564198

    Article  Google Scholar 

  43. van Emde-Boas, P.: Another NP-complete partition problem and the complexity of computing short vectors in a lattice. Report. Department of Mathematics. University of Amsterdam. Department, University (1981). http://books.google.com.tw/books?id=tCQiHQAACAAJ

  44. Zhang, Z., Li, T., Ding, C., Zhang, X.: Binary matrix factorization with applications. In: Seventh IEEE International Conference on Data Mining, 2007, ICDM 2007, pp. 391–400. IEEE (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthew M. Lin.

Additional information

Bo Dong research was supported in part by the National Natural Science Foundation of China under Grant 11101067 and the Fundamental Research Funds for the Central Universities. Matthew M. Lin research was supported in part by the National Center for Theoretical Sciences of Taiwan and by the Ministry of Science and Technology of Taiwan under Grants 104-2115-M-006-017-MY3 and 105-2634-E-002-001. Haesun Park research was supported in part by the Defense Advanced Research Projects Agency (DARPA) XDATA program Grant FA8750-12-2-0309 and NSF Grants CCF-0808863, IIS-1242304, and IIS-1231742.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dong, B., Lin, M.M. & Park, H. Integer Matrix Approximation and Data Mining. J Sci Comput 75, 198–224 (2018). https://doi.org/10.1007/s10915-017-0531-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10915-017-0531-7

Keywords

Navigation