Skip to main content
Log in

Ternary Matrix Factorization: problem definitions and algorithms

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Can we learn from the unknown? Logical data sets of the ternary kind are often found in information systems. They contain unknown as well as true/false values. An unknown value may represent a missing entry (lost or indeterminable) or have meaning, like a Don’t Know response in a questionnaire. In this paper, we introduce algorithms for reducing the dimensionality of logical data (categorical data in general) in the context of a new data mining challenge: Ternary Matrix Factorization (TMF). For a ternary data matrix, TMF exploits ternary logic to produce a basis matrix (which holds the major patterns in the data) and a usage matrix (which maps patterns to original observations). Both matrices are interpretable, and their ternary matrix product approximates the original matrix. TMF has applications in (1) finding targeted structure in ternary data, (2) imputing values through pattern discovery in highly incomplete categorical data sets, and (3) solving instances of its encapsulated Binary Matrix Factorization problem. Our elegant algorithm FasTer (FASt TERnary Matrix Factorization) has linear run-time complexity with respect to the dimensions of the data set and is parameter-robust. A variant of FasTer that exploits useful results from combinatorics provides accuracy bounds for a core part of the algorithm in certain situations. Experiments on synthetic and real-world data sets show that our algorithms are able to outperform state-of-the-art techniques in all three TMF applications with respect to run-time and effectiveness. Finally, convincing speedup and efficiency results on a parallel version of FasTer demonstrate its suitability for weak- and strong-scaling scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Data from the UCI Machine Learning Repository.

  2. Other mapping choices yield analogous results.

  3. Using \(\tau =0.5\) for the Asso rounding threshold gives the most sensible results here.

  4. dropbox.com/s/znut1tsxutyjfvu/tmf.zip.

  5. For imputation problems, values of \(\mathfrak {u}\) are replaced with \(\mathfrak {f}\) in this selection to ensure that only binary values are present in the initial basis vector.

  6. We focus on the TUP rather than the TBP because, in the general case, the TBP involves ternary logic and cannot be easily compared with or reduced to combinatorial problems.

  7. dropbox.com/s/znut1tsxutyjfvu/tmf.zip.

  8. Aside from \(\rho _\mathfrak {u}\) which is not applicable in BMF.

  9. The publicly available implementation was used.

  10. We thank Radim Belohlavek for sharing an up-to-date implementation.

  11. We thank Pauli Miettinen for sharing an up-to-date implementation.

  12. We use the author’s original implementation with YALMIP and SDPT3 as the SPD solver.

  13. api.stackexchange.com.

  14. dropbox.com/s/znut1tsxutyjfvu/tmf.zip.

  15. “Search and research” is the first point of advice on SO’s How to Ask page.

  16. One of the highest-rated questions on the site, for example, is simply “What is a plain English explanation of Big-O?”

  17. dropbox.com/s/znut1tsxutyjfvu/tmf.zip.

  18. A nonparametric test that does not assume the population to be normally distributed.

  19. dropbox.com/s/znut1tsxutyjfvu/tmf.zip.

References

  1. Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969

    Article  Google Scholar 

  2. Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: ACM international conference on information and knowledge management, pp 415–424

  3. Belohlavek R (2013) Beyond Boolean matrix decompositions: toward factor analysis and dimensionality reduction of ordinal data. IEEE international conference on data mining, pp 961–966

  4. Belohlavek R, Vychodil V (2010) Discovery of optimal factors in binary data via a novel method of matrix decomposition. J Comput Syst Sci 76(1):3–20

    Article  MATH  MathSciNet  Google Scholar 

  5. Chvatal V (1979) A greedy heuristic for the set-covering problem. Math Oper Res 4(3):233–235

    Article  MATH  MathSciNet  Google Scholar 

  6. Çivril A, Magdon-Ismail M (2009) On selecting a maximum volume sub-matrix of a matrix and related problems. Theor Comput Sci 410(4749):4801–4811

    Article  MATH  Google Scholar 

  7. Codd EF (1986) Missing information in relational databases. ACM Sigmod Rec 15(4):53–53

    Article  Google Scholar 

  8. Cormode G, Karloff H, Wirth A (2010) Set cover algorithms for very large datasets. In: Proceedings of the 19th ACM international conference on Information and knowledge management, pp 479–488

  9. Francis JD, Busch L (1975) What we know about I don’t knows. Public Opin Q 39(2):207–218

    Article  Google Scholar 

  10. Kleene SC (1938) On notation for ordinal numbers. J Symb Logic 3(4):150–155

    Article  MathSciNet  Google Scholar 

  11. Lin C-J (2007) Projected gradient methods for nonnegative matrix factorization. Neural Comput 19(10):2756–2779

    Article  MATH  MathSciNet  Google Scholar 

  12. Lu H, Vaidya J, Atluri V (2008) Optimal Boolean matrix decomposition: application to role engineering. In: IEEE international conference on data engineering, pp 297–306

  13. Lucchese C, Orlando S, Perego R (2010) Mining top-k patterns from binary datasets. SIAM Int Conf Data Mining 10:165–176

    Google Scholar 

  14. Luk FT (1985) A parallel method for computing the generalized singular value decomposition. J Parallel Distrib Comput 2(3):250–260

    Article  Google Scholar 

  15. Malinowski G (2007) Many-valued logic and its philosophy. In: Handbook of the history of logic, vol 8. North-Holland, Amsterdam, pp 13–94

  16. Maurus S, Plant C (2014) Ternary Matrix Factorization. In: IEEE international conference on data mining (Best Paper Award)

  17. Miettinen P (2008a) The Boolean column and column-row matrix decompositions. Data Mining Knowl Discov 17(1):39–56

    Article  MathSciNet  Google Scholar 

  18. Miettinen P (2008b) On the positive–negative partial set cover problem. Inf Process Lett 108(4):219–221

    Article  MathSciNet  Google Scholar 

  19. Miettinen P (2009) Matrix decomposition methods for data mining, PhD thesis, University of Helsinki

  20. Miettinen P, Mielikainen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl and Data Eng 20(10):1348–1362

    Article  Google Scholar 

  21. Miettinen P, Vreeken J (2014) MDL4BMF: minimum description length for Boolean Matrix Factorization. ACM Trans Knowl Discov Data 8(4):1–31

    Article  Google Scholar 

  22. OpenMP Architecture Review Board (2005) OpenMP application program interface

  23. Peleg D (2007) Approximation algorithms for the label-CoverMAX and red–blue set cover problems. J Discrete Algorithms 5(1):55–64

    Article  MATH  MathSciNet  Google Scholar 

  24. Pew Research Center (2010) Executive summary: tolerance and tension: Islam and christianity in sub-Saharan Africa. Technical report, Pew Research Center

  25. Rubin DB, Stern HS, Vehovar V (1995) Handling “Don’t know” survey responses. J Am Stat Assoc 90(431):822–828

    Google Scholar 

  26. Srebro N, Rennie J, Jaakkola TS (2004) Maximum-margin matrix factorization. Adv Neural Inf Process Syst 17:1329–1336

    Google Scholar 

  27. Vreeken J, Siebes A (2008) Filling in the blanks-Krimp minimisation for missing data. In: IEEE international conference on data mining, pp 1067–1072

  28. Yadava P, Miettinen P (2012) BMF with missing values, Master’s Thesis, University of Saarland

  29. Zaki MJ (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by the Helmholtz Young Investigators Groups programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samuel Maurus.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Maurus, S., Plant, C. Ternary Matrix Factorization: problem definitions and algorithms. Knowl Inf Syst 46, 1–31 (2016). https://doi.org/10.1007/s10115-015-0838-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0838-3

Keywords

Navigation