Abstract
Can we learn from the unknown? Logical data sets of the ternary kind are often found in information systems. They contain unknown as well as true/false values. An unknown value may represent a missing entry (lost or indeterminable) or have meaning, like a Don’t Know response in a questionnaire. In this paper, we introduce algorithms for reducing the dimensionality of logical data (categorical data in general) in the context of a new data mining challenge: Ternary Matrix Factorization (TMF). For a ternary data matrix, TMF exploits ternary logic to produce a basis matrix (which holds the major patterns in the data) and a usage matrix (which maps patterns to original observations). Both matrices are interpretable, and their ternary matrix product approximates the original matrix. TMF has applications in (1) finding targeted structure in ternary data, (2) imputing values through pattern discovery in highly incomplete categorical data sets, and (3) solving instances of its encapsulated Binary Matrix Factorization problem. Our elegant algorithm FasTer (FASt TERnary Matrix Factorization) has linear run-time complexity with respect to the dimensions of the data set and is parameter-robust. A variant of FasTer that exploits useful results from combinatorics provides accuracy bounds for a core part of the algorithm in certain situations. Experiments on synthetic and real-world data sets show that our algorithms are able to outperform state-of-the-art techniques in all three TMF applications with respect to run-time and effectiveness. Finally, convincing speedup and efficiency results on a parallel version of FasTer demonstrate its suitability for weak- and strong-scaling scenarios.












Similar content being viewed by others
Notes
Data from the UCI Machine Learning Repository.
Other mapping choices yield analogous results.
Using \(\tau =0.5\) for the Asso rounding threshold gives the most sensible results here.
For imputation problems, values of \(\mathfrak {u}\) are replaced with \(\mathfrak {f}\) in this selection to ensure that only binary values are present in the initial basis vector.
We focus on the TUP rather than the TBP because, in the general case, the TBP involves ternary logic and cannot be easily compared with or reduced to combinatorial problems.
Aside from \(\rho _\mathfrak {u}\) which is not applicable in BMF.
The publicly available implementation was used.
We thank Radim Belohlavek for sharing an up-to-date implementation.
We thank Pauli Miettinen for sharing an up-to-date implementation.
We use the author’s original implementation with YALMIP and SDPT3 as the SPD solver.
“Search and research” is the first point of advice on SO’s How to Ask page.
One of the highest-rated questions on the site, for example, is simply “What is a plain English explanation of Big-O?”
A nonparametric test that does not assume the population to be normally distributed.
References
Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969
Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: ACM international conference on information and knowledge management, pp 415–424
Belohlavek R (2013) Beyond Boolean matrix decompositions: toward factor analysis and dimensionality reduction of ordinal data. IEEE international conference on data mining, pp 961–966
Belohlavek R, Vychodil V (2010) Discovery of optimal factors in binary data via a novel method of matrix decomposition. J Comput Syst Sci 76(1):3–20
Chvatal V (1979) A greedy heuristic for the set-covering problem. Math Oper Res 4(3):233–235
Çivril A, Magdon-Ismail M (2009) On selecting a maximum volume sub-matrix of a matrix and related problems. Theor Comput Sci 410(4749):4801–4811
Codd EF (1986) Missing information in relational databases. ACM Sigmod Rec 15(4):53–53
Cormode G, Karloff H, Wirth A (2010) Set cover algorithms for very large datasets. In: Proceedings of the 19th ACM international conference on Information and knowledge management, pp 479–488
Francis JD, Busch L (1975) What we know about I don’t knows. Public Opin Q 39(2):207–218
Kleene SC (1938) On notation for ordinal numbers. J Symb Logic 3(4):150–155
Lin C-J (2007) Projected gradient methods for nonnegative matrix factorization. Neural Comput 19(10):2756–2779
Lu H, Vaidya J, Atluri V (2008) Optimal Boolean matrix decomposition: application to role engineering. In: IEEE international conference on data engineering, pp 297–306
Lucchese C, Orlando S, Perego R (2010) Mining top-k patterns from binary datasets. SIAM Int Conf Data Mining 10:165–176
Luk FT (1985) A parallel method for computing the generalized singular value decomposition. J Parallel Distrib Comput 2(3):250–260
Malinowski G (2007) Many-valued logic and its philosophy. In: Handbook of the history of logic, vol 8. North-Holland, Amsterdam, pp 13–94
Maurus S, Plant C (2014) Ternary Matrix Factorization. In: IEEE international conference on data mining (Best Paper Award)
Miettinen P (2008a) The Boolean column and column-row matrix decompositions. Data Mining Knowl Discov 17(1):39–56
Miettinen P (2008b) On the positive–negative partial set cover problem. Inf Process Lett 108(4):219–221
Miettinen P (2009) Matrix decomposition methods for data mining, PhD thesis, University of Helsinki
Miettinen P, Mielikainen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl and Data Eng 20(10):1348–1362
Miettinen P, Vreeken J (2014) MDL4BMF: minimum description length for Boolean Matrix Factorization. ACM Trans Knowl Discov Data 8(4):1–31
OpenMP Architecture Review Board (2005) OpenMP application program interface
Peleg D (2007) Approximation algorithms for the label-CoverMAX and red–blue set cover problems. J Discrete Algorithms 5(1):55–64
Pew Research Center (2010) Executive summary: tolerance and tension: Islam and christianity in sub-Saharan Africa. Technical report, Pew Research Center
Rubin DB, Stern HS, Vehovar V (1995) Handling “Don’t know” survey responses. J Am Stat Assoc 90(431):822–828
Srebro N, Rennie J, Jaakkola TS (2004) Maximum-margin matrix factorization. Adv Neural Inf Process Syst 17:1329–1336
Vreeken J, Siebes A (2008) Filling in the blanks-Krimp minimisation for missing data. In: IEEE international conference on data mining, pp 1067–1072
Yadava P, Miettinen P (2012) BMF with missing values, Master’s Thesis, University of Saarland
Zaki MJ (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25
Acknowledgments
This work is supported by the Helmholtz Young Investigators Groups programme.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Maurus, S., Plant, C. Ternary Matrix Factorization: problem definitions and algorithms. Knowl Inf Syst 46, 1–31 (2016). https://doi.org/10.1007/s10115-015-0838-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0838-3