Ternary Matrix Factorization: problem definitions and algorithms

Maurus, Samuel; Plant, Claudia

doi:10.1007/s10115-015-0838-3

Ternary Matrix Factorization: problem definitions and algorithms

Regular paper
Published: 16 May 2015

Volume 46, pages 1–31, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Samuel Maurus¹ &
Claudia Plant¹

605 Accesses
2 Citations
Explore all metrics

Abstract

Can we learn from the unknown? Logical data sets of the ternary kind are often found in information systems. They contain unknown as well as true/false values. An unknown value may represent a missing entry (lost or indeterminable) or have meaning, like a Don’t Know response in a questionnaire. In this paper, we introduce algorithms for reducing the dimensionality of logical data (categorical data in general) in the context of a new data mining challenge: Ternary Matrix Factorization (TMF). For a ternary data matrix, TMF exploits ternary logic to produce a basis matrix (which holds the major patterns in the data) and a usage matrix (which maps patterns to original observations). Both matrices are interpretable, and their ternary matrix product approximates the original matrix. TMF has applications in (1) finding targeted structure in ternary data, (2) imputing values through pattern discovery in highly incomplete categorical data sets, and (3) solving instances of its encapsulated Binary Matrix Factorization problem. Our elegant algorithm FasTer (FASt TERnary Matrix Factorization) has linear run-time complexity with respect to the dimensions of the data set and is parameter-robust. A variant of FasTer that exploits useful results from combinatorics provides accuracy bounds for a core part of the algorithm in certain situations. Experiments on synthetic and real-world data sets show that our algorithms are able to outperform state-of-the-art techniques in all three TMF applications with respect to run-time and effectiveness. Finally, convincing speedup and efficiency results on a parallel version of FasTer demonstrate its suitability for weak- and strong-scaling scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

Data from the UCI Machine Learning Repository.
Other mapping choices yield analogous results.
Using \(\tau =0.5\) for the Asso rounding threshold gives the most sensible results here.
dropbox.com/s/znut1tsxutyjfvu/tmf.zip.
For imputation problems, values of \(\mathfrak {u}\) are replaced with \(\mathfrak {f}\) in this selection to ensure that only binary values are present in the initial basis vector.
We focus on the TUP rather than the TBP because, in the general case, the TBP involves ternary logic and cannot be easily compared with or reduced to combinatorial problems.
dropbox.com/s/znut1tsxutyjfvu/tmf.zip.
Aside from \(\rho _\mathfrak {u}\) which is not applicable in BMF.
The publicly available implementation was used.
We thank Radim Belohlavek for sharing an up-to-date implementation.
We thank Pauli Miettinen for sharing an up-to-date implementation.
We use the author’s original implementation with YALMIP and SDPT3 as the SPD solver.
api.stackexchange.com.
dropbox.com/s/znut1tsxutyjfvu/tmf.zip.
“Search and research” is the first point of advice on SO’s How to Ask page.
One of the highest-rated questions on the site, for example, is simply “What is a plain English explanation of Big-O?”
dropbox.com/s/znut1tsxutyjfvu/tmf.zip.
A nonparametric test that does not assume the population to be normally distributed.
dropbox.com/s/znut1tsxutyjfvu/tmf.zip.

References

Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969
Article Google Scholar
Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: ACM international conference on information and knowledge management, pp 415–424
Belohlavek R (2013) Beyond Boolean matrix decompositions: toward factor analysis and dimensionality reduction of ordinal data. IEEE international conference on data mining, pp 961–966
Belohlavek R, Vychodil V (2010) Discovery of optimal factors in binary data via a novel method of matrix decomposition. J Comput Syst Sci 76(1):3–20
Article MATH MathSciNet Google Scholar
Chvatal V (1979) A greedy heuristic for the set-covering problem. Math Oper Res 4(3):233–235
Article MATH MathSciNet Google Scholar
Çivril A, Magdon-Ismail M (2009) On selecting a maximum volume sub-matrix of a matrix and related problems. Theor Comput Sci 410(4749):4801–4811
Article MATH Google Scholar
Codd EF (1986) Missing information in relational databases. ACM Sigmod Rec 15(4):53–53
Article Google Scholar
Cormode G, Karloff H, Wirth A (2010) Set cover algorithms for very large datasets. In: Proceedings of the 19th ACM international conference on Information and knowledge management, pp 479–488
Francis JD, Busch L (1975) What we know about I don’t knows. Public Opin Q 39(2):207–218
Article Google Scholar
Kleene SC (1938) On notation for ordinal numbers. J Symb Logic 3(4):150–155
Article MathSciNet Google Scholar
Lin C-J (2007) Projected gradient methods for nonnegative matrix factorization. Neural Comput 19(10):2756–2779
Article MATH MathSciNet Google Scholar
Lu H, Vaidya J, Atluri V (2008) Optimal Boolean matrix decomposition: application to role engineering. In: IEEE international conference on data engineering, pp 297–306
Lucchese C, Orlando S, Perego R (2010) Mining top-k patterns from binary datasets. SIAM Int Conf Data Mining 10:165–176
Google Scholar
Luk FT (1985) A parallel method for computing the generalized singular value decomposition. J Parallel Distrib Comput 2(3):250–260
Article Google Scholar
Malinowski G (2007) Many-valued logic and its philosophy. In: Handbook of the history of logic, vol 8. North-Holland, Amsterdam, pp 13–94
Maurus S, Plant C (2014) Ternary Matrix Factorization. In: IEEE international conference on data mining (Best Paper Award)
Miettinen P (2008a) The Boolean column and column-row matrix decompositions. Data Mining Knowl Discov 17(1):39–56
Article MathSciNet Google Scholar
Miettinen P (2008b) On the positive–negative partial set cover problem. Inf Process Lett 108(4):219–221
Article MathSciNet Google Scholar
Miettinen P (2009) Matrix decomposition methods for data mining, PhD thesis, University of Helsinki
Miettinen P, Mielikainen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl and Data Eng 20(10):1348–1362
Article Google Scholar
Miettinen P, Vreeken J (2014) MDL4BMF: minimum description length for Boolean Matrix Factorization. ACM Trans Knowl Discov Data 8(4):1–31
Article Google Scholar
OpenMP Architecture Review Board (2005) OpenMP application program interface
Peleg D (2007) Approximation algorithms for the label-CoverMAX and red–blue set cover problems. J Discrete Algorithms 5(1):55–64
Article MATH MathSciNet Google Scholar
Pew Research Center (2010) Executive summary: tolerance and tension: Islam and christianity in sub-Saharan Africa. Technical report, Pew Research Center
Rubin DB, Stern HS, Vehovar V (1995) Handling “Don’t know” survey responses. J Am Stat Assoc 90(431):822–828
Google Scholar
Srebro N, Rennie J, Jaakkola TS (2004) Maximum-margin matrix factorization. Adv Neural Inf Process Syst 17:1329–1336
Google Scholar
Vreeken J, Siebes A (2008) Filling in the blanks-Krimp minimisation for missing data. In: IEEE international conference on data mining, pp 1067–1072
Yadava P, Miettinen P (2012) BMF with missing values, Master’s Thesis, University of Saarland
Zaki MJ (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25
Article Google Scholar

Download references

Acknowledgments

This work is supported by the Helmholtz Young Investigators Groups programme.

Author information

Authors and Affiliations

Helmholtz Zentrum München, Technische Universität München, 85764, Neuherberg, Germany
Samuel Maurus & Claudia Plant

Authors

Samuel Maurus
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Plant
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samuel Maurus.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maurus, S., Plant, C. Ternary Matrix Factorization: problem definitions and algorithms. Knowl Inf Syst 46, 1–31 (2016). https://doi.org/10.1007/s10115-015-0838-3

Download citation

Received: 11 December 2014
Revised: 02 March 2015
Accepted: 15 April 2015
Published: 16 May 2015
Issue Date: January 2016
DOI: https://doi.org/10.1007/s10115-015-0838-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Ternary Matrix Factorization: problem definitions and algorithms

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Statistical power for cluster analysis

kNN Classification: a review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Ternary Matrix Factorization: problem definitions and algorithms

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Statistical power for cluster analysis

kNN Classification: a review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation