research-article

Clustering Stability via Concept-based Nonnegative Matrix Factorization

Authors:
Nghia Duong-Trung

Can Tho University of Technology, FPT University, Can Tho, Vietnam

Can Tho University of Technology, FPT University, Can Tho, Vietnam
View Profile

,
Minh-Hoang Nguyen

Department of Information Engineering, and Computer Science, University of Trento, Trento, Italy

Department of Information Engineering, and Computer Science, University of Trento, Trento, Italy
View Profile

,
Hanh T. H. Nguyen

Information Systems and Machine Learning Lab, Hildesheim, Germany

Information Systems and Machine Learning Lab, Hildesheim, Germany
View Profile

ICMLSC '19: Proceedings of the 3rd International Conference on Machine Learning and Soft ComputingJanuary 2019Pages 49–54https://doi.org/10.1145/3310986.3310991

Published:25 January 2019Publication History

ICMLSC '19: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing

Pages 49–54

ABSTRACT

One of the most important contributions of topic modeling is to accurately and the ectively discover and classify documents in a collection of texts by a number of clusters/topics. However, finding an appropriate number of topics is a particularly challenging model selection question. In this context, we introduce a new unsupervised conceptual stability framework to access the validity of a clustering solution. We integrate the proposed framework into nonnegative matrix factorization (NMF) to guide the selection of desired number of topics. Our model provides a exible way to enhance the interpretation of NMF for the effective clustering solutions. The work presented in this paper crosses the bridge between stability-based validation of clustering solutions and NMF in the context of unsupervised learning. We perform a thorough evaluation of our approach over a wide range of real-world datasets and compare it to current state-of-the-art which are two NMF-based approaches and four Latent Dirichlet Allocation (LDA) based models. the quantitative experimental results show that integrating such conceptual stability analysis into NMF can lead to significant improvements in the document clustering and information retrieval the ectiveness.

References

R Arun, Venkatasubramaniyan Suresh, CE Veni Madhavan, and MN Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 391--402. Google ScholarDigital Library
Mohammadreza Babaee, Stefanos Tsoukalas, Gerhard Rigoll, and Mihai Datcu. 2016. Immersive visualization of visual data using nonnegative matrix factorization. Neurocomputing 173 (2016), 245--255. Google ScholarDigital Library
Mark Belford, Brian Mac Namee, and Derek Greene. 2017. Stability of Topic Modeling via Matrix Factorization. arXiv preprint arXiv:1702.07186 (2017).Google Scholar
Shai Ben-David, David Pal, and Hans Ulrich Simon. 2007. Stability of k-means clustering. In International Conference on Computational Learning Šeory. Springer, 20--34. Google ScholarDigital Library
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet alloca-tion. the Journal of machine Learning research 3 (2003), 993--1022. Google ScholarDigital Library
Christos Boutsidis and Efstratios Gallopoulos. 2008. SVD based initialization: A head start for nonnegative matrix factorization. Pattern Recognition 41, 4 (2008), 1350--1362. Google ScholarDigital Library
Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. 2004. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences 101, 12 (2004), 4164--4169.Google ScholarCross Ref
Deng Cai, Qiaozhu Mei, Jiawei Han, and Chengxiang Zhai. 2008. Modeling hidden topics on document manifold. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 911--920. Google ScholarDigital Library
Deng Cai, Xuanhui Wang, and Xiaofei He. 2009. Probabilistic dyadic data analysis with local and global consistency. In Proceedings of the 26th annual international conference on machine learning. ACM, 105--112. Google ScholarDigital Library
Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. 2009. A density-based method for adaptive LDA model selection. Neurocomputing 72, 7 (2009), 1775--1781. Google ScholarDigital Library
Ye Chen, Bei Yu, Xuewei Zhang, and Yihan Yu. 2016. Topic modeling for evalu-ating students' reflective writing: a case study of pre-service teachers' journals. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM, 1--5. Google ScholarDigital Library
Andrzej Cichocki and PHAN Anh-Huy. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences 92, 3 (2009), 708--721.Google Scholar
Romain Deveaud, Eric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document numerique 17, 1 (2014), 61--84.Google Scholar
Ronald Fagin, Ravi Kumar, and D Sivakumar. 2003. Comparing top k lists. SIAM Journal on Discrete Mathematics 17, 1 (2003), 134--160. Google ScholarDigital Library
Christiane Fellbaum. 1998. WordNet. Wiley Online Library.Google Scholar
Nicolas Gillis. 2014. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines 12, 257 (2014).Google Scholar
Derek Greene and Padraig Cunningham. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd international conference on Machine learning. ACM, 377--384. Google ScholarDigital Library
Derek Greene, Derek OCallaghan, and Padraig Cunningham. 2014. How many topics? stability analysis for topic models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 498--513.Google ScholarDigital Library
Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101, suppl 1 (2004), 5228--5235.Google ScholarCross Ref
Kurt Hornik and Bettina Grun. 2011. Topicmodels: An R package for fitting topic models. Journal of Statistical Software 40, 13 (2011), 1--30.Google Scholar
Hannah Kim, Jaegul Choo, Jingu Kim, Chandan K Reddy, and Haesun Park. 2015. Simultaneous discovery of common and discriminative topics via joint nonnega-tive matrix factorization. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 567--576. Google ScholarDigital Library
Jingu Kim and Haesun Park. 2008. Sparse nonnegative matrix factorization for clustering. (2008).Google Scholar
Daichi Kitamura, Nobutaka Ono, Hiroshi Saruwatari, Yu Takahashi, and Kazunobu Kondo. 2016. Discriminative and reconstructive basis training for audio source separation with semi-supervised nonnegative matrix factorization. In Acoustic Signal Enhancement (IWAENC), 2016 IEEE International Workshop on. IEEE, 1--5.Google ScholarCross Ref
Xiangwei Kong, Lin Liang, Tianshe Yang, Jing Zhao, and Xuhua Wang. 2015. Source separation based on nonnegative matrix factorization and independent component correlation algorithm. In 2015 8th International Congress on Image and Signal Processing (CISP). IEEE, 1614--1619.Google ScholarCross Ref
Da Kuang, Jaegul Choo, and Haesun Park. 2015. Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional Clustering Algorithms. Springer, 215--243.Google Scholar
Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1--2 (1955), 83--97.Google Scholar
Ken Lang. 1995. Newsweeder: Learning to €lter netnews. In Proceedings of the Twelfth International Conference on Machine Learning. 331--339. Google ScholarDigital Library
Tilman Lange, Volker Roth, Mikio L Braun, and Joachim M Buhmann. 2004. Stability-based validation of clustering solutions. Neural computation 16, 6 (2004), 1299--1323. Google ScholarDigital Library
Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788--791.Google Scholar
Erel Levine and Eytan Domany. 2001. Resampling method for unsupervised estimation of cluster validity. Neural computation 13, 11 (2001), 2573--2593. Google ScholarDigital Library
Nicolai Meinshausen and Peter Buhlmann. 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, 4 (2010), 417--473.Google ScholarCross Ref
George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39--41. Google ScholarDigital Library
Victor Mocioiu, Sreenath P Kyathanahally, Carles Arus,' Alfredo Vellido, and Margarida Julia-Sape. 2016. Automated Quality Control for Proton Magnetic Res-onance Spectroscopy Data Using Convex Non-negative Matrix Factorization. In International Conference on Bioinformatics and Biomedical Engineering. Springer, 719--727.Google Scholar
Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. 2003. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning 52, 1--2 (2003), 91--118. Google ScholarDigital Library
Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 2 (1994), 111--126.Google ScholarCross Ref
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830. Google ScholarDigital Library
Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed gibbs sampling for latent dirichlet al-location. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 569--577. Google ScholarDigital Library
Jing Su et al. 2016. TopicListener: Observing Key Topics from Multi-channel Speech Audio Streams. In 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService). IEEE, 85--94.Google Scholar
Cheng Chuan Toh, Darsono Abdul Majid, Mohd Shakir, Md Saat, Awang Md Isa Azmi, and Hashim Norlezah. 2016. Blind Source Separation On Biomedical Field By Using Nonnegative Matrix Factorization. ARPN Journal Of Engineering And Applied Sciences 11, 13 (2016), 8200--8206.Google Scholar
Siqi Wu, Antony Joseph, Ann S Hammonds, Susan E Celniker, Bin Yu, and Erwin Frise. 2016. Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proceedings of the National Academy of Sciences (2016), 201521171.Google ScholarCross Ref
Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 133--138. Google ScholarDigital Library
B Xie, L Song, and H Park. 2013. Topic modeling via nonnegative matrix factorization on probability simplex. In NIPS workshop on topic models: computation, application, and evaluation.Google Scholar

Index Terms

Clustering Stability via Concept-based Nonnegative Matrix Factorization
1. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
      2. Document topic models
  2. Information systems applications
    1. Data mining
      1. Clustering
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Models of learning
      2. Unsupervised learning and clustering

Recommendations

Weighted Nonnegative Matrix Tri-Factorization for Co-clustering
ICTAI '11: Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence

Nonnegative matrix tri-factorization and spectral co-clustering are two popular techniques that allow simultaneous clustering of the rows and columns of a matrix. In this paper, by adding a weighting scheme derived from spectral co-clustering into the ...
Read More
Penalized nonnegative matrix tri-factorization for co-clustering

Nonnegative matrix factorization has been widely used in co-clustering tasks which group data points and features simultaneously. In recent years, several proposed co-clustering algorithms have shown their superiorities over traditional one-side ...
Read More
Heuristics for exact nonnegative matrix factorization

The exact nonnegative matrix factorization (exact NMF) problem is the following: given an m-by-n nonnegative matrix X and a factorization rank r, find, if possible, an m-by-r nonnegative matrix W and an r-by-n nonnegative matrix H such that $$X = WH$$X=...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICMLSC '19: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing
January 2019
268 pages
ISBN:9781450366120
DOI:10.1145/3310986

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 January 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Concept Discovery
Conceptual Stability
Nonnegative Matrix Factorization
Topic Modeling
Unsupervised Document Clustering
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 86
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Clustering Stability via Concept-based Nonnegative Matrix Factorization

ICMLSC '19: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Weighted Nonnegative Matrix Tri-Factorization for Co-clustering

Penalized nonnegative matrix tri-factorization for co-clustering

Heuristics for exact nonnegative matrix factorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Clustering Stability via Concept-based Nonnegative Matrix Factorization

ICMLSC '19: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Weighted Nonnegative Matrix Tri-Factorization for Co-clustering

Penalized nonnegative matrix tri-factorization for co-clustering

Heuristics for exact nonnegative matrix factorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media