ABSTRACT
One of the most important contributions of topic modeling is to accurately and the ectively discover and classify documents in a collection of texts by a number of clusters/topics. However, finding an appropriate number of topics is a particularly challenging model selection question. In this context, we introduce a new unsupervised conceptual stability framework to access the validity of a clustering solution. We integrate the proposed framework into nonnegative matrix factorization (NMF) to guide the selection of desired number of topics. Our model provides a exible way to enhance the interpretation of NMF for the effective clustering solutions. The work presented in this paper crosses the bridge between stability-based validation of clustering solutions and NMF in the context of unsupervised learning. We perform a thorough evaluation of our approach over a wide range of real-world datasets and compare it to current state-of-the-art which are two NMF-based approaches and four Latent Dirichlet Allocation (LDA) based models. the quantitative experimental results show that integrating such conceptual stability analysis into NMF can lead to significant improvements in the document clustering and information retrieval the ectiveness.
- R Arun, Venkatasubramaniyan Suresh, CE Veni Madhavan, and MN Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 391--402. Google ScholarDigital Library
- Mohammadreza Babaee, Stefanos Tsoukalas, Gerhard Rigoll, and Mihai Datcu. 2016. Immersive visualization of visual data using nonnegative matrix factorization. Neurocomputing 173 (2016), 245--255. Google ScholarDigital Library
- Mark Belford, Brian Mac Namee, and Derek Greene. 2017. Stability of Topic Modeling via Matrix Factorization. arXiv preprint arXiv:1702.07186 (2017).Google Scholar
- Shai Ben-David, David Pal, and Hans Ulrich Simon. 2007. Stability of k-means clustering. In International Conference on Computational Learning Šeory. Springer, 20--34. Google ScholarDigital Library
- David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet alloca-tion. the Journal of machine Learning research 3 (2003), 993--1022. Google ScholarDigital Library
- Christos Boutsidis and Efstratios Gallopoulos. 2008. SVD based initialization: A head start for nonnegative matrix factorization. Pattern Recognition 41, 4 (2008), 1350--1362. Google ScholarDigital Library
- Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. 2004. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences 101, 12 (2004), 4164--4169.Google ScholarCross Ref
- Deng Cai, Qiaozhu Mei, Jiawei Han, and Chengxiang Zhai. 2008. Modeling hidden topics on document manifold. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 911--920. Google ScholarDigital Library
- Deng Cai, Xuanhui Wang, and Xiaofei He. 2009. Probabilistic dyadic data analysis with local and global consistency. In Proceedings of the 26th annual international conference on machine learning. ACM, 105--112. Google ScholarDigital Library
- Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. 2009. A density-based method for adaptive LDA model selection. Neurocomputing 72, 7 (2009), 1775--1781. Google ScholarDigital Library
- Ye Chen, Bei Yu, Xuewei Zhang, and Yihan Yu. 2016. Topic modeling for evalu-ating students' reflective writing: a case study of pre-service teachers' journals. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM, 1--5. Google ScholarDigital Library
- Andrzej Cichocki and PHAN Anh-Huy. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences 92, 3 (2009), 708--721.Google Scholar
- Romain Deveaud, Eric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document numerique 17, 1 (2014), 61--84.Google Scholar
- Ronald Fagin, Ravi Kumar, and D Sivakumar. 2003. Comparing top k lists. SIAM Journal on Discrete Mathematics 17, 1 (2003), 134--160. Google ScholarDigital Library
- Christiane Fellbaum. 1998. WordNet. Wiley Online Library.Google Scholar
- Nicolas Gillis. 2014. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines 12, 257 (2014).Google Scholar
- Derek Greene and Padraig Cunningham. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd international conference on Machine learning. ACM, 377--384. Google ScholarDigital Library
- Derek Greene, Derek OCallaghan, and Padraig Cunningham. 2014. How many topics? stability analysis for topic models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 498--513.Google ScholarDigital Library
- Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101, suppl 1 (2004), 5228--5235.Google ScholarCross Ref
- Kurt Hornik and Bettina Grun. 2011. Topicmodels: An R package for fitting topic models. Journal of Statistical Software 40, 13 (2011), 1--30.Google Scholar
- Hannah Kim, Jaegul Choo, Jingu Kim, Chandan K Reddy, and Haesun Park. 2015. Simultaneous discovery of common and discriminative topics via joint nonnega-tive matrix factorization. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 567--576. Google ScholarDigital Library
- Jingu Kim and Haesun Park. 2008. Sparse nonnegative matrix factorization for clustering. (2008).Google Scholar
- Daichi Kitamura, Nobutaka Ono, Hiroshi Saruwatari, Yu Takahashi, and Kazunobu Kondo. 2016. Discriminative and reconstructive basis training for audio source separation with semi-supervised nonnegative matrix factorization. In Acoustic Signal Enhancement (IWAENC), 2016 IEEE International Workshop on. IEEE, 1--5.Google ScholarCross Ref
- Xiangwei Kong, Lin Liang, Tianshe Yang, Jing Zhao, and Xuhua Wang. 2015. Source separation based on nonnegative matrix factorization and independent component correlation algorithm. In 2015 8th International Congress on Image and Signal Processing (CISP). IEEE, 1614--1619.Google ScholarCross Ref
- Da Kuang, Jaegul Choo, and Haesun Park. 2015. Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional Clustering Algorithms. Springer, 215--243.Google Scholar
- Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1--2 (1955), 83--97.Google Scholar
- Ken Lang. 1995. Newsweeder: Learning to €lter netnews. In Proceedings of the Twelfth International Conference on Machine Learning. 331--339. Google ScholarDigital Library
- Tilman Lange, Volker Roth, Mikio L Braun, and Joachim M Buhmann. 2004. Stability-based validation of clustering solutions. Neural computation 16, 6 (2004), 1299--1323. Google ScholarDigital Library
- Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788--791.Google Scholar
- Erel Levine and Eytan Domany. 2001. Resampling method for unsupervised estimation of cluster validity. Neural computation 13, 11 (2001), 2573--2593. Google ScholarDigital Library
- Nicolai Meinshausen and Peter Buhlmann. 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, 4 (2010), 417--473.Google ScholarCross Ref
- George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39--41. Google ScholarDigital Library
- Victor Mocioiu, Sreenath P Kyathanahally, Carles Arus,' Alfredo Vellido, and Margarida Julia-Sape. 2016. Automated Quality Control for Proton Magnetic Res-onance Spectroscopy Data Using Convex Non-negative Matrix Factorization. In International Conference on Bioinformatics and Biomedical Engineering. Springer, 719--727.Google Scholar
- Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. 2003. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning 52, 1--2 (2003), 91--118. Google ScholarDigital Library
- Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 2 (1994), 111--126.Google ScholarCross Ref
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830. Google ScholarDigital Library
- Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed gibbs sampling for latent dirichlet al-location. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 569--577. Google ScholarDigital Library
- Jing Su et al. 2016. TopicListener: Observing Key Topics from Multi-channel Speech Audio Streams. In 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService). IEEE, 85--94.Google Scholar
- Cheng Chuan Toh, Darsono Abdul Majid, Mohd Shakir, Md Saat, Awang Md Isa Azmi, and Hashim Norlezah. 2016. Blind Source Separation On Biomedical Field By Using Nonnegative Matrix Factorization. ARPN Journal Of Engineering And Applied Sciences 11, 13 (2016), 8200--8206.Google Scholar
- Siqi Wu, Antony Joseph, Ann S Hammonds, Susan E Celniker, Bin Yu, and Erwin Frise. 2016. Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proceedings of the National Academy of Sciences (2016), 201521171.Google ScholarCross Ref
- Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 133--138. Google ScholarDigital Library
- B Xie, L Song, and H Park. 2013. Topic modeling via nonnegative matrix factorization on probability simplex. In NIPS workshop on topic models: computation, application, and evaluation.Google Scholar
Index Terms
- Clustering Stability via Concept-based Nonnegative Matrix Factorization
Recommendations
Weighted Nonnegative Matrix Tri-Factorization for Co-clustering
ICTAI '11: Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial IntelligenceNonnegative matrix tri-factorization and spectral co-clustering are two popular techniques that allow simultaneous clustering of the rows and columns of a matrix. In this paper, by adding a weighting scheme derived from spectral co-clustering into the ...
Penalized nonnegative matrix tri-factorization for co-clustering
Nonnegative matrix factorization has been widely used in co-clustering tasks which group data points and features simultaneously. In recent years, several proposed co-clustering algorithms have shown their superiorities over traditional one-side ...
Heuristics for exact nonnegative matrix factorization
The exact nonnegative matrix factorization (exact NMF) problem is the following: given an m-by-n nonnegative matrix X and a factorization rank r, find, if possible, an m-by-r nonnegative matrix W and an r-by-n nonnegative matrix H such that $$X = WH$$X=...
Comments