skip to main content
10.1145/3310986.3310991acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlscConference Proceedingsconference-collections
research-article

Clustering Stability via Concept-based Nonnegative Matrix Factorization

Authors Info & Claims
Published:25 January 2019Publication History

ABSTRACT

One of the most important contributions of topic modeling is to accurately and the ectively discover and classify documents in a collection of texts by a number of clusters/topics. However, finding an appropriate number of topics is a particularly challenging model selection question. In this context, we introduce a new unsupervised conceptual stability framework to access the validity of a clustering solution. We integrate the proposed framework into nonnegative matrix factorization (NMF) to guide the selection of desired number of topics. Our model provides a exible way to enhance the interpretation of NMF for the effective clustering solutions. The work presented in this paper crosses the bridge between stability-based validation of clustering solutions and NMF in the context of unsupervised learning. We perform a thorough evaluation of our approach over a wide range of real-world datasets and compare it to current state-of-the-art which are two NMF-based approaches and four Latent Dirichlet Allocation (LDA) based models. the quantitative experimental results show that integrating such conceptual stability analysis into NMF can lead to significant improvements in the document clustering and information retrieval the ectiveness.

References

  1. R Arun, Venkatasubramaniyan Suresh, CE Veni Madhavan, and MN Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 391--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mohammadreza Babaee, Stefanos Tsoukalas, Gerhard Rigoll, and Mihai Datcu. 2016. Immersive visualization of visual data using nonnegative matrix factorization. Neurocomputing 173 (2016), 245--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mark Belford, Brian Mac Namee, and Derek Greene. 2017. Stability of Topic Modeling via Matrix Factorization. arXiv preprint arXiv:1702.07186 (2017).Google ScholarGoogle Scholar
  4. Shai Ben-David, David Pal, and Hans Ulrich Simon. 2007. Stability of k-means clustering. In International Conference on Computational Learning Šeory. Springer, 20--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet alloca-tion. the Journal of machine Learning research 3 (2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christos Boutsidis and Efstratios Gallopoulos. 2008. SVD based initialization: A head start for nonnegative matrix factorization. Pattern Recognition 41, 4 (2008), 1350--1362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. 2004. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences 101, 12 (2004), 4164--4169.Google ScholarGoogle ScholarCross RefCross Ref
  8. Deng Cai, Qiaozhu Mei, Jiawei Han, and Chengxiang Zhai. 2008. Modeling hidden topics on document manifold. In Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 911--920. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Deng Cai, Xuanhui Wang, and Xiaofei He. 2009. Probabilistic dyadic data analysis with local and global consistency. In Proceedings of the 26th annual international conference on machine learning. ACM, 105--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. 2009. A density-based method for adaptive LDA model selection. Neurocomputing 72, 7 (2009), 1775--1781. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ye Chen, Bei Yu, Xuewei Zhang, and Yihan Yu. 2016. Topic modeling for evalu-ating students' reflective writing: a case study of pre-service teachers' journals. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM, 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Andrzej Cichocki and PHAN Anh-Huy. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences 92, 3 (2009), 708--721.Google ScholarGoogle Scholar
  13. Romain Deveaud, Eric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document numerique 17, 1 (2014), 61--84.Google ScholarGoogle Scholar
  14. Ronald Fagin, Ravi Kumar, and D Sivakumar. 2003. Comparing top k lists. SIAM Journal on Discrete Mathematics 17, 1 (2003), 134--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Christiane Fellbaum. 1998. WordNet. Wiley Online Library.Google ScholarGoogle Scholar
  16. Nicolas Gillis. 2014. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines 12, 257 (2014).Google ScholarGoogle Scholar
  17. Derek Greene and Padraig Cunningham. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd international conference on Machine learning. ACM, 377--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Derek Greene, Derek OCallaghan, and Padraig Cunningham. 2014. How many topics? stability analysis for topic models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 498--513.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101, suppl 1 (2004), 5228--5235.Google ScholarGoogle ScholarCross RefCross Ref
  20. Kurt Hornik and Bettina Grun. 2011. Topicmodels: An R package for fitting topic models. Journal of Statistical Software 40, 13 (2011), 1--30.Google ScholarGoogle Scholar
  21. Hannah Kim, Jaegul Choo, Jingu Kim, Chandan K Reddy, and Haesun Park. 2015. Simultaneous discovery of common and discriminative topics via joint nonnega-tive matrix factorization. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 567--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jingu Kim and Haesun Park. 2008. Sparse nonnegative matrix factorization for clustering. (2008).Google ScholarGoogle Scholar
  23. Daichi Kitamura, Nobutaka Ono, Hiroshi Saruwatari, Yu Takahashi, and Kazunobu Kondo. 2016. Discriminative and reconstructive basis training for audio source separation with semi-supervised nonnegative matrix factorization. In Acoustic Signal Enhancement (IWAENC), 2016 IEEE International Workshop on. IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  24. Xiangwei Kong, Lin Liang, Tianshe Yang, Jing Zhao, and Xuhua Wang. 2015. Source separation based on nonnegative matrix factorization and independent component correlation algorithm. In 2015 8th International Congress on Image and Signal Processing (CISP). IEEE, 1614--1619.Google ScholarGoogle ScholarCross RefCross Ref
  25. Da Kuang, Jaegul Choo, and Haesun Park. 2015. Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional Clustering Algorithms. Springer, 215--243.Google ScholarGoogle Scholar
  26. Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1--2 (1955), 83--97.Google ScholarGoogle Scholar
  27. Ken Lang. 1995. Newsweeder: Learning to €lter netnews. In Proceedings of the Twelfth International Conference on Machine Learning. 331--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tilman Lange, Volker Roth, Mikio L Braun, and Joachim M Buhmann. 2004. Stability-based validation of clustering solutions. Neural computation 16, 6 (2004), 1299--1323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788--791.Google ScholarGoogle Scholar
  30. Erel Levine and Eytan Domany. 2001. Resampling method for unsupervised estimation of cluster validity. Neural computation 13, 11 (2001), 2573--2593. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Nicolai Meinshausen and Peter Buhlmann. 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, 4 (2010), 417--473.Google ScholarGoogle ScholarCross RefCross Ref
  32. George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Victor Mocioiu, Sreenath P Kyathanahally, Carles Arus,' Alfredo Vellido, and Margarida Julia-Sape. 2016. Automated Quality Control for Proton Magnetic Res-onance Spectroscopy Data Using Convex Non-negative Matrix Factorization. In International Conference on Bioinformatics and Biomedical Engineering. Springer, 719--727.Google ScholarGoogle Scholar
  34. Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. 2003. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning 52, 1--2 (2003), 91--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 2 (1994), 111--126.Google ScholarGoogle ScholarCross RefCross Ref
  36. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed gibbs sampling for latent dirichlet al-location. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 569--577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jing Su et al. 2016. TopicListener: Observing Key Topics from Multi-channel Speech Audio Streams. In 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService). IEEE, 85--94.Google ScholarGoogle Scholar
  39. Cheng Chuan Toh, Darsono Abdul Majid, Mohd Shakir, Md Saat, Awang Md Isa Azmi, and Hashim Norlezah. 2016. Blind Source Separation On Biomedical Field By Using Nonnegative Matrix Factorization. ARPN Journal Of Engineering And Applied Sciences 11, 13 (2016), 8200--8206.Google ScholarGoogle Scholar
  40. Siqi Wu, Antony Joseph, Ann S Hammonds, Susan E Celniker, Bin Yu, and Erwin Frise. 2016. Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proceedings of the National Academy of Sciences (2016), 201521171.Google ScholarGoogle ScholarCross RefCross Ref
  41. Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 133--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. B Xie, L Song, and H Park. 2013. Topic modeling via nonnegative matrix factorization on probability simplex. In NIPS workshop on topic models: computation, application, and evaluation.Google ScholarGoogle Scholar

Index Terms

  1. Clustering Stability via Concept-based Nonnegative Matrix Factorization

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              ICMLSC '19: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing
              January 2019
              268 pages
              ISBN:9781450366120
              DOI:10.1145/3310986

              Copyright © 2019 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 25 January 2019

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited
            • Article Metrics

              • Downloads (Last 12 months)8
              • Downloads (Last 6 weeks)0

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader