Skip to main content
Log in

Localized user-driven topic discovery via boosted ensemble of nonnegative matrix factorization

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Nonnegative matrix factorization (NMF) has been widely used in topic modeling of large-scale document corpora, where a set of underlying topics are extracted by a low-rank factor matrix from NMF. However, the resulting topics often convey only general, thus redundant information about the documents rather than information that might be minor, but potentially meaningful to users. To address this problem, we present a novel ensemble method based on nonnegative matrix factorization that discovers meaningful local topics. Our method leverages the idea of an ensemble model, which has shown advantages in supervised learning, into an unsupervised topic modeling context. That is, our model successively performs NMF given a residual matrix obtained from previous stages and generates a sequence of topic sets. The algorithm we employ to update is novel in two aspects. The first lies in utilizing the residual matrix inspired by a state-of-the-art gradient boosting model, and the second stems from applying a sophisticated local weighting scheme on the given matrix to enhance the locality of topics, which in turn delivers high-quality, focused topics of interest to users. We subsequently extend this ensemble model by adding keyword- and document-based user interaction to introduce user-driven topic discovery.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://github.com/sanghosuh/four_area_data-matlab/.

  2. The code is available at https://github.com/sanghosuh/lens_nmf-matlab.

  3. https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.

  4. http://qwone.com/~jason/20Newsgroups/.

  5. https://www.cs.cmu.edu/~./enron/.

  6. http://www.vispubdata.org/site/vispubdata/.

  7. https://github.com/kimjingu/nonnegfac-matlab.

  8. http://www.cc.gatech.edu/~hpark/software/nmf_bpas.zip.

  9. http://davian.korea.ac.kr/myfiles/list/Codes/orthonmf.zip.

  10. http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm.

References

  1. Aletras N, Stevenson M (2013) Evaluating topic coherence using distributional semantics. In: Proceedings of the international conference on computational semantics, pp 13–22

  2. Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the international conference on machine learning (ICML), pp 25–32

  3. Bakharia A, Bruza P, Watters J, Narayan B, Sitbon L (2016) Interactive topic modeling for aiding qualitative content analysis. In: Proceedings of the ACM SIGIR on conference on human information interaction and retrieval (CHIIR), pp 213–222

  4. Bernstein MS, Suh B, Hong L, Chen J, Kairam S, Chi EH (2010) Eddi: interactive topic-based browsing of social status streams. In: Proceedings of the annual ACM symposium on user interface software and technology (UIST), pp 303–312

  5. Biggs M, Ghodsi A, Vavasis S (2008) Nonnegative matrix factorization via rank-one downdate. In: Proceedings of the international conference on machine learning (ICML), pp 64–71

  6. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res (JMLR) 3:993–1022

    MATH  Google Scholar 

  7. Brandes U, Corman SR (2003) Visual unrolling of network evolution and the analysis of dynamic discourse. Inf Vis 2(1):40–50

    Article  Google Scholar 

  8. Cho Y-S, Ver Steeg G, Ferrara E, Galstyan A (2016) Latent space model for multi-modal social data. In: Proceedings of the international conference on world wide web (WWW), pp 447–458

  9. Choo J, Lee C, Reddy CK, Park H (2013) UTOPIAN: user-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans Vis Comput Graph (TVCG) 19(12):1992–2001

    Article  Google Scholar 

  10. Choo J, Lee C, Reddy CK, Park H (2015) Weakly supervised nonnegative matrix factorization for user-driven clustering. Data Min Knowl Discov (DMKD) 29(6):1598–1621

    Article  MathSciNet  Google Scholar 

  11. Cichocki A, Zdunek R, Amari S-I (2007) Hierarchical als algorithms for nonnegative matrix and 3d tensor factorization. In: Independent component analysis and signal separation, pp 169–176

  12. DeCoste D (2006) Collaborative prediction using ensembles of maximum margin matrix factorizations. In: Proceedings of the international conference on machine learning (ICML), pp 249–256

  13. Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD)

  14. Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612

    Google Scholar 

  15. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  MathSciNet  MATH  Google Scholar 

  16. Gillis N, Glineur F (2010) Using underapproximations for sparse nonnegative matrix factorization. Pattern Recogn 43(4):1676–1687

    Article  MATH  Google Scholar 

  17. Golub GH, van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore

    MATH  Google Scholar 

  18. Greene D, Cagney G, Krogan N, Cunningham P (2008) Ensemble non-negative matrix factorization methods for clustering protein-protein interactions. Bioinformatics 24(15):1722–1728

    Article  Google Scholar 

  19. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, Berlin

    Book  MATH  Google Scholar 

  20. Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the ACM SIGIR international conference on research and development in information retrieval (SIGIR), pp 50–57

  21. Hoque E, Carenini G (2015) Convisit: interactive topic modeling for exploring asynchronous online conversations. In: Proceedings of the international conference on intelligent user interfaces (IUI), pp 169–180

  22. Huang F, Zhang S, Zhang J, Yu G (2017) Multimodal learning for topic sentiment analysis in microblogging. Neurocomputing 253:144–153

    Article  Google Scholar 

  23. Jo Y, Oh AH (2011) Aspect and sentiment unification model for online review analysis. In: Proceedings of the ACM international conference on web search and data mining (WSDM), pp 815–824

  24. Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12):1495–1502

    Article  Google Scholar 

  25. Kim H, Park H (2008) Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J Matrix Anal Appl 30(2):713–730

    Article  MathSciNet  MATH  Google Scholar 

  26. Kim J, Park H (2008) Sparse nonnegative matrix factorization for clustering. Georgia Institute of Technology, Georgia

    Google Scholar 

  27. Kim J, Park H (2011) Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J Sci Comput 33(6):3261–3281

    Article  MathSciNet  MATH  Google Scholar 

  28. Kim J, He Y, Park H (2014) Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J Glob Optim 58(2):285–319

    Article  MathSciNet  MATH  Google Scholar 

  29. Kim H, Choo J, Kim J, Reddy CK, Park H (2015) Simultaneous discovery of common and discriminative topics via joint nonnegative matrix factorization. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 567–576

  30. Kim M, Kang K, Park D, Choo J, Elmqvist N (2017) Topiclens: efficient multi-level visual topic exploration of large-scale document collections. IEEE Trans Vis Comput Graph (TVCG) 23(1):151–160

    Article  Google Scholar 

  31. Kuang D, Park H (2013) Fast rank-2 nonnegative matrix factorization for hierarchical document clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 739–747

  32. Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97

    Article  MathSciNet  MATH  Google Scholar 

  33. Kumar S, Mohri M, Talwalkar A (2009) Ensemble nystrom method. In: Advances in neural information processing systems (NIPS), pp 1060–1068

  34. Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791

    Article  MATH  Google Scholar 

  35. Lee H, Kihm J, Choo J, Stasko J, Park H (2012) iVisClustering: an interactive visual document clustering via topic modeling. Comput Graph Forum 31(3 pt 3):1155–1164

    Article  Google Scholar 

  36. Lee J, Sun M, Kim S, Lebanon G (2012) Automatic feature induction for stagewise collaborative filtering. In: Advances in neural information processing systems (NIPS)

  37. Lee J, Kim S, Lebanon G, Singer Y, Bengio S (2016) Llorma: local low-rank matrix approximation. J Mach Learn Res (JMLR) 17(15):1–24

    MathSciNet  MATH  Google Scholar 

  38. Li T, Zhang Y, Sindhwani V (2009) A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, pp 244–252

  39. Lin C-J (2007) Projected gradient methods for nonnegative matrix factorization. Neural Comput 19(10):2756–2779

    Article  MathSciNet  MATH  Google Scholar 

  40. Mackey LW, Talwalkar AS, Jordan MI (2011) Divide-and-conquer matrix factorization. In: Advances in neural information processing systems (NIPS), pp 1134–1142

  41. Meyer M, Munzner T, DePace A, Pfister H (2010) Multeesum: a tool for comparative spatial and temporal gene expression data. IEEE Trans Vis Comput Graph (TVCG) 16(6):908–917

    Article  Google Scholar 

  42. Mukherjea S, Hirata K, Hara Y (1996) Visualizing the results of multimedia web search engines. In: Proceedings of the IEEE symposium on information visualization (InfoVis), pp 64–65, 122

  43. Newman D, Lau JH, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Proceedings of the annual conference of the North American chapter of the association for computational linguistics (NAACL-HLT), pp 100–108

  44. Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5:111–126

    Article  Google Scholar 

  45. Qian S, Zhang T, Xu C, Shao J (2016) Multi-modal event topic model for social event analysis. IEEE Trans Multimed 18:233–246

    Article  Google Scholar 

  46. Sill J, Takacs G, Mackey L, Lin D (2009) Feature-weighted linear stacking. Arxiv preprint arXiv:0911.0460

  47. Su X, Khoshgoftaar TM (2009) A survey of collaborative filtering techniques. Adv Artif Intell 2009:4:2

    Google Scholar 

  48. Suh S, Choo J, Lee J, Reddy CK (2016) L-ensnmf: boosted local topic discovery via ensemble of nonnegative matrix factorization. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 479–488

  49. Titov I, McDonald R (2008) Modeling online reviews with multi-grain topic models. In Proceedings of the international conference on world wide web (WWW), pp 111–120

  50. Wang S, Chen Z, Liu B (2016) Mining aspect-specific opinion using a holistic lifelong topic model. In: Proceedings of the international conference on world wide web (WWW), pp 167–176

  51. Wei F, Liu S, Song Y, Pan S, Zhou MX, Qian W, Shi L, Tan L, Zhang Q (2010) Tiara: a visual exploratory text analytic system. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 153–162

  52. Wilkinson JH, Wilkinson JH, Wilkinson JH (1965) The algebraic eigenvalue problem, vol 87. Clarendon Press, Oxford

    MATH  Google Scholar 

  53. Wu Q, Tan M, Li X, Min H, Sun N (2015) Nmfe-sscc: non-negative matrix factorization ensemble for semi-supervised collective classification. Knowl Based Syst 89:160–172

    Article  Google Scholar 

  54. Yang P, Su X, Ou-Yang L, Chua H-N, Li X-L, Ning K (2014) Microbial community pattern detection in human body habitats via ensemble clustering framework. BMC Syst Biol 8(Suppl 4):S7

    Article  Google Scholar 

  55. Zheng Y, Zhang YJ, Larochelle H (2016) A deep and autoregressive approach for topic modeling of multimodal data. IEEE Trans Pattern Anal Mach Intell (TPAMI) 38:1056–1069

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Science Foundation Grants IIS-1707498, IIS-1619028, and IIS-1646881 and by Basic Science Research Program through the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2016R1C1B2015924). Any opinions, findings, and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of funding agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaegul Choo.

Additional information

This work is an extended version of [48].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suh, S., Shin, S., Lee, J. et al. Localized user-driven topic discovery via boosted ensemble of nonnegative matrix factorization. Knowl Inf Syst 56, 503–531 (2018). https://doi.org/10.1007/s10115-017-1147-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1147-9

Keywords

Navigation