Skip to main content
Log in

Cluster-based sparse topical coding for topic mining and document clustering

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

In this paper, we introduce a document clustering method based on Sparse Topical Coding, called Cluster-based Sparse Topical Coding. Topic modeling is capable of improving textual document clustering by describing documents via bag-of-words models and projecting them into a topic space. The latent semantic descriptions derived by the topic model can be utilized as features in a clustering process. In our proposed method, document clustering and topic modeling are integrated in a unified framework in order to achieve the highest performance. This framework includes Sparse Topical Coding, which is responsible for topic mining, and K-means that discovers the latent clusters in documents collection. Experimental results on widely-used datasets show that our proposed method significantly outperforms the traditional and other topic model based clustering methods. Our method achieves from 4 to 39% improvement in clustering accuracy and from 2% to more than 44% improvement in normalized mutual information.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://web.ist.utl.pt/~acardoso/datasets/.

References

  • Ahmadi P, Kaviani R, Gholampour I, Tabandeh M (2015) Clustering improvement via integrating with sparse topical coding. In: 23rd Iranian conference on electrical engineering, IEEE, pp 466–471. http://ieeexplore.ieee.org/document/7146260/

  • Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  • Fritzke B (1995) A growing neural gas network learns topologies. Adv Neural Inf Process Syst 7:625–632

    Google Scholar 

  • Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., pp 289–296

  • Hyvarinen A (1999) Sparse code shrinkage: denoising of nongaussian data by maximum likelihood estimation. Neural Comput 10:1739–1768

    Article  Google Scholar 

  • Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97

    Article  MathSciNet  Google Scholar 

  • Lamirel JC (2012) A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. J Scientometr 93(1):151–166

    Article  Google Scholar 

  • Lamirel JC, Falk I, Gardent C (2015) Federating clustering and cluster labelling capabilities with a single approach based on feature maximization: French verb classes identification with IGNGF neural clustering. Neurocomputing 147:136–146

    Article  Google Scholar 

  • Lee H, Battle A, Raina R, Ng AY (2006) Efficient sparse coding algorithms. In: Advances in neural information processing systems, pp 801–808

  • Li X, Ouyang J, Lu Y, Zhou X, Tian T (2014) Group topic model: organizing topics into groups. Inf Retr J 18(1):1–25

    Google Scholar 

  • Lu Y, Mei Q, Zhai C (2011) Investigating task performance of probabilistic topic models: an empirical study of plsa and lda. Inf Retr 14(2):178–203

    Article  Google Scholar 

  • Papoulis A, Pillai SU (2002) Probability, random variables and stochastic processes, 4th edn. McGraw-Hill, New York

    Google Scholar 

  • Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581

    Article  MathSciNet  Google Scholar 

  • Wallach HM (2008) Structured topic models for language. Doctoral dissertation, Univ. of Cambridge

  • Wang X, Ma X, Grimson WEL (2009) Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. IEEE Trans Pattern Anal Mach Intell 31(3):539–555

    Article  Google Scholar 

  • Wang J, Fu W, Lu H, Ma S (2014) Bilayer sparse topic model for scene analysis in imbalanced surveillance videos. IEEE Trans Image Process 23(11):5198–5208

    Article  MathSciNet  Google Scholar 

  • Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence, p 694. http://auai.org/uai2013/prints/papers/35.pdf

  • Zhu J, Xing E (2011) Sparse topical coding. In: Proceedings of the twenty-seventh conference annual conference on uncertainty in artificial intelligence (UAI), pp 831–838. http://bigml.cs.tsinghua.edu.cn/~jun/code/stc/stc.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Parvin Ahmadi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahmadi, P., Gholampour, I. & Tabandeh, M. Cluster-based sparse topical coding for topic mining and document clustering. Adv Data Anal Classif 12, 537–558 (2018). https://doi.org/10.1007/s11634-017-0280-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-017-0280-3

Keywords

Mathematics Subject Classification

Navigation