Skip to main content
Log in

Ensemble subspace clustering of text data using two-level features

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

This paper proposes a new integrated method for ensemble subspace clustering of high dimensional sparse text data. Our method employs two-level feature representation of text data (words and topics) to generate clusters from subspaces. We also use ensemble clustering to increase the robustness of the clusters. This method depends on topic modeling to get the two-level feature representation of text data and to generate different ensemble components. By using both topics and words to cluster text data, we can get more interpretable clusters as we can measure the weight of words and topics in each cluster. In order to evaluate the proposed method, we have conducted several experiments on seven real-life data sets. While some of these data sets are easy to cluster, others are hard, and some others contain unbalanced data. Experimental results on this diversity of data sets show that our method outperforms other methods for ensemble clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Aggarwal CC (2015) Data mining—the textbook. Springer, Berlin. doi:10.1007/978-3-319-14142-8

  2. Bellman RE (2015) Adaptive control processes: a guided tour. Princeton University Press, Princeton

  3. Bhattacharya I, Getoor L (2006) A latent dirichlet model for unsupervised entity resolution. In: Proceedings of the sixth SIAM international conference on data mining, April 20–22, 2006, Bethesda, MD, USA, pp 47–58. doi:10.1137/1.9781611972764.5

  4. Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84. doi:10.1145/2133806.2133826

  5. Blei DM (2014) Build, compute, critique, repeat: data analysis with latent variable models. Annu Rev Stat Appl 1(1):203–232. doi:10.1146/annurev-statistics-022513-115657

  6. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3, 993–1022. http://www.jmlr.org/papers/v3/blei03a.html

  7. Cai Y, Chen X, Peng PX, Huang JZ (2014) A LDA feature grouping method for subspace clustering of text data. In: Intelligence and security informatics—Pacific Asia workshop, PAISI 2014, Tainan, Taiwan, May 13, 2014. Proceedings, pp 78–90 (2014). doi:10.1007/978-3-319-06677-6_7

  8. Cai Y, Zhao H (2016) GitHub download link of the experimental text data sets. doi:10.5281/zenodo.48688

  9. Cha Y, Bi B, Hsieh C, Cho J (2013) Incorporating popularity in topic models for social network analysis. In: The 36th international ACM SIGIR conference on research and development in Information retrieval, SIGIR ’13, Dublin, Ireland—July 28—August 01, 2013, pp 223–232. doi:10.1145/2484028.2484086

  10. Chaney AJ, Blei DM, Eliassi-Rad T (2015) A probabilistic model for using social networks in personalized item recommendation. In: Proceedings of the 9th ACM conference on recommender systems, RecSys 2015, Vienna, Austria, September 16–20, 2015, pp 43–50. doi:10.1145/2792838.2800193

  11. Chen X, Xu X, Huang JZ, Ye Y (2013) Tw-k-means: automated two-level variable weighting clustering algorithm for multiview data. IEEE Trans Knowl Data Eng 25(4), 932–944. doi:10.1109/TKDE.2011.262

  12. Chen X, Ye Y, Xu X, Huang JZ (2012) A feature group weighting method for subspace clustering of high-dimensional data. Pattern Recognit 45(1), 434–446 (2012). doi:10.1016/j.patcog.2011.06.004

  13. Cheng H, Hua KA, Vu K (2008) Constrained locally weighted clustering. PVLDB 1(1):90–101. http://www.vldb.org/pvldb/1/1453871.pdf

  14. Crain SP, Zhou K, Yang S, Zha H (2012) Dimensionality reduction and topic modeling: from latent semantic indexing to latent dirichlet allocation and beyond. In: Mining text data, pp 129–161. doi:10.1007/978-1-4614-3223-4_5

  15. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

  16. Domeniconi C, Al-Razgan M (2009) Weighted cluster ensembles: methods and analysis. TKDD 2(4). doi:10.1145/1460797.1460800

  17. Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Machine learning, proceedings of the twentieth international conference (ICML 2003), August 21–24, 2003, Washington, DC, USA, pp 186–193. http://www.aaai.org/Library/ICML/2003/icml03-027.php

  18. Fu X, Yang K, Huang JZ, Cui L (2015) Dynamic non-parametric joint sentiment topic mixture model. Knowl. Based Syst. 82:102–114. doi:10.1016/j.knosys.2015.02.021

  19. Gordon AD, Vichi M (2001) Fuzzy partition models for fitting a set of partitions. Psychometrika 66(2):229–247. doi:10.1007/BF02294837

  20. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1), 5228–5235. doi:10.1073/pnas.0307752101

  21. He YL, Wang XZ, Huang JZ (2016) Fuzzy nonlinear regression analysis using a random weight network. Inf Sci. doi:10.1016/j.ins.2016.01.037

  22. Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668. doi:10.1109/TPAMI.2005.95

  23. Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8):1026–1041. doi:10.1109/TKDE.2007.1048

  24. Jing L, Ng MK, Xu J, Huang JZ (2005) Subspace clustering of text documents with feature weighting k-means algorithm. In: Advances in knowledge discovery and data mining, 9th Pacific-Asia conference, PAKDD 2005, Hanoi, Vietnam, May 18–20, 2005, proceedings, pp 802–812. doi:10.1007/11430919_94

  25. Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392. doi:10.1137/S1064827595287997

  26. Khan I, Huang JZ, Tung NT, Williams GJ (2014) Ensemble clustering of high dimensional data with fastmap projection. In: Trends and applications in knowledge discovery and data mining—PAKDD 2014 international workshops: DANTH, BDM, MobiSocial, BigEC, CloudSD, MSMV-MBI, SDA, DMDA-Health, ALSIP, SocNet, DMBIH, BigPMA, Tainan, Taiwan, May 13–16, 2014. Revised selected papers, pp 483–493. doi:10.1007/978-3-319-13186-3_43

  27. Kriegel H, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1). doi:10.1145/1497577.1497578

  28. Kumar CA (2011) Reducing data dimensionality using random projections and fuzzy k-means clustering. Int J Intell Comput Cybern 4(3):353–365. doi:10.1108/17563781111160020

  29. Kuncheva LI, Hadjitodorov ST (2004) Using diversity in cluster ensembles. In: Proceedings of the IEEE international conference on systems, man & cybernetics, The Hague, The Netherlands, 10–13 October 2004, pp 1214–1219. doi:10.1109/ICSMC.2004.1399790

  30. Kuncheva LI, Vetrov D (2006) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans Pattern Anal Mach Intell 28(11):1798–1808. doi:10.1109/TPAMI.2006.226

  31. Law MHC, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: CVPR (2), pp 424–430. doi:10.1109/CVPR.2004.170

  32. Lewis DD (2004) RCV1-v2/LYRL2004: the LYRL2004 distribution of the RCV1-v2 text categorization test collection. http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm

  33. Lewis DD (2015) Reuters-21578 text categorization test collection. http://www.daviddlewis.com/resources/testcollections/reuters21578/

  34. Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5, 361–397. http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lewis04a.pdf

  35. Nagwani NK (2015) Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J Big Data 2(1):1–18. doi:10.1186/s40537-015-0020-5

  36. Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD Explor 6(1):90–105. doi:10.1145/1007730.1007731

  37. Razavi AH, Inkpen D, Brusilovsky D, Bogouslavski L (2013) General topic annotation in social networks: a latent dirichlet allocation approach. In: Advances in artificial intelligence, 26th Canadian conference on artificial intelligence, Canadian AI 2013, Regina, SK, Canada, May 28–31, 2013. Proceedings, pp 293–300. doi:10.1007/978-3-642-38457-8_29

  38. Rennie J (2015) The 20 newsgroups data set. http://qwone.com/jason/20Newsgroups/

  39. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi:10.1145/361219.361220

  40. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617. http://www.jmlr.org/papers/v3/strehl02a.html

  41. Wang X (2015) Learning from big data with uncertainty—editorial. J Intell Fuzzy Syst 28(5):2329–2330. doi:10.3233/IFS-141516

  42. Wang X, Ashfaq RAR, Fu A (2015) Fuzziness based sample categorization for classifier performance improvement. J Intell Fuzzy Syst 29(3):1185–1196. doi:10.3233/IFS-151729

  43. Wang X, Huang JZ (2015) Editorial: uncertainty in learning from big data. Fuzzy Sets Syst 258:1–4. doi:10.1016/j.fss.2014.10.010

  44. Wang X, Xing H, Li Y, Hua Q, Dong C, Pedrycz W (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23(5):1638–1654. doi:10.1109/TFUZZ.2014.2371479

  45. Williams G, Huang JZ, Chen X, Wang Q, Xiao L (2015) wskm: weighted k-means clustering. http://CRAN.R-project.org/package=wskm. R package version 1.4.28

  46. Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, June 28–July 1, 2009, pp 877–886. doi:10.1145/1557019.1557115

  47. Yin H, Cui B, Chen L, Hu Z, Zhou X (2015) Dynamic user modeling in social media systems. ACM Trans Inf Syst 33(3):10:1–10:44. doi:10.1145/2699670

  48. Zhang L, Mahdavi M, Jin R, Yang T, Zhu S (2013) Recovering the optimal solution by dual random projection. In: COLT 2013—the 26th annual conference on learning theory, June 12–14, 2013. Princeton University, Princeton, pp 135–157. http://jmlr.org/proceedings/papers/v30/Zhang13a.html

Download references

Acknowledgments

We are very grateful to the editors and the anonymous reviewers for their helpful comments and suggestions which improve the quality of the paper. This work was supported by the National Natural Science Foundation of China under Grant No. 61473194 and No. 61305059, as well as Guangdong Province of China under Grant No. 2013B091300019.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to He Zhao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, H., Salloum, S., Cai, Y. et al. Ensemble subspace clustering of text data using two-level features. Int. J. Mach. Learn. & Cyber. 8, 1751–1766 (2017). https://doi.org/10.1007/s13042-016-0556-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-016-0556-5

Keywords

Navigation