Ensemble subspace clustering of text data using two-level features

Zhao, He; Salloum, Salman; Cai, Yeshou; Huang, Joshua Zhexue

doi:10.1007/s13042-016-0556-5

Ensemble subspace clustering of text data using two-level features

Original Article
Published: 17 June 2016

Volume 8, pages 1751–1766, (2017)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

He Zhao ORCID: orcid.org/0000-0001-5763-9743^1,2,
Salman Salloum³,
Yeshou Cai⁴ &
…
Joshua Zhexue Huang³

609 Accesses
3 Citations
Explore all metrics

Abstract

This paper proposes a new integrated method for ensemble subspace clustering of high dimensional sparse text data. Our method employs two-level feature representation of text data (words and topics) to generate clusters from subspaces. We also use ensemble clustering to increase the robustness of the clusters. This method depends on topic modeling to get the two-level feature representation of text data and to generate different ensemble components. By using both topics and words to cluster text data, we can get more interpretable clusters as we can measure the weight of words and topics in each cluster. In order to evaluate the proposed method, we have conducted several experiments on seven real-life data sets. While some of these data sets are easy to cluster, others are hard, and some others contain unbalanced data. Experimental results on this diversity of data sets show that our method outperforms other methods for ensemble clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A LDA Feature Grouping Method for Subspace Clustering of Text Data

A Soft Subspace Clustering Method for Text Data Using a Probability Based Feature Weighting Scheme

A Novel Cluster Combination Algorithm for Document Clustering

References

Aggarwal CC (2015) Data mining—the textbook. Springer, Berlin. doi:10.1007/978-3-319-14142-8
Bellman RE (2015) Adaptive control processes: a guided tour. Princeton University Press, Princeton
Bhattacharya I, Getoor L (2006) A latent dirichlet model for unsupervised entity resolution. In: Proceedings of the sixth SIAM international conference on data mining, April 20–22, 2006, Bethesda, MD, USA, pp 47–58. doi:10.1137/1.9781611972764.5
Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84. doi:10.1145/2133806.2133826
Blei DM (2014) Build, compute, critique, repeat: data analysis with latent variable models. Annu Rev Stat Appl 1(1):203–232. doi:10.1146/annurev-statistics-022513-115657
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3, 993–1022. http://www.jmlr.org/papers/v3/blei03a.html
Cai Y, Chen X, Peng PX, Huang JZ (2014) A LDA feature grouping method for subspace clustering of text data. In: Intelligence and security informatics—Pacific Asia workshop, PAISI 2014, Tainan, Taiwan, May 13, 2014. Proceedings, pp 78–90 (2014). doi:10.1007/978-3-319-06677-6_7
Cai Y, Zhao H (2016) GitHub download link of the experimental text data sets. doi:10.5281/zenodo.48688
Cha Y, Bi B, Hsieh C, Cho J (2013) Incorporating popularity in topic models for social network analysis. In: The 36th international ACM SIGIR conference on research and development in Information retrieval, SIGIR ’13, Dublin, Ireland—July 28—August 01, 2013, pp 223–232. doi:10.1145/2484028.2484086
Chaney AJ, Blei DM, Eliassi-Rad T (2015) A probabilistic model for using social networks in personalized item recommendation. In: Proceedings of the 9th ACM conference on recommender systems, RecSys 2015, Vienna, Austria, September 16–20, 2015, pp 43–50. doi:10.1145/2792838.2800193
Chen X, Xu X, Huang JZ, Ye Y (2013) Tw-k-means: automated two-level variable weighting clustering algorithm for multiview data. IEEE Trans Knowl Data Eng 25(4), 932–944. doi:10.1109/TKDE.2011.262
Chen X, Ye Y, Xu X, Huang JZ (2012) A feature group weighting method for subspace clustering of high-dimensional data. Pattern Recognit 45(1), 434–446 (2012). doi:10.1016/j.patcog.2011.06.004
Cheng H, Hua KA, Vu K (2008) Constrained locally weighted clustering. PVLDB 1(1):90–101. http://www.vldb.org/pvldb/1/1453871.pdf
Crain SP, Zhou K, Yang S, Zha H (2012) Dimensionality reduction and topic modeling: from latent semantic indexing to latent dirichlet allocation and beyond. In: Mining text data, pp 129–161. doi:10.1007/978-1-4614-3223-4_5
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Domeniconi C, Al-Razgan M (2009) Weighted cluster ensembles: methods and analysis. TKDD 2(4). doi:10.1145/1460797.1460800
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Machine learning, proceedings of the twentieth international conference (ICML 2003), August 21–24, 2003, Washington, DC, USA, pp 186–193. http://www.aaai.org/Library/ICML/2003/icml03-027.php
Fu X, Yang K, Huang JZ, Cui L (2015) Dynamic non-parametric joint sentiment topic mixture model. Knowl. Based Syst. 82:102–114. doi:10.1016/j.knosys.2015.02.021
Gordon AD, Vichi M (2001) Fuzzy partition models for fitting a set of partitions. Psychometrika 66(2):229–247. doi:10.1007/BF02294837
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1), 5228–5235. doi:10.1073/pnas.0307752101
He YL, Wang XZ, Huang JZ (2016) Fuzzy nonlinear regression analysis using a random weight network. Inf Sci. doi:10.1016/j.ins.2016.01.037
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668. doi:10.1109/TPAMI.2005.95
Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8):1026–1041. doi:10.1109/TKDE.2007.1048
Jing L, Ng MK, Xu J, Huang JZ (2005) Subspace clustering of text documents with feature weighting k-means algorithm. In: Advances in knowledge discovery and data mining, 9th Pacific-Asia conference, PAKDD 2005, Hanoi, Vietnam, May 18–20, 2005, proceedings, pp 802–812. doi:10.1007/11430919_94
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392. doi:10.1137/S1064827595287997
Khan I, Huang JZ, Tung NT, Williams GJ (2014) Ensemble clustering of high dimensional data with fastmap projection. In: Trends and applications in knowledge discovery and data mining—PAKDD 2014 international workshops: DANTH, BDM, MobiSocial, BigEC, CloudSD, MSMV-MBI, SDA, DMDA-Health, ALSIP, SocNet, DMBIH, BigPMA, Tainan, Taiwan, May 13–16, 2014. Revised selected papers, pp 483–493. doi:10.1007/978-3-319-13186-3_43
Kriegel H, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1). doi:10.1145/1497577.1497578
Kumar CA (2011) Reducing data dimensionality using random projections and fuzzy k-means clustering. Int J Intell Comput Cybern 4(3):353–365. doi:10.1108/17563781111160020
Kuncheva LI, Hadjitodorov ST (2004) Using diversity in cluster ensembles. In: Proceedings of the IEEE international conference on systems, man & cybernetics, The Hague, The Netherlands, 10–13 October 2004, pp 1214–1219. doi:10.1109/ICSMC.2004.1399790
Kuncheva LI, Vetrov D (2006) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans Pattern Anal Mach Intell 28(11):1798–1808. doi:10.1109/TPAMI.2006.226
Law MHC, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: CVPR (2), pp 424–430. doi:10.1109/CVPR.2004.170
Lewis DD (2004) RCV1-v2/LYRL2004: the LYRL2004 distribution of the RCV1-v2 text categorization test collection. http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm
Lewis DD (2015) Reuters-21578 text categorization test collection. http://www.daviddlewis.com/resources/testcollections/reuters21578/
Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5, 361–397. http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lewis04a.pdf
Nagwani NK (2015) Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J Big Data 2(1):1–18. doi:10.1186/s40537-015-0020-5
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD Explor 6(1):90–105. doi:10.1145/1007730.1007731
Razavi AH, Inkpen D, Brusilovsky D, Bogouslavski L (2013) General topic annotation in social networks: a latent dirichlet allocation approach. In: Advances in artificial intelligence, 26th Canadian conference on artificial intelligence, Canadian AI 2013, Regina, SK, Canada, May 28–31, 2013. Proceedings, pp 293–300. doi:10.1007/978-3-642-38457-8_29
Rennie J (2015) The 20 newsgroups data set. http://qwone.com/jason/20Newsgroups/
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi:10.1145/361219.361220
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617. http://www.jmlr.org/papers/v3/strehl02a.html
Wang X (2015) Learning from big data with uncertainty—editorial. J Intell Fuzzy Syst 28(5):2329–2330. doi:10.3233/IFS-141516
Wang X, Ashfaq RAR, Fu A (2015) Fuzziness based sample categorization for classifier performance improvement. J Intell Fuzzy Syst 29(3):1185–1196. doi:10.3233/IFS-151729
Wang X, Huang JZ (2015) Editorial: uncertainty in learning from big data. Fuzzy Sets Syst 258:1–4. doi:10.1016/j.fss.2014.10.010
Wang X, Xing H, Li Y, Hua Q, Dong C, Pedrycz W (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23(5):1638–1654. doi:10.1109/TFUZZ.2014.2371479
Williams G, Huang JZ, Chen X, Wang Q, Xiao L (2015) wskm: weighted k-means clustering. http://CRAN.R-project.org/package=wskm. R package version 1.4.28
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, June 28–July 1, 2009, pp 877–886. doi:10.1145/1557019.1557115
Yin H, Cui B, Chen L, Hu Z, Zhou X (2015) Dynamic user modeling in social media systems. ACM Trans Inf Syst 33(3):10:1–10:44. doi:10.1145/2699670
Zhang L, Mahdavi M, Jin R, Yang T, Zhu S (2013) Recovering the optimal solution by dual random projection. In: COLT 2013—the 26th annual conference on learning theory, June 12–14, 2013. Princeton University, Princeton, pp 135–157. http://jmlr.org/proceedings/papers/v30/Zhang13a.html

Download references

Acknowledgments

We are very grateful to the editors and the anonymous reviewers for their helpful comments and suggestions which improve the quality of the paper. This work was supported by the National Natural Science Foundation of China under Grant No. 61473194 and No. 61305059, as well as Guangdong Province of China under Grant No. 2013B091300019.

Author information

Authors and Affiliations

Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, China
He Zhao
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
He Zhao
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
Salman Salloum & Joshua Zhexue Huang
Tencent Technology (Shenzhen) Company Ltd., Shenzhen, China
Yeshou Cai

Authors

He Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Salman Salloum
View author publications
You can also search for this author in PubMed Google Scholar
Yeshou Cai
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Zhexue Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to He Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, H., Salloum, S., Cai, Y. et al. Ensemble subspace clustering of text data using two-level features. Int. J. Mach. Learn. & Cyber. 8, 1751–1766 (2017). https://doi.org/10.1007/s13042-016-0556-5

Download citation

Received: 06 November 2015
Accepted: 01 June 2016
Published: 17 June 2016
Issue Date: December 2017
DOI: https://doi.org/10.1007/s13042-016-0556-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble subspace clustering of text data using two-level features

Abstract

Access this article

Similar content being viewed by others

A LDA Feature Grouping Method for Subspace Clustering of Text Data

A Soft Subspace Clustering Method for Text Data Using a Probability Based Feature Weighting Scheme

A Novel Cluster Combination Algorithm for Document Clustering

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Ensemble subspace clustering of text data using two-level features

Abstract

Access this article

Similar content being viewed by others

A LDA Feature Grouping Method for Subspace Clustering of Text Data

A Soft Subspace Clustering Method for Text Data Using a Probability Based Feature Weighting Scheme

A Novel Cluster Combination Algorithm for Document Clustering

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation