Abstract
This paper proposes a new integrated method for ensemble subspace clustering of high dimensional sparse text data. Our method employs two-level feature representation of text data (words and topics) to generate clusters from subspaces. We also use ensemble clustering to increase the robustness of the clusters. This method depends on topic modeling to get the two-level feature representation of text data and to generate different ensemble components. By using both topics and words to cluster text data, we can get more interpretable clusters as we can measure the weight of words and topics in each cluster. In order to evaluate the proposed method, we have conducted several experiments on seven real-life data sets. While some of these data sets are easy to cluster, others are hard, and some others contain unbalanced data. Experimental results on this diversity of data sets show that our method outperforms other methods for ensemble clustering.
Similar content being viewed by others
References
Aggarwal CC (2015) Data mining—the textbook. Springer, Berlin. doi:10.1007/978-3-319-14142-8
Bellman RE (2015) Adaptive control processes: a guided tour. Princeton University Press, Princeton
Bhattacharya I, Getoor L (2006) A latent dirichlet model for unsupervised entity resolution. In: Proceedings of the sixth SIAM international conference on data mining, April 20–22, 2006, Bethesda, MD, USA, pp 47–58. doi:10.1137/1.9781611972764.5
Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84. doi:10.1145/2133806.2133826
Blei DM (2014) Build, compute, critique, repeat: data analysis with latent variable models. Annu Rev Stat Appl 1(1):203–232. doi:10.1146/annurev-statistics-022513-115657
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3, 993–1022. http://www.jmlr.org/papers/v3/blei03a.html
Cai Y, Chen X, Peng PX, Huang JZ (2014) A LDA feature grouping method for subspace clustering of text data. In: Intelligence and security informatics—Pacific Asia workshop, PAISI 2014, Tainan, Taiwan, May 13, 2014. Proceedings, pp 78–90 (2014). doi:10.1007/978-3-319-06677-6_7
Cai Y, Zhao H (2016) GitHub download link of the experimental text data sets. doi:10.5281/zenodo.48688
Cha Y, Bi B, Hsieh C, Cho J (2013) Incorporating popularity in topic models for social network analysis. In: The 36th international ACM SIGIR conference on research and development in Information retrieval, SIGIR ’13, Dublin, Ireland—July 28—August 01, 2013, pp 223–232. doi:10.1145/2484028.2484086
Chaney AJ, Blei DM, Eliassi-Rad T (2015) A probabilistic model for using social networks in personalized item recommendation. In: Proceedings of the 9th ACM conference on recommender systems, RecSys 2015, Vienna, Austria, September 16–20, 2015, pp 43–50. doi:10.1145/2792838.2800193
Chen X, Xu X, Huang JZ, Ye Y (2013) Tw-k-means: automated two-level variable weighting clustering algorithm for multiview data. IEEE Trans Knowl Data Eng 25(4), 932–944. doi:10.1109/TKDE.2011.262
Chen X, Ye Y, Xu X, Huang JZ (2012) A feature group weighting method for subspace clustering of high-dimensional data. Pattern Recognit 45(1), 434–446 (2012). doi:10.1016/j.patcog.2011.06.004
Cheng H, Hua KA, Vu K (2008) Constrained locally weighted clustering. PVLDB 1(1):90–101. http://www.vldb.org/pvldb/1/1453871.pdf
Crain SP, Zhou K, Yang S, Zha H (2012) Dimensionality reduction and topic modeling: from latent semantic indexing to latent dirichlet allocation and beyond. In: Mining text data, pp 129–161. doi:10.1007/978-1-4614-3223-4_5
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Domeniconi C, Al-Razgan M (2009) Weighted cluster ensembles: methods and analysis. TKDD 2(4). doi:10.1145/1460797.1460800
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Machine learning, proceedings of the twentieth international conference (ICML 2003), August 21–24, 2003, Washington, DC, USA, pp 186–193. http://www.aaai.org/Library/ICML/2003/icml03-027.php
Fu X, Yang K, Huang JZ, Cui L (2015) Dynamic non-parametric joint sentiment topic mixture model. Knowl. Based Syst. 82:102–114. doi:10.1016/j.knosys.2015.02.021
Gordon AD, Vichi M (2001) Fuzzy partition models for fitting a set of partitions. Psychometrika 66(2):229–247. doi:10.1007/BF02294837
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1), 5228–5235. doi:10.1073/pnas.0307752101
He YL, Wang XZ, Huang JZ (2016) Fuzzy nonlinear regression analysis using a random weight network. Inf Sci. doi:10.1016/j.ins.2016.01.037
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668. doi:10.1109/TPAMI.2005.95
Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8):1026–1041. doi:10.1109/TKDE.2007.1048
Jing L, Ng MK, Xu J, Huang JZ (2005) Subspace clustering of text documents with feature weighting k-means algorithm. In: Advances in knowledge discovery and data mining, 9th Pacific-Asia conference, PAKDD 2005, Hanoi, Vietnam, May 18–20, 2005, proceedings, pp 802–812. doi:10.1007/11430919_94
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392. doi:10.1137/S1064827595287997
Khan I, Huang JZ, Tung NT, Williams GJ (2014) Ensemble clustering of high dimensional data with fastmap projection. In: Trends and applications in knowledge discovery and data mining—PAKDD 2014 international workshops: DANTH, BDM, MobiSocial, BigEC, CloudSD, MSMV-MBI, SDA, DMDA-Health, ALSIP, SocNet, DMBIH, BigPMA, Tainan, Taiwan, May 13–16, 2014. Revised selected papers, pp 483–493. doi:10.1007/978-3-319-13186-3_43
Kriegel H, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1). doi:10.1145/1497577.1497578
Kumar CA (2011) Reducing data dimensionality using random projections and fuzzy k-means clustering. Int J Intell Comput Cybern 4(3):353–365. doi:10.1108/17563781111160020
Kuncheva LI, Hadjitodorov ST (2004) Using diversity in cluster ensembles. In: Proceedings of the IEEE international conference on systems, man & cybernetics, The Hague, The Netherlands, 10–13 October 2004, pp 1214–1219. doi:10.1109/ICSMC.2004.1399790
Kuncheva LI, Vetrov D (2006) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans Pattern Anal Mach Intell 28(11):1798–1808. doi:10.1109/TPAMI.2006.226
Law MHC, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: CVPR (2), pp 424–430. doi:10.1109/CVPR.2004.170
Lewis DD (2004) RCV1-v2/LYRL2004: the LYRL2004 distribution of the RCV1-v2 text categorization test collection. http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm
Lewis DD (2015) Reuters-21578 text categorization test collection. http://www.daviddlewis.com/resources/testcollections/reuters21578/
Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5, 361–397. http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lewis04a.pdf
Nagwani NK (2015) Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J Big Data 2(1):1–18. doi:10.1186/s40537-015-0020-5
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD Explor 6(1):90–105. doi:10.1145/1007730.1007731
Razavi AH, Inkpen D, Brusilovsky D, Bogouslavski L (2013) General topic annotation in social networks: a latent dirichlet allocation approach. In: Advances in artificial intelligence, 26th Canadian conference on artificial intelligence, Canadian AI 2013, Regina, SK, Canada, May 28–31, 2013. Proceedings, pp 293–300. doi:10.1007/978-3-642-38457-8_29
Rennie J (2015) The 20 newsgroups data set. http://qwone.com/jason/20Newsgroups/
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi:10.1145/361219.361220
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617. http://www.jmlr.org/papers/v3/strehl02a.html
Wang X (2015) Learning from big data with uncertainty—editorial. J Intell Fuzzy Syst 28(5):2329–2330. doi:10.3233/IFS-141516
Wang X, Ashfaq RAR, Fu A (2015) Fuzziness based sample categorization for classifier performance improvement. J Intell Fuzzy Syst 29(3):1185–1196. doi:10.3233/IFS-151729
Wang X, Huang JZ (2015) Editorial: uncertainty in learning from big data. Fuzzy Sets Syst 258:1–4. doi:10.1016/j.fss.2014.10.010
Wang X, Xing H, Li Y, Hua Q, Dong C, Pedrycz W (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23(5):1638–1654. doi:10.1109/TFUZZ.2014.2371479
Williams G, Huang JZ, Chen X, Wang Q, Xiao L (2015) wskm: weighted k-means clustering. http://CRAN.R-project.org/package=wskm. R package version 1.4.28
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, June 28–July 1, 2009, pp 877–886. doi:10.1145/1557019.1557115
Yin H, Cui B, Chen L, Hu Z, Zhou X (2015) Dynamic user modeling in social media systems. ACM Trans Inf Syst 33(3):10:1–10:44. doi:10.1145/2699670
Zhang L, Mahdavi M, Jin R, Yang T, Zhu S (2013) Recovering the optimal solution by dual random projection. In: COLT 2013—the 26th annual conference on learning theory, June 12–14, 2013. Princeton University, Princeton, pp 135–157. http://jmlr.org/proceedings/papers/v30/Zhang13a.html
Acknowledgments
We are very grateful to the editors and the anonymous reviewers for their helpful comments and suggestions which improve the quality of the paper. This work was supported by the National Natural Science Foundation of China under Grant No. 61473194 and No. 61305059, as well as Guangdong Province of China under Grant No. 2013B091300019.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, H., Salloum, S., Cai, Y. et al. Ensemble subspace clustering of text data using two-level features. Int. J. Mach. Learn. & Cyber. 8, 1751–1766 (2017). https://doi.org/10.1007/s13042-016-0556-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-016-0556-5