Skip to main content
Log in

Effective semi-supervised document clustering via active learning with instance-level constraints

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Semi-supervised document clustering, which takes into account limited supervised data to group unlabeled documents into clusters, has received significant interest recently. Because of getting supervised data may be expensive, it is important to get most informative knowledge to improve the clustering performance. This paper presents a semi-supervised document clustering algorithm and a new method for actively selecting informative instance-level constraints to get improved clustering performance. The semi- supervised document clustering algorithm is a Constrained DBSCAN (Cons-DBSCAN) algorithm, which incorporates instance-level constraints to guide the clustering process in DBSCAN. An active learning approach is proposed to select informative document pairs for obtaining user feedbacks. Experimental results show that Cons-DBSCAN with our proposed active learning approach can improve the clustering performance significantly when given a relatively small amount of constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann Publishers, Los Altos

    Google Scholar 

  2. Frakes WB, Baeza-Yates R (1992) Information retrieval: data structure and algorithms. Prentice-Hall PTR, Englewood Cliffs

    Google Scholar 

  3. Allan J (2002) Topic detection and tracking: event-based information organization. Kluwer, Dordrecht

    MATH  Google Scholar 

  4. Jing LP, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25(1): 35–55

    Article  Google Scholar 

  5. Hu XH, Zhang XD, Lu CM, Park EK, Zhou XH (2009) Exploiting Wikipedia as external knowledge for document clustering. Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 389–396

  6. Zheng HT, Kang BY, Kim HG (2009) Exploiting noun phrases and semantic relationships for text document clustering. Inf Sci 179(13): 2249–2262

    Article  Google Scholar 

  7. Ni XL, Quan XJ, Lu Z, Liu WY, Hua B (2010) Short text clustering by finding core terms. Knowl Inf Syst. Online First, 25 June 2010

  8. Mahdavi M, Abolhassani H (2009) Harmony K-means algorithm for document clustering. Data Min Knowl Discov 18(3): 370–391

    Article  MathSciNet  Google Scholar 

  9. Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. Proceedings of the 17th international conference on machine learning, pp 1103–1110

  10. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd international conference on knowledge discovery and data Mining, pp 226–231

  11. Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. Proceedings of the 18th international conference on machine learning, pp 577–584

  12. Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. Proceedings of the 9th international conference on machine learning, pp 19–26

  13. Davidson I, Ravi S (2005) Clustering with constraints: feasibility issues and the k-means algorithm. Proceedings of the SIAM international conference on data mining, pp 138–149

  14. Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. Adv Neural Inf Process Syst 15: 505–512

    Google Scholar 

  15. Tang W, Xiong H, Zhong S, Wu J (2007) Enhancing semi-supervised clustering: a feature projection perspective. Proceedings of the 13th international conference on knowledge discovery and data mining. pp 707–716

  16. Bar-Hillel A, Hertz, Shental N, Weinshall D (2003) Learning distance functions using equivalence relations. Proceedings of the 12th international conference on machine learning, pp 11–18

  17. Chang H, Yeung DY (2006) Locally linear metric adaptation for semi-supervised clustering and image retrieval. Pattern Recognit 39(7): 1253–1264

    Article  MATH  Google Scholar 

  18. Kumar N, Kummamuru K, Paranjpe D (2005) Semi-supervised clustering with metric learning using relative comparisons. Proceedings of 5th IEEE international conference on data mining, pp 693–696

  19. Yan B, Domeniconi C (2006) Subspace metric ensembles for semi-supervised clustering of high dimensional data. Proceedings of the 17th european conference on machine learning, pp 509–520

  20. Hu G, Zhou S, Guan J, Hu X (2008) Toward effective document clustering: a constrainted k-means based approach. Inf Process Manag 44: 1397–1409

    Article  Google Scholar 

  21. Schohn G, Cohn D (2000) Less is more: active learning with support vector machines. Proceedings of the 17th international conference on machine learning, pp 839–846

  22. Tong S, Koller D (2000) Support vector machine active learning with applications to text classification. Proceedings of the 17th international conference on machine learning, pp 999–1006

  23. Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. Proceedings of the 18th international conference on machine learning, pp 441–448

  24. Sugiyama M (2006) Active learning in approximately linear regression based on conditional expectation of generalization error. J Mach Learn Res 7: 141–166

    MathSciNet  MATH  Google Scholar 

  25. Raghavan H, Madani O, Jones R (2006) Active learning with feedback on both features and instances. J Mach Learn Res 7: 1655–1686

    MathSciNet  MATH  Google Scholar 

  26. Veeramachaneni A, Olivetti E, Avesani P (2006) Active sampling for detecting irrelevant features. Proceedings of the 23rd international conference on machine learning, pp 961–968

  27. Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. Proceedings of the SIAM international conference on data mining, pp 333–344

  28. Huang R, Lam W, Zhang Z (2007) Active learning of constraints for semi-supervised text clustering. Proceedings of the SIAM international conference on data mining, pp 113–124

  29. Huang R, Lam W (2007) Semi-supervised document clustering via active learning with pairwise constraints. Proceedings of the 7th IEEE international conference on data mining, pp 517–522

  30. Huang R, Lam W (2009) An active learning framework for semi-supervised document clustering with language modeling. Data Knowl Eng 68: 49–67

    Article  Google Scholar 

  31. Everitt B (1980) Cluster analysis, 2nd edn. Halsted Press, New York

    MATH  Google Scholar 

  32. Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137

    Article  Google Scholar 

  33. Cios K, Pedrycs W, Swiniarski R (1998) Data mining-methods for knowledge discovery. Kluwer Academic Publishers, Dordrecht

    Book  MATH  Google Scholar 

  34. Davidson I, Wagstaff KL, Basu S (2006) Measuring constraints-set utility for partitional clustering algorithms. Proceedings of conference on principles and practice of knowledge discovery in databases, pp 115–126

  35. Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. Proceedings of the workshop on artificial intelligence for web search, pp 58–64

  36. Cover TM, Thomas JA (1991) Elements of information theory Wiley-Interscience, New York

  37. Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1): 143–175

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weizhong Zhao.

Additional information

This work is supported by the National Natural Science Foundation of China (No. 60933004, 60975039, 61072085), National Basic Research Priorities Programme (No. 2007CB311004), Doctoral Start-up Funding of Xiangtan University (No. 10QDZ42), Funding of enhancement of young teachers’ research of Northwest Normal University (No. NWNU-LKQN-10-1).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, W., He, Q., Ma, H. et al. Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30, 569–587 (2012). https://doi.org/10.1007/s10115-011-0389-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0389-1

Keywords

Navigation