Effective semi-supervised document clustering via active learning with instance-level constraints

Zhao, Weizhong; He, Qing; Ma, Huifang; Shi, Zhongzhi

doi:10.1007/s10115-011-0389-1

Effective semi-supervised document clustering via active learning with instance-level constraints

Regular Paper
Published: 16 March 2011

Volume 30, pages 569–587, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Weizhong Zhao^1,2,3,
Qing He^1,2,
Huifang Ma^1,2,4 &
…
Zhongzhi Shi^1,2

796 Accesses
32 Citations
1 Altmetric
Explore all metrics

Abstract

Semi-supervised document clustering, which takes into account limited supervised data to group unlabeled documents into clusters, has received significant interest recently. Because of getting supervised data may be expensive, it is important to get most informative knowledge to improve the clustering performance. This paper presents a semi-supervised document clustering algorithm and a new method for actively selecting informative instance-level constraints to get improved clustering performance. The semi- supervised document clustering algorithm is a Constrained DBSCAN (Cons-DBSCAN) algorithm, which incorporates instance-level constraints to guide the clustering process in DBSCAN. An active learning approach is proposed to select informative document pairs for obtaining user feedbacks. Experimental results show that Cons-DBSCAN with our proposed active learning approach can improve the clustering performance significantly when given a relatively small amount of constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on semi-supervised learning

Article Open access 15 November 2019

Learning from positive and unlabeled data: a survey

Article 02 April 2020

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

References

Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann Publishers, Los Altos
Google Scholar
Frakes WB, Baeza-Yates R (1992) Information retrieval: data structure and algorithms. Prentice-Hall PTR, Englewood Cliffs
Google Scholar
Allan J (2002) Topic detection and tracking: event-based information organization. Kluwer, Dordrecht
MATH Google Scholar
Jing LP, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25(1): 35–55
Article Google Scholar
Hu XH, Zhang XD, Lu CM, Park EK, Zhou XH (2009) Exploiting Wikipedia as external knowledge for document clustering. Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 389–396
Zheng HT, Kang BY, Kim HG (2009) Exploiting noun phrases and semantic relationships for text document clustering. Inf Sci 179(13): 2249–2262
Article Google Scholar
Ni XL, Quan XJ, Lu Z, Liu WY, Hua B (2010) Short text clustering by finding core terms. Knowl Inf Syst. Online First, 25 June 2010
Mahdavi M, Abolhassani H (2009) Harmony K-means algorithm for document clustering. Data Min Knowl Discov 18(3): 370–391
Article MathSciNet Google Scholar
Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. Proceedings of the 17th international conference on machine learning, pp 1103–1110
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd international conference on knowledge discovery and data Mining, pp 226–231
Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. Proceedings of the 18th international conference on machine learning, pp 577–584
Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. Proceedings of the 9th international conference on machine learning, pp 19–26
Davidson I, Ravi S (2005) Clustering with constraints: feasibility issues and the k-means algorithm. Proceedings of the SIAM international conference on data mining, pp 138–149
Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. Adv Neural Inf Process Syst 15: 505–512
Google Scholar
Tang W, Xiong H, Zhong S, Wu J (2007) Enhancing semi-supervised clustering: a feature projection perspective. Proceedings of the 13th international conference on knowledge discovery and data mining. pp 707–716
Bar-Hillel A, Hertz, Shental N, Weinshall D (2003) Learning distance functions using equivalence relations. Proceedings of the 12th international conference on machine learning, pp 11–18
Chang H, Yeung DY (2006) Locally linear metric adaptation for semi-supervised clustering and image retrieval. Pattern Recognit 39(7): 1253–1264
Article MATH Google Scholar
Kumar N, Kummamuru K, Paranjpe D (2005) Semi-supervised clustering with metric learning using relative comparisons. Proceedings of 5th IEEE international conference on data mining, pp 693–696
Yan B, Domeniconi C (2006) Subspace metric ensembles for semi-supervised clustering of high dimensional data. Proceedings of the 17th european conference on machine learning, pp 509–520
Hu G, Zhou S, Guan J, Hu X (2008) Toward effective document clustering: a constrainted k-means based approach. Inf Process Manag 44: 1397–1409
Article Google Scholar
Schohn G, Cohn D (2000) Less is more: active learning with support vector machines. Proceedings of the 17th international conference on machine learning, pp 839–846
Tong S, Koller D (2000) Support vector machine active learning with applications to text classification. Proceedings of the 17th international conference on machine learning, pp 999–1006
Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. Proceedings of the 18th international conference on machine learning, pp 441–448
Sugiyama M (2006) Active learning in approximately linear regression based on conditional expectation of generalization error. J Mach Learn Res 7: 141–166
MathSciNet MATH Google Scholar
Raghavan H, Madani O, Jones R (2006) Active learning with feedback on both features and instances. J Mach Learn Res 7: 1655–1686
MathSciNet MATH Google Scholar
Veeramachaneni A, Olivetti E, Avesani P (2006) Active sampling for detecting irrelevant features. Proceedings of the 23rd international conference on machine learning, pp 961–968
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. Proceedings of the SIAM international conference on data mining, pp 333–344
Huang R, Lam W, Zhang Z (2007) Active learning of constraints for semi-supervised text clustering. Proceedings of the SIAM international conference on data mining, pp 113–124
Huang R, Lam W (2007) Semi-supervised document clustering via active learning with pairwise constraints. Proceedings of the 7th IEEE international conference on data mining, pp 517–522
Huang R, Lam W (2009) An active learning framework for semi-supervised document clustering with language modeling. Data Knowl Eng 68: 49–67
Article Google Scholar
Everitt B (1980) Cluster analysis, 2nd edn. Halsted Press, New York
MATH Google Scholar
Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137
Article Google Scholar
Cios K, Pedrycs W, Swiniarski R (1998) Data mining-methods for knowledge discovery. Kluwer Academic Publishers, Dordrecht
Book MATH Google Scholar
Davidson I, Wagstaff KL, Basu S (2006) Measuring constraints-set utility for partitional clustering algorithms. Proceedings of conference on principles and practice of knowledge discovery in databases, pp 115–126
Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. Proceedings of the workshop on artificial intelligence for web search, pp 58–64
Cover TM, Thomas JA (1991) Elements of information theory Wiley-Interscience, New York
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1): 143–175
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, 100190, Beijing, China
Weizhong Zhao, Qing He, Huifang Ma & Zhongzhi Shi
Graduate University of Chinese Academy of Sciences, 100039, Beijing, China
Weizhong Zhao, Qing He, Huifang Ma & Zhongzhi Shi
College of Information Engineering, Xiangtan University, 411105, Xiangtan, China
Weizhong Zhao
College of Mathematics and Information, Northwest Normal University, 730070, Gansu Lanzhou, China
Huifang Ma

Authors

Weizhong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Qing He
View author publications
You can also search for this author in PubMed Google Scholar
Huifang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Zhongzhi Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weizhong Zhao.

Additional information

This work is supported by the National Natural Science Foundation of China (No. 60933004, 60975039, 61072085), National Basic Research Priorities Programme (No. 2007CB311004), Doctoral Start-up Funding of Xiangtan University (No. 10QDZ42), Funding of enhancement of young teachers’ research of Northwest Normal University (No. NWNU-LKQN-10-1).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, W., He, Q., Ma, H. et al. Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30, 569–587 (2012). https://doi.org/10.1007/s10115-011-0389-1

Download citation

Received: 03 February 2010
Revised: 02 September 2010
Accepted: 02 March 2011
Published: 16 March 2011
Issue Date: March 2012
DOI: https://doi.org/10.1007/s10115-011-0389-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effective semi-supervised document clustering via active learning with instance-level constraints

Abstract

Access this article

Similar content being viewed by others

A survey on semi-supervised learning

Learning from positive and unlabeled data: a survey

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effective semi-supervised document clustering via active learning with instance-level constraints

Abstract

Access this article

Similar content being viewed by others

A survey on semi-supervised learning

Learning from positive and unlabeled data: a survey

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation