User-Interest-Based Document Filtering via Semi-supervised Clustering

Tang, Na; Vemuri, V. Rao

doi:10.1007/11425274_59

Na Tang²² &
V. Rao Vemuri²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3488))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

1090 Accesses
1 Citations

Abstract

This paper studies the task of user-interest-based document filtering, where users target to find some documents of a specific topic among a large document collection. This is usually done by a text categorization process, which divides all the documents into two categorizes: one containing all the desired documents (called positive documents) and the other containing all the other documents (called negative documents). However, in many cases, some documents among the negative documents are close enough to the positive documents, prompting a re-consideration (called deviating negative documents). Simply treating them as negative documents would deteriorate the categorization accuracy. We modify and extend a semi-supervised clustering method to conduct the categorization. Compared to the original method, our approach incorporates more informative initialization and constraints and in a result leads to better clustering results. The experiments show that our approach retrieves better (sometimes significantly improved) categorization accuracy than the original method in the presence of the deviating negative documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

20 newsgroup data set, last visited January 19 (2005), http://people.csail.mit.edu/jrennie/20Newsgroups
Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and theWeb: Probabilistic Methods and Algorithms. Text Analysis, ch. 4. Wiley, Chichester (2003)
Google Scholar
Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2004), Seattle, WA (2004)
Google Scholar
Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised Clustering by Seeding. In: Proceedings of the 19th International Conference on Machine Learning (ICML 2002), Sydney, Australia (2002)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (1998)
Google Scholar
Cohn, D., Caruana, R., McCallum, A.: Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University (2003)
Google Scholar
Dhillon, I., Kogan, J., Nicholas, C.: Feature Selection and Document Clustering. In: Survey of Text Mining, ch. 4. Springer, Heidelberg (2004)
Google Scholar
Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of the Third IEEE International Conference on Data Mining, Melbourne, Florida (2003)
Google Scholar
Pazzani, M.: Syskill and Webert Web Page Ratings, Last visited January 19 (2005), http://ncdm171.lac.uic.edu:16080/kdd/databases/SyskillWebert/SyskillWebert.task.html
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39 (2000)
Google Scholar
Tang, N., Vemuri, V.R.: Web-based Knowledge Acqusition to Impute Missing Values for Classification. In: Proceedings of the 2004 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI/IAT 2004), Beijing, China (2004)
Google Scholar
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-Means Clustering with Background Knowledge. In: Proceedings of 18th International Conference on Machine Learning, ICML 2001 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Dept., University of California, Davis, Davis, CA, 95616, USA
Na Tang & V. Rao Vemuri

Authors

Na Tang
View author publications
You can also search for this author in PubMed Google Scholar
V. Rao Vemuri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LIRIS - UFR d’Informatique, Université Claude Bernard Lyon 1, 43, boulevard du 11 novembre 1918, 69622, Villeurbanne, France
Mohand-Said Hacid
Department of Computer Science, State University of New York, 12222, Albany, NY, USA
Neil V. Murray
Department of Computer Science, University of North Carolina, 28223, Charlotte, NC, USA
Zbigniew W. Raś
Shimane University, 89-1 Enya-cho Izumo, 6938501, Shimane, Japan
Shusaku Tsumoto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, N., Vemuri, V.R. (2005). User-Interest-Based Document Filtering via Semi-supervised Clustering. In: Hacid, MS., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds) Foundations of Intelligent Systems. ISMIS 2005. Lecture Notes in Computer Science(), vol 3488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11425274_59

Download citation

DOI: https://doi.org/10.1007/11425274_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25878-0
Online ISBN: 978-3-540-31949-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics