skip to main content
10.1145/1390334.1390429acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

A comparative evaluation of different link types on enhancing document clustering

Published: 20 July 2008 Publication History

Abstract

With a growing number of works utilizing link information in enhancing document clustering, it becomes necessary to make a comparative evaluation of the impacts of different link types on document clustering. Various types of links between text documents, including explicit links such as citation links and hyperlinks, implicit links such as co-authorship links, and pseudo links such as content similarity links, convey topic similarity or topic transferring patterns, which is very useful for document clustering. In this study, we adopt a Relaxation Labeling (RL)-based clustering algorithm, which employs both content and linkage information, to evaluate the effectiveness of the aforementioned types of links for document clustering on eight datasets. The experimental results show that linkage is quite effective in improving content-based document clustering. Furthermore, a series of interesting findings regarding the impacts of different link types on document clustering are discovered through our experiments.

References

[1]
Angelova, R. and Weikum, G. Graph-based text classification: learn from your neighbors. SIGIR'06.
[2]
Angelova, R. and Siersdorfer, S. A neighborhood-based approach for clustering of linked document collecitons, CIKM'06.
[3]
Chakrabarti,S., Dom, B. E., and Indyk, P. Enhanced hypertext categorization using hyperlinks. In SIGMOD'98, 307--318
[4]
Cohn, D. and Hofmann,T. The missing link - a probabilistic model of document content and hypertext connectivity. In NIPS 13, 2001.
[5]
Eppstein, D. Finding the k shortest paths. In IEEE Symp. On Foundations of Computer Science, 154--165, 1994.
[6]
Erkan, G., Radev, D.R.: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. Artif. Intell. Res. (JAIR) 22: 457--479 (2004)
[7]
Ghani, R., Slattery, S. and Yang, Y. Hypertext Categorization using Hyperlink Patterns and Meta Data, ICML'01.
[8]
Kleinberg. J. Authoritative sources in a hyperlinked environment. In Proc. Ninth Ann. ACM-SIAM Symp on Discrete Algorithms, 1998.
[9]
Halkidi, Ml, Nguyen, B., Varlamis, I., and Vazirgiannis M. THESUS: Organizing Web Document Collections based on Link Semantics. The VLDB Journal (2003) 12: 320--322.
[10]
He, X., Zha, H, Ding, C. and Simon, H. Web document clustering using hyperlink structures, Tech. Rep. CSE-01-006, Dept. of CS and Eng., Pennsylvania State University, 2001.
[11]
Lafferty, J., McCallum, A. and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML, 282--289, 2001.
[12]
Pelkowitz, L. A continuous relaxation labeling algorithm for markov random fields. IEEE transactions on Systems, Man and Cybernetics, Vol 20 No.3:709--715, 1990.
[13]
Lu,Q. and Getoor, L. Link-based classification. ICML, 2003.
[14]
McCallum, A., Nigam, K., Rennie, J. and Seymore, K. A machine learning approach to building domain-specific search engines, IJCAI1999.
[15]
Menczer, F. Lexical and Semantic Clustering by Web links. JASIST, 55(14): 1261--1269, 2004.
[16]
Modha, D. S. and Spangler, W. S. 2000. Clustering hypertext with applications to web searching. HYPERTEXT '00.
[17]
Oh, H.-J., Myaeng, S. H. and Lee, M.-H. A practical hypertext categorization method using links and incrementally available class information. SIGIR, 264--271, 2000.
[18]
Page, L., Brin,S., Motwani, R., and Winograd,T. The PageRank citation ranking: Bringing order to the Web. Technical report, 1998.
[19]
Slattery, S. and Mitchell, T. Discovering text set regularities in relational domains, ICML'00.
[20]
Strehl, A., Ghosh, J. andMooney, R. J. Impact of similarity measures on web-page clustering. In AAAI Workshop, 2000.
[21]
Wang, Y. and Kitsuregawa, M. 2002. Evaluating contents-link coupled web page clustering for web search results. CIKM '02.
[22]
Weiss, R., Velez, B., Sheldon, M. A. et al. HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering. HYPERTEXT'96.
[23]
Zhang X., Zhou X., and Hu X., Semantic Smoothing for Model-based Document Clustering, ICDM'06.
[24]
Zhao, Y. and Karypis, G. Criterion functions for document clustering: experiments and analysis, Technical Report, Department of Computer Science, Univ. of Minnesota, 2001
[25]
Zhou X., Zhang X. and Hu X., Semantic Smoothing of Document Models for Agglomerative Clustering, IJCAI 2007, 2922--2927.
[26]
Zhou, X., Zhang, X., and Hu, X., Dragon Toolkit: Incorporating Auto-learned Semantic Knowledge into Large-Scale Text Retrieval and Mining, ICTAI'07, 197--20

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
July 2008
934 pages
ISBN:9781605581644
DOI:10.1145/1390334
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. link-based clustering
  2. markov random field
  3. relaxation labeling

Qualifiers

  • Research-article

Conference

SIGIR '08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)2
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Uncovering Hidden Links Between Images Through Their Textual ContextEnterprise Information Systems10.1007/978-3-030-26169-6_18(370-395)Online publication date: 28-Jul-2019
  • (2018)Document ClusteringInformation Retrieval and Management10.4018/978-1-5225-5191-1.ch003(47-64)Online publication date: 2018
  • (2018)Learning a Joint Representation for Classification of Networked DocumentsNeural Information Processing10.1007/978-3-030-04221-9_18(199-209)Online publication date: 17-Nov-2018
  • (2017)Document ClusteringPattern and Data Analysis in Healthcare Settings10.4018/978-1-5225-0536-5.ch013(264-281)Online publication date: 2017
  • (2016)Study Fields Clustering Using KRK CompetencesMachine Intelligence and Big Data in Industry10.1007/978-3-319-30315-4_4(35-47)Online publication date: 25-Mar-2016
  • (2015)Web Search Results Clustering Using Frequent Termset MiningPattern Recognition and Machine Intelligence10.1007/978-3-319-19941-2_50(525-534)Online publication date: 23-Jun-2015
  • (2013)Probability-based text clustering algorithm by alternately repeating two operationsJournal of Information Science10.1177/016555151247005439:3(372-383)Online publication date: 29-Jan-2013
  • (2013)Clustering and Diversifying Web Search Results with Graph-Based Word Sense InductionComputational Linguistics10.1162/COLI_a_0014839:3(709-754)Online publication date: Sep-2013
  • (2012)Overlapping community detection combining content and linkJournal of Zhejiang University SCIENCE C10.1631/jzus.C120004913:11(828-839)Online publication date: 8-Nov-2012
  • (2012)Utilizing Different Link Types to Enhance Document Clustering Based on Markov Random Field Model With Relaxation LabelingIEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans10.1109/TSMCA.2012.218718342:5(1167-1182)Online publication date: 1-Sep-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media