skip to main content
10.1145/1498759.1498809acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Clustering the tagged web

Published: 09 February 2009 Publication History

Abstract

Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.

References

[1]
Open directory project. http://dmoz.org/.
[2]
M. Aurnhammer, P. Hanappe, and L. Steels. Integrating collaborative tagging and emergent semantics for image retrieval. Proc. of the Collaborative Web Tagging Workshop (WWW'06).
[3]
Shenghua Bao, Guirong Xue, Xiaoyuan Wu, Yong Yu, Ben Fei, and Zhong Su. Optimizing web search using social annotations. In WWW '07.
[4]
G. Begelman, P. Keller, and F. Smadja. Automated tag clustering: Improving search and exploration in the tag space. Proc. of the Collaborative Web Tagging Workshop (WWW'06).
[5]
S. M Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly analysis of a very large topically categorized web query log. In SIGIR '04.
[6]
B. Berendt and C. Hanser. Tags are not Metadata, but "Just More Content"--to Some People. ICWSM '07.
[7]
D. M. Blei and M. I. Jordan. Modeling annotated data. In SIGIR '03.
[8]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.
[9]
C. H. Brooks and N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW'06.
[10]
Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a new approach to topic-specific web resource discovery. In WWW '99.
[11]
W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In SIGIR '99.
[12]
D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/Gather: a cluster-based approach to browsing large document collections. In SIGIR '92.
[13]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
[14]
Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR '03.
[15]
Johannes Fürnkranz. Exploiting structural information for text classification on the WWW. In IDA '99.
[16]
T. L. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, '04.
[17]
T. Haveliwala. Topic-sensitive pagerank. In WWW '02.
[18]
T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. In WWW '02.
[19]
C. Hayes and P. Avesani. Using tags and clustering to identify topic-relevant blogs. In ICWSM, 2007.
[20]
Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR '96.
[21]
P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search. In WSDM '08.
[22]
Thomas Hofmann. Probabilistic latent semantic indexing. In SIGIR '99.
[23]
A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications, 4011:411--426, 2006.
[24]
T. Liu, S. Liu, Z. Chen, and W. Y. Ma. An evaluation on feature selection for text clustering. In ICML '03.
[25]
X. Liu and W. B. Croft. Cluster-based retrieval using language models. In SIGIR'04.
[26]
C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008.
[27]
K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, J. L. Klavans, A. Nenkova, C. Sable, B. Schiffman, and S. Sigelman. Tracking and summarizing news on a daily basis with Columbia's Newsblaster. In HLT'02.
[28]
S. Osinski and D. Weiss. A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3):48--54, 2005.
[29]
T. Rattenbury, N. Good, and M. Naaman. Towards automatic extraction of event and place semantics from Flickr tags. In SIGIR '07.
[30]
K. Song, Y. Tian, W. Gao, and T. Huang. Diversifying the image retrieval results. In MULTIMEDIA '06.
[31]
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI Workshop on AI for Web Search (AAAI 2000).
[32]
C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979.
[33]
L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI '04.
[34]
Ellen M. Voorhees. The cluster hypothesis revisited. Technical report, Ithaca, NY, USA, 1985.
[35]
X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR '06.
[36]
Y. Yanbe, A. Jatowt, S. Nakamura, and K. Tanaka. Can social bookmarking enhance search in the web? In JCDL '07.
[37]
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML '97.
[38]
Oren Zamir and Oren Etzioni. Web document clustering: a feasibility demonstration. In SIGIR '98.
[39]
H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma. Learning to cluster web search results. In SIGIR '04.
[40]
D. Zhou, J. Bian, S. Zheng, H. Zha, and C. L. Giles. Exploring social annotations for information retrieval. In WWW '08.

Cited By

View all
  • (2025)Feature-weighted fuzzy clustering methods: An experimental reviewNeurocomputing10.1016/j.neucom.2024.129176619(129176)Online publication date: Feb-2025
  • (2023)Why do banks fail? An investigation via text miningCogent Economics & Finance10.1080/23322039.2023.225127211:2Online publication date: 3-Sep-2023
  • (2022)A semi-hierarchical clustering method for constructing knowledge trees from stackoverflowJournal of Information Science10.1177/016555152096103548:3(393-405)Online publication date: 1-Jun-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
February 2009
314 pages
ISBN:9781605583907
DOI:10.1145/1498759
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

WSDM'09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Feature-weighted fuzzy clustering methods: An experimental reviewNeurocomputing10.1016/j.neucom.2024.129176619(129176)Online publication date: Feb-2025
  • (2023)Why do banks fail? An investigation via text miningCogent Economics & Finance10.1080/23322039.2023.225127211:2Online publication date: 3-Sep-2023
  • (2022)A semi-hierarchical clustering method for constructing knowledge trees from stackoverflowJournal of Information Science10.1177/016555152096103548:3(393-405)Online publication date: 1-Jun-2022
  • (2022)Context-Consistent Generation of Indoor Virtual Environments Based on Geometry ConstraintsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2021.311172928:12(3986-3999)Online publication date: 1-Dec-2022
  • (2021)Managing a natural disaster: actionable insights from microblog dataJournal of Decision Systems10.1080/12460125.2021.191804531:1-2(134-149)Online publication date: 28-Apr-2021
  • (2021)A Joint Representation Learning Approach for Social Media Tag RecommendationNeural Information Processing10.1007/978-3-030-92273-3_9(100-112)Online publication date: 5-Dec-2021
  • (2020)The origins of Objective-C at PPI/Stepstone and its evolution at NeXTProceedings of the ACM on Programming Languages10.1145/33863324:HOPL(1-74)Online publication date: 12-Jun-2020
  • (2020)Villages Status Classification Analysis Involving K-Means Algorithm To Support Kementerian Desa Pembangunan Daerah Tertinggal dan Transmigrasi Work ProgramsJournal of Physics: Conference Series10.1088/1742-6596/1641/1/0120581641(012058)Online publication date: 24-Nov-2020
  • (2020)Research on Chinese Short Text Clustering Ensemble via Convolutional Neural NetworksArtificial Intelligence in China10.1007/978-981-15-0187-6_74(622-628)Online publication date: 1-Feb-2020
  • (2019)Weakly Supervised Domain DetectionTransactions of the Association for Computational Linguistics10.1162/tacl_a_002877(581-596)Online publication date: Nov-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media