Skip to main content

User-Related Tag Expansion for Web Document Clustering

  • Conference paper
Advances in Information Retrieval (ECIR 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6611))

Included in the following conference series:

Abstract

As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags(less than 10). This sparsity seriously limits the usage of tags on clustering. In this work, we propose a user-related tag expansion method to overcome the problem, which incorporates additional useful tags into the original tag document by utilizing user tagging as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. This problem is addressed in this research by designing a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that (1)Our user-related tag expansion method can be effectively applied to over 90% tagged web documents; (2)Folk-LDA can alleviate the topic drift in expansion, especially for those topic-specific documents; (3) Compared to word-based clustering, our approach using only tags achieves a statistically significant increase of 39% on F1 score while reducing 76% terms involved in computation at best.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. http://www.dai-labor.de/en/competence_centers/irml/datasets/

  2. Begelman, G.: Automated tag clustering: Improving search and exploration in the tag space. In: Proc. of the Collaborative Web Tagging Workshop at WWW 2006 (2006)

    Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003)

    Google Scholar 

  4. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: WWW 2003 (2003)

    Google Scholar 

  5. Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: IJCAI 2005, pp. 1048–1053 (2005)

    Google Scholar 

  6. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: AAAI 2006, pp. 1301–1306 (2006)

    Google Scholar 

  7. Gemmell, J., Shepitsen, A., Mobasher, B., Burke, R.: Personalizing navigation in folksonomies using hierarchical tag clustering. In: Data Warehousing and Knowledge Discovery, pp. 196–205 (2008)

    Google Scholar 

  8. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl. 1), 5228–5235 (2004)

    Article  Google Scholar 

  9. Heymann, P., Garcia-Molina, H.: Collaborative creation of communal hierarchical taxonomies in social tagging systems. Tech. Rep. 2006-10, Computer Science Department (2006)

    Google Scholar 

  10. Heymann, P., Koutrika, G., Garcia-Molina, H.: Can social bookmarking improve web search? In: WSDM 2008, pp. 195–206 (2008)

    Google Scholar 

  11. Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proc. of the SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003)

    Google Scholar 

  12. Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging wikipedia semantics. In: SIGIR 2008 (2008)

    Google Scholar 

  13. Hu, X., Zhang, X., Lu, C., Park, E.K., Zhou, X.: Exploiting wikipedia as external knowledge for document clustering. In: KDD 2009, pp. 389–396 (2009)

    Google Scholar 

  14. Li, X., Guo, L., Zhao, Y.E.: Tag-based social interest discovery. In: WWW 2008, pp. 675–684 (2008)

    Google Scholar 

  15. Liu, T., Liu, S., Chen, Z.: An evaluation on feature selection for text clustering. In: ICML, pp. 488–495 (2003)

    Google Scholar 

  16. Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: SIGIR 2004, pp. 186–193 (2004)

    Google Scholar 

  17. Lu, C., Chen, X., Park, E.K.: Exploit the tripartite network of social tagging for web clustering. In: CIKM 2009, pp. 1545–1548 (2009)

    Google Scholar 

  18. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  19. McKeown, K.R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J.L., Nenkova, A., Sable, C., Schiffman, B., Sigelman, S.: Tracking and summarizing news on a daily basis with columbia’s newsblaster. In: HLT 2002, pp. 280–285 (2002)

    Google Scholar 

  20. Ramage, D., Heymann, P., Manning, C.D., Garcia-Molina, H.: Clustering the tagged web. In: WSDM 2009, pp. 54–63 (2009)

    Google Scholar 

  21. Shepitsen, A., Gemmell, J., Mobasher, B., Burke, R.D.: Personalized recommendation in social tagging systems using hierarchical clustering. In: RecSys, pp. 259–266 (2008)

    Google Scholar 

  22. Wetzker, R., Zimmermann, C., Bauckhage, C.: Analyzing social bookmarking systems: A del.icio.us cookbook. In: Mining Social Data (MSoDa) Workshop Proceedings, ECAI 2008, pp. 26–30 (2008)

    Google Scholar 

  23. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML 1997, pp. 412–420 (1997)

    Google Scholar 

  24. Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster web search results. In: SIGIR 2004, pp. 210–217 (2004)

    Google Scholar 

  25. Zhou, D., Bian, J., Zheng, S., Zha, H., Giles, C.L.: Exploring social annotations for information retrieval. In: WWW 2008, pp. 715–724 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, P., Wang, B., Jin, W., Cui, Y. (2011). User-Related Tag Expansion for Web Document Clustering. In: Clough, P., et al. Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20161-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20161-5_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20160-8

  • Online ISBN: 978-3-642-20161-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics