User-Related Tag Expansion for Web Document Clustering

Li, Peng; Wang, Bin; Jin, Wei; Cui, Yachao

doi:10.1007/978-3-642-20161-5_5

Peng Li^21,22,
Bin Wang²¹,
Wei Jin²³ &
…
Yachao Cui^21,22

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6611))

Included in the following conference series:

European Conference on Information Retrieval

6779 Accesses
1 Citations

Abstract

As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags(less than 10). This sparsity seriously limits the usage of tags on clustering. In this work, we propose a user-related tag expansion method to overcome the problem, which incorporates additional useful tags into the original tag document by utilizing user tagging as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. This problem is addressed in this research by designing a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that (1)Our user-related tag expansion method can be effectively applied to over 90% tagged web documents; (2)Folk-LDA can alleviate the topic drift in expansion, especially for those topic-specific documents; (3) Compared to word-based clustering, our approach using only tags achieves a statistically significant increase of 39% on F1 score while reducing 76% terms involved in computation at best.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

http://www.dai-labor.de/en/competence_centers/irml/datasets/
Begelman, G.: Automated tag clustering: Improving search and exploration in the tag space. In: Proc. of the Collaborative Web Tagging Workshop at WWW 2006 (2006)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003)
Google Scholar
Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: WWW 2003 (2003)
Google Scholar
Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: IJCAI 2005, pp. 1048–1053 (2005)
Google Scholar
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: AAAI 2006, pp. 1301–1306 (2006)
Google Scholar
Gemmell, J., Shepitsen, A., Mobasher, B., Burke, R.: Personalizing navigation in folksonomies using hierarchical tag clustering. In: Data Warehousing and Knowledge Discovery, pp. 196–205 (2008)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl. 1), 5228–5235 (2004)
Article Google Scholar
Heymann, P., Garcia-Molina, H.: Collaborative creation of communal hierarchical taxonomies in social tagging systems. Tech. Rep. 2006-10, Computer Science Department (2006)
Google Scholar
Heymann, P., Koutrika, G., Garcia-Molina, H.: Can social bookmarking improve web search? In: WSDM 2008, pp. 195–206 (2008)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proc. of the SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003)
Google Scholar
Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging wikipedia semantics. In: SIGIR 2008 (2008)
Google Scholar
Hu, X., Zhang, X., Lu, C., Park, E.K., Zhou, X.: Exploiting wikipedia as external knowledge for document clustering. In: KDD 2009, pp. 389–396 (2009)
Google Scholar
Li, X., Guo, L., Zhao, Y.E.: Tag-based social interest discovery. In: WWW 2008, pp. 675–684 (2008)
Google Scholar
Liu, T., Liu, S., Chen, Z.: An evaluation on feature selection for text clustering. In: ICML, pp. 488–495 (2003)
Google Scholar
Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: SIGIR 2004, pp. 186–193 (2004)
Google Scholar
Lu, C., Chen, X., Park, E.K.: Exploit the tripartite network of social tagging for web clustering. In: CIKM 2009, pp. 1545–1548 (2009)
Google Scholar
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
McKeown, K.R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J.L., Nenkova, A., Sable, C., Schiffman, B., Sigelman, S.: Tracking and summarizing news on a daily basis with columbia’s newsblaster. In: HLT 2002, pp. 280–285 (2002)
Google Scholar
Ramage, D., Heymann, P., Manning, C.D., Garcia-Molina, H.: Clustering the tagged web. In: WSDM 2009, pp. 54–63 (2009)
Google Scholar
Shepitsen, A., Gemmell, J., Mobasher, B., Burke, R.D.: Personalized recommendation in social tagging systems using hierarchical clustering. In: RecSys, pp. 259–266 (2008)
Google Scholar
Wetzker, R., Zimmermann, C., Bauckhage, C.: Analyzing social bookmarking systems: A del.icio.us cookbook. In: Mining Social Data (MSoDa) Workshop Proceedings, ECAI 2008, pp. 26–30 (2008)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML 1997, pp. 412–420 (1997)
Google Scholar
Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster web search results. In: SIGIR 2004, pp. 210–217 (2004)
Google Scholar
Zhou, D., Bian, J., Zheng, S., Zha, H., Giles, C.L.: Exploring social annotations for information retrieval. In: WWW 2008, pp. 715–724 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, China
Peng Li, Bin Wang & Yachao Cui
Graduate School of the Chinese Academy of Sciences, Beijing, China
Peng Li & Yachao Cui
Department of Computer Science, North Dakota State University, USA
Wei Jin

Authors

Peng Li
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Jin
View author publications
You can also search for this author in PubMed Google Scholar
Yachao Cui
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information School, University of Sheffield, Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
Paul Clough
CLARITY: Centre for Sensor Web Technologies, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland
Colum Foley , Cathal Gurrin & Hyowon Lee , &
Centre for Next Generation Localisation, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland
Gareth J. F. Jones
TNO Human Factors, Brassersplein 2, 2612 CT, Delft, The Netherlands
Wessel Kraaij
Yahoo! Research, 177 Diagonal, 08018, Barcelona, Spain
Vanessa Mudoch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, P., Wang, B., Jin, W., Cui, Y. (2011). User-Related Tag Expansion for Web Document Clustering. In: Clough, P., et al. Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20161-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-20161-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20160-8
Online ISBN: 978-3-642-20161-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics