A Novel Method for Clustering Web Search Results with Wikipedia Disambiguation Pages

Huang, Zhi; Niu, Zhendong; Liu, Donglei; Niu, Wenjuan; Wang, Wei

doi:10.1007/978-3-319-22324-7_1

A Novel Method for Clustering Web Search Results with Wikipedia Disambiguation Pages

Zhi Huang¹⁸,
Zhendong Niu^18,19,20,
Donglei Liu¹⁸,
Wenjuan Niu¹⁸ &
…
Wei Wang¹⁸

Conference paper
First Online: 01 January 2015

1113 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9052))

Abstract

Organizing search results of an ambiguous query into topics can facilitate information search on the Web. In this paper, we propose a novel method to cluster search results of ambiguous query into topics about the query constructed from Wikipedia disambiguation pages (WDP). To improve the clustering result, we propose a concept filtering method to filter semantically unrelated concepts in each topic. Also, we propose the top K full relations (TKFR) algorithm to assign search results to relevant topics based on the similarities between concepts in the results and topics. Comparing with the clustering methods whose topic labels are extracted from search results, the topics of WDP which are edited by human are much more helpful for navigation. The experiment results show that our method can work for ambiguous queries with different query lengths and highly improves the clustering result of method using WDP.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://www.dmoz.org/.
2.
http://wordnet.princeton.edu/.
3.
http://en.wikipedia.org/wiki/Category:Disambiguation_pages.
4.
http://code.google.com/p/word2vec/.
5.
http://credo.fub.it/ambient/.
6.
http://lcl.uniroma1.it/moresque.
7.
https://github.com/HUANG-Zhi/SimpleAmbiguousQueryDataset.
8.
The authors organized a task for evaluation of clustering methods on AMBIENT and MORESQUE, http://www.cs.york.ac.uk/semeval-2013/task11/.

References

Sanderson, M.: Ambiguous queries: test collections need more sense. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 499–506. ACM (2008)
Google Scholar
Di Marco, A., Navigli, R.: Clustering and diversifying web search results with graph-based word sense induction. Comput. Linguist. 39(3), 709–754 (2013)
Article Google Scholar
Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th International Conference on World Wide Web, pp. 658–665. ACM (2004)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint. arXiv:1301.3781
Mandhani, B., Joshi, S., Kummamuru, K.: A matrix density based algorithm to hierarchically co-cluster documents and words. In: Proceedings of the 12th International Conference on World Wide Web, pp. 511–518. ACM (2003)
Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM (1998)
Google Scholar
Bernardini, A., Carpineto, C.: Full-subtopic retrieval with keyphrase-based search results clustering. In: IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, wi-iat 2009 (2009)
Google Scholar
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329. ACM (1992)
Google Scholar
Krishna, K., Krishnapuram, R.: A clustering algorithm for asymmetrically related data with applications to text mining. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 571–573. ACM (2001)
Google Scholar
Lawrie, D., Croft, W.B., Rosenberg, A.: Finding topic words for hierarchical summarization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 349–357. ACM (2001)
Google Scholar
Schütze, H., Pedersen, J.O.: Information retrieval based on word senses (1995)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7, 1606–1611 (2007)
Google Scholar
Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., Su, Z.: Optimizing web search using social annotations. In: Proceedings of the 16th International Conference on World Wide Web, pp. 501–510. ACM (2007)
Google Scholar
Xie, H.R., Li, Q., Cai, Y.: Community-aware resource profiling for personalized search in folksonomy. J. Comput. Sci. Technol. 27(3), 599–610 (2012)
Article MATH Google Scholar
Xie, H., Li, Q., Mao, X., Li, X., Cai, Y., Rao, Y.: Community-aware user profile enrichment in folksonomy. Neural Netw. 58, 111–121 (2014)
Article Google Scholar
Xie, H., Li, Q., Mao, X., Li, X., Cai, Y., Zheng, Q.: Mining latent user community for tag-based and content-based search in social media. Comput. J. 57(9), 1415–1430 (2014)
Article Google Scholar
Schütze, H.: Word space. In: Advances in Neural Information Processing Systems 5. Citeseer (1993)
Google Scholar
Bengio, Y., Schwenk, H., Senécal, J.S., Morin, F., Gauvain, J.L.: Neural probabilistic language models. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Machine Learning. Studies in Fuzziness and Soft Computing, vol. 194, pp. 137–186. Springer, Heidelberg (2006)
Chapter Google Scholar
Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL, pp. 746–751 (2013)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (No. 61370137), the International Corporation Project of Beijing Institute of Technology (No. 3070012221404) and the 111 Project of Beijing Institute of Technology.

Author information

Authors and Affiliations

School of Computer Science, Beijing Institute of Technology, Beijing, 100081, China
Zhi Huang, Zhendong Niu, Donglei Liu, Wenjuan Niu & Wei Wang
Information School, University of Pittsburgh, Pennsylvania, 15260, USA
Zhendong Niu
Beijing Engineering Research Center of Massive Language Information Processing and Cloud Computing Application, Beijing Institute of Technology, Beijing, 100081, China
Zhendong Niu

Authors

Zhi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zhendong Niu
View author publications
You can also search for this author in PubMed Google Scholar
Donglei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wenjuan Niu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhendong Niu .

Editor information

Editors and Affiliations

Soochow University, Suzhou, China
An Liu
Nagoya University, Nagoya, Japan
Yoshiharu Ishikawa
Wuhan University, Wuhan, China
Tieyun Qian
University of Hong Kong, Hong Kong, China
Sarana Nutanong
Monash University, Clayton, Victoria, Australia
Muhammad Aamir Cheema

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Z., Niu, Z., Liu, D., Niu, W., Wang, W. (2015). A Novel Method for Clustering Web Search Results with Wikipedia Disambiguation Pages. In: Liu, A., Ishikawa, Y., Qian, T., Nutanong, S., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9052. Springer, Cham. https://doi.org/10.1007/978-3-319-22324-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-22324-7_1
Published: 30 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22323-0
Online ISBN: 978-3-319-22324-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics