Query Directed Web Page Clustering Using Suffix Tree and Wikipedia Links

Park, John; Gao, Xiaoying; Andreae, Peter

doi:10.1007/978-3-642-35527-1_8

John Park²²,
Xiaoying Gao²² &
Peter Andreae²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7713))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

3479 Accesses

Abstract

Recent research on Web page clustering has shown that the user query plays a critical role in guiding the categorisation of web search results. This paper combines our Query Directed Clustering algorithm (QDC) with another existing algorithm, Suffix Tree Clustering (STC), to identify common phrases shared by documents for base cluster identification. One main contribution is the utilising of a new Wikipedia link based measure to estimate the semantic relatedness between query and the base cluster labels, which has shown great promise in identifying the good base clusters. Our experimental results show that the performance is improved by utilising suffix trees and Wikipedia links.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 46–54. ACM, New York (1998)
Chapter Google Scholar
Crabtree, D., Andreae, P., Gao, X.: Query directed web page clustering. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 202–210. IEEE Computer Society, Washington, DC (2006)
Chapter Google Scholar
Crabtree, D., Gao, X., Andreae, P.: Query directed clustering. The Knowledge and Information Systems (KAIS) Journal (acepted July 29, 2012)
Google Scholar
Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. on Knowl. and Data Eng. 19, 370–383 (2007)
Article Google Scholar
Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proceedings of AAAI 2008 (2008)
Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pp. 1–11. IEEE Computer Society, Washington, DC (1973)
Chapter Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1606–1611 (2007)
Google Scholar
Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In: Proceedings of Association for the Advancement of Artificial Intelligence, AAAI (2006)
Google Scholar
Bu, F., Hao, Y., Zhu, X.: Semantic relationship discovery with wikipedia structure. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, IJCAI 2011, vol. 3, pp. 1770–1775. AAAI Press (2011)
Google Scholar
Crabtree, D.: Raw data set (2005), http://www.danielcrabtree.com/research/wi05/rawdata.zip
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 2007, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington, New Zealand
John Park, Xiaoying Gao & Peter Andreae

Authors

John Park
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoying Gao
View author publications
You can also search for this author in PubMed Google Scholar
Peter Andreae
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Fudan University, Handan Road 220, 200433, Shanghai, China
Shuigeng Zhou
Chinese Academy of Sciences, Academy of Mathematics and Systems Science, Dongguancun East Road 55, 100190, Beijing, China
Songmao Zhang
Department of Computer Science and Engineering, University of Minnesota, Union Street SE 200, 55455, Minneapolis, MN, USA
George Karypis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Park, J., Gao, X., Andreae, P. (2012). Query Directed Web Page Clustering Using Suffix Tree and Wikipedia Links. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-35527-1_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35526-4
Online ISBN: 978-3-642-35527-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics