A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures

Shen, Xiaoyan; Chen, Junliang; Meng, Xiangwu; Zhang, Yujie; Liu, Chuanchang

doi:10.1007/978-3-642-01307-2_99

Xiaoyan Shen²³,
Junliang Chen²³,
Xiangwu Meng²³,
Yujie Zhang²³ &
…
Chuanchang Liu²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2389 Accesses

Abstract

In this paper, a simple but powerful algorithm: block co-citation algorithm is proposed to automatically find related pages for a given web page, by using HTML segmentation technologies and parallel hyperlink structure analysis. First, all hyperlinks in a web page are segmented into several blocks according to the HTML structure and text style information. Second, for each page, the similarity between every two hyperlinks in the same block of the page is computed according to several information, then the total similarity from one page to the other is obtained after all web pages are processed. For a given page u, the pages which have the highest total similarity to u are selected as the related pages of u. At last, the block co-citation algorithm is implemented in parallel to analyze a corpus of 37482913 pages sampled from a commercial search engine and demonstrates its feasibility and efficiency.

This research was sponsored by National Natural Science Foundation of China (No. 60432010), National 973 project of China(No. 2007CB307100).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Loia, V., Senatore, S., Sessa, M.I.: Discovering related web pages through fuzzy-context reasoning. In: The 2002 IEEE International Conference on Plasma Science, pp. 100–105 (2002)
Google Scholar
Fan, W.-B., et al.: Recognition of the topic-oriented Web page relations based on ontology. Journal of South China University of Technology (Natural Science) 32(suppl.), 31–47 (2004)
Google Scholar
Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks 11(11), 1467–1479 (1999)
Article Google Scholar
Tsuyoshi, M.: Finding Related Web Pages Based on Connectivity Information from a Search Engine. In: Proceedings of the 10th International World Wide Web Conference, pp. 18–19 (2001)
Google Scholar
Hou, J., Zhang, Y.: Effectively finding relevant web pages from linkage information. IEEE Transactions on Knowledge and Data Engineering 11(4), 940–950 (2003)
Google Scholar
Ollivier, Y., Senellart, P.: Finding Related Pages Using Green Measures: An Illustration with Wikipedia. In: The 22nd National Conference on Artificial Intelligence (AAAI 2007). pp. 1427–1433 (2007)
Google Scholar
Fogaras, D., Racz, B.: Practical Algorithms and Lower Bounds for Similarity Search in Massive Graphs. IEEE Transactions on Knowledge and Data Engineering 19(5), 585–598 (2007)
Article Google Scholar
Chakrabarti, S., et al.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In: The 7th International Conference on World Wide Web, pp. 65–74 (1998)
Google Scholar
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced Hypertext Categorization Using Hyperlinks. In: 1998 ACM SIGMOD international conference on Management of data. pp. 307–318 (1998)
Google Scholar
Debnath, S., et al.: Automatic identification of informative sections of Web pages. IEEE Transactions on Knowledge and Data Engineering 17(9), 1233–1246 (2005)
Article Google Scholar
Lee, S.H., Kim, S.J., Hong, S.H.: On URL normalization. In: Gervasi, O., Gavrilova, M.L., Kumar, V., Laganá, A., Lee, H.P., Mun, Y., Taniar, D., Tan, C.J.K. (eds.) ICCSA 2005. LNCS, vol. 3481, pp. 1076–1085. Springer, Heidelberg (2005)
Chapter Google Scholar
Dean, J., Ghemawat, J.: MapReduce Simplified Data Processing on Large Clusters. In: The Proceedings of the 6th Symp. on Operating Systems Design and Implementation, pp. 137–149 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

State key Laboratory of Networking and Switching Technology, Beijing University of Posts, and Telecommunications, Beijing, China
Xiaoyan Shen, Junliang Chen, Xiangwu Meng, Yujie Zhang & Chuanchang Liu

Authors

Xiaoyan Shen
View author publications
You can also search for this author in PubMed Google Scholar
Junliang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiangwu Meng
View author publications
You can also search for this author in PubMed Google Scholar
Yujie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chuanchang Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5 Tiwanont Road, 12000, Bangkadi, Muang, Pathumthani, Thailand
Thanaruk Theeramunkong
Dept. of Computer Engineering, Faculty of Engineering, Chulalongkorn University, 10330, Bangkok, Thailand
Boonserm Kijsirikul
Faculty of Science & Engineering, York University, 355 Lumbers Building, 4700 Keele Street, M3J 1P3, Toronto, Ontario, Canada
Nick Cercone
School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, 923-1292, Ishikawa, Japan
Tu-Bao Ho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shen, X., Chen, J., Meng, X., Zhang, Y., Liu, C. (2009). A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_99

Download citation

DOI: https://doi.org/10.1007/978-3-642-01307-2_99
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics