Research on Indexing Page Collection Selection Method for Search Engine

Ru, Liyun; Li, Zhichao; Wu, Yingying; Ma, Shaoping

doi:10.1007/978-1-4614-6880-6_30

Liyun Ru⁶,
Zhichao Li⁷,
Yingying Wu⁸ &
…
Shaoping Ma⁶

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

1793 Accesses

Abstract

With the rapid development of the Internet, the number of web pages has grown explosively. There are also many pages with similar content and low-quality pages. In terms of search engine, indexing such pages is no significant effect for retrieval results but increases the search engine’s indexing and retrieval burden. This paper presents a page selection algorithm, building indexing page collection from massive web data for search engine. On the one hand, a web signature-based clustering algorithm is used to filter the similar pages to compress the size of the indexing page collection; on the other hand, it combines a variety of features of the page dimensions and user dimensions, to ensure the quality of the collection. Experiments show that the size of indexing page collection selected by the proposed algorithm is only one-third of the entire page collection, and can meet the vast majority of user click needs, with a strong practical.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Phrase Based Web Document Clustering: An Indexing Approach

Efficient Web Object Caching Through Query Correlation Approach

Web-Page Indexing Based on the Prioritize Ontology Terms

References

Agrawal, A., Husain, M., Tiwari, R.G., et al.: A novel technique for database selection and document selection. Int. J. Comput. Appl. 17(8), 22–26 (2011)
Google Scholar
Lin, H., Zhang, Y., Davis, J.: Best document selection based on approximate utility optimization. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1215–1216. ACM, New York (2011)
Google Scholar
Welch, M.J., Cho, J., Olston, C.: Search result diversity for informational queries. In: Proceedings of the 20th International Conference on World Wide Web, pp. 237–246. ACM, New York (2011)
Google Scholar
Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, 2000, pp. 1–10
Google Scholar
Broder, A.Z., Classman, S.C., Manasse, M.S.: Syntactic clustering of the Web. In: Proceedings of the 6th International Web Conference, 1997, pp. 11–20
Google Scholar
Mathew, M., Shine, N.D., Lakshmi, T.R., et al.: A novel approach for near-duplicate detection of Web pages using TDW matrix. Int. J. Comput. Appl. 19(7), 16–21 (2011)
Google Scholar
Dubes, R.C., Jain, A.K.: Algorithms for Clustering Data. Prentice Hall, New York (1988)
Google Scholar
Salloum, M., Tsotras, V.J., Srivastava, D., et al.: Selection and ordering of candidate documents for effective query answering in XML databases. In: Fifth International Workshop on Ranking in Databases, pp. 201–207. ACM, New York (2011)
Google Scholar
Ding, Z., Wu, B., Xin, Y.: Research of large-scale URL filter based on bloom filter. New Technol. Lib. Inform. Serv., 3, 45–50 (2008)
Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, pp. 388–397. ACM, New York (2002)
Google Scholar
Page, L., Brin, S., Motwani, R., et al.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Libraries, Stanford (1998)
Google Scholar
Wei, C., Chen, F., Xu, D., et al.: A framework for web page quality evaluation. J. Chin. Inform. Process. (AD of Publication, Beijing, China) 25(5), 3–8 (2011)
Google Scholar
Wang, C., Liu, Y., Zhang, M., et al.: Topic-independent web high-quality page selection based on K-means clustering. Lect. Notes Comput. Sci. 3689, 516–521 (2005)
Article Google Scholar
Spirin, N., Han, J.: Survey on web spam detection: principles and algorithms. ACM SIGKDD Explorations Newslett., 13(2), 50–64 (2012)
Google Scholar
Gyngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proceedings of the 30th International Conference on Very Large Databases (VLDB), pp. 576–587. ACM, New York (2004)
Google Scholar
Rivest, R.: MIT Laboratory for Computer Science and RSA Data Security Inc. The MD5 message-digest algorithm[J], (1992)
Google Scholar
Singh, D.: Improving web search ranking through user behavior information. Int. J. Inform. Technol. Knowl. Manag (Serials Publications, New Delhi, India) 4(2), 635–638 (2011)
Google Scholar
Chen, M., Yamada, S., Takama, Y.: Investigating user behavior in document similarity judgment for interactive clustering-based search engines. J. Emerg. Technol. Web Intell. (Academy Publisher, Oulu, Finland) 3(1), 3–10 (2011)
Google Scholar

Download references

Acknowledgements

This work was supported by Natural Science Foundation (60903107, 61073071) and National High Technology Research and Development (863) Program (2011AA01A205) of China.

Author information

Authors and Affiliations

State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Liyun Ru & Shaoping Ma
Sogou Corporation, Beijing, 100084, China
Zhichao Li
Graduate School of Arts and Sciences, Harvard University, Cambridge, MA, 02138, USA
Yingying Wu

Authors

Liyun Ru
View author publications
You can also search for this author in PubMed Google Scholar
Zhichao Li
View author publications
You can also search for this author in PubMed Google Scholar
Yingying Wu
View author publications
You can also search for this author in PubMed Google Scholar
Shaoping Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liyun Ru .

Editor information

Editors and Affiliations

, Dept. of Computer Science and Technology, Tsinghua University, Room 10-206, East main building, Beijing, 100084, China, People's Republic
Juanzi Li
, School of Comp. Sci. & Eng., Southeast University, Dongda Road 2, Nanjing, 211189, Jiangsu, China, People's Republic
Guilin Qi
Peking University, Inst. of Computer Science & Tech., North Zhongguancun Street 128, Beijing, 100871, China, People's Republic
Dongyan Zhao
L3S Research Center, Leibniz University Hannover, Appelstr. 4, Hannover, 30167, Germany
Wolfgang Nejdl
Tsinghua Campus H202B, Shenzhen City, 518055, China, People's Republic
Hai-Tao Zheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ru, L., Li, Z., Wu, Y., Ma, S. (2013). Research on Indexing Page Collection Selection Method for Search Engine. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, HT. (eds) Semantic Web and Web Science. Springer Proceedings in Complexity. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6880-6_30

Download citation

DOI: https://doi.org/10.1007/978-1-4614-6880-6_30
Published: 02 May 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6879-0
Online ISBN: 978-1-4614-6880-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Research on Indexing Page Collection Selection Method for Search Engine

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Phrase Based Web Document Clustering: An Indexing Approach

Efficient Web Object Caching Through Query Correlation Approach

Web-Page Indexing Based on the Prioritize Ontology Terms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Research on Indexing Page Collection Selection Method for Search Engine

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Phrase Based Web Document Clustering: An Indexing Approach

Efficient Web Object Caching Through Query Correlation Approach

Web-Page Indexing Based on the Prioritize Ontology Terms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation