Abstract
With the rapid development of the Internet, the number of web pages has grown explosively. There are also many pages with similar content and low-quality pages. In terms of search engine, indexing such pages is no significant effect for retrieval results but increases the search engine’s indexing and retrieval burden. This paper presents a page selection algorithm, building indexing page collection from massive web data for search engine. On the one hand, a web signature-based clustering algorithm is used to filter the similar pages to compress the size of the indexing page collection; on the other hand, it combines a variety of features of the page dimensions and user dimensions, to ensure the quality of the collection. Experiments show that the size of indexing page collection selected by the proposed algorithm is only one-third of the entire page collection, and can meet the vast majority of user click needs, with a strong practical.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agrawal, A., Husain, M., Tiwari, R.G., et al.: A novel technique for database selection and document selection. Int. J. Comput. Appl. 17(8), 22–26 (2011)
Lin, H., Zhang, Y., Davis, J.: Best document selection based on approximate utility optimization. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1215–1216. ACM, New York (2011)
Welch, M.J., Cho, J., Olston, C.: Search result diversity for informational queries. In: Proceedings of the 20th International Conference on World Wide Web, pp. 237–246. ACM, New York (2011)
Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, 2000, pp. 1–10
Broder, A.Z., Classman, S.C., Manasse, M.S.: Syntactic clustering of the Web. In: Proceedings of the 6th International Web Conference, 1997, pp. 11–20
Mathew, M., Shine, N.D., Lakshmi, T.R., et al.: A novel approach for near-duplicate detection of Web pages using TDW matrix. Int. J. Comput. Appl. 19(7), 16–21 (2011)
Dubes, R.C., Jain, A.K.: Algorithms for Clustering Data. Prentice Hall, New York (1988)
Salloum, M., Tsotras, V.J., Srivastava, D., et al.: Selection and ordering of candidate documents for effective query answering in XML databases. In: Fifth International Workshop on Ranking in Databases, pp. 201–207. ACM, New York (2011)
Ding, Z., Wu, B., Xin, Y.: Research of large-scale URL filter based on bloom filter. New Technol. Lib. Inform. Serv., 3, 45–50 (2008)
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, pp. 388–397. ACM, New York (2002)
Page, L., Brin, S., Motwani, R., et al.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Libraries, Stanford (1998)
Wei, C., Chen, F., Xu, D., et al.: A framework for web page quality evaluation. J. Chin. Inform. Process. (AD of Publication, Beijing, China) 25(5), 3–8 (2011)
Wang, C., Liu, Y., Zhang, M., et al.: Topic-independent web high-quality page selection based on K-means clustering. Lect. Notes Comput. Sci. 3689, 516–521 (2005)
Spirin, N., Han, J.: Survey on web spam detection: principles and algorithms. ACM SIGKDD Explorations Newslett., 13(2), 50–64 (2012)
Gyngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proceedings of the 30th International Conference on Very Large Databases (VLDB), pp. 576–587. ACM, New York (2004)
Rivest, R.: MIT Laboratory for Computer Science and RSA Data Security Inc. The MD5 message-digest algorithm[J], (1992)
Singh, D.: Improving web search ranking through user behavior information. Int. J. Inform. Technol. Knowl. Manag (Serials Publications, New Delhi, India) 4(2), 635–638 (2011)
Chen, M., Yamada, S., Takama, Y.: Investigating user behavior in document similarity judgment for interactive clustering-based search engines. J. Emerg. Technol. Web Intell. (Academy Publisher, Oulu, Finland) 3(1), 3–10 (2011)
Acknowledgements
This work was supported by Natural Science Foundation (60903107, 61073071) and National High Technology Research and Development (863) Program (2011AA01A205) of China.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this paper
Cite this paper
Ru, L., Li, Z., Wu, Y., Ma, S. (2013). Research on Indexing Page Collection Selection Method for Search Engine. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, HT. (eds) Semantic Web and Web Science. Springer Proceedings in Complexity. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6880-6_30
Download citation
DOI: https://doi.org/10.1007/978-1-4614-6880-6_30
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6879-0
Online ISBN: 978-1-4614-6880-6
eBook Packages: Computer ScienceComputer Science (R0)