Skip to main content

Research on Indexing Page Collection Selection Method for Search Engine

  • Conference paper
  • First Online:
  • 1744 Accesses

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

Abstract

With the rapid development of the Internet, the number of web pages has grown explosively. There are also many pages with similar content and low-quality pages. In terms of search engine, indexing such pages is no significant effect for retrieval results but increases the search engine’s indexing and retrieval burden. This paper presents a page selection algorithm, building indexing page collection from massive web data for search engine. On the one hand, a web signature-based clustering algorithm is used to filter the similar pages to compress the size of the indexing page collection; on the other hand, it combines a variety of features of the page dimensions and user dimensions, to ensure the quality of the collection. Experiments show that the size of indexing page collection selected by the proposed algorithm is only one-third of the entire page collection, and can meet the vast majority of user click needs, with a strong practical.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Agrawal, A., Husain, M., Tiwari, R.G., et al.: A novel technique for database selection and document selection. Int. J. Comput. Appl. 17(8), 22–26 (2011)

    Google Scholar 

  2. Lin, H., Zhang, Y., Davis, J.: Best document selection based on approximate utility optimization. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1215–1216. ACM, New York (2011)

    Google Scholar 

  3. Welch, M.J., Cho, J., Olston, C.: Search result diversity for informational queries. In: Proceedings of the 20th International Conference on World Wide Web, pp. 237–246. ACM, New York (2011)

    Google Scholar 

  4. Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, 2000, pp. 1–10

    Google Scholar 

  5. Broder, A.Z., Classman, S.C., Manasse, M.S.: Syntactic clustering of the Web. In: Proceedings of the 6th International Web Conference, 1997, pp. 11–20

    Google Scholar 

  6. Mathew, M., Shine, N.D., Lakshmi, T.R., et al.: A novel approach for near-duplicate detection of Web pages using TDW matrix. Int. J. Comput. Appl. 19(7), 16–21 (2011)

    Google Scholar 

  7. Dubes, R.C., Jain, A.K.: Algorithms for Clustering Data. Prentice Hall, New York (1988)

    Google Scholar 

  8. Salloum, M., Tsotras, V.J., Srivastava, D., et al.: Selection and ordering of candidate documents for effective query answering in XML databases. In: Fifth International Workshop on Ranking in Databases, pp. 201–207. ACM, New York (2011)

    Google Scholar 

  9. Ding, Z., Wu, B., Xin, Y.: Research of large-scale URL filter based on bloom filter. New Technol. Lib. Inform. Serv., 3, 45–50 (2008)

    Google Scholar 

  10. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, pp. 388–397. ACM, New York (2002)

    Google Scholar 

  11. Page, L., Brin, S., Motwani, R., et al.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Libraries, Stanford (1998)

    Google Scholar 

  12. Wei, C., Chen, F., Xu, D., et al.: A framework for web page quality evaluation. J. Chin. Inform. Process. (AD of Publication, Beijing, China) 25(5), 3–8 (2011)

    Google Scholar 

  13. Wang, C., Liu, Y., Zhang, M., et al.: Topic-independent web high-quality page selection based on K-means clustering. Lect. Notes Comput. Sci. 3689, 516–521 (2005)

    Article  Google Scholar 

  14. Spirin, N., Han, J.: Survey on web spam detection: principles and algorithms. ACM SIGKDD Explorations Newslett., 13(2), 50–64 (2012)

    Google Scholar 

  15. Gyngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proceedings of the 30th International Conference on Very Large Databases (VLDB), pp. 576–587. ACM, New York (2004)

    Google Scholar 

  16. Rivest, R.: MIT Laboratory for Computer Science and RSA Data Security Inc. The MD5 message-digest algorithm[J], (1992)

    Google Scholar 

  17. Singh, D.: Improving web search ranking through user behavior information. Int. J. Inform. Technol. Knowl. Manag (Serials Publications, New Delhi, India) 4(2), 635–638 (2011)

    Google Scholar 

  18. Chen, M., Yamada, S., Takama, Y.: Investigating user behavior in document similarity judgment for interactive clustering-based search engines. J. Emerg. Technol. Web Intell. (Academy Publisher, Oulu, Finland) 3(1), 3–10 (2011)

    Google Scholar 

Download references

Acknowledgements

This work was supported by Natural Science Foundation (60903107, 61073071) and National High Technology Research and Development (863) Program (2011AA01A205) of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liyun Ru .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this paper

Cite this paper

Ru, L., Li, Z., Wu, Y., Ma, S. (2013). Research on Indexing Page Collection Selection Method for Search Engine. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, HT. (eds) Semantic Web and Web Science. Springer Proceedings in Complexity. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6880-6_30

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-6880-6_30

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-6879-0

  • Online ISBN: 978-1-4614-6880-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics