Skip to main content
Log in

Improving the freshness of the search engines by a probabilistic approach based incremental crawler

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

Web is flooded with data. While the crawler is responsible for accessing these web pages and giving it to the indexer for making them available to the users of search engine, the rate at which these web pages change has created the necessity for the crawler to employ refresh strategies to give updated/modified content to the search engine users. Furthermore, Deep web is that part of the web that has alarmingly abundant amounts of quality data (when compared to normal/surface web) but not technically accessible to a search engine’s crawler. The existing deep web crawl methods helps to access the deep web data from the result pages that are generated by filling forms with a set of queries and accessing the web databases through them. However, these methods suffer from not being able to maintain the freshness of the local databases. Both the surface web and the deep web needs an incremental crawl associated with the normal crawl architecture to overcome this problem. Crawling the deep web requires the selection of an appropriate set of queries so that they can cover almost all the records in the data source and in addition the overlapping of records should be low so that network utilization is reduced. An incremental crawl adds to an increase in the network utilization with every increment. Therefore, a reduced query set as described earlier should be used in order to minimize the network utilization. Our contributions in this work are the design of a probabilistic approach based incremental crawler to handle the dynamic changes of the surface web pages, adapting the above mentioned method with a modification to handle the dynamic changes in the deep web databases, a new evaluation measure called the ‘Crawl-hit rate’ to evaluate the efficiency of the incremental crawler in terms of the number of times the crawl is actually necessary in the predicted time and a semantic weighted set covering algorithm for reducing the queries so that the network cost is reduced for every increment of the crawl without any compromise in the number of records retrieved. The evaluation of incremental crawler shows a good improvement in the freshness of the databases and a good Crawl-hit rate (83 % for web pages and 81 % for deep web databases) with a lesser over head when compared to the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Ali, H. A., El Desouky, A. I., & Saleh, A. I. (2008). A New Approach for Building a Scalable and Adaptive Vertical Search Engine. International Journal of Intelligent Information Technologies (IJIIT), 4(1), 52–79. doi:10.4018/jiit.2008010.

    Article  Google Scholar 

  • Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword- based interfaces. In Proceedings of Brazilian Symposium on Databases (SBBD) (pp. 309–321). Brazil: Brasilia.

    Google Scholar 

  • Bergman, M. K. (2001). The deep web: surfacing hidden value. Journal of Electronic Publishing, 7(1), 1174–1175.

    Article  Google Scholar 

  • Bright Planet, (2010) Deep Web FAQs, http://www.brightplanet.com/the-deep-web/

  • Cho, J., & Molina, H. G. (2000a). The Evolution of the Web and Implications for an Incremental Crawler”, In Proceedings of the 26th International Conference on Very Large Data Bases, 200–209.

    Google Scholar 

  • Cho, J., & Molina, H. G. (2000b). Synchronizing a Database to Improve Freshness. In In Proceedings of ACM International Conference on Management of Data (SIGMOD) (pp. 117–128).

    Google Scholar 

  • Cho, J., & Ntoulas, A. (2002). Effective Change Detection using Sampling. In In Proceedings of 28th International Conference on Very Large Data Bases (pp. 514–525).

    Google Scholar 

  • Edwards, J., McCurley, K., & Tomlin, J. (2001). An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In Proceedings of the. In 10th Conference on World Wide Web, ACM Press (pp. 106–113).

    Google Scholar 

  • Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press http://wordnet.princeton.edu.

    Google Scholar 

  • He, B., Patel, M., Zhang, Z., & Chang, K. C. (2008). Accessing the Deep Web: A Survey. Communications of the ACM, 50, 94–101.

  • Kim, J., & Storey, V. C. (2011). Construction of Domain Ontologies: Sourcing the World Wide Web. International Journal of Intelligent Information Technologies (IJIIT), 7(2), 1–24. doi:10.4018/jiit.2011040101.

    Article  Google Scholar 

  • Ling, Y., Meng, X., & Liu, W. (2007). An Attributes Correlation based Approach for Estimating Size of Web Databases. Journal of Software, 19(2), 224–236.

    Article  Google Scholar 

  • Lu, J. (2008). Efficient Estimation of the Size of Text Deep Web Data Source. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, Napa Valley, CA, 1485–1486.

  • Lu, J., Wang, Y., liang, J., Chen, J., & Liu, J. (2008). An Approach to Deep Web Crawling by Sampling. In Proceedings of Web Intelligence, 718–724.

  • Madaan, R., Dixit, A., Sharma, A. K., & Bhatia, K. K. (2010). A Framework for Incremental Hidden Web Crawler. International Journal of Computer Science and Engineering, 2(3), 753–758.

    Google Scholar 

  • Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. Y. (2008). Google’s Deep Web Crawl. Proceedings of the VLDB Endowment, 1(2), 1241–1252.

    Article  Google Scholar 

  • Ntoulas, A., Cho, J., & Olston, C. (2004). What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In In Proceedings of the 13th International Conference on World Wide Web (pp. 1–12).

    Google Scholar 

  • Ntoulas, A., Zerfos, P., Cho, J. (2005). Downloading Textual Hidden Web Content through Keyword Queries. In Proceedings of the Joint Conference on Digital Libraries (JCDL), 5th ACM/IEEE Joint Conference on Digital Libraries, 100–109.

  • One Way ANOVA (2013) University of Wisconsin - Stevens Point. [Online]. Available: http://www.uwsp.edu/psych/stat/12/anova-1w.ht.

  • Pavai, G., & Geetha, T. V. (2013). A Unified Architecture for Surfacing the Content of Deep Web Databases. In In proceedings of the International Conference on Advances in Communication, Network, and Computing (pp. 35–38).

    Google Scholar 

  • Pavai, G., & Geetha, T. V. (2014). A Bootstrapping Approach to Classification of Deep Web Query Interfaces. International Journal on Recent Trends in Engineering and Technology, 11(1), 1–9.

    Article  Google Scholar 

  • Radinsky, K., & Bennett, P. (2013). Predicting Content Change on the Web”, In Proceedings of the sixth ACM International Conference on web search and data mining, 415–424.

    Google Scholar 

  • Raghavan, S., & Molina, H. G. (2001). Crawling the Hidden Web. In In proceedings of the 27th International Conference on Very large databases (pp. 129–138).

    Google Scholar 

  • Santos, A. S. R., De Carvalho, C. R., Almeida, J. M., De Moura, E. S., Da Silva, A. S., & Ziviani, N. (2015). A Genetic Programming Framework to Schedule Webpage Updates. Information Retrieval Journal, 18(1), 73–94.

    Article  Google Scholar 

  • Saranya, S. (2013). Search Engine Optimization for Better Results without User Effort. M.E. Thesis.

    Google Scholar 

  • Sharma, A. K., & Dixit, A. (2008). Self-Adjusting Refresh Time based Architecture for Incremental Web Crawler. International Journal on Computer Science and Network Security, 8(12), 349–354.

    Google Scholar 

  • Shestakov, D., & Salakoski, T., (2007). On Estimating the Scale of National Deep Web. In Proceedings of Database Expert System and Application, 780–789.

  • Tan, Q., & Mitra, P. (2010). Clustering based Incremental Web Crawling. ACM Transactions on Information Systems, 2(3), 1–25.

    Article  Google Scholar 

  • Wang, Y., Lu, J., & Chen, J. (2009). Crawling Deep Web Using a New Set Covering Algorithm. In In Proceedings of the 5th International Conference on Advanced Data Mining and Applications (pp. 326–337).

    Chapter  Google Scholar 

  • Wang, Y., Lu, J., Liang, J., Chen, J., & Liu, J. (2010). Selecting Queries from Sample to Crawl Deep Web Data Sources. Web Intelligence and Agent Systems: An International Journal, 1, 1–15.

    Google Scholar 

  • Wang, Y., Lu, J., & Chen, J. (2014). TS-IDS Algorithm for Query Selection in the Deep Web Crawling. Chapter Web Technologies and Applications, Volume 8709 of the series Lecture Notes in Computer Science, 189–200.

  • Wei, F., Wu-bin, P., & Zhi-Ming, C. (2011). A Novel Method on Incremental Information Acquisition for Deep Web. Journal of Convergence Information Technology, 6(6), 383–389.

    Article  Google Scholar 

  • Wong, B. W. F. (2010). Deep Web Search Engine Ranking Algorithms. Massachusetts Institute of Technology: M.E. Thesis.

    Google Scholar 

  • Wu, P., Wen, J.-R., Liu, H., & Ma, W. Y. (2006). Query Selection Techniques for Efficient Crawling of Structured Web Sources, in Proceedings of International Conference on Data Engineering, 47–56.

    Google Scholar 

  • Wu, W., Li, H., Wang, H., & Zhu, K. (2012). Probase: A Probabilistic Taxonomy for Text Understanding. In In Proceedings of ACM International Conference on Management of Data (SIGMOD) (pp. 481–492).

    Google Scholar 

  • Zhang, X., Zuo, M., & Liu, Q. (2008). Analysis of the Reasons Why Invisible Web can’t be Seen and its Effective Retrieval Strategies. In In Proceedings of the 3rd International Conference on Innovative Computing Information and Control (pp. 563–567).

    Google Scholar 

Download references

Acknowledgments

We thank Anna University, Chennai, Tamil Nadu, India for financially supporting this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. Pavai.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pavai, G., Geetha, T.V. Improving the freshness of the search engines by a probabilistic approach based incremental crawler. Inf Syst Front 19, 1013–1028 (2017). https://doi.org/10.1007/s10796-016-9701-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-016-9701-7

Keywords

Navigation