Abstract
Web is flooded with data. While the crawler is responsible for accessing these web pages and giving it to the indexer for making them available to the users of search engine, the rate at which these web pages change has created the necessity for the crawler to employ refresh strategies to give updated/modified content to the search engine users. Furthermore, Deep web is that part of the web that has alarmingly abundant amounts of quality data (when compared to normal/surface web) but not technically accessible to a search engine’s crawler. The existing deep web crawl methods helps to access the deep web data from the result pages that are generated by filling forms with a set of queries and accessing the web databases through them. However, these methods suffer from not being able to maintain the freshness of the local databases. Both the surface web and the deep web needs an incremental crawl associated with the normal crawl architecture to overcome this problem. Crawling the deep web requires the selection of an appropriate set of queries so that they can cover almost all the records in the data source and in addition the overlapping of records should be low so that network utilization is reduced. An incremental crawl adds to an increase in the network utilization with every increment. Therefore, a reduced query set as described earlier should be used in order to minimize the network utilization. Our contributions in this work are the design of a probabilistic approach based incremental crawler to handle the dynamic changes of the surface web pages, adapting the above mentioned method with a modification to handle the dynamic changes in the deep web databases, a new evaluation measure called the ‘Crawl-hit rate’ to evaluate the efficiency of the incremental crawler in terms of the number of times the crawl is actually necessary in the predicted time and a semantic weighted set covering algorithm for reducing the queries so that the network cost is reduced for every increment of the crawl without any compromise in the number of records retrieved. The evaluation of incremental crawler shows a good improvement in the freshness of the databases and a good Crawl-hit rate (83 % for web pages and 81 % for deep web databases) with a lesser over head when compared to the baseline.
Similar content being viewed by others
References
Ali, H. A., El Desouky, A. I., & Saleh, A. I. (2008). A New Approach for Building a Scalable and Adaptive Vertical Search Engine. International Journal of Intelligent Information Technologies (IJIIT), 4(1), 52–79. doi:10.4018/jiit.2008010.
Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword- based interfaces. In Proceedings of Brazilian Symposium on Databases (SBBD) (pp. 309–321). Brazil: Brasilia.
Bergman, M. K. (2001). The deep web: surfacing hidden value. Journal of Electronic Publishing, 7(1), 1174–1175.
Bright Planet, (2010) Deep Web FAQs, http://www.brightplanet.com/the-deep-web/
Cho, J., & Molina, H. G. (2000a). The Evolution of the Web and Implications for an Incremental Crawler”, In Proceedings of the 26th International Conference on Very Large Data Bases, 200–209.
Cho, J., & Molina, H. G. (2000b). Synchronizing a Database to Improve Freshness. In In Proceedings of ACM International Conference on Management of Data (SIGMOD) (pp. 117–128).
Cho, J., & Ntoulas, A. (2002). Effective Change Detection using Sampling. In In Proceedings of 28th International Conference on Very Large Data Bases (pp. 514–525).
Edwards, J., McCurley, K., & Tomlin, J. (2001). An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In Proceedings of the. In 10th Conference on World Wide Web, ACM Press (pp. 106–113).
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press http://wordnet.princeton.edu.
He, B., Patel, M., Zhang, Z., & Chang, K. C. (2008). Accessing the Deep Web: A Survey. Communications of the ACM, 50, 94–101.
Kim, J., & Storey, V. C. (2011). Construction of Domain Ontologies: Sourcing the World Wide Web. International Journal of Intelligent Information Technologies (IJIIT), 7(2), 1–24. doi:10.4018/jiit.2011040101.
Ling, Y., Meng, X., & Liu, W. (2007). An Attributes Correlation based Approach for Estimating Size of Web Databases. Journal of Software, 19(2), 224–236.
Lu, J. (2008). Efficient Estimation of the Size of Text Deep Web Data Source. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, Napa Valley, CA, 1485–1486.
Lu, J., Wang, Y., liang, J., Chen, J., & Liu, J. (2008). An Approach to Deep Web Crawling by Sampling. In Proceedings of Web Intelligence, 718–724.
Madaan, R., Dixit, A., Sharma, A. K., & Bhatia, K. K. (2010). A Framework for Incremental Hidden Web Crawler. International Journal of Computer Science and Engineering, 2(3), 753–758.
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. Y. (2008). Google’s Deep Web Crawl. Proceedings of the VLDB Endowment, 1(2), 1241–1252.
Ntoulas, A., Cho, J., & Olston, C. (2004). What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In In Proceedings of the 13th International Conference on World Wide Web (pp. 1–12).
Ntoulas, A., Zerfos, P., Cho, J. (2005). Downloading Textual Hidden Web Content through Keyword Queries. In Proceedings of the Joint Conference on Digital Libraries (JCDL), 5th ACM/IEEE Joint Conference on Digital Libraries, 100–109.
One Way ANOVA (2013) University of Wisconsin - Stevens Point. [Online]. Available: http://www.uwsp.edu/psych/stat/12/anova-1w.ht.
Pavai, G., & Geetha, T. V. (2013). A Unified Architecture for Surfacing the Content of Deep Web Databases. In In proceedings of the International Conference on Advances in Communication, Network, and Computing (pp. 35–38).
Pavai, G., & Geetha, T. V. (2014). A Bootstrapping Approach to Classification of Deep Web Query Interfaces. International Journal on Recent Trends in Engineering and Technology, 11(1), 1–9.
Radinsky, K., & Bennett, P. (2013). Predicting Content Change on the Web”, In Proceedings of the sixth ACM International Conference on web search and data mining, 415–424.
Raghavan, S., & Molina, H. G. (2001). Crawling the Hidden Web. In In proceedings of the 27th International Conference on Very large databases (pp. 129–138).
Santos, A. S. R., De Carvalho, C. R., Almeida, J. M., De Moura, E. S., Da Silva, A. S., & Ziviani, N. (2015). A Genetic Programming Framework to Schedule Webpage Updates. Information Retrieval Journal, 18(1), 73–94.
Saranya, S. (2013). Search Engine Optimization for Better Results without User Effort. M.E. Thesis.
Sharma, A. K., & Dixit, A. (2008). Self-Adjusting Refresh Time based Architecture for Incremental Web Crawler. International Journal on Computer Science and Network Security, 8(12), 349–354.
Shestakov, D., & Salakoski, T., (2007). On Estimating the Scale of National Deep Web. In Proceedings of Database Expert System and Application, 780–789.
Tan, Q., & Mitra, P. (2010). Clustering based Incremental Web Crawling. ACM Transactions on Information Systems, 2(3), 1–25.
Wang, Y., Lu, J., & Chen, J. (2009). Crawling Deep Web Using a New Set Covering Algorithm. In In Proceedings of the 5th International Conference on Advanced Data Mining and Applications (pp. 326–337).
Wang, Y., Lu, J., Liang, J., Chen, J., & Liu, J. (2010). Selecting Queries from Sample to Crawl Deep Web Data Sources. Web Intelligence and Agent Systems: An International Journal, 1, 1–15.
Wang, Y., Lu, J., & Chen, J. (2014). TS-IDS Algorithm for Query Selection in the Deep Web Crawling. Chapter Web Technologies and Applications, Volume 8709 of the series Lecture Notes in Computer Science, 189–200.
Wei, F., Wu-bin, P., & Zhi-Ming, C. (2011). A Novel Method on Incremental Information Acquisition for Deep Web. Journal of Convergence Information Technology, 6(6), 383–389.
Wong, B. W. F. (2010). Deep Web Search Engine Ranking Algorithms. Massachusetts Institute of Technology: M.E. Thesis.
Wu, P., Wen, J.-R., Liu, H., & Ma, W. Y. (2006). Query Selection Techniques for Efficient Crawling of Structured Web Sources, in Proceedings of International Conference on Data Engineering, 47–56.
Wu, W., Li, H., Wang, H., & Zhu, K. (2012). Probase: A Probabilistic Taxonomy for Text Understanding. In In Proceedings of ACM International Conference on Management of Data (SIGMOD) (pp. 481–492).
Zhang, X., Zuo, M., & Liu, Q. (2008). Analysis of the Reasons Why Invisible Web can’t be Seen and its Effective Retrieval Strategies. In In Proceedings of the 3rd International Conference on Innovative Computing Information and Control (pp. 563–567).
Acknowledgments
We thank Anna University, Chennai, Tamil Nadu, India for financially supporting this work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pavai, G., Geetha, T.V. Improving the freshness of the search engines by a probabilistic approach based incremental crawler. Inf Syst Front 19, 1013–1028 (2017). https://doi.org/10.1007/s10796-016-9701-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10796-016-9701-7