Improving the freshness of the search engines by a probabilistic approach based incremental crawler

Pavai, G.; Geetha, T. V.

doi:10.1007/s10796-016-9701-7

Improving the freshness of the search engines by a probabilistic approach based incremental crawler

Published: 15 September 2016

Volume 19, pages 1013–1028, (2017)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

G. Pavai¹ &
T. V. Geetha¹

535 Accesses
7 Citations
Explore all metrics

Abstract

Web is flooded with data. While the crawler is responsible for accessing these web pages and giving it to the indexer for making them available to the users of search engine, the rate at which these web pages change has created the necessity for the crawler to employ refresh strategies to give updated/modified content to the search engine users. Furthermore, Deep web is that part of the web that has alarmingly abundant amounts of quality data (when compared to normal/surface web) but not technically accessible to a search engine’s crawler. The existing deep web crawl methods helps to access the deep web data from the result pages that are generated by filling forms with a set of queries and accessing the web databases through them. However, these methods suffer from not being able to maintain the freshness of the local databases. Both the surface web and the deep web needs an incremental crawl associated with the normal crawl architecture to overcome this problem. Crawling the deep web requires the selection of an appropriate set of queries so that they can cover almost all the records in the data source and in addition the overlapping of records should be low so that network utilization is reduced. An incremental crawl adds to an increase in the network utilization with every increment. Therefore, a reduced query set as described earlier should be used in order to minimize the network utilization. Our contributions in this work are the design of a probabilistic approach based incremental crawler to handle the dynamic changes of the surface web pages, adapting the above mentioned method with a modification to handle the dynamic changes in the deep web databases, a new evaluation measure called the ‘Crawl-hit rate’ to evaluate the efficiency of the incremental crawler in terms of the number of times the crawl is actually necessary in the predicted time and a semantic weighted set covering algorithm for reducing the queries so that the network cost is reduced for every increment of the crawl without any compromise in the number of records retrieved. The evaluation of incremental crawler shows a good improvement in the freshness of the databases and a good Crawl-hit rate (83 % for web pages and 81 % for deep web databases) with a lesser over head when compared to the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Article Open access 12 April 2024

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

References

Ali, H. A., El Desouky, A. I., & Saleh, A. I. (2008). A New Approach for Building a Scalable and Adaptive Vertical Search Engine. International Journal of Intelligent Information Technologies (IJIIT), 4(1), 52–79. doi:10.4018/jiit.2008010.
Article Google Scholar
Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword- based interfaces. In Proceedings of Brazilian Symposium on Databases (SBBD) (pp. 309–321). Brazil: Brasilia.
Google Scholar
Bergman, M. K. (2001). The deep web: surfacing hidden value. Journal of Electronic Publishing, 7(1), 1174–1175.
Article Google Scholar
Bright Planet, (2010) Deep Web FAQs, http://www.brightplanet.com/the-deep-web/
Cho, J., & Molina, H. G. (2000a). The Evolution of the Web and Implications for an Incremental Crawler”, In Proceedings of the 26th International Conference on Very Large Data Bases, 200–209.
Google Scholar
Cho, J., & Molina, H. G. (2000b). Synchronizing a Database to Improve Freshness. In In Proceedings of ACM International Conference on Management of Data (SIGMOD) (pp. 117–128).
Google Scholar
Cho, J., & Ntoulas, A. (2002). Effective Change Detection using Sampling. In In Proceedings of 28th International Conference on Very Large Data Bases (pp. 514–525).
Google Scholar
Edwards, J., McCurley, K., & Tomlin, J. (2001). An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In Proceedings of the. In 10th Conference on World Wide Web, ACM Press (pp. 106–113).
Google Scholar
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press http://wordnet.princeton.edu.
Google Scholar
He, B., Patel, M., Zhang, Z., & Chang, K. C. (2008). Accessing the Deep Web: A Survey. Communications of the ACM, 50, 94–101.
Kim, J., & Storey, V. C. (2011). Construction of Domain Ontologies: Sourcing the World Wide Web. International Journal of Intelligent Information Technologies (IJIIT), 7(2), 1–24. doi:10.4018/jiit.2011040101.
Article Google Scholar
Ling, Y., Meng, X., & Liu, W. (2007). An Attributes Correlation based Approach for Estimating Size of Web Databases. Journal of Software, 19(2), 224–236.
Article Google Scholar
Lu, J. (2008). Efficient Estimation of the Size of Text Deep Web Data Source. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, Napa Valley, CA, 1485–1486.
Lu, J., Wang, Y., liang, J., Chen, J., & Liu, J. (2008). An Approach to Deep Web Crawling by Sampling. In Proceedings of Web Intelligence, 718–724.
Madaan, R., Dixit, A., Sharma, A. K., & Bhatia, K. K. (2010). A Framework for Incremental Hidden Web Crawler. International Journal of Computer Science and Engineering, 2(3), 753–758.
Google Scholar
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., & Halevy, A. Y. (2008). Google’s Deep Web Crawl. Proceedings of the VLDB Endowment, 1(2), 1241–1252.
Article Google Scholar
Ntoulas, A., Cho, J., & Olston, C. (2004). What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In In Proceedings of the 13th International Conference on World Wide Web (pp. 1–12).
Google Scholar
Ntoulas, A., Zerfos, P., Cho, J. (2005). Downloading Textual Hidden Web Content through Keyword Queries. In Proceedings of the Joint Conference on Digital Libraries (JCDL), 5th ACM/IEEE Joint Conference on Digital Libraries, 100–109.
One Way ANOVA (2013) University of Wisconsin - Stevens Point. [Online]. Available: http://www.uwsp.edu/psych/stat/12/anova-1w.ht.
Pavai, G., & Geetha, T. V. (2013). A Unified Architecture for Surfacing the Content of Deep Web Databases. In In proceedings of the International Conference on Advances in Communication, Network, and Computing (pp. 35–38).
Google Scholar
Pavai, G., & Geetha, T. V. (2014). A Bootstrapping Approach to Classification of Deep Web Query Interfaces. International Journal on Recent Trends in Engineering and Technology, 11(1), 1–9.
Article Google Scholar
Radinsky, K., & Bennett, P. (2013). Predicting Content Change on the Web”, In Proceedings of the sixth ACM International Conference on web search and data mining, 415–424.
Google Scholar
Raghavan, S., & Molina, H. G. (2001). Crawling the Hidden Web. In In proceedings of the 27th International Conference on Very large databases (pp. 129–138).
Google Scholar
Santos, A. S. R., De Carvalho, C. R., Almeida, J. M., De Moura, E. S., Da Silva, A. S., & Ziviani, N. (2015). A Genetic Programming Framework to Schedule Webpage Updates. Information Retrieval Journal, 18(1), 73–94.
Article Google Scholar
Saranya, S. (2013). Search Engine Optimization for Better Results without User Effort. M.E. Thesis.
Google Scholar
Sharma, A. K., & Dixit, A. (2008). Self-Adjusting Refresh Time based Architecture for Incremental Web Crawler. International Journal on Computer Science and Network Security, 8(12), 349–354.
Google Scholar
Shestakov, D., & Salakoski, T., (2007). On Estimating the Scale of National Deep Web. In Proceedings of Database Expert System and Application, 780–789.
Tan, Q., & Mitra, P. (2010). Clustering based Incremental Web Crawling. ACM Transactions on Information Systems, 2(3), 1–25.
Article Google Scholar
Wang, Y., Lu, J., & Chen, J. (2009). Crawling Deep Web Using a New Set Covering Algorithm. In In Proceedings of the 5th International Conference on Advanced Data Mining and Applications (pp. 326–337).
Chapter Google Scholar
Wang, Y., Lu, J., Liang, J., Chen, J., & Liu, J. (2010). Selecting Queries from Sample to Crawl Deep Web Data Sources. Web Intelligence and Agent Systems: An International Journal, 1, 1–15.
Google Scholar
Wang, Y., Lu, J., & Chen, J. (2014). TS-IDS Algorithm for Query Selection in the Deep Web Crawling. Chapter Web Technologies and Applications, Volume 8709 of the series Lecture Notes in Computer Science, 189–200.
Wei, F., Wu-bin, P., & Zhi-Ming, C. (2011). A Novel Method on Incremental Information Acquisition for Deep Web. Journal of Convergence Information Technology, 6(6), 383–389.
Article Google Scholar
Wong, B. W. F. (2010). Deep Web Search Engine Ranking Algorithms. Massachusetts Institute of Technology: M.E. Thesis.
Google Scholar
Wu, P., Wen, J.-R., Liu, H., & Ma, W. Y. (2006). Query Selection Techniques for Efficient Crawling of Structured Web Sources, in Proceedings of International Conference on Data Engineering, 47–56.
Google Scholar
Wu, W., Li, H., Wang, H., & Zhu, K. (2012). Probase: A Probabilistic Taxonomy for Text Understanding. In In Proceedings of ACM International Conference on Management of Data (SIGMOD) (pp. 481–492).
Google Scholar
Zhang, X., Zuo, M., & Liu, Q. (2008). Analysis of the Reasons Why Invisible Web can’t be Seen and its Effective Retrieval Strategies. In In Proceedings of the 3rd International Conference on Innovative Computing Information and Control (pp. 563–567).
Google Scholar

Download references

Acknowledgments

We thank Anna University, Chennai, Tamil Nadu, India for financially supporting this work.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, CEG, Anna University, Chennai, India
G. Pavai & T. V. Geetha

Authors

G. Pavai
View author publications
You can also search for this author in PubMed Google Scholar
T. V. Geetha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to G. Pavai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pavai, G., Geetha, T.V. Improving the freshness of the search engines by a probabilistic approach based incremental crawler. Inf Syst Front 19, 1013–1028 (2017). https://doi.org/10.1007/s10796-016-9701-7

Download citation

Published: 15 September 2016
Issue Date: October 2017
DOI: https://doi.org/10.1007/s10796-016-9701-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving the freshness of the search engines by a probabilistic approach based incremental crawler

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving the freshness of the search engines by a probabilistic approach based incremental crawler

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation