PDist-RIA Crawler: A Peer-to-Peer Distributed Crawler for Rich Internet Applications

Mirtaheri, Seyed M.; Bochmann, Gregor V.; Jourdan, Guy-Vincent; Onut, Iosif Viorel

doi:10.1007/978-3-319-11746-1_26

Seyed M. Mirtaheri¹⁹,
Gregor V. Bochmann¹⁹,
Guy-Vincent Jourdan¹⁹ &
…
Iosif Viorel Onut²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8787))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1502 Accesses
1 Citations

Abstract

Crawling Rich Internet Applications (RIAs) is important to ensure their security, accessibility and to index them for searching. To crawl a RIA, the crawler has to reach every application state and execute every application event. On a large RIA, this operation takes a long time. Previously published GDist-RIA Crawler proposes a distributed architecture to parallelize the task of crawling RIAs, and run the crawl over multiple computers to reduce time. In GDist-RIA Crawler, a centralized unit calculates the next task to execute, and tasks are dispatched to worker nodes for execution. This architecture is not scalable due to the centralized unit which is bound to become a bottleneck as the number of nodes increases. This paper extends GDist-RIA Crawler and proposes a fully peer-to-peer and scalable architecture to crawl RIAs, called PDist-RIA Crawler. PDist-RIA doesn’t have the same limitations in terms scalability while matching the performance of GDist-RIA. We describe a prototype showing the scalability and performance of the proposed solution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Fault Tolerant P2P RIA Crawling

An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P

An adaptive peer-sampling protocol for building networks of browsers

Article 01 August 2017

References

Amalfitano, D., Fasolino, A.R., Tramontana, P.: Experimenting a reverse engineering technique for modelling the behaviour of rich internet applications. In: IEEE International Conference on Software Maintenance, ICSM 2009, pp. 571–574 (September 2009)
Google Scholar
Benjamin, K., von Bochmann, G., Dincturk, M.E., Jourdan, G.-V., Onut, I.V.: A strategy for efficient crawling of rich internet applications. In: Auer, S., Díaz, O., Papadopoulos, G.A. (eds.) ICWE 2011. LNCS, vol. 6757, pp. 74–89. Springer, Heidelberg (2011)
Chapter Google Scholar
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. In: Proc. Australian World Wide Web Conference, vol. 34(8), pp. 711–726 (2002)
Google Scholar
Boldi, P., Marino, A., Santini, M., Vigna, S.: Bubing: Massive crawling for the masses
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107–117. Elsevier Science Publishers B. V, Amsterdam (1998)
Google Scholar
Choudhary, S., Dincturk, E., Mirtaheri, S., Bochmann, G.V., Jourdan, G.-V., Onut, V.: Model-based rich internet applications crawling: Menu and probability models
Google Scholar
Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., Jourdan, G.-V., Bochmann, G.v., Onut, I.V.: Building rich internet applications models: Example of a better strategy. In: Daniel, F., Dolog, P., Li, Q. (eds.) ICWE 2013. LNCS, vol. 7977, pp. 291–305. Springer, Heidelberg (2013)
Chapter Google Scholar
Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., Jourdan, G.-V., Bochmann, G.v., Onut, I.V.: Building rich internet applications models: Example of a better strategy. In: Daniel, F., Dolog, P., Li, Q. (eds.) ICWE 2013. LNCS, vol. 7977, pp. 291–305. Springer, Heidelberg (2013)
Chapter Google Scholar
Dincturk, E., Jourdan, G.-V., Bochmann, G.V., Onut, V.: A model-based approach for crawling rich internet applications. ACM Transactions on the Web (2014)
Google Scholar
Dincturk, M.E., Choudhary, S., von Bochmann, G., Jourdan, G.-V., Onut, I.V.: A statistical approach for efficient crawling of rich internet applications. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 362–369. Springer, Heidelberg (2012)
Chapter Google Scholar
Duda, C., Frey, G., Kossmann, D., Matter, R., Zhou, C.: Ajax crawl: Making ajax applications searchable. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE 2009, pp. 78–89. IEEE Computer Society, Washington, DC (2009)
Google Scholar
Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler (2001)
Google Scholar
Frey, G.: Indexing ajax web applications. Master’s thesis, ETH Zurich (2007), http://e-collection.library.ethz.ch/eserv/eth:30111/eth-30111-01.pdf
Gabriel, E., et al.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004)
Chapter Google Scholar
Hafaiedh, K., Bochmann, G., Jourdan, G.-V., Onut, I.: A scalable p2p ria crawling system with partial knowledge (2014)
Google Scholar
Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2, 219–229 (1999)
Article Google Scholar
Karonis, N.T., Toonen, B., Foster, I.: Mpich-g2: A grid-enabled implementation of the message passing interface. Journal of Parallel and Distributed Computing 63(5), 551–563 (2003)
Article MATH Google Scholar
Li, J., Loo, B., Hellerstein, J., Kaashoek, M., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 207–215. Springer, Heidelberg (2003)
Chapter Google Scholar
Matter, R.: Ajax crawl: Making ajax applications searchable. Master’s thesis, ETH Zurich (2008), http://e-collection.library.ethz.ch/eserv/eth:30709/eth-30709-01.pdf
Mesbah, A., Bozdag, E., Deursen, A.V.: Crawling ajax by inferring user interface state changes. In: Proceedings of the 2008 Eighth International Conference on Web Engineering, ICWE 2008, pp. 122–134. IEEE Computer Society Press, Washington, DC (2008)
Google Scholar
Mesbah, A., van Deursen, A., Lenselink, S.: Crawling ajax-based web applications through dynamic analysis of user interface state changes. TWEB 6(1), 3 (2012)
Article Google Scholar
Mirtaheri, S.M., Bochmann, G.V., Jourdan, G.-V., Onut, I.V.: Gdist-ria crawler: A greedy distributed crawler for rich internet applications
Google Scholar
Mirtaheri, S.M., Dinçtürk, M.E., Hooshmand, S., Bochmann, G.V., Jourdan, G.-V., Onut, I.V.: A brief history of web crawlers. In: Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research, pp. 40–54. IBM Corp. (2013)
Google Scholar
Mirtaheri, S.M., Zou, D., Bochmann, G.V., Jourdan, G.-V.,, I.V.: Dist-ria crawler: A distributed crawler for rich internet applications. In: 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), pp. 105–112. IEEE (2013)
Google Scholar
Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)
Article MATH Google Scholar
Peng, Z., He, N., Jiang, C., Li, Z., Xu, L., Li, Y., Ren, Y.: Graph-based ajax crawl: Mining data from rich internet applications. In: 2012 International Conference on Computer Science and Electronics Engineering (ICCSEE), vol. 3, pp. 590–594 (March 2012)
Google Scholar
Snir, M., Otto, S.W., Walker, D.W., Dongarra, J., Huss-Lederman, S.: MPI: the complete reference. MIT Press (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada
Seyed M. Mirtaheri, Gregor V. Bochmann & Guy-Vincent Jourdan
Security AppScanR®Enterprise, IBM, 770 Palladium Dr, Ottawa, Ontario, Canada
Iosif Viorel Onut

Authors

Seyed M. Mirtaheri
View author publications
You can also search for this author in PubMed Google Scholar
Gregor V. Bochmann
View author publications
You can also search for this author in PubMed Google Scholar
Guy-Vincent Jourdan
View author publications
You can also search for this author in PubMed Google Scholar
Iosif Viorel Onut
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of New South Wales, Sydney, Australia
Boualem Benatallah
Boston University, Boston, MA, USA
Azer Bestavros
Aristotle University of Thessaloniki, Thessaloniki, Greece
Yannis Manolopoulos & Athena Vakali &
Victoria University, Footscray, VIC, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mirtaheri, S.M., Bochmann, G.V., Jourdan, GV., Onut, I.V. (2014). PDist-RIA Crawler: A Peer-to-Peer Distributed Crawler for Rich Internet Applications. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2014. WISE 2014. Lecture Notes in Computer Science, vol 8787. Springer, Cham. https://doi.org/10.1007/978-3-319-11746-1_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-11746-1_26
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11745-4
Online ISBN: 978-3-319-11746-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PDist-RIA Crawler: A Peer-to-Peer Distributed Crawler for Rich Internet Applications

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Fault Tolerant P2P RIA Crawling

An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P

An adaptive peer-sampling protocol for building networks of browsers

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

PDist-RIA Crawler: A Peer-to-Peer Distributed Crawler for Rich Internet Applications

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Fault Tolerant P2P RIA Crawling

An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P

An adaptive peer-sampling protocol for building networks of browsers

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation