Abstract
Unstructured big data processing requires efficient computational styles to rapidly analyze continuously changing data. Incremental processing is a promising technique to update search results without re-computing the whole process. One of the most important mining algorithms is PageRank which necessitates incremental processing and iterative computations to keep the mining results up-to-date. In this paper, a light and efficient PageRank algorithm is proposed using differential dataflow in the MapReduce programming style named DiffPageRank. The innovation of the proposed approach is investigated from two points of view: (1) the updating state is dependent on a partial order of changes in DiffPageRank computing, while the changes are applied based on complete order sequences in the standard incremental computing; and (2) in DiffPageRank, a set of updates rebuilds each version, while in the incremental systems, every single update continually bring up-to-date the current version state. DiffPageRank is compared with two other implementations of PageRank including standard and incremental PageRank in MapReduce in terms of processing time, CPU utilization and update speed of data mining results. The results show that if the changes on the input data are large, the efficiency of the differential method will considerably increase in terms of time than the standard and incremental PageRank approaches up to 61% and 86%, respectively. DiffPageRank is also compared with two state-of-the-art incremental graph processing benchmarks on Orkut and Twitter datasets in terms of execution time and the number of updates. The experimental comparisons on benchmark datasets show that our method reduces the number of updates and total execution time by using the differential MapReduce approach.














Similar content being viewed by others
References
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117
White T (2012) Hadoop: the definitive guide. O’Reilly Media, Inc., Sebastopol
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, pp 1–10
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Maleki N, Rahmani AM, Conti M (2019) MapReduce: an infrastructure review and research insights. J Supercomput 75(10):6934–7002
Gupta D, Rani R (2018) A study of big data evolution and research challenges. J Inf Sci 45:322–340
Talan PP, Sharma KU, Nawade PP, Talan KP (2019) An overview of Hadoop MapReduce, spark, and scalable graph processing architecture. In: Kalita J, Balas VE, Borah S, Pradhan R (eds) Recent developments in machine learning and data analytics. Springer, Berlin, pp 35–42
Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web. ACM, pp 591–600
Zhang Y, Chen S, Wang Q, Yu G (2016) i2MapReduce: incremental MapReduce for mining evolving big data. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE, pp 1482–1483
McSherry F, Murray DG, Isaacs R, Isard M (2013) Differential dataflow. In: CIDR
Bhawiyuga A, Kirana AP (2016) Implementation of page rank algorithm in Hadoop MapReduce framework. In: 2016 International Seminar on Intelligent Technology and Its Applications (ISITIA). IEEE, pp 231–236
Murray DG, McSherry F, Isaacs R, Isard M, Barham P, Abadi M (2013) Naiad: a timely dataflow system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, pp 439–455
Murray DG, McSherry F, Isard M, Isaacs R, Barham P, Abadi M (2016) Incremental, iterative data processing with timely dataflow. Commun ACM 59(10):75–83
Pasquinelli M (2009) Google’s PageRank algorithm: a diagram of cognitive capitalism and the rentier of the common intellect. In: Becker K, Stalder F (eds) Deep search: the politics of search beyond Google. Studien Verlag, Innsbruck, pp 152–163
Cauwenberghs G, Poggio T (2001) Incremental and decremental support vector machine learning. In: Advances in Neural Information Processing Systems, pp 409–415
Peng D, Dabek F (2010) Large-scale incremental processing using distributed transactions and notifications. In: OSDI, vol 10, pp 1–15
Popa L, Budiu M, Yu Y, Isard M (2009) DryadInc: reusing work in large-scale computations. HotCloud 9:2–6
Logothetis D, Olston C, Reed B, Webb KC, Yocum K (2010) Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp 51–62
Lee D, Kim J-S, Maeng S (2014) Large-scale incremental processing with MapReduce. Future Gener Comput Syst 36:66–79
Zhang Y, Chen S (2013) i2MapReduce: incremental iterative MapReduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, pp 1–4
Jörg T, Parvizi R, Yong H, Dessloch S (2011) Incremental recomputations in mapreduce. In: Proceedings of the Third International Workshop on Cloud Data Management, pp 7–14
Saadon AGB, Mokhtar HM (2019) Survey on iterative and incremental approaches in distributed computing environment. Int J Data Sci 4(1):18–30
Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquin, R (2011) Incoop: MapReduce for incremental computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, p 7
Logothetis D, Olston C, Reed B, Webb KC, Yocum K (2010) Stateful bulk processing for incremental analytics. In: Presented at the Proceedings of the 1st ACM Symposium on Cloud Computing—SoCC ‘10. http://dx.doi.org/10.1145/1807128.1807138
McSherry FD, Isaacs R, Isard MA, Murray DG (2015) Differential dataflow, ed: Google Patents
Cheng R et al (2012) Kineograph: taking the pulse of a fast-changing and connected world. In: Proceedings of the 7th ACM European Conference on Computer Systems, pp 85–98
Yin J, Gao L (2016) Asynchronous distributed incremental computation on evolving graphs. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pp 722–738
Lv X, Xiao W, Zhang Y, Liao X, Jin H, Hua Q (2019) An effective framework for asynchronous incremental graph processing. Front Comput Sci 13(3):539–551
Park S, Lee W, Choe B, Lee S-G (2019) A survey on personalized PageRank computation algorithms. IEEE Access 7:163049–163062
Bahmani B, Chowdhury A, Goel A (2010) Fast incremental and personalized pagerank. Proc VLDB Endow 4(3):173–184
Abdullah IB (2010) Incremental pagerank for twitter data using hadoop. Technical paper
Desikan P, Pathak N, Srivastava J, Kumar V (2005) Incremental page rank computation on evolving graphs. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. ACM, pp 1094–1095
Kim KS, Choi YS (2015) Incremental iteration method for fast pagerank computation. In: Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication. ACM, p 80
Lin W (2019) Distributed algorithms for fully personalized pagerank on large graphs. In: The World Wide Web Conference, pp 1084–1094
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Nooraei Abadeh, M., Mirzaie, M. DiffPageRank: an efficient differential PageRank approach in MapReduce. J Supercomput 77, 188–211 (2021). https://doi.org/10.1007/s11227-020-03265-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03265-3