A distributed incremental information acquisition model for large-scale text data

Sun, Shengtao; Gong, Jibing; Zomaya, Albert Y.; Wu, Aizhi

doi:10.1007/s10586-017-1498-8

A distributed incremental information acquisition model for large-scale text data

Published: 21 December 2017

Volume 22, pages 2383–2394, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Shengtao Sun ORCID: orcid.org/0000-0001-7311-0196^1,2,4,
Jibing Gong^1,4,
Albert Y. Zomaya² &
…
Aizhi Wu³

440 Accesses
Explore all metrics

Abstract

Timely discovering and acquiring information from incremental data on the Internet is a hot topic in a big data era. This paper presents a distributed incremental information acquisition model for large-scale text data. To obtain a lower false positive rate and higher efficiency of the traditional Bloom filter, a distributed multidimensional Bloom filter is designed and proposed to cope with the deduplication of large-scale Web URL text data. Three methods related to Bloom filter were compared based on the false positive rate and response efficiency. The results show that the distributed incremental information acquisition model for large-scale text data can achieve a high duplicate removal rate with a lower false positive rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Bloom Filter-Based Data Deduplication for Big Data

Design and implementation of a Bloom filter-based data deduplication algorithm for efficient data management

Article 08 June 2018

Detecting Duplicates in Real-Time Data Warehouse Using Bloom Filter-Based Approach

References

Wang, L., Song, W., Liu, P.: Link the remote sensing big data to the image features via wavelet transformation. Clust. Comput. 19(2), 793–810 (2016)
Article Google Scholar
Ranjan, R., Georgakopoulos, D., Wang, L.: A note on software tools and technologies for delivering smart media-optimized big data applications in the cloud. Computing 98, 1–5 (2016)
Article MathSciNet MATH Google Scholar
Chen, D., Li, X., Wang, L., et al.: Fast and scalable multi-way analysis of massive neural data. IEEE Trans. Comput. 64(3), 707–719 (2015)
Article MathSciNet MATH Google Scholar
Deng, Z., Han, W., Wang, L., et al.: An efficient online direction-preserving compression approach for trajectory streaming data. Fut. Gener. Comput. Syst. 68, 150–162 (2017)
Article Google Scholar
Li, J., Zhang, P., Li, Y., et al.: A data-check based distributed storage model for storing hot temporary data. Fut. Gener. Comput. Syst. 73, 13–21 (2017)
Article Google Scholar
Melnik, S., Gubarev, A., Long, J.J., et al.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54, 114–123 (2011)
Article Google Scholar
Voras, I., Zagar, M.: Adapting the Bloom filter to multithreaded environments. In: The 15th IEEE Mediterranean Electrotechnical Conference, Valletta, Malta, pp. 1488–1493 (2010)
Ma, Y., Wang, L., Zomaya, A.Y., et al.: Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic dag scheduling. IEEE Trans. Parallel Distrib. Syst. 25(8), 2126–2137 (2014)
Article Google Scholar
Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. 19(3), 1283–1292 (2016)
Article Google Scholar
Xiang, Z., Schwartz, Z., Gerdes Jr., J.H., Uysal, M.: What can big data and text analytics tell us about hotel guest experience and satisfaction? Int. J. Hosp. Manag. 44, 120–130 (2015)
Article Google Scholar
Jensen, K., Nguyen, H.T., Van Do, T., Arnes, A.: A big data analytics approach to combat telecommunication vulnerabilities. Clust. Comput. 20(3), 2363–2374 (2017)
Article Google Scholar
Ma, L., Zhang, Y.: Using Word2Vec to process big text data. In: IEEE International Conference on Big Data, Santa Clara, pp. 2895–2897 (2015)
Schmidt, K., Bachle, S., Scholl, P., Nold, G.: Big Scale Text Analytics and Smart Content Navigation. Enabling Real-Time Business Intelligence, Lecture Notes in Business Information Processing, vol. 206, pp. 167–170. Springer, Berlin (2015)
Google Scholar
Deng, Z., Wu, X., Wang, L., et al.: Parallel processing of dynamic continuous qeries over streaming data flows. IEEE Trans. Parallel Distrib. Syst. 26(3), 834–846 (2015)
Article Google Scholar
Chen, D., Wang, L., Zomaya, A.Y., et al.: Parallel simulation of complex evacuation scenarios with adaptive agent models. IEEE Trans. Parallel Distrib. Syst. 26(3), 847–857 (2015)
Article Google Scholar
Cho, J., Garcia-Molina, H.: Dealing with web data: history and look ahead. Proc. VLDB Endow. 3(1–2), 4–4 (2010)
Article Google Scholar
Sharma, D.K., Sharma, A.K.: A novel architecture for deep web crawler. Int. J. Inf. Technol. Web Eng. 6(1), 25–48 (2011)
Article Google Scholar
Zhang, Z., Dong, G., Peng, Z., et al.: A framework for incremental deep web crawler based on URL classification. In: The International Conference on Web Information Systems and Mining, Taiyuan, China, pp. 302–310 (2011)
Guo, H., Chen, Q., Xin, C., Wang, X., Bi, Ye: A real environment oriented parallel duplicates removal approach for large scale Chinese webpages. J. Comput. Inf. Syst. 7(5), 1420–1427 (2011)
Google Scholar
Zhang, F., Liu, M., Gui, F., Shen, W., Shami, Abdallah, Ma, Yunlong: A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust. Comput. 18(4), 1493–1501 (2015)
Article Google Scholar
Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., Bal, H.: WebPIE: a web-scale parallel inference engine using MapReduce. Web Semant. 10, 59–75 (2012)
Article Google Scholar
Ben, X., Jia, D., Yuan, L.: A three layer distributed architecture for large-scale duplicated web page detection. Comput. Digital Eng. 10, 1751–1755 (2015)
Google Scholar
Jose, J., Subramoni, H., Luo, M., et al.: Memcached design on high performance RDMA capable interconnects. In: The International Conference on Parallel Processing, Taipei, Taiwan, pp. 743–752 (2011)
Josiah, L.: Garlson: Redis in Action. Manning Publications Co., Greenwich (2013)
Google Scholar
Subramanyam, R., Gupta, I., Leslie, L.M., Wang, W.: Idempotent distributed counters using a forgetful bloom filter. Clust. Comput. 19(2), 879–892 (2016)
Article Google Scholar
Tarkoma, S., Rothenberg, C., Lagerspetz, E.: Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor. 14(1), 131–155 (2011)
Article Google Scholar
Crainiceanu, A., Lemire, D.: Bloofi: multidimensional Bloom filters. Inf. Syst. 54, 311–324 (2015)
Article Google Scholar
Wu, Y., Huang, H., Zhou, X., et al.: A space-saving URL duplication removal method for web crawler. J. Inf. Comput. Sci. 9(5), 1195–1203 (2012)
Google Scholar
Han, H., Jung, H., Eom, H., et al.: Scatter-Gather-Merge: an efficient star-join query processing algorithm for data-parallel frameworks. Clust. Comput. 14(2), 183–197 (2011)
Article Google Scholar
Alewiwi, M., Orencik, C., Savas, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19(1), 109–126 (2016)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National High Technology Research and Development 863 Program of China (No. 2015AA124102) and the Hebei Natural Science Foundation of China (No. F2015203280). Shengtao Sun also acknowledges the Chinese Scholarship Council (No. 201608130030) for a visiting scholarship at University of Sydney. The authors would like to show great appreciation for the works done by Lin Zhang, Yi Zhao and Lili Wang from the research group of Knowledge Engineering (KEG), in Yanshan University.

Author information

Authors and Affiliations

School of Information Science and Engineering, Yanshan University, Qinhuangdao, 066004, Hebei, People’s Republic of China
Shengtao Sun & Jibing Gong
School of Information Technologies, University of Sydney, Sydney, NSW, 2006, Australia
Shengtao Sun & Albert Y. Zomaya
College of Vehicle and Energy, Yanshan University, Qinhuangdao, People’s Republic of China
Aizhi Wu
Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, and Key Laboratory for Software Engineering of Hebei Province, Qinhuangdao, People’s Republic of China
Shengtao Sun & Jibing Gong

Authors

Shengtao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jibing Gong
View author publications
You can also search for this author in PubMed Google Scholar
Albert Y. Zomaya
View author publications
You can also search for this author in PubMed Google Scholar
Aizhi Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jibing Gong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, S., Gong, J., Zomaya, A.Y. et al. A distributed incremental information acquisition model for large-scale text data. Cluster Comput 22 (Suppl 1), 2383–2394 (2019). https://doi.org/10.1007/s10586-017-1498-8

Download citation

Received: 29 August 2017
Revised: 06 November 2017
Accepted: 07 December 2017
Published: 21 December 2017
Issue Date: 16 January 2019
DOI: https://doi.org/10.1007/s10586-017-1498-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A distributed incremental information acquisition model for large-scale text data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Bloom Filter-Based Data Deduplication for Big Data

Design and implementation of a Bloom filter-based data deduplication algorithm for efficient data management

Detecting Duplicates in Real-Time Data Warehouse Using Bloom Filter-Based Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation