Abstract
Timely discovering and acquiring information from incremental data on the Internet is a hot topic in a big data era. This paper presents a distributed incremental information acquisition model for large-scale text data. To obtain a lower false positive rate and higher efficiency of the traditional Bloom filter, a distributed multidimensional Bloom filter is designed and proposed to cope with the deduplication of large-scale Web URL text data. Three methods related to Bloom filter were compared based on the false positive rate and response efficiency. The results show that the distributed incremental information acquisition model for large-scale text data can achieve a high duplicate removal rate with a lower false positive rate.









Similar content being viewed by others
References
Wang, L., Song, W., Liu, P.: Link the remote sensing big data to the image features via wavelet transformation. Clust. Comput. 19(2), 793–810 (2016)
Ranjan, R., Georgakopoulos, D., Wang, L.: A note on software tools and technologies for delivering smart media-optimized big data applications in the cloud. Computing 98, 1–5 (2016)
Chen, D., Li, X., Wang, L., et al.: Fast and scalable multi-way analysis of massive neural data. IEEE Trans. Comput. 64(3), 707–719 (2015)
Deng, Z., Han, W., Wang, L., et al.: An efficient online direction-preserving compression approach for trajectory streaming data. Fut. Gener. Comput. Syst. 68, 150–162 (2017)
Li, J., Zhang, P., Li, Y., et al.: A data-check based distributed storage model for storing hot temporary data. Fut. Gener. Comput. Syst. 73, 13–21 (2017)
Melnik, S., Gubarev, A., Long, J.J., et al.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54, 114–123 (2011)
Voras, I., Zagar, M.: Adapting the Bloom filter to multithreaded environments. In: The 15th IEEE Mediterranean Electrotechnical Conference, Valletta, Malta, pp. 1488–1493 (2010)
Ma, Y., Wang, L., Zomaya, A.Y., et al.: Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic dag scheduling. IEEE Trans. Parallel Distrib. Syst. 25(8), 2126–2137 (2014)
Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. 19(3), 1283–1292 (2016)
Xiang, Z., Schwartz, Z., Gerdes Jr., J.H., Uysal, M.: What can big data and text analytics tell us about hotel guest experience and satisfaction? Int. J. Hosp. Manag. 44, 120–130 (2015)
Jensen, K., Nguyen, H.T., Van Do, T., Arnes, A.: A big data analytics approach to combat telecommunication vulnerabilities. Clust. Comput. 20(3), 2363–2374 (2017)
Ma, L., Zhang, Y.: Using Word2Vec to process big text data. In: IEEE International Conference on Big Data, Santa Clara, pp. 2895–2897 (2015)
Schmidt, K., Bachle, S., Scholl, P., Nold, G.: Big Scale Text Analytics and Smart Content Navigation. Enabling Real-Time Business Intelligence, Lecture Notes in Business Information Processing, vol. 206, pp. 167–170. Springer, Berlin (2015)
Deng, Z., Wu, X., Wang, L., et al.: Parallel processing of dynamic continuous qeries over streaming data flows. IEEE Trans. Parallel Distrib. Syst. 26(3), 834–846 (2015)
Chen, D., Wang, L., Zomaya, A.Y., et al.: Parallel simulation of complex evacuation scenarios with adaptive agent models. IEEE Trans. Parallel Distrib. Syst. 26(3), 847–857 (2015)
Cho, J., Garcia-Molina, H.: Dealing with web data: history and look ahead. Proc. VLDB Endow. 3(1–2), 4–4 (2010)
Sharma, D.K., Sharma, A.K.: A novel architecture for deep web crawler. Int. J. Inf. Technol. Web Eng. 6(1), 25–48 (2011)
Zhang, Z., Dong, G., Peng, Z., et al.: A framework for incremental deep web crawler based on URL classification. In: The International Conference on Web Information Systems and Mining, Taiyuan, China, pp. 302–310 (2011)
Guo, H., Chen, Q., Xin, C., Wang, X., Bi, Ye: A real environment oriented parallel duplicates removal approach for large scale Chinese webpages. J. Comput. Inf. Syst. 7(5), 1420–1427 (2011)
Zhang, F., Liu, M., Gui, F., Shen, W., Shami, Abdallah, Ma, Yunlong: A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust. Comput. 18(4), 1493–1501 (2015)
Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., Bal, H.: WebPIE: a web-scale parallel inference engine using MapReduce. Web Semant. 10, 59–75 (2012)
Ben, X., Jia, D., Yuan, L.: A three layer distributed architecture for large-scale duplicated web page detection. Comput. Digital Eng. 10, 1751–1755 (2015)
Jose, J., Subramoni, H., Luo, M., et al.: Memcached design on high performance RDMA capable interconnects. In: The International Conference on Parallel Processing, Taipei, Taiwan, pp. 743–752 (2011)
Josiah, L.: Garlson: Redis in Action. Manning Publications Co., Greenwich (2013)
Subramanyam, R., Gupta, I., Leslie, L.M., Wang, W.: Idempotent distributed counters using a forgetful bloom filter. Clust. Comput. 19(2), 879–892 (2016)
Tarkoma, S., Rothenberg, C., Lagerspetz, E.: Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor. 14(1), 131–155 (2011)
Crainiceanu, A., Lemire, D.: Bloofi: multidimensional Bloom filters. Inf. Syst. 54, 311–324 (2015)
Wu, Y., Huang, H., Zhou, X., et al.: A space-saving URL duplication removal method for web crawler. J. Inf. Comput. Sci. 9(5), 1195–1203 (2012)
Han, H., Jung, H., Eom, H., et al.: Scatter-Gather-Merge: an efficient star-join query processing algorithm for data-parallel frameworks. Clust. Comput. 14(2), 183–197 (2011)
Alewiwi, M., Orencik, C., Savas, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19(1), 109–126 (2016)
Acknowledgements
This work is supported by the National High Technology Research and Development 863 Program of China (No. 2015AA124102) and the Hebei Natural Science Foundation of China (No. F2015203280). Shengtao Sun also acknowledges the Chinese Scholarship Council (No. 201608130030) for a visiting scholarship at University of Sydney. The authors would like to show great appreciation for the works done by Lin Zhang, Yi Zhao and Lili Wang from the research group of Knowledge Engineering (KEG), in Yanshan University.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sun, S., Gong, J., Zomaya, A.Y. et al. A distributed incremental information acquisition model for large-scale text data. Cluster Comput 22 (Suppl 1), 2383–2394 (2019). https://doi.org/10.1007/s10586-017-1498-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-1498-8