Skip to main content
Log in

A distributed incremental information acquisition model for large-scale text data

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Timely discovering and acquiring information from incremental data on the Internet is a hot topic in a big data era. This paper presents a distributed incremental information acquisition model for large-scale text data. To obtain a lower false positive rate and higher efficiency of the traditional Bloom filter, a distributed multidimensional Bloom filter is designed and proposed to cope with the deduplication of large-scale Web URL text data. Three methods related to Bloom filter were compared based on the false positive rate and response efficiency. The results show that the distributed incremental information acquisition model for large-scale text data can achieve a high duplicate removal rate with a lower false positive rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Wang, L., Song, W., Liu, P.: Link the remote sensing big data to the image features via wavelet transformation. Clust. Comput. 19(2), 793–810 (2016)

    Article  Google Scholar 

  2. Ranjan, R., Georgakopoulos, D., Wang, L.: A note on software tools and technologies for delivering smart media-optimized big data applications in the cloud. Computing 98, 1–5 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  3. Chen, D., Li, X., Wang, L., et al.: Fast and scalable multi-way analysis of massive neural data. IEEE Trans. Comput. 64(3), 707–719 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  4. Deng, Z., Han, W., Wang, L., et al.: An efficient online direction-preserving compression approach for trajectory streaming data. Fut. Gener. Comput. Syst. 68, 150–162 (2017)

    Article  Google Scholar 

  5. Li, J., Zhang, P., Li, Y., et al.: A data-check based distributed storage model for storing hot temporary data. Fut. Gener. Comput. Syst. 73, 13–21 (2017)

    Article  Google Scholar 

  6. Melnik, S., Gubarev, A., Long, J.J., et al.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54, 114–123 (2011)

    Article  Google Scholar 

  7. Voras, I., Zagar, M.: Adapting the Bloom filter to multithreaded environments. In: The 15th IEEE Mediterranean Electrotechnical Conference, Valletta, Malta, pp. 1488–1493 (2010)

  8. Ma, Y., Wang, L., Zomaya, A.Y., et al.: Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic dag scheduling. IEEE Trans. Parallel Distrib. Syst. 25(8), 2126–2137 (2014)

    Article  Google Scholar 

  9. Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. 19(3), 1283–1292 (2016)

    Article  Google Scholar 

  10. Xiang, Z., Schwartz, Z., Gerdes Jr., J.H., Uysal, M.: What can big data and text analytics tell us about hotel guest experience and satisfaction? Int. J. Hosp. Manag. 44, 120–130 (2015)

    Article  Google Scholar 

  11. Jensen, K., Nguyen, H.T., Van Do, T., Arnes, A.: A big data analytics approach to combat telecommunication vulnerabilities. Clust. Comput. 20(3), 2363–2374 (2017)

    Article  Google Scholar 

  12. Ma, L., Zhang, Y.: Using Word2Vec to process big text data. In: IEEE International Conference on Big Data, Santa Clara, pp. 2895–2897 (2015)

  13. Schmidt, K., Bachle, S., Scholl, P., Nold, G.: Big Scale Text Analytics and Smart Content Navigation. Enabling Real-Time Business Intelligence, Lecture Notes in Business Information Processing, vol. 206, pp. 167–170. Springer, Berlin (2015)

    Google Scholar 

  14. Deng, Z., Wu, X., Wang, L., et al.: Parallel processing of dynamic continuous qeries over streaming data flows. IEEE Trans. Parallel Distrib. Syst. 26(3), 834–846 (2015)

    Article  Google Scholar 

  15. Chen, D., Wang, L., Zomaya, A.Y., et al.: Parallel simulation of complex evacuation scenarios with adaptive agent models. IEEE Trans. Parallel Distrib. Syst. 26(3), 847–857 (2015)

    Article  Google Scholar 

  16. Cho, J., Garcia-Molina, H.: Dealing with web data: history and look ahead. Proc. VLDB Endow. 3(1–2), 4–4 (2010)

    Article  Google Scholar 

  17. Sharma, D.K., Sharma, A.K.: A novel architecture for deep web crawler. Int. J. Inf. Technol. Web Eng. 6(1), 25–48 (2011)

    Article  Google Scholar 

  18. Zhang, Z., Dong, G., Peng, Z., et al.: A framework for incremental deep web crawler based on URL classification. In: The International Conference on Web Information Systems and Mining, Taiyuan, China, pp. 302–310 (2011)

  19. Guo, H., Chen, Q., Xin, C., Wang, X., Bi, Ye: A real environment oriented parallel duplicates removal approach for large scale Chinese webpages. J. Comput. Inf. Syst. 7(5), 1420–1427 (2011)

    Google Scholar 

  20. Zhang, F., Liu, M., Gui, F., Shen, W., Shami, Abdallah, Ma, Yunlong: A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust. Comput. 18(4), 1493–1501 (2015)

    Article  Google Scholar 

  21. Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., Bal, H.: WebPIE: a web-scale parallel inference engine using MapReduce. Web Semant. 10, 59–75 (2012)

    Article  Google Scholar 

  22. Ben, X., Jia, D., Yuan, L.: A three layer distributed architecture for large-scale duplicated web page detection. Comput. Digital Eng. 10, 1751–1755 (2015)

    Google Scholar 

  23. Jose, J., Subramoni, H., Luo, M., et al.: Memcached design on high performance RDMA capable interconnects. In: The International Conference on Parallel Processing, Taipei, Taiwan, pp. 743–752 (2011)

  24. Josiah, L.: Garlson: Redis in Action. Manning Publications Co., Greenwich (2013)

    Google Scholar 

  25. Subramanyam, R., Gupta, I., Leslie, L.M., Wang, W.: Idempotent distributed counters using a forgetful bloom filter. Clust. Comput. 19(2), 879–892 (2016)

    Article  Google Scholar 

  26. Tarkoma, S., Rothenberg, C., Lagerspetz, E.: Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor. 14(1), 131–155 (2011)

    Article  Google Scholar 

  27. Crainiceanu, A., Lemire, D.: Bloofi: multidimensional Bloom filters. Inf. Syst. 54, 311–324 (2015)

    Article  Google Scholar 

  28. Wu, Y., Huang, H., Zhou, X., et al.: A space-saving URL duplication removal method for web crawler. J. Inf. Comput. Sci. 9(5), 1195–1203 (2012)

    Google Scholar 

  29. Han, H., Jung, H., Eom, H., et al.: Scatter-Gather-Merge: an efficient star-join query processing algorithm for data-parallel frameworks. Clust. Comput. 14(2), 183–197 (2011)

    Article  Google Scholar 

  30. Alewiwi, M., Orencik, C., Savas, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19(1), 109–126 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National High Technology Research and Development 863 Program of China (No. 2015AA124102) and the Hebei Natural Science Foundation of China (No. F2015203280). Shengtao Sun also acknowledges the Chinese Scholarship Council (No. 201608130030) for a visiting scholarship at University of Sydney. The authors would like to show great appreciation for the works done by Lin Zhang, Yi Zhao and Lili Wang from the research group of Knowledge Engineering (KEG), in Yanshan University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jibing Gong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, S., Gong, J., Zomaya, A.Y. et al. A distributed incremental information acquisition model for large-scale text data. Cluster Comput 22 (Suppl 1), 2383–2394 (2019). https://doi.org/10.1007/s10586-017-1498-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1498-8

Keywords

Navigation