Skip to main content
Log in

A parallel computing framework for big data

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Big data has received great attention in research and application. However, most of the current efforts focus on system and application to handle the challenges of “volume” and “velocity”, and not much has been done on the theoretical foundation and to handle the challenge of “variety”. Based on metric-space indexing and computationalcomplexity theory, we propose a parallel computing framework for big data. This framework consists of three components, i.e., universal representation of big data by abstracting various data types into metric space, partitioning of big data based on pair-wise distances in metric space, and parallel computing of big data with the NC-class computing theory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Zhou B, Liu W, Fan C. Big data: strategy, technology and practice. Beijing: Publishing House of Electronics Industry, 2013

  2. Tu Z. Big data. Guilin: Guangxi Normal University Press, 2012

    Google Scholar 

  3. Fan W, Geerts F, Neven F. Making queries tractable on big data with preprocessing: through the eyes of complexity theory. Proceedings of the VLDB Endowment, 2013, 6(9): 685–696

    Article  Google Scholar 

  4. Chen G. Design and analysis of parallel algorithms. Beijing: Higher Education Press, 2011

    Google Scholar 

  5. Cook S A. Deterministic CFL’s are accepted simultaneously in polynomial time and log squared space. In: Proceedings of the 11th Annual ACM Symposium on Theory of Computing. 1979, 338–345

    Google Scholar 

  6. Mao R, Xu H L, Wu W B, Li J Q, Li Y, Lu M H. Overcoming the challenge of variety: big data abstraction, the next evolution of data management for AAL communication system. IEEE Communications Magazine, 2015, 53(1): 42–47

    Article  Google Scholar 

  7. Xiong J, Lu J, Tan F. Topology. Beijing: China Machine Press, 2013

    Google Scholar 

  8. Chávez E, Navarro G, Baeza-Yates R, Marroquín J L. Searching in metric spaces. ACM Computing Surveys, 2001, 33(3): 273–321

    Article  Google Scholar 

  9. Zezula P, Amato G, Dohnal V, Batko M. Similarity Search: the Metric Space Approach. Springer Science & Business Media, 2006

    MATH  Google Scholar 

  10. Mao R, Zhang P H, Li X L, Liu X, Lu M H. Pivot selection for metricspace indexing. International Journal of Machine Learning and Cybernetics, 2016, 7(2): 311–323

    Article  Google Scholar 

  11. Mao R, Miranker W, Miranker D P. Pivot selection: dimension reduction for distance-based indexing. Journal of Discrete Algorithms, 2012, 13: 32–46

    Article  MathSciNet  MATH  Google Scholar 

  12. Shi H M, Schaeffer J. Parallel sorting by regular sampling. Journal of Parallel and Distributed Computing, 1992, 14(4): 361–372

    Article  MATH  Google Scholar 

  13. Valiant L G. Parallelism in comparison problems. SIAM Journal on Computing, 1975, 4(3): 348–355

    Article  MathSciNet  MATH  Google Scholar 

  14. Shiloach Y, Vishkin U. Finding the maximum. Merging and Sorting in a Parallel Computational Model, 1981, 2(1): 88–102

    MATH  Google Scholar 

  15. Chen G. Balanced (n, m)-selection networks. Computer Research and Development, 1984, 21(11): 9–22

    Google Scholar 

  16. Samet, H. Foundations of Multidimensional and Metric Data Structures. San Francisco: Morgan-Kaufmann, 2006

    MATH  Google Scholar 

  17. Hjaltason G R, Samet H. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 2003. 28(4): 517–580

    Article  Google Scholar 

  18. Uhlmann J K. Satisfying general proximity/similarity queries withmetric trees. Information Processing Letter, 1991, 40(4): 175–179

    Article  MATH  Google Scholar 

  19. Yianilos P N. Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the 4th annual ACM-SIAM Symposium on Discrete algorithms. 1993

    Google Scholar 

  20. Mao R, Liu S, Xu H L, Zhang D, Miranker D P. On data partitioning in tree structure metric-space indexes. In: Proceedings of the 19th International Conference on Database Systems for Advanced Applications. 2014, 141–155

    Chapter  Google Scholar 

  21. Bozkaya T, Ozsoyoglu M. Indexing large metric spaces for similarity search queries. ACM Transactions on Database Systems, 1999, 24(3): 361–404

    Article  Google Scholar 

  22. Sergey B. Near neighbor search in large metric spaces. In: Proceedings of the 21st International Conference on Very Large Data Bases. 1995

    Google Scholar 

  23. Ciaccia P, Patella M, Zezula P. M-tree: an efficient access method for similarity search in metric spaces. In: proceedings of the 23rd International Conference on Very Large Data Bases. 1997

    Google Scholar 

  24. Navarro G. Searching in metric spaces by spatial approximation. The VLDB Journal, 2002, 11(1): 28–46

    Article  Google Scholar 

  25. Jagadish H V, Ooi B C, Tan K L, Yu C, Zhang R. iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems (TODS), 2005, 30(2): 364–397

    Article  Google Scholar 

  26. Chavez E, Navarro G. Unbalancing: the key to index high dimensional metric spaces. Technical Report. 1999

    Google Scholar 

  27. Mao R, Xu W, Ramakrishnan S. Nuckolls G, Miranker D P. On optimizing distance-based similarity search for biological databases. In: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference. 2005, 351–361

    Chapter  Google Scholar 

  28. MacQueen J B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability. 1967, 281–297

    Google Scholar 

  29. Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM SIGKDD Conference. 1996. 226–231

    Google Scholar 

  30. Culler D, Karp R, Patterson D, Sahay A, Schauser K E, Santos E, Subramonian R, von Eicken T. LogP: towards a realistic model of parallel computation. In: Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1993, 1–12

    Google Scholar 

  31. Tu B B, Zou M, Zhan J F, Zhao X F, Fan J P. Research on parallel computation model with memory hierarchy on multi-core clusters. Chinese Journal of Computers, 2008, 31(11): 1948–1955

    Article  Google Scholar 

  32. Ladner R E. The circuit value problem is log space complete for P. ACM SIGACT News, 1975, 7(1): 18–20

    Article  Google Scholar 

  33. Pippenger N J. On simultaneous resource bounds. In: Proceedings of the 20th IEEE Annual Symposium on Foundations of Computer Science. 1979, 307–311

    Google Scholar 

  34. Chandra A K, Stockmeyer L J. Alternation. In: Proceedings of the 17th IEEE Annual Symposium on Foundations of Computer Science. 1976, 98–108

    Google Scholar 

  35. Goldschlager L M. Synchronous parallel computation. Dissertation for the Doctoral Degree. Toronto: University of Toronto, 1977

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank Prof. Wenfei Fan, Prof. Benjamin Wah, and Prof. Guihai Chen for their comments. This work was partially supported by the National High Technology Research and Development Program of China (863 Program) (2015AA015305), the National Natural Science Foundation (NSF) of China (Grant Nos. U1301252, 61471243), NSF of Guangdong (2013B090500055, 2014A030313553), Guangdong Key Laboratory Project (2012A061400024), and NSF of Shenzhen (JCYJ20140418095735561, JCYJ20150529164656096, JCYJ20150731160834611).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Mao.

Additional information

Guoliang Chen received his BS degree from Xi’an Jiaotong University, China in 1961. He is currently the dean of the College of Computer Science and Software Engineering of Shenzhen University, China and the dean of the School of Software, University of Science and Technology of China. Prof. Chen is a member of the Chinese Academy of Sciences, China. He has received more than 20 national and province level awards and authored about 20 books. His research interests include high performance computing and big data.

Rui Mao received his BS (1997) and MS (2000) in computer science from the University of Science and Technology of China, China and MS (2006) in statistics and PhD (2007) in computer science from the University of Texas at Austin, USA. After three years of working at the Oracle USA Corporation, he joined Shenzhen University (SZU), China in 2010. He is now an associate professor and an associate dean of the College of Computer Science and Software Engineering, SZU. His work on the pivot space model was awarded the SISAP 2010 Best Paper award. His research interest includes universal data management and analysis in metric space, and high performance computing.

Kezhong Lu received his BS and PhD degree in computer software and theory from the University of Science and Technology of China, China in 2001 and 2006, respectively. Currently he is a full professor in College of Computer Science and Software Engineering, Shenzhen University (SZU), China. Prior to current position, he worked as a Lecturer from 2006 to 2007 and as an associate professor from 2007 to 2012 at SZU. His research interests include wireless sensor networks, parallel computing and big data.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, G., Mao, R. & Lu, K. A parallel computing framework for big data. Front. Comput. Sci. 11, 608–621 (2017). https://doi.org/10.1007/s11704-016-5003-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-016-5003-y

Keywords

Navigation