Skip to main content
Log in

Tractable queries on big data via preprocessing with logarithmic-size output

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

To provide a dichotomy between those queries that are feasible on big data after appropriate preprocessing and those for which preprocessing does not help, Fan et al. developed the \(\sqcap \)-tractability theory, which provides a formal foundation on the tractability of query classes in the context of big data. Inspired by some technologies used to deal with big data, we introduce a novel notion of \(\sqcap '\)-tractability in this paper. We place a restriction on preprocessing functions, which limits the functions to produce relatively short outputs, at most logarithmic-size of the inputs. We set a complexity class to denote the classes of Boolean queries that are \(\sqcap '\)-tractable and conclude that it is properly contained in that of \(\sqcap \)-tractable query classes, after discovering that a \(\sqcap \)-tractable query class is not \(\sqcap '\)-tractable. With an existing reduction, which does not allow re-factorizing data and query parts, we define complete query classes for the complexity class and give an efficient way to detect such query classes. We also investigate the query classes that can be made \(\sqcap '\)-tractable and prove that all PTIME classes of Boolean queries can be made \(\sqcap '\)-tractable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://support.google.com/gsa/answer/4411411/.

References

  1. Cao Y, Fan W, Wo T, Yu W (2014) Bounded conjunctive queries. PVLDB 7(12):1231–1242

    Google Scholar 

  2. Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347

    Article  Google Scholar 

  3. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  4. Fan W, Huai J (2014) Querying big data: bridging theory and practice. J Comput Sci Technol 29(5):849–869

    Article  MathSciNet  Google Scholar 

  5. Fan W, Li J, Wang X, Wu Y (2012) Query preserving graph compression. In: Proceedings of the ACM 2012 international conference on management of data, pp 157–168

  6. Fan W, Geerts F, Neven F (2013) Making queries tractable on big data with preprocessing: through the eyes of complexity theory. PVLDB 6(9):685–696

    Google Scholar 

  7. Fan W, Geerts F, Libkin L (2014) On scale independence for querying big data. In: Proceedings of the ACM 33rd symposium on principles of database systems, pp 51–62

  8. Fan W, Wang X, Wu Y (2014) Querying big graphs within bounded resources. In: Proceedings of the ACM 2014 international conference on management of data, pp 301–312

  9. Fiori A, Mignone A, Rospo G (2016) Decoclu: density consensus clustering approach for public transport data. Inf Sci 328:378–388

    Article  Google Scholar 

  10. Gani A, Siddiqa A, Shamshirband S, Hanum F (2016) A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl Inf Syst 46(2):241–284

    Article  Google Scholar 

  11. Greenlaw R (1993) Breadth-depth search is P-complete. Parallel Process Lett 3(03):209–222

    Article  MathSciNet  Google Scholar 

  12. Greenlaw R, Hoover HJ, Ruzzo WL (1995) Limits to parallel computation: P-completeness theory. Oxford University Press, New York

    MATH  Google Scholar 

  13. Hamooni H, Mueen A, Neel A (2016) Phoneme sequence recognition via dtw-based classification. Knowl Inf Syst 48(2):253–275

    Article  Google Scholar 

  14. Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of “big data” on cloud computing: review and open research issues. Inf Syst 47:98–115

    Article  Google Scholar 

  15. Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C (2014) Big data and its technical challenges. Commun ACM 57(7):86–94

    Article  Google Scholar 

  16. Jung G, Gnanasambandam N, Mukherjee T (2012) Synchronous parallel processing of big-data analytics services to optimize performance in federated clouds. In: IEEE proceedings of the 5th international conference on cloud computing, pp 811–818

  17. Kang U, Tong H, Sun J, Lin C, Faloutsos C (2011) Gbase: A scalable and general graph management system. In: ACM proceedings of the 17th international conference on knowledge discovery and data mining, pp 1091–1099

  18. Marz N, Warren J (2015) Big data: principles and best practices of scalable realtime data systems. Manning Publications Co, Greenwich

    Google Scholar 

  19. Michael K, Miller KW (2013) Big data: new opportunities and new challenges. Computer 46(6):22–24

    Article  Google Scholar 

  20. Mozafari B, Zeng K, D’Antoni L, Zaniolo C (2013) High-performance complex event processing over hierarchical data. ACM T Database Syst 38(4):21

    MathSciNet  MATH  Google Scholar 

  21. National Research Council (2013) Frontiers in massive data analysis. The National Academies Press, Washington

  22. Papadimitriou CH (2003) Computational complexity. In: Encyclopedia of computer science. Wiley, Chichester, pp 260–265

  23. Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265

    Article  Google Scholar 

  24. del Río S, López V, Benítez JM, Herrera F (2014) On the use of mapreduce for imbalanced big data using random forest. Inf Sci 285:112–137

    Article  Google Scholar 

  25. Sarma AD, Lee H, Gonzalez H, Madhavan J, Halevy AY (2013) Consistent thinning of large geographical data for map visualization. ACM T Database Syst 38(4):22

    MathSciNet  MATH  Google Scholar 

  26. Vardi MY (1982) The complexity of relational query languages. In: Proceedings of the 14th Annual ACM Symposium on Theory of Computing, pp 137–146

  27. Wu X, Zhu X, Wu G, Ding W (2014) Data mining with big data. IEEE T Knowl Data En 26(1):97–107

    Article  Google Scholar 

  28. Yang C, Zhang X, Zhong C, Liu C, Pei J, Ramamohanarao K, Chen J (2014) A spatiotemporal compression based approach for efficient big data processing on cloud. J Comput Syst Sci 80(8):1563–1583

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors are very grateful to Professor Wenfei Fan and the anonymous reviewers for their invaluable suggestions. This work was supported by the National Natural Science Foundation of China (Grants Nos. 61370053, 61572003, and 61772035).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongzhi Cao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, J., Wang, H. & Cao, Y. Tractable queries on big data via preprocessing with logarithmic-size output. Knowl Inf Syst 56, 141–163 (2018). https://doi.org/10.1007/s10115-017-1092-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1092-7

Keywords

Navigation