Tractable queries on big data via preprocessing with logarithmic-size output

Yang, Jiannan; Wang, Hanpin; Cao, Yongzhi

doi:10.1007/s10115-017-1092-7

Tractable queries on big data via preprocessing with logarithmic-size output

Regular Paper
Published: 19 August 2017

Volume 56, pages 141–163, (2018)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

316 Accesses
2 Citations
Explore all metrics

Abstract

To provide a dichotomy between those queries that are feasible on big data after appropriate preprocessing and those for which preprocessing does not help, Fan et al. developed the \(\sqcap \)-tractability theory, which provides a formal foundation on the tractability of query classes in the context of big data. Inspired by some technologies used to deal with big data, we introduce a novel notion of \(\sqcap '\)-tractability in this paper. We place a restriction on preprocessing functions, which limits the functions to produce relatively short outputs, at most logarithmic-size of the inputs. We set a complexity class to denote the classes of Boolean queries that are \(\sqcap '\)-tractable and conclude that it is properly contained in that of \(\sqcap \)-tractable query classes, after discovering that a \(\sqcap \)-tractable query class is not \(\sqcap '\)-tractable. With an existing reduction, which does not allow re-factorizing data and query parts, we define complete query classes for the complexity class and give an efficient way to detect such query classes. We also investigate the query classes that can be made \(\sqcap '\)-tractable and prove that all PTIME classes of Boolean queries can be made \(\sqcap '\)-tractable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sublinear-Time Reductions for Big Data Computing

Recognizing the Tractability in Big Data Computing

What Is the Sublinear Computation Paradigm?

Notes

https://support.google.com/gsa/answer/4411411/.

References

Cao Y, Fan W, Wo T, Yu W (2014) Bounded conjunctive queries. PVLDB 7(12):1231–1242
Google Scholar
Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347
Article Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Fan W, Huai J (2014) Querying big data: bridging theory and practice. J Comput Sci Technol 29(5):849–869
Article MathSciNet Google Scholar
Fan W, Li J, Wang X, Wu Y (2012) Query preserving graph compression. In: Proceedings of the ACM 2012 international conference on management of data, pp 157–168
Fan W, Geerts F, Neven F (2013) Making queries tractable on big data with preprocessing: through the eyes of complexity theory. PVLDB 6(9):685–696
Google Scholar
Fan W, Geerts F, Libkin L (2014) On scale independence for querying big data. In: Proceedings of the ACM 33rd symposium on principles of database systems, pp 51–62
Fan W, Wang X, Wu Y (2014) Querying big graphs within bounded resources. In: Proceedings of the ACM 2014 international conference on management of data, pp 301–312
Fiori A, Mignone A, Rospo G (2016) Decoclu: density consensus clustering approach for public transport data. Inf Sci 328:378–388
Article Google Scholar
Gani A, Siddiqa A, Shamshirband S, Hanum F (2016) A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl Inf Syst 46(2):241–284
Article Google Scholar
Greenlaw R (1993) Breadth-depth search is P-complete. Parallel Process Lett 3(03):209–222
Article MathSciNet Google Scholar
Greenlaw R, Hoover HJ, Ruzzo WL (1995) Limits to parallel computation: P-completeness theory. Oxford University Press, New York
MATH Google Scholar
Hamooni H, Mueen A, Neel A (2016) Phoneme sequence recognition via dtw-based classification. Knowl Inf Syst 48(2):253–275
Article Google Scholar
Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of “big data” on cloud computing: review and open research issues. Inf Syst 47:98–115
Article Google Scholar
Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C (2014) Big data and its technical challenges. Commun ACM 57(7):86–94
Article Google Scholar
Jung G, Gnanasambandam N, Mukherjee T (2012) Synchronous parallel processing of big-data analytics services to optimize performance in federated clouds. In: IEEE proceedings of the 5th international conference on cloud computing, pp 811–818
Kang U, Tong H, Sun J, Lin C, Faloutsos C (2011) Gbase: A scalable and general graph management system. In: ACM proceedings of the 17th international conference on knowledge discovery and data mining, pp 1091–1099
Marz N, Warren J (2015) Big data: principles and best practices of scalable realtime data systems. Manning Publications Co, Greenwich
Google Scholar
Michael K, Miller KW (2013) Big data: new opportunities and new challenges. Computer 46(6):22–24
Article Google Scholar
Mozafari B, Zeng K, D’Antoni L, Zaniolo C (2013) High-performance complex event processing over hierarchical data. ACM T Database Syst 38(4):21
MathSciNet MATH Google Scholar
National Research Council (2013) Frontiers in massive data analysis. The National Academies Press, Washington
Papadimitriou CH (2003) Computational complexity. In: Encyclopedia of computer science. Wiley, Chichester, pp 260–265
Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265
Article Google Scholar
del Río S, López V, Benítez JM, Herrera F (2014) On the use of mapreduce for imbalanced big data using random forest. Inf Sci 285:112–137
Article Google Scholar
Sarma AD, Lee H, Gonzalez H, Madhavan J, Halevy AY (2013) Consistent thinning of large geographical data for map visualization. ACM T Database Syst 38(4):22
MathSciNet MATH Google Scholar
Vardi MY (1982) The complexity of relational query languages. In: Proceedings of the 14th Annual ACM Symposium on Theory of Computing, pp 137–146
Wu X, Zhu X, Wu G, Ding W (2014) Data mining with big data. IEEE T Knowl Data En 26(1):97–107
Article Google Scholar
Yang C, Zhang X, Zhong C, Liu C, Pei J, Ramamohanarao K, Chen J (2014) A spatiotemporal compression based approach for efficient big data processing on cloud. J Comput Syst Sci 80(8):1563–1583
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors are very grateful to Professor Wenfei Fan and the anonymous reviewers for their invaluable suggestions. This work was supported by the National Natural Science Foundation of China (Grants Nos. 61370053, 61572003, and 61772035).

Author information

Authors and Affiliations

Key Laboratory of High Confidence Software Technologies (MOE), School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Jiannan Yang, Hanpin Wang & Yongzhi Cao

Authors

Jiannan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hanpin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yongzhi Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongzhi Cao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, J., Wang, H. & Cao, Y. Tractable queries on big data via preprocessing with logarithmic-size output. Knowl Inf Syst 56, 141–163 (2018). https://doi.org/10.1007/s10115-017-1092-7

Download citation

Received: 21 August 2016
Revised: 07 July 2017
Accepted: 28 July 2017
Published: 19 August 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s10115-017-1092-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tractable queries on big data via preprocessing with logarithmic-size output

Abstract

Access this article

Similar content being viewed by others

Sublinear-Time Reductions for Big Data Computing

Recognizing the Tractability in Big Data Computing

What Is the Sublinear Computation Paradigm?

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Tractable queries on big data via preprocessing with logarithmic-size output

Abstract

Access this article

Similar content being viewed by others

Sublinear-Time Reductions for Big Data Computing

Recognizing the Tractability in Big Data Computing

What Is the Sublinear Computation Paradigm?

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation