Interval Estimation for Aggregate Queries on Incomplete Data

Zhang, An-Zhen; Li, Jian-Zhong; Gao, Hong

doi:10.1007/s11390-019-1970-4

Interval Estimation for Aggregate Queries on Incomplete Data

Regular Paper
Published: 22 November 2019

Volume 34, pages 1203–1216, (2019)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

An-Zhen Zhang¹,
Jian-Zhong Li¹ &
Hong Gao¹

97 Accesses
1 Citation
Explore all metrics

Abstract

Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation (filling in) as a preprocessing step before analyses. Unfortunately, not a single imputation method could impute all missing values correctly in all cases. Users could hardly trust the query result on such complete data without any confidence guarantee. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than to impute the missing values. An interval estimation, composed of the upper and the lower bound of aggregate query results among all possible interpretations of missing values, is presented to the end users. The ground-truth aggregate result is guaranteed to be among the interval. We believe that decision support applications could benefit significantly from the estimation, since they can tolerate inexact answers, as long as there are clearly defined semantics and guarantees associated with the results. Our main techniques are parameter-free and do not assume prior knowledge about the distribution and missingness mechanisms. Experimental results are consistent with the theoretical results and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Osborne J W. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data (1st edition). SAGE Publications, Inc., 2012.
Rahm E, Do H H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 2000, 23(4): 3-13.
Google Scholar
Little R J, Rubin D B. Statistical Analysis with Missing Data (2nd Edition, Kindle Edition). Wiley-Interscience, 2014.
Zhang A, Wang J, Li J, Gao H. Aggregate query processing on incomplete data. In Proc. the 2nd International Joint Conference on Web and Big Data, July 2018, pp.286-294.
Jr W L. On semantic issues connected with incomplete information databases. ACM Trans. Database Syst., 1979, 4(3): 262-296.
Article Google Scholar
Reiter R. On closed world data bases. In Proc. the 1977 Symposium on Logic and Data Bases, November 1977, pp.55-76.
Codd E. Extending the database relational model to capture more meaning. ACM Trans. Database Syst., 1979, 4(4): 397-434.
Article Google Scholar
Lakshminarayan K, Harp S A, Samad T. Imputation of missing data in industrial databases. Appl. Intell., 1999, 11(3): 259-275.
Article Google Scholar
Mayfield C, Neville J, Prabhakar S. ERACER: A database approach for statistical inference and data cleaning. In Proc. the ACM SIGMOD International Conference on Management of Data, June 2010, pp.75-86.
Abiteboul S, Hull R, Vianu V. Foundations of Databases. Addison-Wesley, 1995.
Grahne G. The Problem of Incomplete Information in Relational Databases. Springer, 1991.
Imielinski T, Jr W L. Incomplete information in relational databases. J. ACM, 1984, 31(4): 761-791.
Article MathSciNet Google Scholar
van der Meyden R. Logical approaches to incomplete information: A survey. In Logics for Databases and Information Systems, Chomicki J, Saake G (eds.), Springer, 1998, pp.307-356.
Codd E F. Understanding relations (Installment #6). FDT — Bulletin of ACM SIGMOD, 1975, 7(1): 1-4.
Article Google Scholar
Date C J. Database in Depth Relational Theory for Practitioners. O’Reilly, 2005.
Date C. A critique of Claude Rubinson’s paper nulls, three-valued logic, and ambiguity in SQL: Critiquing date’s critique. SIGMOD Record, 2008, 37(3): 20-22.
Article Google Scholar
Date C J, Darwen H. A Guide to SQL Standard (4th edition). Addison-Wesley, 1997.
Grant J. Null values in a relational data base. Inf. Process. Lett., 1977, 6(5): 156-157.
Article Google Scholar
Abiteboul S, Kanellakis P C, Grahne G. On the representation and querying of sets of possible worlds. Theor. Comput. Sci., 1991, 78(1): 158-187.
MathSciNet MATH Google Scholar
Sarle W S. Prediction with missing inputs. In Proc. the 4th Joint Conference on Information Sciences, October 1998, pp.399-402.
Feelders A. Handling missing data in trees: Surrogate splits or statistical imputation? In Proc. the 3rd European Conference on Principles of Data Mining and Knowledge Discovery, September 1999, pp.329-334.
Sande I G. Hot-deck imputation procedures. Incomplete Data in Sample Surveys, 1983, 3: 339-349.
Google Scholar
Buck S F. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society. Series B (Methodological), 1960, 22(2): 302-306.
MathSciNet MATH Google Scholar
Duda R O, Hart P E. Pattern Classification and Scene Analysis (1st edition). Wiley, 1973.
Ghahramani Z, Jordan M I. Mixture models for learning from incomplete data. Computational Learning Theory and Natural Learning Systems, 1997, 4: 67-85.
Google Scholar
van Buuren S, Mulligen E V, Brand J P L. Routine multiple imputation in statistical databases. In Proc. the 7th International Working Conference on Scientific and Statistical Database Management, September 1994, pp.74-78.
Rubin D B. Multiple imputation after 18+ years. Journal of the American statistical Association, 1996, 91(434): 473-489.
Article Google Scholar
Li K H. Imputation using Markov chains. Journal of Statistical Computation and Simulation, 1988, 30(1): 57-79.
Article MathSciNet Google Scholar
Rubin D B. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, 2004.
Schafer J L. Analysis of Incomplete Multivariate Data (1st edition). Chapman and Hall/CRC, 1997.

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
An-Zhen Zhang, Jian-Zhong Li & Hong Gao

Authors

An-Zhen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Zhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to An-Zhen Zhang.

Electronic supplementary material

ESM 1

(PDF 345 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, AZ., Li, JZ. & Gao, H. Interval Estimation for Aggregate Queries on Incomplete Data. J. Comput. Sci. Technol. 34, 1203–1216 (2019). https://doi.org/10.1007/s11390-019-1970-4

Download citation

Received: 26 December 2018
Revised: 12 September 2019
Published: 22 November 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11390-019-1970-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interval Estimation for Aggregate Queries on Incomplete Data

Abstract

Access this article

Similar content being viewed by others

Aggregate Query Processing on Incomplete Data

A Comparison of Characteristic Sets and Generalized Maximal Consistent Blocks in Mining Incomplete Data

Characteristic Sets and Generalized Maximal Consistent Blocks in Mining Incomplete Data

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Interval Estimation for Aggregate Queries on Incomplete Data

Abstract

Access this article

Similar content being viewed by others

Aggregate Query Processing on Incomplete Data

A Comparison of Characteristic Sets and Generalized Maximal Consistent Blocks in Mining Incomplete Data

Characteristic Sets and Generalized Maximal Consistent Blocks in Mining Incomplete Data

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation