Skip to main content
Log in

Efficient histogram-based range query estimation for dirty data

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

In recent years, data quality issues have attracted wide attentions. Data quality problems are mainly caused by dirty data. Currently, many methods for dirty data management have been proposed, and one of them is entity-based relational database in which one tuple represents an entity. The traditional query optimizations are not suitable for the new entity-based model. Then new query optimizations need to be developed. In this paper, we propose a new query selectivity estimation strategy based on histogram, and focus on solving the overestimation which traditional methods lead to. We prove our approaches are unbiased. The experimental results on both real and synthetic data sets show that our approaches can give good estimates with low error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Batini C, Scannapieco M. Data Quality: Concepts, Methodologies and Techniques. New York: Springer Publishing Company, Inc., 2006

    MATH  Google Scholar 

  2. Lenzerini M. Data integration: a theoretical perspective. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2015, 233–246

    Google Scholar 

  3. Dong X L, Halevy A, Yu C. Data integration with uncertainty. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(2): 469–500

    Article  Google Scholar 

  4. Redman T. The impact of poor data quality on the typical enterprise. Communications of the ACM, 1998, 41(2): 49–71

    Article  Google Scholar 

  5. Raman D, Ton Z. Execution: the missing link in retail operations. Jutas Bus.l, 2001, 43(3): 489–503

    Google Scholar 

  6. English L P. Information quality management: the next frontier. In: Proceedings of ASQ World Conference on Quality and Improvement. 2001

    Google Scholar 

  7. Rahm E, Do H H. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 2000, 23(23): 3–13

    Google Scholar 

  8. Fan WF, Li J, Ma S, Tang N, Yu W. Interaction between record matching and data repairing. Journal of Data & Information Quality, 2011, 4(4): 469–480

    Google Scholar 

  9. Fuxman A D, Miller R J. First-order query rewriting for inconsistent databases. In: Proceedings of International Conference on Database Theory. 2005, 337–351

    Google Scholar 

  10. Andritsos P, Fuxman A, Miller R J. Clean answers over dirty databases: a probabilistic approach. IEEE Computer Society, 2006, 30

    Google Scholar 

  11. Wolf G, Kalavagattu A, Khatri H, Balakrishnan R, Chokshi B, Fan J, Chen Y, Kambhampati S. Query processing over incomplete autonomous databases: query rewriting using learned data dependencies. The VLDB Journal, 2009, 18(5): 1167–1190

    Article  Google Scholar 

  12. Fuxman A, Fazli E, Miller J. Conquer: efficient management of inconsistent databases. In: Proceedings of SIGMOD Conference. 2005, 155–166

    Google Scholar 

  13. Boulos J, Dalvi N, Mandhani B, Mathur S, Re C, Suciu D. MYSTIQ: a system for finding more answers by using probabilities. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 891–893

    Google Scholar 

  14. Dalvi N, Suciu D. Management of probabilistic data: foundations and challenges. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 1–12

    Google Scholar 

  15. Widom J. Trio: a system for integrated management of data, accuracy, and lineage. In: Proceedings of the Conference on Innovative Data Systems Research (CIDR). 2005, 262–276

    Google Scholar 

  16. Hassanzadeh O, Miller R J. Creating probabilistic databases from duplicated data. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(5): 1141–1166

    Article  Google Scholar 

  17. Benjelloun O, Garcia-Molina H, Menestrina D, Whang S E, Su Q, Widom J. Swoosh: a generic approach to entity resolution. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(1): 255–276

    Article  Google Scholar 

  18. Whang S E, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H. Entity resolution with iterative blocking. In: Proceedings of the 35th SIGMOD International Conference on Management of Data. 2009, 219–232

    Chapter  Google Scholar 

  19. Li Y, Wang H, Gao H. Efficient entity resolution based on sequence rules. In: Proceedings of Communications in Computer and Information Science. 2011, 381–388

    Google Scholar 

  20. Lu W, Fung G P C, Du X, Zhou X, Chen L, Deng K. Approximate entity extraction in temporal databases. World Wide Web, 2011, 14(2): 157–186

    Article  Google Scholar 

  21. Zhang W J, Zhan L M, Zhang Y, Cheema M A, Lin X M. Efficient top-k similarity join processing over multi-valued objects. World Wide Web, 2014, 17(3): 285–309

    Article  Google Scholar 

  22. Ioannidis Y E. The history of histograms (abridged). In: Proceedings of the 29th International Conference on Very Large Data Bases. 2004, 19–30

    Google Scholar 

  23. Cormode G, Garofalakis M. Histograms and wavelets on probabilistic data. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(8): 1142–1157

    Article  Google Scholar 

  24. Cormode G, Deligiannakis A, Garofalakis M, McGregor A. Probabilistic histograms for probabilistic data. Proceedings of the VLDB Endowment, 2009, 2(1): 526–537

    Article  Google Scholar 

  25. Wang H Z, Liu X L, Li J Z, Tong X, Yang L, Li Y K. EntityManager: an entity-based dirty data management system. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2013, 468–471

    Chapter  Google Scholar 

  26. Abiteboul S, Kanellakis P, Grahne G. On the representation and querying of sets of possible worlds. Theoretical Computer Science, 1987, 16(3): 34–48

    MATH  Google Scholar 

  27. Fuhr N, Rolleke T. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 1997, 15(1): 32–66

    Article  Google Scholar 

  28. Lakshmanan L, Leone N, Ross R, Subrahmanian V S. Probview: a flexible probabilistic database system. ACM Transactions on Database Systems, 1997, 22(3): 419–469

    Article  Google Scholar 

  29. Nierman A, Jagadish H. ProTDB: probabilistic data in XML. In: Proceedings of the 28th International Conference on Very Large Data Bases. 2002, 646–657

    Chapter  Google Scholar 

  30. Jin C Q, Yi K, Chen L, Yu J X, Lin X. Sliding-window top-k queries on uncertain streams. Proceedings of the VLDB Endowment, 2008, 1(1): 301–312

    Article  Google Scholar 

  31. Burdick D, Deshpande P M, Jayram T S, Ramakrishnan R, Vaithyanathan S. OLAP over uncertain and imprecise data. The VLDB Journal—The International Journal on Very Large Data Bases, 2007, 16(1): 123–144

    Article  Google Scholar 

  32. Qi Y, Jain R, Singh S, Prabhakar S. Threshold query optimization for uncertain data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 315–326

    Google Scholar 

  33. Tao Y F, Cheng R, Xiao X K, Ngai W K, Kao B, Prabhakar S. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 922–933

    Google Scholar 

  34. Tao Y F, Xiao X K, Cheng R. Range search on multidimensional uncertain data. ACM Transactions on Database Systems, 2007, 32(3): 15

    Article  Google Scholar 

  35. Dalvi N, Suciu D. Efficient query evaluation on probabilistic databases. In: Proceedings of International Conference on Very Large Databases. 2008, 16(1): 119–128

    Google Scholar 

  36. Cheng R, Kalashnikov D V, Prabhakar S. Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2003, 551–562

    Google Scholar 

  37. Pei J, Jiang B, Lin X M, Yuan Y D. Probabilistic skylines on uncertain data. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 15–26

    Google Scholar 

  38. Dellis E, Seeger B. Efficient computation of reverse skyline queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 291–302

    Google Scholar 

  39. Soliman M A, Ilyas I F, Chang K C C. Top-k query processing in uncertain databases. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 896–905

    Google Scholar 

  40. Ge T, Zdonik S, Madden S. Top-k queries on uncertain data: on score distribution and typical answers. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 375–388

    Chapter  Google Scholar 

  41. Wang G R, Huo H, Han D H, Hui X Y. Query processing and optimization techniques over streamed fragmented XML. World Wide Web, 2008, 11(3): 339–359

    Article  Google Scholar 

  42. Barbosa D, Mignet L, Veltri P. Studying the XML Web: gathering statistics from an XML sample. World Wide Web, 2006, 9(2): 187–212

    Article  Google Scholar 

  43. Kooi R. The optimization of queries in relational databases. Dissertation for the Doctoral Degree. Cleveland, Ohio: Case Western Reserve University, 1980

    Google Scholar 

  44. Piatetsky-Shapiro G, Connell C. Accurate estimation of the number of tuples satisfying a condition. ACM SIGMOD Record, 1984, 14(2): 256–276

    Article  Google Scholar 

  45. Ioannidis Y, Poosala V. Balancing histogram optimality and practicality for query result size estimation. ACM SIGMOD Record, 1995, 24(2): 233–244

    Article  Google Scholar 

  46. Gunopulos D, Kollios G, Tsotras V J, Domeniconi C. Approximating multi-dimensional aggregate range queries over real attributes. ACM SIGMOD Record, 2000, 29(2): 463–474.

    Article  Google Scholar 

  47. Bruno N, Chaudhuri S, Gravano L. STHoles: a multidimensional workload aware histogram. ACM SIGMOD Record, 2001, 30(2): 211–222

    Article  Google Scholar 

  48. Haas P J, Naughton J F, Seshadri S, Swami A N. Selectivity and cost estimation for joins based on random sampling. Journal of Computer and System Sciences, 1996, 52(3): 550–569

    Article  MathSciNet  MATH  Google Scholar 

  49. Lipton R J, Naughton J F. Query size estimation by adaptive sampling. Journal of Computer and System Sciences, 1995, 51(1): 18–25

    Article  MathSciNet  MATH  Google Scholar 

  50. Olken F. Random sampling from databases. Dissertation for the Doctoral Degree. University of California at Berkeley, 1997

    Google Scholar 

  51. Ngu A, Harangsri B, Shepherd J. Query size estimation for joins using systematic sampling. Distributed and Parallel Databases, 2004, 15(3): 237–275

    Article  Google Scholar 

  52. Chaudhuri S, Das G, Narasayya V R. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems, 2007, 32(2): 9

    Article  Google Scholar 

  53. Zhang Y, Yang L, Wang H Z. Range query estimation for dirty data management system. In: Proceedings of International Conference on Web-Age Information Management. 2012, 152–164

    Chapter  Google Scholar 

Download references

Acknowledgements

This paper was partially supported by the National Natural Science Foundation of China (Grant Nos. U1509216 and 61472099), National Sci-Tech Support Plan (2015BAH10F01), the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province (LC2016026), and MOE–Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology, China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang.

Additional information

Yan Zhang is an associate professor and master supervisor at Harbin Institute of Technology, China. His research area includes data quality and bioinformatics.

Hongzhi Wang is a professor and doctoral supervisor at Harbin Institute of Technology, China. His research area is big data management, including data quality, XML data management and graph management. He is a recipient of the outstanding dissertation award of CCF, Microsoft Fellow and IBM PhD Fellowship.

Long Yang is a master student at Harbin Institute of Technology, China. His research area is data quality management.

Jianzhong Li is a professor and doctoral supervisor at Harbin Institute of Technology, China. He is a senior member of CCF. His research interests include database, parallel computing and wireless sensor networks, etc.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Wang, H., Yang, L. et al. Efficient histogram-based range query estimation for dirty data. Front. Comput. Sci. 12, 984–999 (2018). https://doi.org/10.1007/s11704-016-5551-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-016-5551-1

Keywords

Navigation