Skip to main content
Log in

Distributed probabilistic top-k dominating queries over uncertain databases

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In many real-world applications such as business planning and sensor data monitoring, one important, yet challenging, task is to rank objects (e.g., products, documents, or spatial objects) based on their ranking scores and efficiently return those objects with the highest scores. In practice, due to the unreliability of data sources, many real-world objects often contain noises and are thus imprecise and uncertain. In this paper, we study the problem of probabilistic top-k dominating (PTD) query on such large-scale uncertain data in a distributed environment, which retrieves k uncertain objects from distributed uncertain databases (on multiple distributed servers), having the largest ranking scores with high confidences. In order to efficiently tackle the distributed PTD problem, we propose a MapReduce framework for processing distributed PTD queries over distributed uncertain databases. In this MapReduce framework, we design effective pruning strategies to filter out false alarms in the distributed setting, propose cost-model-based index distribution mechanisms over servers, and develop efficient distributed PTD query processing algorithms. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed distributed PTD approaches on both real and synthetic data sets through various experimental settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Antova L, Koch C, Olteanu D (2007) MayBMS: Managing incomplete information with probabilistic world-set decompositions. In: IEEE international conference on data engineering

  2. Arenas M, Bertossi LE, Chomicki J (1999) Consistent query answers in inconsistent databases. In: Proceedings of the eighteenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS)

  3. Benjelloun O, Sarma AD, Halevy A, Widom J (2006) ULDBs: databases with uncertainty and lineage. In: Proceedings of the very large data bases

  4. Boulos J, Dalvi N, Mandhani B, Mathur S, Ré C, Suciu D (2005) Mystiq: a system for finding more answers by using probabilities. In: Proceedings of ACM SIGMOD international conference on management of data

  5. Cheng R, Kalashnikov DV, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD international conference on management of data

  6. Cheng R, Kalashnikov DV, Prabhakar S (2004) Querying imprecise data in moving object environments. IEEE Transactions on Knowledge and Data Engineering 16(9)

  7. Cheng R, Singh S, Prabhakar S (2005) U-DBMS: a database system for managing constantly-evolving data. In: Proceedings of the very large data bases

  8. Chorochronos.org: Tiger/line: California streets mbr data (2018). http://chorochronos.datastories.org/?q=node/59. Accessed 15 Apr 2018

  9. Cormode G, Li F, Yi K (2009) Semantics of ranking queries for probabilistic data and expected ranks. In: IEEE international conference on data engineering

  10. Dalvi N, Suciu D (2007) Efficient query evaluation on probabilistic databases. VLDB J. 16(4)

  11. Deshpande A, Guestrin C, Madden SR, Hellerstein JM, Hong W (2004) Model-driven data acquisition in sensor networks. In: Proceedings of the very large data bases

  12. Feng X, Zhao X, Gao Y, Zhang Y (2013) Probabilistic top-k dominating query over sliding windows. In: Proceedings of web technologies and applications—15th Asia-Pacific Web Conference (APWeb)

  13. Flesca S, Furfaro F, Parisi F (2014) Consistency checking and querying in probabilistic databases under integrity constraints. J Comput Syst Sci 80(7):1448–1489. https://doi.org/10.1016/j.jcss.2014.04.026

    Article  MathSciNet  MATH  Google Scholar 

  14. Fuxman A, Fazli E, Miller RJ (2005) Conquer: Efficient management of inconsistent databases. In: Proceedings of the ACM SIGMOD international conference on management of data

  15. Grant J, Molinaro C, Parisi F (2018) Probabilistic spatio-temporal knowledge bases: capacity constraints, count queries, and consistency checking. Int J Approx Reason 100:1–28. https://doi.org/10.1016/j.ijar.2018.05.003

    Article  MathSciNet  MATH  Google Scholar 

  16. Han X, Li J, Gao H (2015) TDEP: efficiently processing top-\(k\) dominating query on massive data. Knowl Inf Syst 43(3)

  17. Hua M, Pei J, Zhang W, Lin X (2008) Ranking queries on uncertain data: a probabilistic threshold approach. In: Proceedings of ACM SIGMOD international conference on management of data

  18. Jampani R, Xu F, Wu M, Perez LL, Jermaine C, Haas PJ (2008) MCDB: a Monte Carlo approach to managing uncertain data. In: Proceedings of ACM SIGMOD international conference on management of data

  19. Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28(1)

  20. Kriegel HP, Kunath P, Renz M (2007) Probabilistic nearest-neighbor query on uncertain objects. In: Database systems for advanced applications (DASFAA)

  21. Lai CC, Wang TC, Liu CM, Wang LC (2019) Probabilistic top-\(k\) dominating query monitoring over multiple uncertain IoT data streams in edge computing environments. IEEE Internet Things J 6(5):8563–8576. https://doi.org/10.1109/jiot.2019.2920908

    Article  Google Scholar 

  22. Lazaridis I, Mehrotra S (2001) Progressive approximate aggregate queries with a multi-resolution tree structure. In: Proceedings of ACM SIGMOD international conference on management of data

  23. LeFevre K, DeWitt DJ, Ramakrishnan R (2005) Incognito: Efficient full-domain k-anonymity. In: Proceedings of the ACM SIGMOD international conference on management of data

  24. Li F, Yi K, Jestes J (2009) Ranking distributed probabilistic data. In: Proceedings of ACM SIGMOD international conference on management of data

  25. Li J, Saha B, Deshpande A (2011) A unified approach to ranking in probabilistic databases. VLDB J. 20(2)

  26. Lian X, Chen L (2008) Monochromatic and bichromatic reverse skyline search over uncertain databases. In: Proceedings of ACM SIGMOD international conference on management of data

  27. Lian X, Chen L (2009) Top-\(k\) dominating queries in uncertain databases. In: International conference on extending database technology (EDBT)

  28. Lian X, Chen L (2013) Probabilistic top-\(k\) dominating queries in uncertain databases. Inf Sci 226

  29. Lian X, Chen L, Song S (2010) Consistent query answers in inconsistent probabilistic databases. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10, pp 303–314. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1807167.1807202

  30. Papadias D, Tao Y, Fu G, Seeger B (2005) Progressive skyline computation in database systems. ACM Trans Database Syst 30(1)

  31. Parisi F, Grant J (2016) Knowledge representation in probabilistic spatio-temporal knowledge bases. J Artif Intell Res 55:4883. https://doi.org/10.1613/jair.4883

    Article  MathSciNet  MATH  Google Scholar 

  32. Parisi F, Grant J (2017) On repairing and querying inconsistent probabilistic spatio-temporal databases. Int J Approx Reason 84:41–74. https://doi.org/10.1016/j.ijar.2017.02.003

    Article  MathSciNet  MATH  Google Scholar 

  33. Park Y, Min JK, Shim K (2013) Parallel computation of skyline and reverse skyline queries using mapreduce. In: Proceedings of the very large data bases, vol 6(14)

  34. Park Y, Min JK, Shim K (2015) Processing of probabilistic skyline queries using mapreduce. In: Proceedings of the very large data bases, vol 8(12)

  35. Pei J, Jiang B, Lin X, Yuan Y (2007) Probabilistic skylines on uncertain data. In: Proceedings of the very large data bases

  36. Santoso BJ, Chiu G (2014) Close dominance graph: an efficient framework for answering continuous top-\(k\) dominating queries. IEEE Trans Knowl Data Eng 26(8)

  37. Singh S, Mayfield C, Shah R, Prabhakar S, Hambrusch S, Neville J, Cheng R (2008) Database support for probabilistic attributes and tuples. In: IEEE international conference on data engineering

  38. Skoutas D, Sacharidis D, Simitsis A, Kantere V, Sellis T (2009) Top-k dominant web services under multi-criteria matching. In: International conference on extending database technology (EDBT)

  39. Tiakas E, Papadopoulos AN, Manolopoulos Y (2011) Progressive processing of subspace dominating queries. VLDB J. 20(6)

  40. Tiakas E, Valkanas G, Papadopoulos AN, Manolopoulos Y (2014) Metric-based top-k dominating queries. In: Proceedings of the 17th international conference on extending database technology (EDBT)

  41. Wang DZ, Michelakis E, Garofalakis M, Hellerstein J (2008) Bayestore: managing large, uncertain data repositories with probabilistic graphical models. In: Proceedings of the very large data bases

  42. Wikipedia: Central limit theorem—wikipedia, the free encyclopedia (2017). https://en.wikipedia.org/w/index.php?title=Central_limit_theorem &oldid=800332726. Accessed 15 Sep 2017

  43. Yiu ML, Mamoulis N (2007) Efficient processing of top-k dominating queries on multi-dimensional data. In: Proceedings of the very large data bases

  44. Zhang J, Jiang X, Ku WS, Qin X (2016) Efficient parallel skyline evaluation using mapreduce. IEEE Trans Parallel Distrib Syst 27(7)

  45. Zhang W, Lin X, Zhang Y, Pei J, Wang W (2010) Threshold-based probabilistic top-\(k\) dominating queries. VLDB J 19(2)

  46. Zhou X, Li K, Zhou Y, Li K (2016) Adaptive processing for distributed skyline queries over uncertain data. IEEE Trans Knowl Data Eng 28(2)

Download references

Acknowledgements

Funding for this work was provided by NSF OAC No. 1739491, NSF CCF No. 2217104, and Lian Startup No. 220981, Kent State University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiang Lian.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rai, N., Lian, X. Distributed probabilistic top-k dominating queries over uncertain databases. Knowl Inf Syst 65, 4939–4965 (2023). https://doi.org/10.1007/s10115-023-01917-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-01917-3

Keywords

Navigation