Skip to main content

Advertisement

Log in

Top-k dominating queries on incomplete large dataset

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What’s more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Papadias D, Tao Y, Fu G, Seeger B (2005) Progressive skyline computation in database systems. ACM Trans Database Syst 30(1):41–82

    Article  Google Scholar 

  2. Yiu ML, Mamoulis N (2009) Multi-dimensional top-k dominating queries. VLDB J 18(3):695–718

    Article  Google Scholar 

  3. Ge S, Mamoulis N, Cheung DW et al (2015) Dominance relationship analysis with budget constraints. Knowl Inf Syst 42(2):409–440

    Article  Google Scholar 

  4. Mamoulis N, Cheng KH, Yiu ML, Cheung DW (2006) Efficient aggregation of ranked inputs, In: 22nd International Conference on Data Engineering (ICDE’06), pp 72–72, IEEE

  5. Tiakas E, Valkanas G, Papadopoulos AN, Manolopoulos Y, Gunopulos D (2014) Metric-based top-k dominating queries. In: EDBT, pp 415–426

  6. Miao X, Gao Y, Zheng B, Chen G, Cui H (2015) Top-k dominating queries on incomplete data. IEEE Trans Knowl Data Eng 28(1):252–266

    Article  Google Scholar 

  7. Zhu H, Li X, Liu Q, Xu Z (2020) Top-k dominating queries on skyline groups. IEEE Trans Knowl Data Eng 32(7):1431–1444

    Article  Google Scholar 

  8. Tiwari D, Bhati BS (2021) A deep analysis and prediction of covid-19 in India: using ensemble regression approach. In: Artificial Intelligence and Machine Learning for COVID-19, pp 97–109

  9. Tiwari D, Nagpal B (2020) Ensemble methodsof sentiment analysis: a survey. In: 2020 7th International Conference on Computing for Sustainable Global Development (INDIACom), pp 150–155, IEEE

  10. Zhang X, Fan M, Wang D, Zhou P, Tao D (2020) Top-k feature selection framework using robust 0-1 integer programming. IEEE Trans Neural Netw Learn Syst

  11. Schibler T, Suri S (2020) K-dominance in multidimensional data: theory and applications. Comput Geom 87:101594

  12. Xie M, Wong RC-W, Lall A (2020) An experimental survey of regret minimization query and variants: bridging the best worlds between top-k query and skyline query. VLDB J 29(1):147–175

    Article  Google Scholar 

  13. Wang Y, Li X, Li X, Wang Y (2013) A survey of queries over uncertain data. Knowl Inf Syst 37(3):485–530

    Article  Google Scholar 

  14. Khalefa ME, Mokbel MF, Levandoski JJ (2008) Skyline query processing for incomplete data, In: 2008 IEEE 24th International Conference on Data Engineering, pp 556–565, IEEE

  15. Lian X, Chen L (2009) Top-k dominating queries in uncertain databases, In: Proceedings of the 12th international conference on extending database technology: advances in database technology, pp 660–671

  16. Lian X, Chen L (2013) Probabilistic top-k dominating queries in uncertain databases. Inf Sci 226:23–46

    Article  MathSciNet  Google Scholar 

  17. Han X, Li J, Gao H (2015) Tdep: efficiently processing top-k dominating query on massive data. Knowl Inf Syst 43(3):689–718

    Article  Google Scholar 

  18. Zhang K, Gao H, Han X, Cai Z, Li J (2020) Modeling and computing probabilistic skyline on incomplete data. IEEE Trans Knowl Data Eng 32(7):1405–1418

    Article  Google Scholar 

  19. Chen C-M, Chen L, Gan W, Qiu L, Ding W (2021) Discovering high utility-occupancy patterns from uncertain data. Inf Sci 546:1208–1229

    Article  MathSciNet  Google Scholar 

  20. Sefidian AM, Daneshpour N (2019) Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model. Expert Syst Appl 115:68–94

    Article  Google Scholar 

  21. Biessmann F, Rukat T, Schmidt P, Naidu P, Schelter S, Taptunov A, Lange D, Salinas D (2019) Datawig: missing value imputation for tables. J Mach Learn Res 20(175):1–6

    MATH  Google Scholar 

  22. Wu K, Shoshani A, Stockinger K (2008) Analyses of multi-level and multi-component compressed bitmap indexes. ACM Trans Database Syst 35(1):1–52

    Article  Google Scholar 

  23. Chen Z, Wen Y, Cao J, Zheng W, Chang J, Wu Y, Ma G, Hakmaoui M, Peng G (2015) A survey of bitmap index compression algorithms for big data. Tsinghua Sci Technol 20(1):100–115

    Article  MathSciNet  Google Scholar 

  24. Wu K, Otoo EJ, Shoshani A (2002) Compressing bitmap indexes for faster search operations, In: Proceedings 14th International Conference on Scientific and Statistical Database Management, pp 99–108, IEEE

  25. Manogaran G, Lopez D (2018) Disease surveillance system for big climate data processing and dengue transmission, In: Climate Change and Environmental Concerns: Breakthroughs in Research and Practice, pp 427–446, IGI Global

  26. Kamal S, Ripon SH, Dey N, Ashour AS, Santhi V (2016) A mapreduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput Methods Programs Biomed 131:191–206

    Article  Google Scholar 

  27. Kamal MS, Parvin S, Ashour AS, Shi F, Dey N (2017) De-bruijn graph with mapreduce framework towards metagenomic data classification. Int J Inf Technol 9(1):59–75

    Google Scholar 

  28. Matallah H, Belalem G, Bouamrane K (2017) Towards a new model of storage and access to data in big data and cloud computing. Int J Ambient Comput Intell 8(4):31–44

    Article  Google Scholar 

  29. Ezatpoor P, Zhan J, Wu JM-T, Chiu C (2018) Finding top-\(k\) dominance on incomplete big data using mapreduce framework. IEEE Access 6:7872–7887

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mu-En Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, J.MT., Wei, M., Wu, ME. et al. Top-k dominating queries on incomplete large dataset. J Supercomput 78, 3976–3997 (2022). https://doi.org/10.1007/s11227-021-04005-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04005-x

Keywords