Bayesian Classifier Modeling for Dirty Data

Wang, Hongya; Cheng, Weidong; Guo, Kaiyan; Xiao, Yingyuan; Liu, Zhenyu

doi:10.1007/978-3-030-29894-4_6

Hongya Wang¹⁰,
Weidong Cheng¹⁰,
Kaiyan Guo¹⁰,
Yingyuan Xiao¹¹ &
…
Zhenyu Liu¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11672))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

2647 Accesses

Abstract

Bayesian classifiers have been proven effective in many practical applications. To train a Bayesian classifier, important parameters such as prior and class conditional probabilities need to be learned from datasets. In practice, datasets are prone to errors due to dirty (missing, erroneous or duplicated) values, which will severely affect the model accuracy if no data cleaning task is enforced. However, cleaning the whole dataset is prohibitively laborious and thus infeasible for even medium-sized datasets. To this end, we propose to induce Bayes models by cleaning only small samples of the dataset. We derive confidence intervals as a function of sample size after data cleaning. In this way, the posterior probability is guaranteed to fall into the estimated confidence intervals with constant probability. Then, we design two strategies to compare the posterior probability intervals if overlap exists. Extension to semi-naive Bayes method is also addressed. Experimental results suggest that cleaning only a small number of samples can train satisfactory Bayesian models, offering significant improvement in cost over cleaning all of the data and significant improvement on precision, recall and F-Measure over cleaning none of the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1), 173–184 (2010)
Google Scholar
Nakahara, Y.: User oriented ranking criteria and its application to fuzzy mathematical programming problems. Fuzzy Sets Syst. 94(3), 275–286 (1998)
Article MathSciNet MATH Google Scholar
Franklin, M.J., Kossmann, D., Kraska, T., et al.: CrowdDB: answering queries with crowdsourcing. In: SIGMOD Conference, pp. 61–72 (2011)
Google Scholar
Bounhas, M., Mellouli, K., Prade, H., Serrurier, M.: From Bayesian classifiers to possibilistic classifiers for numerical data. In: Deshpande, A., Hunter, A. (eds.) SUM 2010. LNCS (LNAI), vol. 6379, pp. 112–125. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15951-0_15
Chapter Google Scholar
Lee, M.L., Ling, T.W., Low, W.L.: IntelliClean: a knowledge-based intelligent data cleaner. In: SIGKDD Conference, pp. 290–294 (2000)
Google Scholar
Sengupta, A., Pal, T.K.: On comparing interval numbers. Eur. J. Oper. Res. 127(1), 28–43 (2000)
Article MathSciNet MATH Google Scholar
Wang, J., Krishnan, S., Franklin, M.J., Goldberg, K., Kraska, T., Milo, T.: A sample-and-clean framework for fast and accurate query processing on dirty data. In: SIGMOD Conference, pp. 469–480 (2014)
Google Scholar
Kazmierska, J., Malicki, J.: Application of the Naive Bayesian Classifier to optimize treatment decisions. Radiother. Oncol. 86(2), 211–216 (2008)
Article Google Scholar
Li, J., Yang, D., Ji, C.: Mine weighted network motifs via Bayes’ theorem. In: ICSAI, pp. 448–452 (2017)
Google Scholar
Zheng, Z., Webb, G.I.: Lazy learning of Bayesian rules. Mach. Learn. 41(1), 53–84 (2000)
Article Google Scholar
Swartz, N.: Gartner warns firms of ‘dirty data’. Inf. Manage. J. 41(3) (2007)
Google Scholar
Gokhale, C., et al.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD Conference, pp. 601–612 (2014)
Google Scholar
Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with hadoop. PVLDB 5(12), 1878–1881 (2012)
Google Scholar
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Kittur, A., Chi, E.H., Suh, B.: Crowdsourcing user studies with Mechanical Turk. In: SIGCHI Conference, pp. 453–456 (2008)
Google Scholar
Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1–2), 197–207 (2010)
Google Scholar
Khayyat, Z., Ilyas, I.F., Jindal, A., et al.: BigDansing: a system for big data cleansing. In: SIGMOD Conference, pp. 1215–1230 (2015)
Google Scholar

Download references

Acknowledgment

The work reported in this paper is partially supported by NSFC under grant number 61370205 and NSF of Xinjiang Key Laboratory under grant number 2019D04024.

Author information

Authors and Affiliations

School of Computer Science and Technology, Donghua University, Shanghai, China
Hongya Wang, Weidong Cheng & Kaiyan Guo
School of CSE, Tianjin University of Technology, Tianjin, China
Yingyuan Xiao
Shanghai Key Laboratory of Computer Software Testing and Evaluation, Shanghai, China
Zhenyu Liu

Authors

Hongya Wang
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Kaiyan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yingyuan Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongya Wang .

Editor information

Editors and Affiliations

Department of Computing, Macquarie University, Sydney, NSW, Australia
Abhaya C. Nayak
RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Alok Sharma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, H., Cheng, W., Guo, K., Xiao, Y., Liu, Z. (2019). Bayesian Classifier Modeling for Dirty Data. In: Nayak, A., Sharma, A. (eds) PRICAI 2019: Trends in Artificial Intelligence. PRICAI 2019. Lecture Notes in Computer Science(), vol 11672. Springer, Cham. https://doi.org/10.1007/978-3-030-29894-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-29894-4_6
Published: 23 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29893-7
Online ISBN: 978-3-030-29894-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics