TabMentor: Detect Errors on Tabular Data with Noisy Labels

Zhang, Yaru; Qin, Jianbin; Wang, Yaoshu; Ali, Muhammad Asif; Ji, Yan; Mao, Rui

doi:10.1007/978-3-031-46671-7_12

Yaru Zhang¹⁵,
Jianbin Qin¹⁵,
Yaoshu Wang¹⁵,
Muhammad Asif Ali¹⁶,
Yan Ji¹⁵ &
…
Rui Mao¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14178))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

248 Accesses

Abstract

Existing supervised methods for error detection require access to clean labels in order to train the classification models. This is difficult to achieve in practical scenarios. While the majority of the error detection algorithms ignore the effect of noisy labels, in this paper, we design effective techniques for error detection when both data and labels contain noise. Nevertheless, we present TabMentor, a novel deep-learning model for error detection on tabular data with noisy training labels. TabMentor introduces a deep model for the prediction, i.e., Tabclassifier that suggests the most salient features for the decision step, enabling efficient learning. For feature extraction, it uses existing error detection algorithms, along with some raw features from the datasets. To reduce the negative effect of noisy training labels on the model, TabMentor uses another deep model, i.e., Teachernet, to supervise the training of Tabclassifier. During the training process, both Teachernet and Tabclassifier dynamically learn curriculum from data, allowing Tabclassifier to focus more on clean labeled samples. Performance evaluation using five different data sets shows that the TabMentor excels over the best baseline error detection system by 0.05 to 0.11 in terms of F1 scores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arik, S.Ö., Pfister, T.: Tabnet: attentive interpretable tabular learning. In: AAAI (2021)
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning (2009)
Google Scholar
Biessmann, F., et al.: Datawig: missing value imputation for tables. J. Mach. Learn. Res. 20, 1–6 (2019)
MathSciNet MATH Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41, 1–58 (2009)
Article Google Scholar
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data (2016)
Google Scholar
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE. IEEE (2013)
Google Scholar
Csiszár, I.: Information geometry and alternating minimization procedures. Stat. Decis. 1, 205–237 (1984)
MathSciNet MATH Google Scholar
Dallachiesa, M., et al.: Nadeef: a commodity data cleaning system. In: 2013 ACM SIGMOD (2013)
Google Scholar
Das, S., Doan, A., Psgc, C.G., Konda, P., Govind, Y., Paulsen, D.: The magellan data repository (2015)
Google Scholar
Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. Adv. Neural Inf. Process. Syst. 31 (2018)
Google Scholar
Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: Holodetect: few-shot learning for error detection. In: Proceedings of the 2019 International Conference on Management of Data (2019)
Google Scholar
Hellerstein, J.M.: Quantitative data cleaning for large databases. UNECE (2008)
Google Scholar
Huang, Z., He, Y.: Auto-detect: data-driven error detection in tables. In: Proceedings of the 2018 International Conference on Management of Data (2018)
Google Scholar
Jiang, L., Meng, D., Zhao, Q., Shan, S., Hauptmann, A.G.: Self-paced curriculum learning. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: International Conference on Machine Learning. PMLR (2018)
Google Scholar
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the Sigchi Conference on Human Factors in Computing Systems (2011)
Google Scholar
Katzir, L., Elidan, G., El-Yaniv, R.: Net-dnf: effective deep modeling of tabular data. In: International Conference on Learning Representations (2020)
Google Scholar
Krishnan, S., Wang, J., Wu, E., Franklin, M.J., Goldberg, K.: Activeclean: interactive data cleaning for statistical modeling. In: PVLDB (2016)
Google Scholar
Kumar, M., Packer, B., Koller, D.: Self-paced learning for latent variable models. Adv. Neural Inf. Process. Syst. (2010)
Google Scholar
Li, J., Socher, R., Hoi, S.C.: Dividemix: learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020)
Liu, Z., Zhou, Z., Rekatsinas, T.: Picket: self-supervised data diagnostics for ml pipelines. arXiv (2020)
Google Scholar
Mahdavi, M., et al.: Raha: a configuration-free error detection system. In: SIGMOD (2019)
Google Scholar
Malach, E., Shalev-Shwartz, S.: Decoupling “when to update” from “how to update". Adv. Neural Inf. Process. Syst. (2017)
Google Scholar
Neutatz, F., Chen, B., Abedjan, Z., Wu, E.: From cleaning before ml to cleaning for ml. IEEE (2021)
Google Scholar
Neutatz, F., Mahdavi, M., Abedjan, Z.: Ed2: two-stage active learning for error detection-technical report. arXiv (2019)
Google Scholar
Ouzzani, M., Hammady, H., Fedorowicz, Z., Elmagarmid, A.: Rayyan-a web and mobile app for systematic reviews. Syst. Rev. 5, 1–10 (2016)
Article Google Scholar
Pit-Claudel, C., Mariet, Z., Harding, R., Madden, S.: Outlier detection in heterogeneous datasets using automatic tuple expansion (2016)
Google Scholar
Popov, S., Morozov, S., Babenko, A.: Neural oblivious decision ensembles for deep learning on tabular data. arXiv preprint arXiv:1909.06312 (2019)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE (2000)
Google Scholar
Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB (2001)
Google Scholar
Rammelaere, J., Geerts, F.: Explaining repaired data with cfds. In: VLDB (2018)
Google Scholar
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. arXiv (2017)
Google Scholar
Ridzuan, F., Zainon, W.M.N.W.: Diagnostic analysis for outlier detection in big data analytics. Procedia Comput. Sci. 197, 685–692 (2022)
Article Google Scholar
Rosales, R., Fung, G., Tong, W.: Automatic discrimination of mislabeled training points for large margin classifiers. In: Proceedings of Snowbird Machine Learning Workshop. Citeseer (2009)
Google Scholar
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.M.: “everyone wants to do the model work, not the data work”: data cascades in high-stakes AI. In: proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (2021)
Google Scholar
Sharma, K., Donmez, P., Luo, E., Liu, Y., Yalniz, I.Z.: NoiseRank: unsupervised label noise reduction with dependence models. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 737–753. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_44
Chapter Google Scholar
Shwartz-Ziv, R., Armon, A.: Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022)
Article Google Scholar
Visengeriyeva, L., Abedjan, Z.: Metadata-driven error detection. In: SSDBM (2018)
Google Scholar
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. arXiv (2011)
Google Scholar
Yan, J.N., Schulte, O., Zhang, M., Wang, J., Cheng, R.: Scoded: statistical constraint oriented data error detection. In: 2020 ACM SIGMOD (2020)
Google Scholar
Yuan, B., Chen, J., Zhang, W., Tai, H.S., McMains, S.: Iterative cross learning on noisy labels. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE (2018)
Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64, 107–115 (2021)
Article Google Scholar
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22, 177–210 (2004)
Google Scholar

Download references

Acknowledgements

National Key R &D program of China 2021YFB3301500, Guangdong Provincial Natural Science Foundation 2019A1515111047, Shenzhen Colleges and Universities Continuous Support Grant 20200811104054002, Guangdong “Pearl River Talent Recruitment Program” under Grant 2019ZT08X603, the 14th “115” Industrial Innovation Group (Project 4) of Anhui Province, NSFC 62072311, U2001212, Guangdong Project 2020B1515120028, and Shenzhen Project JCYJ20210324094402008.

Author information

Authors and Affiliations

Shenzhen Institute of Computing Sciences, Shenzhen University, Shenzhen, China
Yaru Zhang, Jianbin Qin, Yaoshu Wang, Yan Ji & Rui Mao
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Muhammad Asif Ali

Authors

Yaru Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianbin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Yaoshu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Asif Ali
View author publications
You can also search for this author in PubMed Google Scholar
Yan Ji
View author publications
You can also search for this author in PubMed Google Scholar
Rui Mao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianbin Qin .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Xiaochun Yang
The University of Indonesia, Depok, Indonesia
Heru Suhartanto
Beijing Institute of Technology, Beijing, China
Guoren Wang
Northeastern University, Shenyang, China
Bin Wang
University of Technology Sydney, Sydney, NSW, Australia
Jing Jiang
Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
Bing Li
Sun Yat-sen University, Guangzhou, China
Huaijie Zhu
Anhui University, Hefei, China
Ningning Cui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Qin, J., Wang, Y., Ali, M.A., Ji, Y., Mao, R. (2023). TabMentor: Detect Errors on Tabular Data with Noisy Labels. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14178. Springer, Cham. https://doi.org/10.1007/978-3-031-46671-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-46671-7_12
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46670-0
Online ISBN: 978-3-031-46671-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TabMentor: Detect Errors on Tabular Data with Noisy Labels