Kern: A Labeling Environment for Large-Scale, High-Quality Training Data

Hötter, Johannes; Wenck, Henrik; Feuerpfeil, Moritz; Witzke, Simon

doi:10.1007/978-3-031-08473-7_46

Johannes Hötter¹²,
Henrik Wenck¹²,
Moritz Feuerpfeil¹² &
…
Simon Witzke¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13286))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

Abstract

The lack of large-scale, high-quality training data is a significant bottleneck in supervised learning. We introduce kern, a labeling environment used by machine learning experts and subject matter experts to create training data and find manual labeling errors powered by weak supervision, active transfer learning, and confident learning. We explain the current workflow and system overview and showcase the benefits of our system in an intent classification experiment, where we reduce the labeling error rate of a given dataset by an absolute 4.9% while improving the F\(_1\) score of a baseline classifier by a total of 9.7%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Snorkel is available under https://github.com/snorkel-team/snorkel.
2.
modAL is available under https://github.com/modAL-python/modAL.
3.
cleanlab is available under https://github.com/cleanlab/cleanlab.
4.
We chose this name to allude that training data is the “core” of modern supervised learning applications, both in research and applied systems.
5.
Models are implemented using the embedding store, and standard machine learning libraries such as Scikit-Learn [10].
6.
Accessible under https://rapidapi.com/organization/symanto.
7.
Information sources are run containerized due to security and scalability.
8.
For instance, if the intent is to cancel an order, a system can automatically do so if it can find the order reference number within the given text message.

References

Basile, A., Pérez-Torró, G., Franco-Salvador, M.: Probabilistic ensembles of zero- and few-shot learning models for emotion classification. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 128–137. INCOMA Ltd., September 2021
Google Scholar
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013)
Article Google Scholar
Danka, T., Horvath, P.: modAL: a modular active learning framework for Python. https://github.com/modAL-python/modAL, arXiv:1805.00979
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Fu, D.Y., Chen, M.F., Sala, F., Hooper, S.M., Fatahalian, K., Ré, C.: Fast and three-rious: speeding up weak supervision with triplet methods. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020) (2020)
Google Scholar
Halevy, A., Norvig, P., Fernando, N.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24, 8–12 (2009)
Article Google Scholar
Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., Hovy, E.: Learning whom to trust with MACE. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 1120–1130. Association for Computational Linguistics, June 2013
Google Scholar
Northcutt, C., Jiang, L., Chuang, I.: Confident learning: estimating uncertainty in dataset labels. J. Artif. Intell. Res. 70, 1373–1411 (2021)
Article MathSciNet Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel. Proc. VLDB Endow. 11(3), 269–282 (2017)
Article Google Scholar
Ratner, A., Sa, C.D., Wu, S., Selsam, D., Ré, C.: Data programming: creating large training sets, quickly. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, pp. 3574–3582. Curran Associates Inc., Red Hook (2016)
Google Scholar
Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data - AI integration perspective. IEEE Trans. Knowl. Data Eng. 33(4), 1328–1347 (2021)
Article Google Scholar
Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison (2009)
Google Scholar
Shi, X., Fan, W., Ren, J.: Actively transfer domain knowledge. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5212, pp. 342–357. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87481-2_23
Chapter Google Scholar
Sun, C., Shrivastava, A., Singh, S., Gupta, A.K.: Revisiting unreasonable effectiveness of data in deep learning era. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Google Scholar
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

kern.ai, Gerhart-Hauptmann-Allee 71, 15732, Eichwalde, Germany
Johannes Hötter, Henrik Wenck, Moritz Feuerpfeil & Simon Witzke

Authors

Johannes Hötter
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Wenck
View author publications
You can also search for this author in PubMed Google Scholar
Moritz Feuerpfeil
View author publications
You can also search for this author in PubMed Google Scholar
Simon Witzke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johannes Hötter .

Editor information

Editors and Affiliations

Universitat Politècnica de València, Valencia, Spain
Paolo Rosso
University of Turin, Torino, Italy
Valerio Basile
Universidad Nacional de Educación a Distancia, Madrid, Spain
Raquel Martínez
Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais
University of Derby, Derby, UK
Farid Meziane

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hötter, J., Wenck, H., Feuerpfeil, M., Witzke, S. (2022). Kern: A Labeling Environment for Large-Scale, High-Quality Training Data. In: Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham. https://doi.org/10.1007/978-3-031-08473-7_46

Download citation

DOI: https://doi.org/10.1007/978-3-031-08473-7_46
Published: 13 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08472-0
Online ISBN: 978-3-031-08473-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Kern: A Labeling Environment for Large-Scale, High-Quality Training Data