Abstract
The lack of large-scale, high-quality training data is a significant bottleneck in supervised learning. We introduce kern, a labeling environment used by machine learning experts and subject matter experts to create training data and find manual labeling errors powered by weak supervision, active transfer learning, and confident learning. We explain the current workflow and system overview and showcase the benefits of our system in an intent classification experiment, where we reduce the labeling error rate of a given dataset by an absolute 4.9% while improving the F\(_1\) score of a baseline classifier by a total of 9.7%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Snorkel is available under https://github.com/snorkel-team/snorkel.
- 2.
modAL is available under https://github.com/modAL-python/modAL.
- 3.
cleanlab is available under https://github.com/cleanlab/cleanlab.
- 4.
We chose this name to allude that training data is the “core” of modern supervised learning applications, both in research and applied systems.
- 5.
Models are implemented using the embedding store, and standard machine learning libraries such as Scikit-Learn [10].
- 6.
Accessible under https://rapidapi.com/organization/symanto.
- 7.
Information sources are run containerized due to security and scalability.
- 8.
For instance, if the intent is to cancel an order, a system can automatically do so if it can find the order reference number within the given text message.
References
Basile, A., Pérez-Torró, G., Franco-Salvador, M.: Probabilistic ensembles of zero- and few-shot learning models for emotion classification. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 128–137. INCOMA Ltd., September 2021
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013)
Danka, T., Horvath, P.: modAL: a modular active learning framework for Python. https://github.com/modAL-python/modAL, arXiv:1805.00979
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
Fu, D.Y., Chen, M.F., Sala, F., Hooper, S.M., Fatahalian, K., Ré, C.: Fast and three-rious: speeding up weak supervision with triplet methods. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020) (2020)
Halevy, A., Norvig, P., Fernando, N.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24, 8–12 (2009)
Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., Hovy, E.: Learning whom to trust with MACE. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 1120–1130. Association for Computational Linguistics, June 2013
Northcutt, C., Jiang, L., Chuang, I.: Confident learning: estimating uncertainty in dataset labels. J. Artif. Intell. Res. 70, 1373–1411 (2021)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel. Proc. VLDB Endow. 11(3), 269–282 (2017)
Ratner, A., Sa, C.D., Wu, S., Selsam, D., Ré, C.: Data programming: creating large training sets, quickly. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, pp. 3574–3582. Curran Associates Inc., Red Hook (2016)
Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data - AI integration perspective. IEEE Trans. Knowl. Data Eng. 33(4), 1328–1347 (2021)
Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison (2009)
Shi, X., Fan, W., Ren, J.: Actively transfer domain knowledge. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5212, pp. 342–357. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87481-2_23
Sun, C., Shrivastava, A., Singh, S., Gupta, A.K.: Revisiting unreasonable effectiveness of data in deep learning era. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hötter, J., Wenck, H., Feuerpfeil, M., Witzke, S. (2022). Kern: A Labeling Environment for Large-Scale, High-Quality Training Data. In: Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham. https://doi.org/10.1007/978-3-031-08473-7_46
Download citation
DOI: https://doi.org/10.1007/978-3-031-08473-7_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08472-0
Online ISBN: 978-3-031-08473-7
eBook Packages: Computer ScienceComputer Science (R0)