Abstract
The vast body of research is dedicated to interpreting language models, particularly probing them for linguistic properties. As in many other NLP fields, probing works tend to reuse existing datasets, resulting in more and more specialized findings. Introducing new datasets, although necessary for truly typologically diverse studies, requires labor-intensive data annotation. Meanwhile, models become heavier, probing methods inventory enriches, and the cost of probing experiments grows accordingly. To minimize the amount of work annotating new data, and reduce the computational cost of experimenting with the existing data, it will be beneficial to assess dataset size.
We propose fractions probing, a novel method of validating probing dataset size. It includes data redundancy test to review existing datasets and data sufficiency test to provide guidance when collecting new ones. We illustrate the method’s applicability with SentEval probing suite, finding that it can be safely reduced. Our experiments are conducted for two models, BERT and RoBERTa, showing the latter to consistently require more data. Fractions probing can be used to analogously investigate other datasets and models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We compute all the metrics in both ways, but interpret the layer-wise setup only for Pearson correlation. When computing layer-wise metrics, we omit Word content. Radical changes in absolute numbers of its learning curve override the signal from other tasks.
- 2.
The reported graphs are for BERT model.
- 3.
The numerical metrics do not give such consistency. However, we suspect that they may be subject to biases like, for example, described in Sect. 5.1.
References
Adcock, C.J.: Sample size determination: a review. J. Roy. Stat. Soc.: Ser. D (Stat.) 46(2), 261–283 (1997). https://doi.org/10.1111/1467-9884.00082, https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9884.00082, _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9884.00082
Belinkov, Y.: Probing classifiers: promises, shortcomings, and advances. arXiv:2102.12452 [cs] (2021)
Boonyanunta, N., Zeephongsekul, P.: Predicting the relationship between the size of training sample and the predictive power of classifiers. In: Negoita, M.G., Howlett, R.J., Jain, L.C. (eds.) KES 2004. LNCS (LNAI), vol. 3215, pp. 529–535. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30134-9_71
Briggs, A.H., Gray, A.M.: Power and sample size calculations for stochastic cost-effectiveness analysis. Med. Decis. Making: Int. J. Soc. Med. Decis. Making 18(2 Suppl), S81-92 (1998). https://doi.org/10.1177/0272989X98018002S10
Brinker, K.: Incorporating diversity in active learning with support vector machines, pp. 59–66 (2003)
Carneiro, A.V.: Estimating sample size in clinical studies: basic methodological principles. Revista Portuguesa De Cardiologia: Orgao Oficial Da Sociedade Portuguesa De Cardiologia = Portuguese J. Cardiol.: Off. J. Portuguese Soc. Cardiol. 22(12), 1513–1521 (2003)
Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M.: What you can cram into a single vector: probing sentence embeddings for linguistic properties. arXiv:1805.01070 [cs] (2018)
Cortes, C., Jackel, L., Solla, S., Vapnik, V., Denker, J.: Learning curves: asymptotic values and rate of convergence. In: NIPS (1993)
Dalvi, F., et al.: NeuroX: a toolkit for analyzing individual neurons in neural networks. In: AAAI Conference on Artificial Intelligence (AAAI) (2019). https://www.aaai.org/ojs/index.php/AAAI/article/view/5063
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs] (2019)
Dobbin, K.K., Zhao, Y., Simon, R.M.: How large a training set is needed to develop a classifier for microarray data? Clin. Cancer Res.: Off. J. Am. Assoc. Cancer Res. 14(1), 108–114 (2008). https://doi.org/10.1158/1078-0432.CCR-07-0443
Eger, S., Daxenberger, J., Gurevych, I.: How to probe sentence embeddings in low-resource languages: on structural design choices for probing task evaluation. In: Proceedings of the 24th Conference on Computational Natural Language Learning, pp. 108–118. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.conll-1.8, https://aclanthology.org/2020.conll-1.8
Elazar, Y., Ravfogel, S., Jacovi, A., Goldberg, Y.: Amnesic probing: behavioral explanation with amnesic counterfactuals. Trans. Assoc. Comput. Linguist. 9, 160–175 (2021). https://doi.org/10.1162/tacl_a_00359, _eprint: https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00359/1924189/tacl_a_00359.pdf
Ethayarajh, K., Jurafsky, D.: Utility is in the eye of the user: a critique of NLP leaderboards. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4846–4853. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.393, https://aclanthology.org/2020.emnlp-main.393
Figueroa, R.L., Zeng-Treitler, Q., Kandula, S., Ngo, L.H.: Predicting sample size required for classification performance. BMC Med. Inform. Decis. Making 12(1), 8 (2012). https://doi.org/10.1186/1472-6947-12-8
Fréchet, M.: Sur quelques points du calcul fonctionnel. Rendiconti Circolo Mat. Palermo 22, 1–72 (1884–1940)
Fukunaga, K., Hayes, R.: Effects of sample size in classifier design. IEEE Trans. Pattern Anal. Mach. Intell. 11, 873–885 (1989). https://doi.org/10.1109/34.31448
Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-84858-7
Hess, K.R., Wei, C.: Learning curves in classification with microarray data. Semin. Oncol. 37(1), 65–68 (2010). https://doi.org/10.1053/j.seminoncol.2009.12.002
Kim, S.Y.: Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC Bioinform. 10, 147 (2009). https://doi.org/10.1186/1471-2105-10-147
Lenth, R.: Some practical guidelines for effective sample-size determination. Am. Stat. 55 (2001). https://doi.org/10.1198/000313001317098149
Li, M., Sethi, I.: Confidence-based active learning. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1251–61 (2006). https://doi.org/10.1109/TPAMI.2006.156
Liu, Y.: Active learning with support vector machine applied to gene expression data for cancer classification. J. Chem. Inf. Comput. Sci. 44(6), 1936–1941 (2004). https://doi.org/10.1021/ci049810a
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). https://doi.org/10.48550/arXiv.1907.11692, http://arxiv.org/abs/1907.11692
Maxwell, S.E., Kelley, K., Rausch, J.R.: Sample size planning for statistical power and accuracy in parameter estimation. Annu. Rev. Psychol. 59, 537–563 (2008). https://doi.org/10.1146/annurev.psych.59.103006.093735
Mikhailov, V., Taktasheva, E., Sigdel, E., Artemova, E.: RuSentEval: linguistic source, encoder force! arXiv:2103.00573 [cs] (2021)
Mukherjee, S., et al.: Estimating dataset size requirements for classifying DNA microarray data. J. Computat. Biol.: J. Comput. Mol. Cell Biol. 10(2), 119–142 (2003). https://doi.org/10.1089/106652703321825928
Perlich, C.: Learning curves in machine learning (2011). https://doi.org/10.1007/978-0-387-30164-8_452
Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 1999, pp. 23–32. Association for Computing Machinery, New York (1999). https://doi.org/10.1145/312129.312188
Ravishankar, V., Øvrelid, L., Velldal, E.: Probing multilingual sentence representations with x-probe. In: RepL4NLP@ACL (2019)
Rodriguez, P., Barrow, J., Hoyle, A.M., Lalor, J.P., Jia, R., Boyd-Graber, J.: Evaluation examples are not equally informative: how should that change NLP leaderboards? In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4486–4503. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.346, https://aclanthology.org/2021.acl-long.346
Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. arXiv:2002.12327 [cs] (2020)
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 [cs] (2017)
Voita, E., Titov, I.: Information-theoretic probing with minimum description length. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 183–196. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.14, https://aclanthology.org/2020.emnlp-main.14
Warmuth, M.K., Liao, J., Rätsch, G., Mathieson, M., Putta, S., Lemmen, C.: Active learning with support vector machines in the drug discovery process. J. Chem. Inf. Comput. Sci. 43(2), 667–673 (2003). https://doi.org/10.1021/ci025620t
Zhu, Z., Wang, J., Li, B., Rudzicz, F.: On the data requirements of probing. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 4132–4147. Association for Computational Linguistics, Dublin (2022). https://doi.org/10.18653/v1/2022.findings-acl.326, https://aclanthology.org/2022.findings-acl.326
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Detailed SentEval Tasks Description
1.1 A.1 Surface Information
These tasks test the extent to which sentence embeddings are preserving surface properties of the sentences they encode. SentLen is the task of predicting the length of sentences in terms of words. Word content (WC) is the task of predicting which word the sentence contains from a closed set of 1000.
1.2 A.2 Syntactic Information
Bigram shift (BShift is the task of predicting whether the sentence has intact word order or contains two inverted random adjecent words. Tree depth (TreeDepth) is the task of determining the depth of the syntactic tree of the sentence. Top constituent (TopConst) is the task of determining the sequence of the top constituents immediately below the sentence (S) node.
1.3 A.3 Semantic Information
Tense is the task of predicting the tens of the main-clause verb. Subject number (SubjNum) is the task of determining the number of the direct object of the main clause. Similarly, object number (ObjNum) tests for the number of the direct object of the main clause. Semantic odd man out (SOMO) is the task of predicting whether the sentence contains a replaced verb or noun which forms bigrams with the previous and following word of the same frequency as the original. Coordination inversion (CoordInv) is the task of determining whether the sentence has intact or inverted order of clauses.
Here we would like to note that the original SentEval domains partition should possibly be revised. For example, Tense, Subj number and Obj number were categorized as semantic information because the models did not have access to morphology. Nowadays, it is not the case with the models which use byte pair encoding and similar techniques.
B Recommended Fractions for BERT and RoBERTa
Table 3 displays the results of data redundancy and data sufficiency tests on SentEval for BERT and RoBERTa.
C Limitations
Our work has a number of limitations. First, our results with SentEval cannot be extrapolated to other datasets without rerunning experiments with our method on them. Second, our method does not allow to extrapolate the learning curves of the metric in data redundancy test. One should build them empirically by running experiments on bigger fractions of data.
Then, one should always keep in mind the known weaknesses of probing studies, such as the misleading nature of accuracy scores. Although fractions probing can be easily adapted for safer methods such as selectivity-based ones or MDL, in this work we have employed the vanilla classification technique. We encourage researchers to responsibly choose proper probing methods for their studies when using fractions probing.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Orlov, E., Serikov, O. (2024). Less than Necessary or More than Sufficient: Validating Probing Dataset Size. In: Ignatov, D.I., et al. Analysis of Images, Social Networks and Texts. AIST 2023. Lecture Notes in Computer Science, vol 14486. Springer, Cham. https://doi.org/10.1007/978-3-031-54534-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-54534-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54533-7
Online ISBN: 978-3-031-54534-4
eBook Packages: Computer ScienceComputer Science (R0)