Less than Necessary or More than Sufficient: Validating Probing Dataset Size

Orlov, Evgeny; Serikov, Oleg

doi:10.1007/978-3-031-54534-4_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14486))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

291 Accesses

Abstract

The vast body of research is dedicated to interpreting language models, particularly probing them for linguistic properties. As in many other NLP fields, probing works tend to reuse existing datasets, resulting in more and more specialized findings. Introducing new datasets, although necessary for truly typologically diverse studies, requires labor-intensive data annotation. Meanwhile, models become heavier, probing methods inventory enriches, and the cost of probing experiments grows accordingly. To minimize the amount of work annotating new data, and reduce the computational cost of experimenting with the existing data, it will be beneficial to assess dataset size.

We propose fractions probing, a novel method of validating probing dataset size. It includes data redundancy test to review existing datasets and data sufficiency test to provide guidance when collecting new ones. We illustrate the method’s applicability with SentEval probing suite, finding that it can be safely reduced. Our experiments are conducted for two models, BERT and RoBERTa, showing the latter to consistently require more data. Fractions probing can be used to analogously investigate other datasets and models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Nob-MIAs: Non-biased Membership Inference Attacks Assessment on Large Language Models with Ex-Post Dataset Construction

SciHyp: A Fine-Grained Dataset Describing Hypotheses and Their Components from Scientific Articles

Curating global datasets of structural linguistic features for independence

Article Open access 18 January 2025

Notes

1.
We compute all the metrics in both ways, but interpret the layer-wise setup only for Pearson correlation. When computing layer-wise metrics, we omit Word content. Radical changes in absolute numbers of its learning curve override the signal from other tasks.
2.
The reported graphs are for BERT model.
3.
The numerical metrics do not give such consistency. However, we suspect that they may be subject to biases like, for example, described in Sect. 5.1.

References

Adcock, C.J.: Sample size determination: a review. J. Roy. Stat. Soc.: Ser. D (Stat.) 46(2), 261–283 (1997). https://doi.org/10.1111/1467-9884.00082, https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9884.00082, _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9884.00082
Belinkov, Y.: Probing classifiers: promises, shortcomings, and advances. arXiv:2102.12452 [cs] (2021)
Boonyanunta, N., Zeephongsekul, P.: Predicting the relationship between the size of training sample and the predictive power of classifiers. In: Negoita, M.G., Howlett, R.J., Jain, L.C. (eds.) KES 2004. LNCS (LNAI), vol. 3215, pp. 529–535. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30134-9_71
Chapter Google Scholar
Briggs, A.H., Gray, A.M.: Power and sample size calculations for stochastic cost-effectiveness analysis. Med. Decis. Making: Int. J. Soc. Med. Decis. Making 18(2 Suppl), S81-92 (1998). https://doi.org/10.1177/0272989X98018002S10
Article Google Scholar
Brinker, K.: Incorporating diversity in active learning with support vector machines, pp. 59–66 (2003)
Google Scholar
Carneiro, A.V.: Estimating sample size in clinical studies: basic methodological principles. Revista Portuguesa De Cardiologia: Orgao Oficial Da Sociedade Portuguesa De Cardiologia = Portuguese J. Cardiol.: Off. J. Portuguese Soc. Cardiol. 22(12), 1513–1521 (2003)
Google Scholar
Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M.: What you can cram into a single vector: probing sentence embeddings for linguistic properties. arXiv:1805.01070 [cs] (2018)
Cortes, C., Jackel, L., Solla, S., Vapnik, V., Denker, J.: Learning curves: asymptotic values and rate of convergence. In: NIPS (1993)
Google Scholar
Dalvi, F., et al.: NeuroX: a toolkit for analyzing individual neurons in neural networks. In: AAAI Conference on Artificial Intelligence (AAAI) (2019). https://www.aaai.org/ojs/index.php/AAAI/article/view/5063
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs] (2019)
Dobbin, K.K., Zhao, Y., Simon, R.M.: How large a training set is needed to develop a classifier for microarray data? Clin. Cancer Res.: Off. J. Am. Assoc. Cancer Res. 14(1), 108–114 (2008). https://doi.org/10.1158/1078-0432.CCR-07-0443
Article Google Scholar
Eger, S., Daxenberger, J., Gurevych, I.: How to probe sentence embeddings in low-resource languages: on structural design choices for probing task evaluation. In: Proceedings of the 24th Conference on Computational Natural Language Learning, pp. 108–118. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.conll-1.8, https://aclanthology.org/2020.conll-1.8
Elazar, Y., Ravfogel, S., Jacovi, A., Goldberg, Y.: Amnesic probing: behavioral explanation with amnesic counterfactuals. Trans. Assoc. Comput. Linguist. 9, 160–175 (2021). https://doi.org/10.1162/tacl_a_00359, _eprint: https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00359/1924189/tacl_a_00359.pdf
Ethayarajh, K., Jurafsky, D.: Utility is in the eye of the user: a critique of NLP leaderboards. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4846–4853. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.393, https://aclanthology.org/2020.emnlp-main.393
Figueroa, R.L., Zeng-Treitler, Q., Kandula, S., Ngo, L.H.: Predicting sample size required for classification performance. BMC Med. Inform. Decis. Making 12(1), 8 (2012). https://doi.org/10.1186/1472-6947-12-8
Article Google Scholar
Fréchet, M.: Sur quelques points du calcul fonctionnel. Rendiconti Circolo Mat. Palermo 22, 1–72 (1884–1940)
Google Scholar
Fukunaga, K., Hayes, R.: Effects of sample size in classifier design. IEEE Trans. Pattern Anal. Mach. Intell. 11, 873–885 (1989). https://doi.org/10.1109/34.31448
Article Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-84858-7
Book Google Scholar
Hess, K.R., Wei, C.: Learning curves in classification with microarray data. Semin. Oncol. 37(1), 65–68 (2010). https://doi.org/10.1053/j.seminoncol.2009.12.002
Article Google Scholar
Kim, S.Y.: Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC Bioinform. 10, 147 (2009). https://doi.org/10.1186/1471-2105-10-147
Article Google Scholar
Lenth, R.: Some practical guidelines for effective sample-size determination. Am. Stat. 55 (2001). https://doi.org/10.1198/000313001317098149
Li, M., Sethi, I.: Confidence-based active learning. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1251–61 (2006). https://doi.org/10.1109/TPAMI.2006.156
Article Google Scholar
Liu, Y.: Active learning with support vector machine applied to gene expression data for cancer classification. J. Chem. Inf. Comput. Sci. 44(6), 1936–1941 (2004). https://doi.org/10.1021/ci049810a
Article Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). https://doi.org/10.48550/arXiv.1907.11692, http://arxiv.org/abs/1907.11692
Maxwell, S.E., Kelley, K., Rausch, J.R.: Sample size planning for statistical power and accuracy in parameter estimation. Annu. Rev. Psychol. 59, 537–563 (2008). https://doi.org/10.1146/annurev.psych.59.103006.093735
Article Google Scholar
Mikhailov, V., Taktasheva, E., Sigdel, E., Artemova, E.: RuSentEval: linguistic source, encoder force! arXiv:2103.00573 [cs] (2021)
Mukherjee, S., et al.: Estimating dataset size requirements for classifying DNA microarray data. J. Computat. Biol.: J. Comput. Mol. Cell Biol. 10(2), 119–142 (2003). https://doi.org/10.1089/106652703321825928
Article Google Scholar
Perlich, C.: Learning curves in machine learning (2011). https://doi.org/10.1007/978-0-387-30164-8_452
Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 1999, pp. 23–32. Association for Computing Machinery, New York (1999). https://doi.org/10.1145/312129.312188
Ravishankar, V., Øvrelid, L., Velldal, E.: Probing multilingual sentence representations with x-probe. In: RepL4NLP@ACL (2019)
Google Scholar
Rodriguez, P., Barrow, J., Hoyle, A.M., Lalor, J.P., Jia, R., Boyd-Graber, J.: Evaluation examples are not equally informative: how should that change NLP leaderboards? In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4486–4503. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.346, https://aclanthology.org/2021.acl-long.346
Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. arXiv:2002.12327 [cs] (2020)
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 [cs] (2017)
Voita, E., Titov, I.: Information-theoretic probing with minimum description length. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 183–196. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.14, https://aclanthology.org/2020.emnlp-main.14
Warmuth, M.K., Liao, J., Rätsch, G., Mathieson, M., Putta, S., Lemmen, C.: Active learning with support vector machines in the drug discovery process. J. Chem. Inf. Comput. Sci. 43(2), 667–673 (2003). https://doi.org/10.1021/ci025620t
Article Google Scholar
Zhu, Z., Wang, J., Li, B., Rudzicz, F.: On the data requirements of probing. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 4132–4147. Association for Computational Linguistics, Dublin (2022). https://doi.org/10.18653/v1/2022.findings-acl.326, https://aclanthology.org/2022.findings-acl.326

Download references

Author information

Authors and Affiliations

ITMO University, St. Petersburg, Russia
Evgeny Orlov
AIR Institute, Moscow, Russia
Oleg Serikov

Authors

Evgeny Orlov
View author publications
You can also search for this author in PubMed Google Scholar
Oleg Serikov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Evgeny Orlov .

Editor information

Editors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Krasovskii Institute of Mathematics and Mechanics of Russian Academy of Sciences, Yekaterinburg, Russia
Michael Khachay
University of Oslo, Oslo, Norway
Andrey Kutuzov
American University of Armenia, Yerevan, Armenia
Habet Madoyan
Artificial Intelligence Research Institute, Moscow, Russia
Ilya Makarov
University of Hamburg, Hamburg, Germany
Irina Nikishina
Skolkovo Institute of Science and Technology, Moscow, Russia
Alexander Panchenko
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Maxim Panov
University of Florida, Gainesville, FL, USA
Panos M. Pardalos
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Apptek, Aachen, Germany
Evgenii Tsymbalov
Kazan Federal University, Kazan, Russia
Elena Tutubalina
MTS AI, Moscow, Russia
Sergey Zagoruyko

Appendices

A Detailed SentEval Tasks Description

1.1 A.1 Surface Information

These tasks test the extent to which sentence embeddings are preserving surface properties of the sentences they encode. SentLen is the task of predicting the length of sentences in terms of words. Word content (WC) is the task of predicting which word the sentence contains from a closed set of 1000.

1.2 A.2 Syntactic Information

Bigram shift (BShift is the task of predicting whether the sentence has intact word order or contains two inverted random adjecent words. Tree depth (TreeDepth) is the task of determining the depth of the syntactic tree of the sentence. Top constituent (TopConst) is the task of determining the sequence of the top constituents immediately below the sentence (S) node.

1.3 A.3 Semantic Information

Tense is the task of predicting the tens of the main-clause verb. Subject number (SubjNum) is the task of determining the number of the direct object of the main clause. Similarly, object number (ObjNum) tests for the number of the direct object of the main clause. Semantic odd man out (SOMO) is the task of predicting whether the sentence contains a replaced verb or noun which forms bigrams with the previous and following word of the same frequency as the original. Coordination inversion (CoordInv) is the task of determining whether the sentence has intact or inverted order of clauses.

Here we would like to note that the original SentEval domains partition should possibly be revised. For example, Tense, Subj number and Obj number were categorized as semantic information because the models did not have access to morphology. Nowadays, it is not the case with the models which use byte pair encoding and similar techniques.

B Recommended Fractions for BERT and RoBERTa

Table 3 displays the results of data redundancy and data sufficiency tests on SentEval for BERT and RoBERTa.

Table 3. Recommended fractions according to different methods for BERT and RoBERTa. Data redundancy test: r is Pearson correlation; f is Fréchet distance. Data sufficiency test: D1 and D2 are first and second discrete differences. Mean fraction/Total dataset size row simultaneously shows the mean fraction across the tasks and the resulting fraction of whole SentEval for each method.

Full size table

C Limitations

Our work has a number of limitations. First, our results with SentEval cannot be extrapolated to other datasets without rerunning experiments with our method on them. Second, our method does not allow to extrapolate the learning curves of the metric in data redundancy test. One should build them empirically by running experiments on bigger fractions of data.

Then, one should always keep in mind the known weaknesses of probing studies, such as the misleading nature of accuracy scores. Although fractions probing can be easily adapted for safer methods such as selectivity-based ones or MDL, in this work we have employed the vanilla classification technique. We encourage researchers to responsibly choose proper probing methods for their studies when using fractions probing.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Orlov, E., Serikov, O. (2024). Less than Necessary or More than Sufficient: Validating Probing Dataset Size. In: Ignatov, D.I., et al. Analysis of Images, Social Networks and Texts. AIST 2023. Lecture Notes in Computer Science, vol 14486. Springer, Cham. https://doi.org/10.1007/978-3-031-54534-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-54534-4_8
Published: 12 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54533-7
Online ISBN: 978-3-031-54534-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Less than Necessary or More than Sufficient: Validating Probing Dataset Size