Skip to main content

Less than Necessary or More than Sufficient: Validating Probing Dataset Size

  • Conference paper
  • First Online:
Analysis of Images, Social Networks and Texts (AIST 2023)

Abstract

The vast body of research is dedicated to interpreting language models, particularly probing them for linguistic properties. As in many other NLP fields, probing works tend to reuse existing datasets, resulting in more and more specialized findings. Introducing new datasets, although necessary for truly typologically diverse studies, requires labor-intensive data annotation. Meanwhile, models become heavier, probing methods inventory enriches, and the cost of probing experiments grows accordingly. To minimize the amount of work annotating new data, and reduce the computational cost of experimenting with the existing data, it will be beneficial to assess dataset size.

We propose fractions probing, a novel method of validating probing dataset size. It includes data redundancy test to review existing datasets and data sufficiency test to provide guidance when collecting new ones. We illustrate the method’s applicability with SentEval probing suite, finding that it can be safely reduced. Our experiments are conducted for two models, BERT and RoBERTa, showing the latter to consistently require more data. Fractions probing can be used to analogously investigate other datasets and models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We compute all the metrics in both ways, but interpret the layer-wise setup only for Pearson correlation. When computing layer-wise metrics, we omit Word content. Radical changes in absolute numbers of its learning curve override the signal from other tasks.

  2. 2.

    The reported graphs are for BERT model.

  3. 3.

    The numerical metrics do not give such consistency. However, we suspect that they may be subject to biases like, for example, described in Sect. 5.1.

References

  1. Adcock, C.J.: Sample size determination: a review. J. Roy. Stat. Soc.: Ser. D (Stat.) 46(2), 261–283 (1997). https://doi.org/10.1111/1467-9884.00082, https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9884.00082, _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9884.00082

  2. Belinkov, Y.: Probing classifiers: promises, shortcomings, and advances. arXiv:2102.12452 [cs] (2021)

  3. Boonyanunta, N., Zeephongsekul, P.: Predicting the relationship between the size of training sample and the predictive power of classifiers. In: Negoita, M.G., Howlett, R.J., Jain, L.C. (eds.) KES 2004. LNCS (LNAI), vol. 3215, pp. 529–535. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30134-9_71

    Chapter  Google Scholar 

  4. Briggs, A.H., Gray, A.M.: Power and sample size calculations for stochastic cost-effectiveness analysis. Med. Decis. Making: Int. J. Soc. Med. Decis. Making 18(2 Suppl), S81-92 (1998). https://doi.org/10.1177/0272989X98018002S10

    Article  Google Scholar 

  5. Brinker, K.: Incorporating diversity in active learning with support vector machines, pp. 59–66 (2003)

    Google Scholar 

  6. Carneiro, A.V.: Estimating sample size in clinical studies: basic methodological principles. Revista Portuguesa De Cardiologia: Orgao Oficial Da Sociedade Portuguesa De Cardiologia = Portuguese J. Cardiol.: Off. J. Portuguese Soc. Cardiol. 22(12), 1513–1521 (2003)

    Google Scholar 

  7. Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M.: What you can cram into a single vector: probing sentence embeddings for linguistic properties. arXiv:1805.01070 [cs] (2018)

  8. Cortes, C., Jackel, L., Solla, S., Vapnik, V., Denker, J.: Learning curves: asymptotic values and rate of convergence. In: NIPS (1993)

    Google Scholar 

  9. Dalvi, F., et al.: NeuroX: a toolkit for analyzing individual neurons in neural networks. In: AAAI Conference on Artificial Intelligence (AAAI) (2019). https://www.aaai.org/ojs/index.php/AAAI/article/view/5063

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs] (2019)

  11. Dobbin, K.K., Zhao, Y., Simon, R.M.: How large a training set is needed to develop a classifier for microarray data? Clin. Cancer Res.: Off. J. Am. Assoc. Cancer Res. 14(1), 108–114 (2008). https://doi.org/10.1158/1078-0432.CCR-07-0443

    Article  Google Scholar 

  12. Eger, S., Daxenberger, J., Gurevych, I.: How to probe sentence embeddings in low-resource languages: on structural design choices for probing task evaluation. In: Proceedings of the 24th Conference on Computational Natural Language Learning, pp. 108–118. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.conll-1.8, https://aclanthology.org/2020.conll-1.8

  13. Elazar, Y., Ravfogel, S., Jacovi, A., Goldberg, Y.: Amnesic probing: behavioral explanation with amnesic counterfactuals. Trans. Assoc. Comput. Linguist. 9, 160–175 (2021). https://doi.org/10.1162/tacl_a_00359, _eprint: https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00359/1924189/tacl_a_00359.pdf

  14. Ethayarajh, K., Jurafsky, D.: Utility is in the eye of the user: a critique of NLP leaderboards. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4846–4853. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.393, https://aclanthology.org/2020.emnlp-main.393

  15. Figueroa, R.L., Zeng-Treitler, Q., Kandula, S., Ngo, L.H.: Predicting sample size required for classification performance. BMC Med. Inform. Decis. Making 12(1), 8 (2012). https://doi.org/10.1186/1472-6947-12-8

    Article  Google Scholar 

  16. Fréchet, M.: Sur quelques points du calcul fonctionnel. Rendiconti Circolo Mat. Palermo 22, 1–72 (1884–1940)

    Google Scholar 

  17. Fukunaga, K., Hayes, R.: Effects of sample size in classifier design. IEEE Trans. Pattern Anal. Mach. Intell. 11, 873–885 (1989). https://doi.org/10.1109/34.31448

    Article  Google Scholar 

  18. Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-84858-7

    Book  Google Scholar 

  19. Hess, K.R., Wei, C.: Learning curves in classification with microarray data. Semin. Oncol. 37(1), 65–68 (2010). https://doi.org/10.1053/j.seminoncol.2009.12.002

    Article  Google Scholar 

  20. Kim, S.Y.: Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC Bioinform. 10, 147 (2009). https://doi.org/10.1186/1471-2105-10-147

    Article  Google Scholar 

  21. Lenth, R.: Some practical guidelines for effective sample-size determination. Am. Stat. 55 (2001). https://doi.org/10.1198/000313001317098149

  22. Li, M., Sethi, I.: Confidence-based active learning. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1251–61 (2006). https://doi.org/10.1109/TPAMI.2006.156

    Article  Google Scholar 

  23. Liu, Y.: Active learning with support vector machine applied to gene expression data for cancer classification. J. Chem. Inf. Comput. Sci. 44(6), 1936–1941 (2004). https://doi.org/10.1021/ci049810a

    Article  Google Scholar 

  24. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). https://doi.org/10.48550/arXiv.1907.11692, http://arxiv.org/abs/1907.11692

  25. Maxwell, S.E., Kelley, K., Rausch, J.R.: Sample size planning for statistical power and accuracy in parameter estimation. Annu. Rev. Psychol. 59, 537–563 (2008). https://doi.org/10.1146/annurev.psych.59.103006.093735

    Article  Google Scholar 

  26. Mikhailov, V., Taktasheva, E., Sigdel, E., Artemova, E.: RuSentEval: linguistic source, encoder force! arXiv:2103.00573 [cs] (2021)

  27. Mukherjee, S., et al.: Estimating dataset size requirements for classifying DNA microarray data. J. Computat. Biol.: J. Comput. Mol. Cell Biol. 10(2), 119–142 (2003). https://doi.org/10.1089/106652703321825928

    Article  Google Scholar 

  28. Perlich, C.: Learning curves in machine learning (2011). https://doi.org/10.1007/978-0-387-30164-8_452

  29. Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 1999, pp. 23–32. Association for Computing Machinery, New York (1999). https://doi.org/10.1145/312129.312188

  30. Ravishankar, V., Øvrelid, L., Velldal, E.: Probing multilingual sentence representations with x-probe. In: RepL4NLP@ACL (2019)

    Google Scholar 

  31. Rodriguez, P., Barrow, J., Hoyle, A.M., Lalor, J.P., Jia, R., Boyd-Graber, J.: Evaluation examples are not equally informative: how should that change NLP leaderboards? In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4486–4503. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.346, https://aclanthology.org/2021.acl-long.346

  32. Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. arXiv:2002.12327 [cs] (2020)

  33. Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 [cs] (2017)

  34. Voita, E., Titov, I.: Information-theoretic probing with minimum description length. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 183–196. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.14, https://aclanthology.org/2020.emnlp-main.14

  35. Warmuth, M.K., Liao, J., Rätsch, G., Mathieson, M., Putta, S., Lemmen, C.: Active learning with support vector machines in the drug discovery process. J. Chem. Inf. Comput. Sci. 43(2), 667–673 (2003). https://doi.org/10.1021/ci025620t

    Article  Google Scholar 

  36. Zhu, Z., Wang, J., Li, B., Rudzicz, F.: On the data requirements of probing. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 4132–4147. Association for Computational Linguistics, Dublin (2022). https://doi.org/10.18653/v1/2022.findings-acl.326, https://aclanthology.org/2022.findings-acl.326

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Evgeny Orlov .

Editor information

Editors and Affiliations

Appendices

A Detailed SentEval Tasks Description

1.1 A.1 Surface Information

These tasks test the extent to which sentence embeddings are preserving surface properties of the sentences they encode. SentLen is the task of predicting the length of sentences in terms of words. Word content (WC) is the task of predicting which word the sentence contains from a closed set of 1000.

1.2 A.2 Syntactic Information

Bigram shift (BShift is the task of predicting whether the sentence has intact word order or contains two inverted random adjecent words. Tree depth (TreeDepth) is the task of determining the depth of the syntactic tree of the sentence. Top constituent (TopConst) is the task of determining the sequence of the top constituents immediately below the sentence (S) node.

1.3 A.3 Semantic Information

Tense is the task of predicting the tens of the main-clause verb. Subject number (SubjNum) is the task of determining the number of the direct object of the main clause. Similarly, object number (ObjNum) tests for the number of the direct object of the main clause. Semantic odd man out (SOMO) is the task of predicting whether the sentence contains a replaced verb or noun which forms bigrams with the previous and following word of the same frequency as the original. Coordination inversion (CoordInv) is the task of determining whether the sentence has intact or inverted order of clauses.

Here we would like to note that the original SentEval domains partition should possibly be revised. For example, Tense, Subj number and Obj number were categorized as semantic information because the models did not have access to morphology. Nowadays, it is not the case with the models which use byte pair encoding and similar techniques.

B Recommended Fractions for BERT and RoBERTa

Table 3 displays the results of data redundancy and data sufficiency tests on SentEval for BERT and RoBERTa.

Table 3. Recommended fractions according to different methods for BERT and RoBERTa. Data redundancy test: r is Pearson correlation; f is Fréchet distance. Data sufficiency test: D1 and D2 are first and second discrete differences. Mean fraction/Total dataset size row simultaneously shows the mean fraction across the tasks and the resulting fraction of whole SentEval for each method.

C Limitations

Our work has a number of limitations. First, our results with SentEval cannot be extrapolated to other datasets without rerunning experiments with our method on them. Second, our method does not allow to extrapolate the learning curves of the metric in data redundancy test. One should build them empirically by running experiments on bigger fractions of data.

Then, one should always keep in mind the known weaknesses of probing studies, such as the misleading nature of accuracy scores. Although fractions probing can be easily adapted for safer methods such as selectivity-based ones or MDL, in this work we have employed the vanilla classification technique. We encourage researchers to responsibly choose proper probing methods for their studies when using fractions probing.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Orlov, E., Serikov, O. (2024). Less than Necessary or More than Sufficient: Validating Probing Dataset Size. In: Ignatov, D.I., et al. Analysis of Images, Social Networks and Texts. AIST 2023. Lecture Notes in Computer Science, vol 14486. Springer, Cham. https://doi.org/10.1007/978-3-031-54534-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-54534-4_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-54533-7

  • Online ISBN: 978-3-031-54534-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics