Skip to main content

LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2024)

Abstract

Transformer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. It is optimized for efficient training on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL’s effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20000 tokens while training on a single GPU with 32 GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL’s potential as a groundbreaking approach in the field of large file classification.

K. Allix—Independent Researcher.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Ahmad, W.U., Chakraborty, S., Ray, B., Chang, K.W.: Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021)

  3. Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages (2019)

    Google Scholar 

  4. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  5. Baker, C.T.: The Numerical Treatment of Integral Equations. Oxford University Press, Oxford (1977)

    Google Scholar 

  6. Bamman, D., Smith, N.: New alignment methods for discriminative book summarization. arXiv preprint arXiv:1305.1319 (2013)

  7. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

  8. Ben-Israel, A., Greville, T.N.: Generalized Inverses: Theory and Applications, vol. 15. Springer, Heidelberg (2003)

    Google Scholar 

  9. Bulatov, A., Kuratov, Y., Burtsev, M.S.: Scaling transformer to 1 m tokens and beyond with RMT. arXiv preprint arXiv:2304.11062 (2023)

  10. Bulatov, A., Kuratov, Y., Burtsev, M.: Recurrent memory transformer. In: Advances in Neural Information Processing Systems, vol. 35, pp. 11079–11091 (2022)

    Google Scholar 

  11. Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Androutsopoulos, I.: Large-scale multi-label text classification on EU legislation. arXiv:1906.02192 (2019)

  12. Dang, N.C., Moreno-García, M.N., De la Prieta, F.: Sentiment analysis based on deep learning: a comparative study. Electronics 9(3), 483 (2020)

    Article  Google Scholar 

  13. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)

  14. Ding, M., Zhou, C., Yang, H., Tang, J.: CogLTX: applying BERT to long texts. In: NeurIPS (2020)

    Google Scholar 

  15. Feng, J., Zhou, Z.H.: Deep MIML network. In: AAAI (2017)

    Google Scholar 

  16. Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages. In: Findings of EMNLP (2020)

    Google Scholar 

  17. Hanif, H., Maffeis, S.: VulBERTa: simplified source code pre-training for vulnerability detection. arXiv preprint arXiv:2205.12424 (2022)

  18. Hebbar, R., et al.: Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices. Speech, and Music Processing, Audio (2021)

    Google Scholar 

  19. Ilse, M., Tomczak, J., Welling, M.: Attention-based deep multiple instance learning. In: ICML (2018)

    Google Scholar 

  20. Ji, Y., Liu, H., He, B., Xiao, X., Wu, H., Yu, Y.: Diversified multiple instance learning for document-level multi-aspect sentiment classification. In: EMNLP (2020)

    Google Scholar 

  21. Kanavati, F., et al.: Weakly-supervised learning for lung carcinoma classification using deep learning. Sci. Rep. (2020)

    Google Scholar 

  22. Kiesel, J., et al.: SemEval-2019 task 4: Hyperpartisan news detection. In: 13th International Workshop on Semantic Evaluation (2019)

    Google Scholar 

  23. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  24. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019)

    Article  Google Scholar 

  25. Kumar, S., Asthana, R., Upadhyay, S., Upreti, N., Akbar, M.: Fake news detection using deep learning models: a novel approach. Trans. Emerg. Telecommun. Technol. 31(2), e3767 (2020)

    Article  Google Scholar 

  26. Lang, K.: NewsWeeder: learning to filter netnews. In: Machine Learning Proceedings 1995, pp. 331–339 (1995)

    Google Scholar 

  27. Lerousseau, M., et al.: Weakly supervised multiple instance learning histopathological tumor segmentation. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12265, pp. 470–479. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59722-1_45

    Chapter  Google Scholar 

  28. Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In: CVPR (2021)

    Google Scholar 

  29. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  30. Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021)

    Article  Google Scholar 

  31. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: EMNLP (2004)

    Google Scholar 

  32. Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., Dehak, N.: Hierarchical transformers for long document classification. In: IEEE ASRU (2019)

    Google Scholar 

  33. Park, H., Vyas, Y., Shah, K.: Efficient classification of long documents using transformers. In: ACL (2022)

    Google Scholar 

  34. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  35. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  36. Ranasinghe, T., Zampieri, M.: Multilingual offensive language identification with cross-lingual embeddings. arXiv preprint arXiv:2010.05324 (2020)

  37. Razavi, M.K., Kerayechian, A., Gachpazan, M., Shateyi, S.: A new iterative method for finding approximate inverses of complex matrices. In: Abstract and Applied Analysis (2014)

    Google Scholar 

  38. Rote, G.: Computing the minimum hausdorff distance between two point sets on a line under translation. Inf. Process. Lett. 38(3), 123–127 (1991)

    Article  MathSciNet  Google Scholar 

  39. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature (1986)

    Google Scholar 

  40. Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., et al.: Transmil: transformer based correlated multiple instance learning for whole slide image classification. In: NeurIPS (2021)

    Google Scholar 

  41. Sharma, Y., Shrivastava, A., Ehsan, L., Moskaluk, C.A., Syed, S., Brown, D.: Cluster-to-conquer: a framework for end-to-end multi-instance learning for whole slide image classification. In: Medical Imaging with Deep Learning (2021)

    Google Scholar 

  42. Shen, D., et al.: Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms. arXiv preprint arXiv:1805.09843 (2018)

  43. Song, K., et al.: Using customer service dialogues for satisfaction analysis with context-assisted multiple instance learning. In: EMNLP (2019)

    Google Scholar 

  44. Sun, T., et al.: DexBERT: effective, task-agnostic and fine-grained representation learning of Android bytecode. IEEE Trans. Softw. Eng. (2023)

    Google Scholar 

  45. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  46. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  47. Wang, X., Yan, Y., Tang, P., Bai, X., Liu, W.: Revisiting multiple instance neural networks. Pattern Recogn. (2018)

    Google Scholar 

  48. Xiong, Y., et al.: Nyströmformer: a nyström-based algorithm for approximating self-attention. In: AAAI (2021)

    Google Scholar 

  49. Xu, G., et al.: Camel: a weakly supervised learning framework for histopathology image segmentation. In: ICCV (2019)

    Google Scholar 

  50. Zaheer, M., et al.: Big bird: transformers for longer sequences. In: NeurIPS (2020)

    Google Scholar 

  51. Zhang, H., et al.: DTFD-mil: double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In: CVPR (2022)

    Google Scholar 

  52. Zhang, W.: Non-IID multi-instance learning for predicting instance and bag labels using variational auto-encoder. arXiv preprint arXiv:2105.01276 (2021)

  53. Zhang, Y., et al.: Pushing the limit of LLM capacity for text classification. arXiv preprint arXiv:2402.07470 (2024)

  54. Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: NeurIPS (2019)

    Google Scholar 

  55. Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treating instances as non-IID samples. In: ICML (2009)

    Google Scholar 

Download references

Acknowledgment

This research was funded in whole, or in part, by the Luxembourg National Research Fund (FNR), grant references 16344458 (REPROCESS), 18154263 (UNLOCK), and 17046335 (AFR PhD grant).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tiezhu Sun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, T., Pian, W., Daoudi, N., Allix, K., F. Bissyandé, T., Klein, J. (2024). LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14762. Springer, Cham. https://doi.org/10.1007/978-3-031-70239-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70239-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70238-9

  • Online ISBN: 978-3-031-70239-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics