Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few Labels

Abdali, Sara; Mukherjee, Subhabrata; Papalexakis, Evangelos E.

doi:10.1007/978-3-031-26390-3_33

Sara Abdali¹³,
Subhabrata Mukherjee¹⁴ &
Evangelos E. Papalexakis¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13714))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

677 Accesses
1 Citations

Abstract

Recent advances in state-of-the-art machine learning models like deep neural networks heavily rely on large amounts of labeled training data which is difficult to obtain for many applications. To address label scarcity, recent work has focused on data augmentation techniques to create synthetic training data. In this work, we propose a novel approach of data augmentation leveraging tensor decomposition to generate synthetic samples by exploiting local and global information in text and reducing concept drift. We develop Vec2Node that leverages self-training from in-domain unlabeled data augmented with tensorized word embeddings that significantly improves over state-of-the-art models, particularly in low-resource settings. For instance, with only \(1\%\) of labeled training data, Vec2Node improves the accuracy of a base model by \(16.7 \%\). Furthermore, Vec2Node generates explicable augmented data leveraging tensor embeddings.

S. Abdali—This research work was conducted while the first author was a Ph.D. student at the University of California, Riverside.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Process of translating a text to another language and translating it back to the original language.
2.
https://github.com/huggingface/transformers.

References

Abdali, S., Shah, N., Papalexakis, E.E.: Hijod: semi-supervised multi-aspect detection of misinformation using hierarchical joint decomposition. In: ECML/PKDD (2020)
Google Scholar
Bader, B., Kolda, T.: Algorithm 862: matlab tensor classes for fast algorithm prototyping. ACM Trans. Math. Softw. 32, 635–653 (2006)
Article MathSciNet MATH Google Scholar
Bizer, C., et al.: Dbpedia - a crystallization point for the web of data. J. Web Semant. 7(3), 154–165 (2009). https://doi.org/10.1016/j.websem.2009.07.002
Article Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 NAACL, pp. 4171–4186. ACL, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423
Du, J., et al.: Self-training improves pre-training for natural language understanding (2020)
Google Scholar
Gallo, G., Longo, G., Pallottino, S., Nguyen, S.: Directed hypergraphs and applications. Discrete Appl. Math. 42, 177–201 (1993). https://doi.org/10.1016/0166-218X(93)90045-P
Article MathSciNet MATH Google Scholar
Guacho, G.B., Abdali, S., Shah, N., Papalexakis, E.E.: Semi-supervised content-based detection of misinformation via tensor embeddings, pp. 322–325 (2018). https://doi.org/10.1109/ASONAM.2018.8508241
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)
MATH Google Scholar
Harshman, R.A.: Foundations of the PARAFAC procedure: models and conditions for an explanatory multi-modal factor analysis. UCLA Working Pap. Phonetics 16(1), 84 (1970)
Google Scholar
He, J., Gu, J., Shen, J., Ranzato, M.: Revisiting self-training for neural sequence generation (2020)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification (2016)
Google Scholar
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009). https://doi.org/10.1137/07070111X
Article MathSciNet MATH Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: ICML 2014, vol. 4 (2014)
Google Scholar
Li, X., et al.: Learning to self-train for semi-supervised few-shot classification (2019)
Google Scholar
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150. ACL, Portland, Oregon, USA (2011)
Google Scholar
Meng, Y., Shen, J., Zhang, C., Han, J.: Weakly-supervised neural text classification. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM (2018). https://doi.org/10.1145/3269206.3271737
P. Liu, X. Wang, C.X., Meng, W.: A survey of text data augmentation (2020)
Google Scholar
Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Tensors for data mining and data fusion: models, applications, and scalable algorithms. ACM Trans. Intell. Syst. Technol. 8(2), 16:1–16:44 (2016). https://doi.org/10.1145/2915921
Sidiropoulos, N., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E., Faloutsos, C.: Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process. 65(13), 3551–3582 (2016). https://doi.org/10.1109/TSP.2017.2690524
Article MathSciNet MATH Google Scholar
Smith, S., Ravindran, N., Sidiropoulos, N.D., Karypis, G.: Splatt: efficient and parallel sparse tensor-matrix multiplication. In: IPDPS, pp. 61–70 (2015)
Google Scholar
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642. ACL, Seattle, Washington, USA (2013)
Google Scholar
Spitz, A., Aumiller, D., Soproni, B., Gertz, M.: A versatile hypergraph model for document collections. In: SSDBM 2020 (2020)
Google Scholar
Wang, Y., et al.: Adaptive self-training for few-shot neural sequence labeling. ArXiv: abs/2010.03680 (2020)
Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: EMNLP-IJCNLP, pp. 6383–6389. Association for Computational Linguistics, Hong Kong (2019)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 38–45. ACL (2020)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv: abs/1609.08144 (2016)
Xie, Q., Dai, Z., Hovy, E.H., Luong, M., Le, Q.V.: Unsupervised data augmentation. CoRR abs/1904.12848 (2019)
Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: ICLR (2017)
Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 649–657. Curran Associates, Inc. (2015)
Google Scholar
Zhou, D., Huang, J., Schölkopf, B.: Learning with hypergraphs: Clustering, classification, and embedding, vol. 19, pp. 1601–1608 (2006)
Google Scholar

Download references

Acknowledgments

The GPUs used for this research were donated by the NVIDIA Corp. Research was partly supported by a UCR Regents Faculty Fellowship. Research was also supported by the National Science Foundation grant no. 1901379, CAREER grant no. IIS 2046086 and grant no. 2127309 to the Computing Research Associate for the CIFellows project.

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, USA
Sara Abdali
Microsoft Research, Redmond, USA
Subhabrata Mukherjee
University of California, Riverside, USA
Evangelos E. Papalexakis

Authors

Sara Abdali
View author publications
You can also search for this author in PubMed Google Scholar
Subhabrata Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
Evangelos E. Papalexakis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sara Abdali .

Editor information

Editors and Affiliations

Grenoble Alpes University, Saint Martin d'Hères, France
Massih-Reza Amini
INSA Rouen Normandy, Saint Etienne du Rouvray, France
Stéphane Canu
Ruhr-Universität Bochum, Bochum, Germany
Asja Fischer
KU Leuven, Leuven, Belgium
Tias Guns
Central European University, Vienna, Austria
Petra Kralj Novak
Aristotle University of Thessaloniki, Thessaloniki, Greece
Grigorios Tsoumakas

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 127 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abdali, S., Mukherjee, S., Papalexakis, E.E. (2023). Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few Labels. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13714. Springer, Cham. https://doi.org/10.1007/978-3-031-26390-3_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-26390-3_33
Published: 17 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26389-7
Online ISBN: 978-3-031-26390-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few Labels