Abstract
Recent advances in state-of-the-art machine learning models like deep neural networks heavily rely on large amounts of labeled training data which is difficult to obtain for many applications. To address label scarcity, recent work has focused on data augmentation techniques to create synthetic training data. In this work, we propose a novel approach of data augmentation leveraging tensor decomposition to generate synthetic samples by exploiting local and global information in text and reducing concept drift. We develop Vec2Node that leverages self-training from in-domain unlabeled data augmented with tensorized word embeddings that significantly improves over state-of-the-art models, particularly in low-resource settings. For instance, with only \(1\%\) of labeled training data, Vec2Node improves the accuracy of a base model by \(16.7 \%\). Furthermore, Vec2Node generates explicable augmented data leveraging tensor embeddings.
S. Abdali—This research work was conducted while the first author was a Ph.D. student at the University of California, Riverside.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Process of translating a text to another language and translating it back to the original language.
- 2.
References
Abdali, S., Shah, N., Papalexakis, E.E.: Hijod: semi-supervised multi-aspect detection of misinformation using hierarchical joint decomposition. In: ECML/PKDD (2020)
Bader, B., Kolda, T.: Algorithm 862: matlab tensor classes for fast algorithm prototyping. ACM Trans. Math. Softw. 32, 635–653 (2006)
Bizer, C., et al.: Dbpedia - a crystallization point for the web of data. J. Web Semant. 7(3), 154–165 (2009). https://doi.org/10.1016/j.websem.2009.07.002
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 NAACL, pp. 4171–4186. ACL, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423
Du, J., et al.: Self-training improves pre-training for natural language understanding (2020)
Gallo, G., Longo, G., Pallottino, S., Nguyen, S.: Directed hypergraphs and applications. Discrete Appl. Math. 42, 177–201 (1993). https://doi.org/10.1016/0166-218X(93)90045-P
Guacho, G.B., Abdali, S., Shah, N., Papalexakis, E.E.: Semi-supervised content-based detection of misinformation via tensor embeddings, pp. 322–325 (2018). https://doi.org/10.1109/ASONAM.2018.8508241
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)
Harshman, R.A.: Foundations of the PARAFAC procedure: models and conditions for an explanatory multi-modal factor analysis. UCLA Working Pap. Phonetics 16(1), 84 (1970)
He, J., Gu, J., Shen, J., Ranzato, M.: Revisiting self-training for neural sequence generation (2020)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification (2016)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009). https://doi.org/10.1137/07070111X
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: ICML 2014, vol. 4 (2014)
Li, X., et al.: Learning to self-train for semi-supervised few-shot classification (2019)
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150. ACL, Portland, Oregon, USA (2011)
Meng, Y., Shen, J., Zhang, C., Han, J.: Weakly-supervised neural text classification. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM (2018). https://doi.org/10.1145/3269206.3271737
P. Liu, X. Wang, C.X., Meng, W.: A survey of text data augmentation (2020)
Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Tensors for data mining and data fusion: models, applications, and scalable algorithms. ACM Trans. Intell. Syst. Technol. 8(2), 16:1–16:44 (2016). https://doi.org/10.1145/2915921
Sidiropoulos, N., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E., Faloutsos, C.: Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process. 65(13), 3551–3582 (2016). https://doi.org/10.1109/TSP.2017.2690524
Smith, S., Ravindran, N., Sidiropoulos, N.D., Karypis, G.: Splatt: efficient and parallel sparse tensor-matrix multiplication. In: IPDPS, pp. 61–70 (2015)
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642. ACL, Seattle, Washington, USA (2013)
Spitz, A., Aumiller, D., Soproni, B., Gertz, M.: A versatile hypergraph model for document collections. In: SSDBM 2020 (2020)
Wang, Y., et al.: Adaptive self-training for few-shot neural sequence labeling. ArXiv: abs/2010.03680 (2020)
Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: EMNLP-IJCNLP, pp. 6383–6389. Association for Computational Linguistics, Hong Kong (2019)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 38–45. ACL (2020)
Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv: abs/1609.08144 (2016)
Xie, Q., Dai, Z., Hovy, E.H., Luong, M., Le, Q.V.: Unsupervised data augmentation. CoRR abs/1904.12848 (2019)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: ICLR (2017)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 649–657. Curran Associates, Inc. (2015)
Zhou, D., Huang, J., Schölkopf, B.: Learning with hypergraphs: Clustering, classification, and embedding, vol. 19, pp. 1601–1608 (2006)
Acknowledgments
The GPUs used for this research were donated by the NVIDIA Corp. Research was partly supported by a UCR Regents Faculty Fellowship. Research was also supported by the National Science Foundation grant no. 1901379, CAREER grant no. IIS 2046086 and grant no. 2127309 to the Computing Research Associate for the CIFellows project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Abdali, S., Mukherjee, S., Papalexakis, E.E. (2023). Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few Labels. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13714. Springer, Cham. https://doi.org/10.1007/978-3-031-26390-3_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-26390-3_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26389-7
Online ISBN: 978-3-031-26390-3
eBook Packages: Computer ScienceComputer Science (R0)