skip to main content
10.1145/3535782.3535835acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmsieConference Proceedingsconference-collections
research-article

Towards Bilingual Word Embedding Models for Engineering: Evaluating Semantic Linking Capabilities of Engineering-Specific Word Embeddings Across Languages

Published: 18 July 2022 Publication History

Abstract

Word embeddings represent the semantic meanings of words in high-dimensional vector space. Because of this capability, word embeddings could be used in a wide range of Natural Language Processing (NLP) tasks. While domain-specific monolingual word embeddings are common in literature, domain-specific bilingual word embeddings are uncommon. In general, large text corpora are required for training high quality word embeddings. Furthermore, training domain-specific word embeddings necessitates the use of source texts from the relevant domain. To train bilingual domain-specific word embeddings, the domain-specific texts must also be available in two different languages. In this paper, we use a large dataset of engineering-related articles in German and English to train bilingual engineering-specific word embedding models using different approaches. We will evaluate our trained models, identify the most promising approach, and demonstrate that the best performing one is very capable of representing semantic relationships between engineering-specific words and mapping languages in a shared vector space. Moreover, we show that the additional use of an engineering-specific learning dictionary can improve the quality of bilingual engineering-specific word embeddings.

References

[1]
Y. Liu, Z. Liu, T.-S. Chua and M. Sun, "Topical Word Embeddings," in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas, USA, 2015.
[2]
T. Schopf, D. Braun and F. Matthes, "Lbl2Vec: An Embedding-based Approach for Unsupervised Document Retrieval on Predefined Topics," in Proceedings of the 17th International Conference on Web Information Systems and Technologies - WEBIST, 2021.
[3]
T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," in Proceedings of Workshop at ICLR, 2013.
[4]
J. Pennington, R. Socher and C. Manning, "GloVe: Global Vectors for Word Representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014.
[5]
P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, "Enriching Word Vectors with Subword Information," in Transactions of the Association for Computational Linguistics, Volume 5, 2017.
[6]
M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee and L. Zettlemoyer, "Deep Contextualized Word Representations," in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, 2018.
[7]
V. Efstathiou, C. Chatzilenas and D. Spinellis, "Word embeddings for the software engineering domain," in Proceedings of the 15th International Conference on Mining Software Repositories, 2018.
[8]
I. Chalkidis and D. Kampas, "Deep learning in law: early adaptation and legal word embeddings trained on large corpora," in Artificial Intelligence and Law 27, 2019.
[9]
Y. Wang, S. Liu, N. Afzal, M. Rastegar-Mojarad, L. Wang, F. Shen, P. Kingsbury and H. Liu, "A comparison of word embeddings for the biomedical natural language processing," in Journal of Biomedical Informatics, 2018.
[10]
D. Braun, O. Klymenko, T. Schopf, Y. Kaan Akan and F. Matthes, "The Language of Engineering: Training a Domain-Specific Word Embedding Model for Engineering," in 2021 3rd International Conference on Management Science and Industrial Engineering (MSIE 2021), Osaka, Japan, 2021.
[11]
W. Zou, R. Socher, D. Cer and C. Manning, "Bilingual Word Embeddings for Phrase-Based Machine Translation," in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, 2013.
[12]
T. Mikolov, Q. Le and I. Sutskever, "Exploiting Similarities among Languages for Machine Translation," in CoRR, 2013.
[13]
M. Artetxe, G. Labaka and E. Agirre, "Learning principled bilingual mappings of word embeddings while preserving monolingual invariance," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 2016.
[14]
M. Artetxe, G. Labaka and E. Agirre, "Learning bilingual word embeddings with (almost) no bilingual data," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, 2017.
[15]
M. Artetxe, G. Labaka and E. Agirre, "Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations," in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 2018.
[16]
M. Artetxe, G. Labaka and E. Agirre, "A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 2018.
[17]
A. Conneau, G. Lample, M. A. Ranzato, L. Denoyer and H. Jégou, "WORD TRANSLATION WITHOUT PARALLEL DATA," in 6th International Conference on Learning Representations, Vancouver, BC, Canada, 2018.
[18]
W. Shi, M. Chen, Y. Tian and K.-W. Chang, "Learning Bilingual Word Embeddings Using Lexical Definitions," in Proceedings of the 4th Workshop on Representation Learning for NLP, Florence, Italy, 2019.
[19]
T. Schnabel, I. Labutov, D. Mimno and T. Joachims, "Evaluation methods for unsupervised word embeddings," in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015.
[20]
A. Bakarov, "A Survey of Word Embeddings Evaluation Methods," in CoRR, 2018.
[21]
B. Li, A. Drozd, Y. Guo, T. Liu, S. Matsuoka and X. Du, "Scaling Word2Vec on Big Corpus," in Data Science and Engineering 4 (2019), 2019.
[22]
J. Camacho-Collados, M. T. Pilehvar, N. Collier and R. Navigli, "Multilingual and Cross-lingual Semantic Word Similarity," in Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), Vancouver, Canada, 2017.
[23]
L. van der Maaten and G. Hinton, "Visualizing Data using t-SNE," in Journal of Machine Learning Research 9, 2008.
[24]
Q. Le and T. Mikolov, "Distributed Representations of Sentences and Documents," in 31st International Conference on Machine Learning, Beijing, China, 2014.

Cited By

View all
  • (2023)Semantic Label Representations with Lbl2Vec: A Similarity-Based Approach for Unsupervised Text ClassificationWeb Information Systems and Technologies10.1007/978-3-031-24197-0_4(59-73)Online publication date: 18-Jan-2023
  • (2022)Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based ApproachesProceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval10.1145/3582768.3582795(6-15)Online publication date: 16-Dec-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MSIE '22: Proceedings of the 4th International Conference on Management Science and Industrial Engineering
April 2022
497 pages
ISBN:9781450395816
DOI:10.1145/3535782
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2022

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

MSIE 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Semantic Label Representations with Lbl2Vec: A Similarity-Based Approach for Unsupervised Text ClassificationWeb Information Systems and Technologies10.1007/978-3-031-24197-0_4(59-73)Online publication date: 18-Jan-2023
  • (2022)Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based ApproachesProceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval10.1145/3582768.3582795(6-15)Online publication date: 16-Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media