Abstract
Research geared toward human well-being in developing nations often concentrates on web content written in a world language (e.g., English) and ignores a significant chunk of content written in a poorly resourced yet highly prevalent first language of the region in concern (e.g., Hindi). Such omissions are common due to the sheer mismatch between linguistic resources offered in a world language and its low-resource counterpart. However, during a global pandemic or an imminent war, demand for linguistic resources might get recalibrated. In this work, we focus on the high-resource and low-resource language pair \(\langle en , hi _e \rangle \) (English, and Romanized Hindi) and present a cross-lingual sampling method that takes example documents in English, and retrieves similar content written in Romanized Hindi, the most popular form of Hindi observed in social media. At the core of our technique is a novel finding that a surprisingly simple constrained nearest-neighbor sampling in polyglot Skip-gram word embedding space can retrieve substantial bilingual lexicons, even from harsh social media data sets. Our cross-lingual sampling method obtains substantial performance improvement in the important domains of detecting peace-seeking, hostility-diffusing hope speech in the context of the 2019 India-Pakistan conflict, and in detecting comments encouraging compliance with COVID-19 guidelines.
A.R. KhudaBukhsh and S. Palakodety—Equal contribution first authors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Resources and additional details are available at: https://www.cs.cmu.edu/~akhudabu/SocInfo2022.html.
- 2.
References
Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: ACL 2017, pp. 451–462 (2017). https://doi.org/10.18653/v1/P17-1042
Benesch, S.: Defining and diminishing hate speech. State World’s Minorities Indigenous Peoples 2014, 18–25 (2014)
Benesch, S., Ruths, D., Dillon, K.P., Saleem, H.M., Wright, L.: Counterspeech on twitter: A field study. A report for Public Safety Canada under the Kanishka Project (2016)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL 5, 135–146 (2017)
Cieri, C., Maxwell, M., Strassel, S., Tracey, J.: Selection criteria for low resource language programs. In: LREC, pp. 4543–4549 (2016)
Dinu, G., Lazaridou, A., Baroni, M.: Improving zero-shot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568 (2014)
Dou, Z.Y., Zhou, Z.H., Huang, S.: Unsupervised bilingual lexicon induction via latent variable models. In: EMNLP 2018, pp. 621–626 (2018)
Gella, S., Bali, K., Choudhury, M.: “ye word kis lang ka hai bhai?” testing the limits of word level language identification. In: ICNLP-2014, pp. 368–377 (2014)
Gumperz, J.J.: Discourse Strategies, vol. 1. Cambridge University Press, Cambridge (1982)
Jegou, H., Schmid, C., Harzallah, H., Verbeek, J.: Accurate image search using the contextual dissimilarity measure. PAMI 2008 32(1), 2–11 (2008)
KhudaBukhsh, A.R., Palakodety, S., Carbonell, J.G.: Harnessing code switching to transcend the linguistic barrier. In: IJCAI-PRICAI, pp. 4366–4374 (2020)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT summit, vol. 5, pp. 79–86 (2005)
Lample, G., Conneau, A., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. In: 6th International Conference on Learning Representations, ICLR 2018. OpenReview.net (2018). https://openreview.net/forum?id=H196sainb
Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M.: Phrase-based & neural unsupervised machine translation. In: EMNLP-2018, pp. 5039–5049 (2018). https://doi.org/10.18653/v1/D18-1549, https://www.aclweb.org/anthology/D18-1549
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Mathew, B., et al.: Thou shalt not hate: countering online hate speech. In: Proceedings of the Thirteenth International Conference on Web and Social Media, ICWSM 2019, pp. 369–380. AAAI Press (2019)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
Mulcaire, P., Kasai, J., Smith, N.A.: Low-resource parsing with crosslingual contextualized representations. In: CoNLL, pp. 304–315 (2019)
Mulcaire, P., Kasai, J., Smith, N.A.: Polyglot contextual representations improve crosslingual transfer. In: NAACL-HLT-2019, pp. 3912–3918 (2019). https://doi.org/10.18653/v1/N19-1392
Mulcaire, P., Swayamdipta, S., Smith, N.A.: Polyglot semantic role labeling. In: ACL-2018, pp. 667–672 (2018). https://doi.org/10.18653/v1/P18-2106, https://www.aclweb.org/anthology/P18-2106
Palakodety, S., KhudaBukhsh, A.R., Carbonell, J.G.: Hope speech detection: a computational analysis of the voice of peace. In: ECAI-2020, pp. 1881–1889 (2020)
Palakodety, S., KhudaBukhsh, A.R., Carbonell, J.G.: Mining insights from large-scale corpora using fine-tuned language models. In: ECAI-20, pp. 1890–1897 (2020)
Palakodety, S., KhudaBukhsh, A.R., Carbonell, J.G.: Voice for the voiceless: active sampling to detect comments supporting the Rohingyas. In: AAAI-20, pp. 454–462 (2020)
Pennington, J., Socher, R., Manning, C.D.: GLOVE: global vectors for word representation. In: Proceedings of the EMNLP, pp. 1532–1543 (2014)
Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. JMLR 11(Sep), 2487–2531 (2010)
Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 65, 569–631 (2019)
Saha, P., Singh, K., Kumar, A., Mathew, B., Mukherjee, A.: CounterGeDi: a controllable approach to generate polite, detoxified and emotional counter speech. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, pp. 5157–5163. ijcai.org (2022)
Sarkar, R., Mahinder, S., KhudaBukhsh, A.: The non-native speaker aspect: Indian English in social media. In: Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp. 61–70. Association for Computational Linguistics, Online (2020)
Sarkar, R., Mahinder, S., Sarkar, H., KhudaBukhsh, A.: Social media attributions in the context of water crisis. In: EMNLP, pp. 1402–1412. Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.109, https://www.aclweb.org/anthology/2020.emnlp-main.109
Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)
Toon, O.B., et al.: Rapidly expanding nuclear arsenals in Pakistan and India portend regional and global catastrophe. Sci. Adv. 5(10), eaay5478 (2019)
Tyagi, A., Field, A., Lathwal, P., Tsvetkov, Y., Carley, K.M.: A computational analysis of polarization on Indian and Pakistani social media. In: SocInfo 2020. Lecture Notes in Computer Science, vol. 12467, pp. 364–379 (2020). https://doi.org/10.1007/978-3-030-60975-7_27, https://doi.org/10.1007/978-3-030-60975-7_27
Yoo, C.H., Palakodety, S., Sarkar, R., KhudaBukhsh, A.: Empathy and hope: resource transfer to model inter-country social media dynamics. In: Proceedings of the 1st Workshop on NLP for Positive Impact, pp. 125–134. Association for Computational Linguistics, Online (2021)
Zhang, M., Liu, Y., Luan, H., Sun, M.: Adversarial training for unsupervised bilingual lexicon induction. In: ACL-2017, pp. 1959–1970 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 Ethics Statement
While the setting discussed in the paper involves humanitarian tasks, the techniques can be trivially adapted to conduct cross-lingual sampling and surfacing of content like hate speech, or detection of hope speech with the explicit object to censor it. In many recent conflicts in the Indian subcontinent, such systems can have adverse social effects and thus particular care is needed before these systems are deployed. Next, language-specific features can sometimes cause syntactically similar but semantically opposite content to be surfaced underscoring the need for a human-in-the-loop setting before such systems are deployed for social media content moderation tasks. Finally, in this work, care is taken to ensure that no particular community is the target of the sampled content. NLP methods can be utilized to selectively conduct cross lingual sampling to discover content against disenfranchised communities - it is imperative for the system designers to ensure that unwittingly or otherwise, communities at large are not targeted by system deployments.
1.2 A.2 News Networks
See Table 10.
1.3 A.3 Analyzing Other Language Pairs
We were curious to learn if our approach works with other language pairs. On two European language pairs, \(\langle en , es \rangle \) and \(\langle en , de \rangle \), we observed that our simple approach of constrained nearest neighbor sampling retrieves reasonable bilingual lexicons even when trained on a single, multilingual corpus (synthetically induced) without any explicit attempt to align.
Data Sets: We conduct experiments using Europarl [12] and Wikipedia data sets. We synthetically induce a multilingual corpus by combining two monolingual corpora and then randomly shuffling at the sentence level. Table 6 summarizes our results. We find that our overall performance improved with Wikipedia data especially for de \(\rightarrow \) en and es \(\rightarrow \) en. [13] also reported a performance boost with Wikipedia data.
Our primary takeaways are:
Source word frequency: Our experiments with Indian social media data sets indicate that our method performs better when we restrict ourselves to high-frequency source words. A fine-grained look at the performance based on the frequency of the source word reveals that we perform substantially better on high-frequency words belonging to \(\mathcal {V}_{source }^{0-5}\) (e.g., \(en \rightarrow es\) performance jumps from 0.25 to 0.61 when we consider words in \(\mathcal {V}_{source }^{0-5}\)).
Topical cohesion: When we sample the en part of the corpus from Europarl and the es (or de) part from Wikipedia, we remove the topical cohesion between the en and es (de) components. We observe that performance dips slightly.
1.4 A.4 \(\mathcal {D}^{ covid }\) Data Set Visualization
Figure 3 presents a 2D visualization of the word embeddings obtained using the language-identifier we considered [11]. The visualization indicates that apart from Romanized Hindi and English, our data set also demonstrates substantial presence of Hindi written in Devanagari script further establishing the challenges associated to our task. The size of the estimated vocabularies is presented in Table 9 (Table 11).
1.5 A.5 Annotation
We used two annotators fluent in Hindi, Urdu and English. For word translations, consensus labels were used. For hope speech annotation, the minimum Fleiss’ \(\kappa \) measure was high (0.88) indicating strong inter-rater agreement. After independent labeling, differences were resolved through discussion.
1.6 A.6 System Pipeline
See Fig. 4.
1.7 A.7 Hyperparameters
Our preprocessing steps and hyperparameters to train embeddings are identical to previous literature [11]. All the models discussed in this paper are obtained by training Fasttext [4] Skip-gram models with the following parameters unless stated otherwise:
-
Dimension: 100
-
Minimum subword unit length: 2
-
Maximum subword unit length: 4
-
Epochs: 5
-
Context window: 5
-
Number of negatives sampled: 5
1.8 A.8 Hyperparameter Sensitivity Analysis
Recall that, we restricted \({\mathcal {V}}_{source }\) and \({\mathcal {V}}_{target }\) to prevalence criteria that (1) \({\mathcal {V}}_{source }\) is restricted to \({\mathcal {V}}_{source }^{0-5}\) (2) \({\mathcal {V}}_{target }\) contains words that have appeared at least 100 or more times in the corpus. In Table 7, we relax the prevalence criterion on \({\mathcal {V}}_{source }\) and observe that as we move towards more infrequent words, the translation performance degrades. The performance drop is more visible with \({\mathcal {V}}_{source }^{10-100}\). Our annotators informed that poor quality of spelling and increased prevalence of contraction made the annotation task particularly challenging for rare words.
We next analyze the effect of the frequency threshold of 100 on \({\mathcal {V}}_{target }\). In order to reduce annotation burden, we only focused on the subset of words with perfect translation (i.e., p@1 performance 100%). When we relax the frequency threshold to 10, our p@1, p@5 and p@10 numbers are respectively, 0.38, 0.84, 0.91, respectively. Hence, although for 91% or the source words we found a translation within the top 10 translations, our p@1 performance took a considerable hit. Our annotators reported that with a lowered frequency threshold, the retrieved translations contained higher degree of misspellings. Our conclusion from this experiment is 100 is a reasonable threshold given the noisy nature of our corpora.
We conducted a similar analysis on our word translation tasks using European language pairs. As shown in Table 8, when English is the source language, our translation performance on frequent words is substantially better than rare words. However, when English is the target language, we did not observe any similar trend, the performance was roughly equal across the entire spectrum of words ranked by frequency. With Wikipedia corpus (not shown in the Table), we observed qualitatively similar trends.
1.9 A.9 Extended Examples of Lexicons
Table 12 lists an extended bilingual lexicon containing 90 words pairs (30 from each corpus) obtained using our method. We will release the complete lexicon of 1,100 word pairs upon acceptance.
1.10 A.10 Disabling Pair
We also disabled select bigrams in the corpus to investigate the contribution of phrases like . Any \(\langle number , string \rangle \) pair, was replaced with a random number and random string pair throughout the corpus. Our results showed a 48% p@1 performance dip indicating that these phrases contribute massively to the word translation phenomenon observed.
1.11 A.11 Loanword
We now slightly abuse the definition of a loanword and consider a word is a loanword if it appears in a context of words written in a different language, and define a simple measure to quantify to what extent this occurs in a two-language setting. Let c denote the context (single word left and right) of a word w. We first count the instances where the language labels of c and w agree, i.e., \(\mathcal {L}({w}) = \mathcal {L}({c})\) (e.g., is not a loanword in the following phrase: ). Let this number be denoted as \(\mathcal {N}_{not\text {-}borrowed }\). Similarly, we count the instances when c and w have different language labels, i.e., \(\mathcal {L}({w}) \ne \mathcal {L}({c})\). This scenario would arise when a word is borrowed from a different language (e.g., is a loanword in ). In our scheme, the Loan Word Index (LWI) of a word w is defined as LWI(w) = \(\frac{\mathcal {N}_{borrowed }}{\mathcal {N}_{borrowed } + \mathcal {N}_{not\text {-}borrowed }}\). A high LWI indicates substantial lexical borrowing of the word outside its language. For a word pair \(\langle w_{source }, w_{target } \rangle \), we define LWI(.) as the maximum of their individual LWIs. For example, if the LWI is high for the pair , it indicates that at least one of these words was substantially borrowed. Our hypothesis is that successfully translated word pairs would have a high LWI indicating at least one of the pair was used as a loanword facilitating translation. The average Loan Word Index of all successfully translated word pairs in our test data sets across all three corpora is 0.15. Compared to this, randomly generated word pairings have an average Loan Word Index of 0.09. We next performed a frequency preserving loan word exchange to modify the corpus where translated word pairs are interchanged to diminish the extent to which words are borrowed (e.g., phrases like is rewritten as ). Frequency is preserved by interchanging both words in a successfully translated word pair as many times as the least borrowed word is borrowed. In our example if was borrowed 10 times, and 15 times, we alter 10 instances where is borrowed with , and 10 instances where is borrowed with . We thus preserve word frequencies while diminishing the loanword phenomenon. We observed that the retrieval performance of our p@1 set dipped by 33% after this corpus modification indicating that frequent borrowing of words possibly positively contributed to our method’s translation performance.
1.12 A.12 Analysis of Discovered Words
In our translation scheme, we found that translations for nouns, adjectives and adverbs were successfully discovered (see Table 2). Preserving plurality ( , , ) on most occasions, translating numerals ( , ) were among some surprising observations considering the noisy social media setting. For a given source word, multiple valid synonymous target words were often among the top translations produced by our method (e.g., and for ; , and for ). Stylistic choices like contraction were reflected in the translation (e.g., (kyuki) mapped to (because), and (sahi) mapped to (right)). Verbs are conjugated differently in Hindi and English and word-for-word translations don’t typically exist - for instance translates to , thus words like were rarely successfully translated.
Polysemy: During single word translation, without context, resolving polysemous words to their true meanings w.r.t. the context is not possible. However, we noticed that in a few instances top translation choices of polysemous source words include valid translations of their different meanings. For example, the word can mean both low temperature or a common viral infection. In \(\mathcal {D}^{ covid }\), both these meanings were captured in the top translations.
Nativization of Loanwords: Lexical borrowing across language pairs in the context of loanwords (or borrowed words) has been studied in linguistics and computational linguistics. Borrowed words, also known as loanwords, are lexical items borrowed from a donor language. For example, the English word or is borrowed from Hindi, while () and () are Hindi words borrowed from English. We noticed nativized loanwords, i.e., borrowed words that underwent phonological repairs to adapt to a foreign language, translate back to their English donor counterpart (e.g., and translate to donor words and , respectively).
1.13 A.13 Topical Cohesion
We break topical cohesion by sampling en and es (de) from Europarl and Wikipedia respectively. Our results show that bilingual lexicons are still retrieved albeit with marginally lower performance. We conclude that topical cohesion possibly helps but may not be a prerequisite for retrieving a reasonably sized bilingual lexicon.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
KhudaBukhsh, A.R., Palakodety, S., Mitchell, T.M. (2022). Harnessing Unsupervised Word Translation to Address Resource Inequality for Peace and Health. In: Hopfgartner, F., Jaidka, K., Mayr, P., Jose, J., Breitsohl, J. (eds) Social Informatics. SocInfo 2022. Lecture Notes in Computer Science, vol 13618. Springer, Cham. https://doi.org/10.1007/978-3-031-19097-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-19097-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19096-4
Online ISBN: 978-3-031-19097-1
eBook Packages: Computer ScienceComputer Science (R0)