Skip to main content

Harnessing Unsupervised Word Translation to Address Resource Inequality for Peace and Health

  • Conference paper
  • First Online:
Social Informatics (SocInfo 2022)

Abstract

Research geared toward human well-being in developing nations often concentrates on web content written in a world language (e.g., English) and ignores a significant chunk of content written in a poorly resourced yet highly prevalent first language of the region in concern (e.g., Hindi). Such omissions are common due to the sheer mismatch between linguistic resources offered in a world language and its low-resource counterpart. However, during a global pandemic or an imminent war, demand for linguistic resources might get recalibrated. In this work, we focus on the high-resource and low-resource language pair \(\langle en , hi _e \rangle \) (English, and Romanized Hindi) and present a cross-lingual sampling method that takes example documents in English, and retrieves similar content written in Romanized Hindi, the most popular form of Hindi observed in social media. At the core of our technique is a novel finding that a surprisingly simple constrained nearest-neighbor sampling in polyglot Skip-gram word embedding space can retrieve substantial bilingual lexicons, even from harsh social media data sets. Our cross-lingual sampling method obtains substantial performance improvement in the important domains of detecting peace-seeking, hostility-diffusing hope speech in the context of the 2019 India-Pakistan conflict, and in detecting comments encouraging compliance with COVID-19 guidelines.

A.R. KhudaBukhsh and S. Palakodety—Equal contribution first authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Resources and additional details are available at: https://www.cs.cmu.edu/~akhudabu/SocInfo2022.html.

  2. 2.

    https://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/prevention.html.

References

  1. Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: ACL 2017, pp. 451–462 (2017). https://doi.org/10.18653/v1/P17-1042

  2. Benesch, S.: Defining and diminishing hate speech. State World’s Minorities Indigenous Peoples 2014, 18–25 (2014)

    Google Scholar 

  3. Benesch, S., Ruths, D., Dillon, K.P., Saleem, H.M., Wright, L.: Counterspeech on twitter: A field study. A report for Public Safety Canada under the Kanishka Project (2016)

    Google Scholar 

  4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL 5, 135–146 (2017)

    Article  Google Scholar 

  5. Cieri, C., Maxwell, M., Strassel, S., Tracey, J.: Selection criteria for low resource language programs. In: LREC, pp. 4543–4549 (2016)

    Google Scholar 

  6. Dinu, G., Lazaridou, A., Baroni, M.: Improving zero-shot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568 (2014)

  7. Dou, Z.Y., Zhou, Z.H., Huang, S.: Unsupervised bilingual lexicon induction via latent variable models. In: EMNLP 2018, pp. 621–626 (2018)

    Google Scholar 

  8. Gella, S., Bali, K., Choudhury, M.: “ye word kis lang ka hai bhai?” testing the limits of word level language identification. In: ICNLP-2014, pp. 368–377 (2014)

    Google Scholar 

  9. Gumperz, J.J.: Discourse Strategies, vol. 1. Cambridge University Press, Cambridge (1982)

    Google Scholar 

  10. Jegou, H., Schmid, C., Harzallah, H., Verbeek, J.: Accurate image search using the contextual dissimilarity measure. PAMI 2008 32(1), 2–11 (2008)

    Google Scholar 

  11. KhudaBukhsh, A.R., Palakodety, S., Carbonell, J.G.: Harnessing code switching to transcend the linguistic barrier. In: IJCAI-PRICAI, pp. 4366–4374 (2020)

    Google Scholar 

  12. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT summit, vol. 5, pp. 79–86 (2005)

    Google Scholar 

  13. Lample, G., Conneau, A., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. In: 6th International Conference on Learning Representations, ICLR 2018. OpenReview.net (2018). https://openreview.net/forum?id=H196sainb

  14. Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M.: Phrase-based & neural unsupervised machine translation. In: EMNLP-2018, pp. 5039–5049 (2018). https://doi.org/10.18653/v1/D18-1549, https://www.aclweb.org/anthology/D18-1549

  15. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)

    Google Scholar 

  16. Mathew, B., et al.: Thou shalt not hate: countering online hate speech. In: Proceedings of the Thirteenth International Conference on Web and Social Media, ICWSM 2019, pp. 369–380. AAAI Press (2019)

    Google Scholar 

  17. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  18. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)

  19. Mulcaire, P., Kasai, J., Smith, N.A.: Low-resource parsing with crosslingual contextualized representations. In: CoNLL, pp. 304–315 (2019)

    Google Scholar 

  20. Mulcaire, P., Kasai, J., Smith, N.A.: Polyglot contextual representations improve crosslingual transfer. In: NAACL-HLT-2019, pp. 3912–3918 (2019). https://doi.org/10.18653/v1/N19-1392

  21. Mulcaire, P., Swayamdipta, S., Smith, N.A.: Polyglot semantic role labeling. In: ACL-2018, pp. 667–672 (2018). https://doi.org/10.18653/v1/P18-2106, https://www.aclweb.org/anthology/P18-2106

  22. Palakodety, S., KhudaBukhsh, A.R., Carbonell, J.G.: Hope speech detection: a computational analysis of the voice of peace. In: ECAI-2020, pp. 1881–1889 (2020)

    Google Scholar 

  23. Palakodety, S., KhudaBukhsh, A.R., Carbonell, J.G.: Mining insights from large-scale corpora using fine-tuned language models. In: ECAI-20, pp. 1890–1897 (2020)

    Google Scholar 

  24. Palakodety, S., KhudaBukhsh, A.R., Carbonell, J.G.: Voice for the voiceless: active sampling to detect comments supporting the Rohingyas. In: AAAI-20, pp. 454–462 (2020)

    Google Scholar 

  25. Pennington, J., Socher, R., Manning, C.D.: GLOVE: global vectors for word representation. In: Proceedings of the EMNLP, pp. 1532–1543 (2014)

    Google Scholar 

  26. Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. JMLR 11(Sep), 2487–2531 (2010)

    Google Scholar 

  27. Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 65, 569–631 (2019)

    Article  MathSciNet  Google Scholar 

  28. Saha, P., Singh, K., Kumar, A., Mathew, B., Mukherjee, A.: CounterGeDi: a controllable approach to generate polite, detoxified and emotional counter speech. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, pp. 5157–5163. ijcai.org (2022)

    Google Scholar 

  29. Sarkar, R., Mahinder, S., KhudaBukhsh, A.: The non-native speaker aspect: Indian English in social media. In: Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp. 61–70. Association for Computational Linguistics, Online (2020)

    Google Scholar 

  30. Sarkar, R., Mahinder, S., Sarkar, H., KhudaBukhsh, A.: Social media attributions in the context of water crisis. In: EMNLP, pp. 1402–1412. Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.109, https://www.aclweb.org/anthology/2020.emnlp-main.109

  31. Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)

  32. Toon, O.B., et al.: Rapidly expanding nuclear arsenals in Pakistan and India portend regional and global catastrophe. Sci. Adv. 5(10), eaay5478 (2019)

    Google Scholar 

  33. Tyagi, A., Field, A., Lathwal, P., Tsvetkov, Y., Carley, K.M.: A computational analysis of polarization on Indian and Pakistani social media. In: SocInfo 2020. Lecture Notes in Computer Science, vol. 12467, pp. 364–379 (2020). https://doi.org/10.1007/978-3-030-60975-7_27, https://doi.org/10.1007/978-3-030-60975-7_27

  34. Yoo, C.H., Palakodety, S., Sarkar, R., KhudaBukhsh, A.: Empathy and hope: resource transfer to model inter-country social media dynamics. In: Proceedings of the 1st Workshop on NLP for Positive Impact, pp. 125–134. Association for Computational Linguistics, Online (2021)

    Google Scholar 

  35. Zhang, M., Liu, Y., Luan, H., Sun, M.: Adversarial training for unsupervised bilingual lexicon induction. In: ACL-2017, pp. 1959–1970 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashiqur R. KhudaBukhsh .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Ethics Statement

While the setting discussed in the paper involves humanitarian tasks, the techniques can be trivially adapted to conduct cross-lingual sampling and surfacing of content like hate speech, or detection of hope speech with the explicit object to censor it. In many recent conflicts in the Indian subcontinent, such systems can have adverse social effects and thus particular care is needed before these systems are deployed. Next, language-specific features can sometimes cause syntactically similar but semantically opposite content to be surfaced underscoring the need for a human-in-the-loop setting before such systems are deployed for social media content moderation tasks. Finally, in this work, care is taken to ensure that no particular community is the target of the sampled content. NLP methods can be utilized to selectively conduct cross lingual sampling to discover content against disenfranchised communities - it is imperative for the system designers to ensure that unwittingly or otherwise, communities at large are not targeted by system deployments.

1.2 A.2 News Networks

See Table 10.

1.3 A.3 Analyzing Other Language Pairs

We were curious to learn if our approach works with other language pairs. On two European language pairs, \(\langle en , es \rangle \) and \(\langle en , de \rangle \), we observed that our simple approach of constrained nearest neighbor sampling retrieves reasonable bilingual lexicons even when trained on a single, multilingual corpus (synthetically induced) without any explicit attempt to align.

Data Sets: We conduct experiments using Europarl [12] and Wikipedia data sets. We synthetically induce a multilingual corpus by combining two monolingual corpora and then randomly shuffling at the sentence level. Table 6 summarizes our results. We find that our overall performance improved with Wikipedia data especially for de \(\rightarrow \) en and es \(\rightarrow \) en. [13] also reported a performance boost with Wikipedia data.

Our primary takeaways are:

Source word frequency: Our experiments with Indian social media data sets indicate that our method performs better when we restrict ourselves to high-frequency source words. A fine-grained look at the performance based on the frequency of the source word reveals that we perform substantially better on high-frequency words belonging to \(\mathcal {V}_{source }^{0-5}\) (e.g., \(en \rightarrow es\) performance jumps from 0.25 to 0.61 when we consider words in \(\mathcal {V}_{source }^{0-5}\)).

Topical cohesion: When we sample the en part of the corpus from Europarl and the es (or de) part from Wikipedia, we remove the topical cohesion between the en and es (de) components. We observe that performance dips slightly.

1.4 A.4 \(\mathcal {D}^{ covid }\) Data Set Visualization

Fig. 3.
figure 3

A 2D visualization of \(\mathcal {D}^{ covid }\). Apart from English and Romanized Hindi, Hindi in Devanagari also has substantial presence in the corpus.

Table 6. Performance comparison on Europarl [12] and Wikipedia. \(\mathcal {V}_{target }\) is restricted to words that appeared more than 100 times in the corpus.
Table 7. Word translation performance on social media data. Each cell summarizes the p@K performance for a given translation direction on a data set as a/b/c, where a (top) is the performance observed when the source vocabulary is restricted to \({\mathcal {V}}_{source }^{0-5}\) (color coded with blue); b (middle) is the performance observed when the source vocabulary is restricted to \({\mathcal {V}}_{source }^{5-10}\) (color coded with red); c (bottom) is the performance observed when the source vocabulary is restricted to \({\mathcal {V}}_{source }^{10-100}\)(color coded with gray). 500 source words are randomly selected from \({\mathcal {V}}_{source }^{0-5}\); from \({\mathcal {V}}_{source }^{5-10}\) and \({\mathcal {V}}_{source }^{10-100}\), 100 source words are randomly selected. The selected words are mapped to target words in \({\mathcal {V}}_{target }\) that are present in the corpus for at least 100 or more times. p@K indicates top-K accuracy.
Table 8. Performance summary of our approach with training data set Europarl [12]; test data set (denoted by \(\mathcal {D}_{test }\)) introduced in [13]. \({\mathcal {V}}_{target }\) is restricted to words that appeared more than 100 times in the training data set. Each cell summarizes the p@K performance for a given translation direction as a/b/c/d, where a (top) is the overall performance observed on \(\mathcal {D}_{test }\); b is the performance observed on \({\mathcal {V}}_{source }^{0-5} \cap \mathcal {D}_{test }\) (color coded with blue); c is the performance observed on \({\mathcal {V}}_{source }^{5-10} \cap \mathcal {D}_{test }\) (color coded with red); d (bottom) is the performance observed on \({\mathcal {V}}_{source }^{10-100} \cap \mathcal {D}_{test }\) (color coded with gray).
Table 9. Size of the estimated vocabularies using language identifier presented in [11] on our data sets. Spelling variations in Romanized Hindi possibly contributed to a large size of Romanized Hindi vocabulary.

Figure 3 presents a 2D visualization of the word embeddings obtained using the language-identifier we considered [11]. The visualization indicates that apart from Romanized Hindi and English, our data set also demonstrates substantial presence of Hindi written in Devanagari script further establishing the challenges associated to our task. The size of the estimated vocabularies is presented in Table 9 (Table 11).

Table 10. National channels.
Table 11. Evaluating the importance of topical cohesion. Blue, red and gray denote Europarl, Wikipedia and a mixed corpus where English is sampled from Europarl and the other language (Spanish or German) is sampled from Wikipedia, respectively. Results indicate that lack of topical cohesion affects performance. However, in spite of reduced topical cohesion, our method still retrieves bilingual lexicons of reasonable size.

1.5 A.5 Annotation

We used two annotators fluent in Hindi, Urdu and English. For word translations, consensus labels were used. For hope speech annotation, the minimum Fleiss’ \(\kappa \) measure was high (0.88) indicating strong inter-rater agreement. After independent labeling, differences were resolved through discussion.

1.6 A.6 System Pipeline

See Fig. 4.

Fig. 4.
figure 4

The full pipeline. We start with a set of source documents, and run translateEmbedding on each and then \(NN-Sample\) to return target language documents similar to the original.

1.7 A.7 Hyperparameters

Our preprocessing steps and hyperparameters to train embeddings are identical to previous literature [11]. All the models discussed in this paper are obtained by training Fasttext [4] Skip-gram models with the following parameters unless stated otherwise:

  • Dimension: 100

  • Minimum subword unit length: 2

  • Maximum subword unit length: 4

  • Epochs: 5

  • Context window: 5

  • Number of negatives sampled: 5

1.8 A.8 Hyperparameter Sensitivity Analysis

Recall that, we restricted \({\mathcal {V}}_{source }\) and \({\mathcal {V}}_{target }\) to prevalence criteria that (1) \({\mathcal {V}}_{source }\) is restricted to \({\mathcal {V}}_{source }^{0-5}\) (2) \({\mathcal {V}}_{target }\) contains words that have appeared at least 100 or more times in the corpus. In Table 7, we relax the prevalence criterion on \({\mathcal {V}}_{source }\) and observe that as we move towards more infrequent words, the translation performance degrades. The performance drop is more visible with \({\mathcal {V}}_{source }^{10-100}\). Our annotators informed that poor quality of spelling and increased prevalence of contraction made the annotation task particularly challenging for rare words.

We next analyze the effect of the frequency threshold of 100 on \({\mathcal {V}}_{target }\). In order to reduce annotation burden, we only focused on the subset of words with perfect translation (i.e., p@1 performance 100%). When we relax the frequency threshold to 10, our p@1, p@5 and p@10 numbers are respectively, 0.38, 0.84, 0.91, respectively. Hence, although for 91% or the source words we found a translation within the top 10 translations, our p@1 performance took a considerable hit. Our annotators reported that with a lowered frequency threshold, the retrieved translations contained higher degree of misspellings. Our conclusion from this experiment is 100 is a reasonable threshold given the noisy nature of our corpora.

We conducted a similar analysis on our word translation tasks using European language pairs. As shown in Table 8, when English is the source language, our translation performance on frequent words is substantially better than rare words. However, when English is the target language, we did not observe any similar trend, the performance was roughly equal across the entire spectrum of words ranked by frequency. With Wikipedia corpus (not shown in the Table), we observed qualitatively similar trends.

1.9 A.9 Extended Examples of Lexicons

Table 12 lists an extended bilingual lexicon containing 90 words pairs (30 from each corpus) obtained using our method. We will release the complete lexicon of 1,100 word pairs upon acceptance.

1.10 A.10 Disabling Pair

We also disabled select bigrams in the corpus to investigate the contribution of phrases like . Any \(\langle number , string \rangle \) pair, was replaced with a random number and random string pair throughout the corpus. Our results showed a 48% p@1 performance dip indicating that these phrases contribute massively to the word translation phenomenon observed.

1.11 A.11 Loanword

We now slightly abuse the definition of a loanword and consider a word is a loanword if it appears in a context of words written in a different language, and define a simple measure to quantify to what extent this occurs in a two-language setting. Let c denote the context (single word left and right) of a word w. We first count the instances where the language labels of c and w agree, i.e., \(\mathcal {L}({w}) = \mathcal {L}({c})\) (e.g., is not a loanword in the following phrase: ). Let this number be denoted as \(\mathcal {N}_{not\text {-}borrowed }\). Similarly, we count the instances when c and w have different language labels, i.e., \(\mathcal {L}({w}) \ne \mathcal {L}({c})\). This scenario would arise when a word is borrowed from a different language (e.g., is a loanword in ). In our scheme, the Loan Word Index (LWI) of a word w is defined as LWI(w) = \(\frac{\mathcal {N}_{borrowed }}{\mathcal {N}_{borrowed } + \mathcal {N}_{not\text {-}borrowed }}\). A high LWI indicates substantial lexical borrowing of the word outside its language. For a word pair \(\langle w_{source }, w_{target } \rangle \), we define LWI(.) as the maximum of their individual LWIs. For example, if the LWI is high for the pair , it indicates that at least one of these words was substantially borrowed. Our hypothesis is that successfully translated word pairs would have a high LWI indicating at least one of the pair was used as a loanword facilitating translation. The average Loan Word Index of all successfully translated word pairs in our test data sets across all three corpora is 0.15. Compared to this, randomly generated word pairings have an average Loan Word Index of 0.09. We next performed a frequency preserving loan word exchange to modify the corpus where translated word pairs are interchanged to diminish the extent to which words are borrowed (e.g., phrases like is rewritten as ). Frequency is preserved by interchanging both words in a successfully translated word pair as many times as the least borrowed word is borrowed. In our example if was borrowed 10 times, and 15 times, we alter 10 instances where is borrowed with , and 10 instances where is borrowed with . We thus preserve word frequencies while diminishing the loanword phenomenon. We observed that the retrieval performance of our p@1 set dipped by 33% after this corpus modification indicating that frequent borrowing of words possibly positively contributed to our method’s translation performance.

1.12 A.12 Analysis of Discovered Words

In our translation scheme, we found that translations for nouns, adjectives and adverbs were successfully discovered (see Table 2). Preserving plurality ( , , ) on most occasions, translating numerals ( , ) were among some surprising observations considering the noisy social media setting. For a given source word, multiple valid synonymous target words were often among the top translations produced by our method (e.g., and for ; , and for ). Stylistic choices like contraction were reflected in the translation (e.g., (kyuki) mapped to (because), and (sahi) mapped to (right)). Verbs are conjugated differently in Hindi and English and word-for-word translations don’t typically exist - for instance translates to , thus words like were rarely successfully translated.

Polysemy: During single word translation, without context, resolving polysemous words to their true meanings w.r.t. the context is not possible. However, we noticed that in a few instances top translation choices of polysemous source words include valid translations of their different meanings. For example, the word can mean both low temperature or a common viral infection. In \(\mathcal {D}^{ covid }\), both these meanings were captured in the top translations.

Nativization of Loanwords: Lexical borrowing across language pairs in the context of loanwords (or borrowed words) has been studied in linguistics and computational linguistics. Borrowed words, also known as loanwords, are lexical items borrowed from a donor language. For example, the English word or is borrowed from Hindi, while () and () are Hindi words borrowed from English. We noticed nativized loanwords, i.e., borrowed words that underwent phonological repairs to adapt to a foreign language, translate back to their English donor counterpart (e.g., and translate to donor words and , respectively).

1.13 A.13 Topical Cohesion

We break topical cohesion by sampling en and es (de) from Europarl and Wikipedia respectively. Our results show that bilingual lexicons are still retrieved albeit with marginally lower performance. We conclude that topical cohesion possibly helps but may not be a prerequisite for retrieving a reasonably sized bilingual lexicon.

Table 12. A random sample of translated word pairs from our corpora.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

KhudaBukhsh, A.R., Palakodety, S., Mitchell, T.M. (2022). Harnessing Unsupervised Word Translation to Address Resource Inequality for Peace and Health. In: Hopfgartner, F., Jaidka, K., Mayr, P., Jose, J., Breitsohl, J. (eds) Social Informatics. SocInfo 2022. Lecture Notes in Computer Science, vol 13618. Springer, Cham. https://doi.org/10.1007/978-3-031-19097-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19097-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19096-4

  • Online ISBN: 978-3-031-19097-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics