Harnessing Unsupervised Word Translation to Address Resource Inequality for Peace and Health

KhudaBukhsh, Ashiqur R.; Palakodety, Shriphani; Mitchell, Tom M.

doi:10.1007/978-3-031-19097-1_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13618))

Included in the following conference series:

International Conference on Social Informatics

937 Accesses

Abstract

Research geared toward human well-being in developing nations often concentrates on web content written in a world language (e.g., English) and ignores a significant chunk of content written in a poorly resourced yet highly prevalent first language of the region in concern (e.g., Hindi). Such omissions are common due to the sheer mismatch between linguistic resources offered in a world language and its low-resource counterpart. However, during a global pandemic or an imminent war, demand for linguistic resources might get recalibrated. In this work, we focus on the high-resource and low-resource language pair \(\langle en , hi _e \rangle \) (English, and Romanized Hindi) and present a cross-lingual sampling method that takes example documents in English, and retrieves similar content written in Romanized Hindi, the most popular form of Hindi observed in social media. At the core of our technique is a novel finding that a surprisingly simple constrained nearest-neighbor sampling in polyglot Skip-gram word embedding space can retrieve substantial bilingual lexicons, even from harsh social media data sets. Our cross-lingual sampling method obtains substantial performance improvement in the important domains of detecting peace-seeking, hostility-diffusing hope speech in the context of the 2019 India-Pakistan conflict, and in detecting comments encouraging compliance with COVID-19 guidelines.

A.R. KhudaBukhsh and S. Palakodety—Equal contribution first authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Resources and additional details are available at: https://www.cs.cmu.edu/~akhudabu/SocInfo2022.html.
2.
https://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/prevention.html.

References

Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: ACL 2017, pp. 451–462 (2017). https://doi.org/10.18653/v1/P17-1042
Benesch, S.: Defining and diminishing hate speech. State World’s Minorities Indigenous Peoples 2014, 18–25 (2014)
Google Scholar
Benesch, S., Ruths, D., Dillon, K.P., Saleem, H.M., Wright, L.: Counterspeech on twitter: A field study. A report for Public Safety Canada under the Kanishka Project (2016)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL 5, 135–146 (2017)
Article Google Scholar
Cieri, C., Maxwell, M., Strassel, S., Tracey, J.: Selection criteria for low resource language programs. In: LREC, pp. 4543–4549 (2016)
Google Scholar
Dinu, G., Lazaridou, A., Baroni, M.: Improving zero-shot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568 (2014)
Dou, Z.Y., Zhou, Z.H., Huang, S.: Unsupervised bilingual lexicon induction via latent variable models. In: EMNLP 2018, pp. 621–626 (2018)
Google Scholar
Gella, S., Bali, K., Choudhury, M.: “ye word kis lang ka hai bhai?” testing the limits of word level language identification. In: ICNLP-2014, pp. 368–377 (2014)
Google Scholar
Gumperz, J.J.: Discourse Strategies, vol. 1. Cambridge University Press, Cambridge (1982)
Google Scholar
Jegou, H., Schmid, C., Harzallah, H., Verbeek, J.: Accurate image search using the contextual dissimilarity measure. PAMI 2008 32(1), 2–11 (2008)
Google Scholar
KhudaBukhsh, A.R., Palakodety, S., Carbonell, J.G.: Harnessing code switching to transcend the linguistic barrier. In: IJCAI-PRICAI, pp. 4366–4374 (2020)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT summit, vol. 5, pp. 79–86 (2005)
Google Scholar
Lample, G., Conneau, A., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. In: 6th International Conference on Learning Representations, ICLR 2018. OpenReview.net (2018). https://openreview.net/forum?id=H196sainb
Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M.: Phrase-based & neural unsupervised machine translation. In: EMNLP-2018, pp. 5039–5049 (2018). https://doi.org/10.18653/v1/D18-1549, https://www.aclweb.org/anthology/D18-1549
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar
Mathew, B., et al.: Thou shalt not hate: countering online hate speech. In: Proceedings of the Thirteenth International Conference on Web and Social Media, ICWSM 2019, pp. 369–380. AAAI Press (2019)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
Mulcaire, P., Kasai, J., Smith, N.A.: Low-resource parsing with crosslingual contextualized representations. In: CoNLL, pp. 304–315 (2019)
Google Scholar
Mulcaire, P., Kasai, J., Smith, N.A.: Polyglot contextual representations improve crosslingual transfer. In: NAACL-HLT-2019, pp. 3912–3918 (2019). https://doi.org/10.18653/v1/N19-1392
Mulcaire, P., Swayamdipta, S., Smith, N.A.: Polyglot semantic role labeling. In: ACL-2018, pp. 667–672 (2018). https://doi.org/10.18653/v1/P18-2106, https://www.aclweb.org/anthology/P18-2106
Palakodety, S., KhudaBukhsh, A.R., Carbonell, J.G.: Hope speech detection: a computational analysis of the voice of peace. In: ECAI-2020, pp. 1881–1889 (2020)
Google Scholar
Palakodety, S., KhudaBukhsh, A.R., Carbonell, J.G.: Mining insights from large-scale corpora using fine-tuned language models. In: ECAI-20, pp. 1890–1897 (2020)
Google Scholar
Palakodety, S., KhudaBukhsh, A.R., Carbonell, J.G.: Voice for the voiceless: active sampling to detect comments supporting the Rohingyas. In: AAAI-20, pp. 454–462 (2020)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GLOVE: global vectors for word representation. In: Proceedings of the EMNLP, pp. 1532–1543 (2014)
Google Scholar
Radovanović, M., Nanopoulos, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. JMLR 11(Sep), 2487–2531 (2010)
Google Scholar
Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 65, 569–631 (2019)
Article MathSciNet Google Scholar
Saha, P., Singh, K., Kumar, A., Mathew, B., Mukherjee, A.: CounterGeDi: a controllable approach to generate polite, detoxified and emotional counter speech. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, pp. 5157–5163. ijcai.org (2022)
Google Scholar
Sarkar, R., Mahinder, S., KhudaBukhsh, A.: The non-native speaker aspect: Indian English in social media. In: Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp. 61–70. Association for Computational Linguistics, Online (2020)
Google Scholar
Sarkar, R., Mahinder, S., Sarkar, H., KhudaBukhsh, A.: Social media attributions in the context of water crisis. In: EMNLP, pp. 1402–1412. Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.109, https://www.aclweb.org/anthology/2020.emnlp-main.109
Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)
Toon, O.B., et al.: Rapidly expanding nuclear arsenals in Pakistan and India portend regional and global catastrophe. Sci. Adv. 5(10), eaay5478 (2019)
Google Scholar
Tyagi, A., Field, A., Lathwal, P., Tsvetkov, Y., Carley, K.M.: A computational analysis of polarization on Indian and Pakistani social media. In: SocInfo 2020. Lecture Notes in Computer Science, vol. 12467, pp. 364–379 (2020). https://doi.org/10.1007/978-3-030-60975-7_27, https://doi.org/10.1007/978-3-030-60975-7_27
Yoo, C.H., Palakodety, S., Sarkar, R., KhudaBukhsh, A.: Empathy and hope: resource transfer to model inter-country social media dynamics. In: Proceedings of the 1st Workshop on NLP for Positive Impact, pp. 125–134. Association for Computational Linguistics, Online (2021)
Google Scholar
Zhang, M., Liu, Y., Luan, H., Sun, M.: Adversarial training for unsupervised bilingual lexicon induction. In: ACL-2017, pp. 1959–1970 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Rochester Institute of Technology, Rochester, NY, 14623, USA
Ashiqur R. KhudaBukhsh
Onai, San Jose, CA, 95123, USA
Shriphani Palakodety
Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Tom M. Mitchell

Authors

Ashiqur R. KhudaBukhsh
View author publications
You can also search for this author in PubMed Google Scholar
Shriphani Palakodety
View author publications
You can also search for this author in PubMed Google Scholar
Tom M. Mitchell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashiqur R. KhudaBukhsh .

Editor information

Editors and Affiliations

Universität Koblenz-Landau, Koblenz, Germany
Frank Hopfgartner
National University of Singapore, Singapore, Singapore
Kokil Jaidka
GESIS – Leibniz-Institut für Sozialwissenschaften, Cologne, Germany
Philipp Mayr
University of Glasgow, Glasgow, UK
Joemon Jose
University of Glasgow, Glasgow, UK
Jan Breitsohl

A Appendix

1.1 A.1 Ethics Statement

While the setting discussed in the paper involves humanitarian tasks, the techniques can be trivially adapted to conduct cross-lingual sampling and surfacing of content like hate speech, or detection of hope speech with the explicit object to censor it. In many recent conflicts in the Indian subcontinent, such systems can have adverse social effects and thus particular care is needed before these systems are deployed. Next, language-specific features can sometimes cause syntactically similar but semantically opposite content to be surfaced underscoring the need for a human-in-the-loop setting before such systems are deployed for social media content moderation tasks. Finally, in this work, care is taken to ensure that no particular community is the target of the sampled content. NLP methods can be utilized to selectively conduct cross lingual sampling to discover content against disenfranchised communities - it is imperative for the system designers to ensure that unwittingly or otherwise, communities at large are not targeted by system deployments.

1.2 A.2 News Networks

See Table 10.

1.3 A.3 Analyzing Other Language Pairs

We were curious to learn if our approach works with other language pairs. On two European language pairs, \(\langle en , es \rangle \) and \(\langle en , de \rangle \), we observed that our simple approach of constrained nearest neighbor sampling retrieves reasonable bilingual lexicons even when trained on a single, multilingual corpus (synthetically induced) without any explicit attempt to align.

Data Sets: We conduct experiments using Europarl [12] and Wikipedia data sets. We synthetically induce a multilingual corpus by combining two monolingual corpora and then randomly shuffling at the sentence level. Table 6 summarizes our results. We find that our overall performance improved with Wikipedia data especially for de \(\rightarrow \) en and es \(\rightarrow \) en. [13] also reported a performance boost with Wikipedia data.

Our primary takeaways are:

Source word frequency: Our experiments with Indian social media data sets indicate that our method performs better when we restrict ourselves to high-frequency source words. A fine-grained look at the performance based on the frequency of the source word reveals that we perform substantially better on high-frequency words belonging to \(\mathcal {V}_{source }^{0-5}\) (e.g., \(en \rightarrow es\) performance jumps from 0.25 to 0.61 when we consider words in \(\mathcal {V}_{source }^{0-5}\)).

Topical cohesion: When we sample the en part of the corpus from Europarl and the es (or de) part from Wikipedia, we remove the topical cohesion between the en and es (de) components. We observe that performance dips slightly.

1.4 A.4 \(\mathcal {D}^{ covid }\) Data Set Visualization

Table 6. Performance comparison on Europarl [12] and Wikipedia. \(\mathcal {V}_{target }\) is restricted to words that appeared more than 100 times in the corpus.

Full size table

Table 7. Word translation performance on social media data. Each cell summarizes the p@K performance for a given translation direction on a data set as a/b/c, where a (top) is the performance observed when the source vocabulary is restricted to \({\mathcal {V}}_{source }^{0-5}\) (color coded with blue); b (middle) is the performance observed when the source vocabulary is restricted to \({\mathcal {V}}_{source }^{5-10}\) (color coded with red); c (bottom) is the performance observed when the source vocabulary is restricted to \({\mathcal {V}}_{source }^{10-100}\)(color coded with gray). 500 source words are randomly selected from \({\mathcal {V}}_{source }^{0-5}\); from \({\mathcal {V}}_{source }^{5-10}\) and \({\mathcal {V}}_{source }^{10-100}\), 100 source words are randomly selected. The selected words are mapped to target words in \({\mathcal {V}}_{target }\) that are present in the corpus for at least 100 or more times. p@K indicates top-K accuracy.

Full size table

Table 8. Performance summary of our approach with training data set Europarl [12]; test data set (denoted by \(\mathcal {D}_{test }\)) introduced in [13]. \({\mathcal {V}}_{target }\) is restricted to words that appeared more than 100 times in the training data set. Each cell summarizes the p@K performance for a given translation direction as a/b/c/d, where a (top) is the overall performance observed on \(\mathcal {D}_{test }\); b is the performance observed on \({\mathcal {V}}_{source }^{0-5} \cap \mathcal {D}_{test }\) (color coded with blue); c is the performance observed on \({\mathcal {V}}_{source }^{5-10} \cap \mathcal {D}_{test }\) (color coded with red); d (bottom) is the performance observed on \({\mathcal {V}}_{source }^{10-100} \cap \mathcal {D}_{test }\) (color coded with gray).

Full size table

Table 9. Size of the estimated vocabularies using language identifier presented in [11] on our data sets. Spelling variations in Romanized Hindi possibly contributed to a large size of Romanized Hindi vocabulary.

Full size table

Figure 3 presents a 2D visualization of the word embeddings obtained using the language-identifier we considered [11]. The visualization indicates that apart from Romanized Hindi and English, our data set also demonstrates substantial presence of Hindi written in Devanagari script further establishing the challenges associated to our task. The size of the estimated vocabularies is presented in Table 9 (Table 11).

Table 10. National channels.

Full size table

Table 11. Evaluating the importance of topical cohesion. Blue, red and gray denote Europarl, Wikipedia and a mixed corpus where English is sampled from Europarl and the other language (Spanish or German) is sampled from Wikipedia, respectively. Results indicate that lack of topical cohesion affects performance. However, in spite of reduced topical cohesion, our method still retrieves bilingual lexicons of reasonable size.

Full size table

1.5 A.5 Annotation

We used two annotators fluent in Hindi, Urdu and English. For word translations, consensus labels were used. For hope speech annotation, the minimum Fleiss’ \(\kappa \) measure was high (0.88) indicating strong inter-rater agreement. After independent labeling, differences were resolved through discussion.

1.6 A.6 System Pipeline

See Fig. 4.

1.7 A.7 Hyperparameters

Our preprocessing steps and hyperparameters to train embeddings are identical to previous literature [11]. All the models discussed in this paper are obtained by training Fasttext [4] Skip-gram models with the following parameters unless stated otherwise:

Dimension: 100
Minimum subword unit length: 2
Maximum subword unit length: 4
Epochs: 5
Context window: 5
Number of negatives sampled: 5

1.8 A.8 Hyperparameter Sensitivity Analysis

Recall that, we restricted \({\mathcal {V}}_{source }\) and \({\mathcal {V}}_{target }\) to prevalence criteria that (1) \({\mathcal {V}}_{source }\) is restricted to \({\mathcal {V}}_{source }^{0-5}\) (2) \({\mathcal {V}}_{target }\) contains words that have appeared at least 100 or more times in the corpus. In Table 7, we relax the prevalence criterion on \({\mathcal {V}}_{source }\) and observe that as we move towards more infrequent words, the translation performance degrades. The performance drop is more visible with \({\mathcal {V}}_{source }^{10-100}\). Our annotators informed that poor quality of spelling and increased prevalence of contraction made the annotation task particularly challenging for rare words.

We next analyze the effect of the frequency threshold of 100 on \({\mathcal {V}}_{target }\). In order to reduce annotation burden, we only focused on the subset of words with perfect translation (i.e., p@1 performance 100%). When we relax the frequency threshold to 10, our p@1, p@5 and p@10 numbers are respectively, 0.38, 0.84, 0.91, respectively. Hence, although for 91% or the source words we found a translation within the top 10 translations, our p@1 performance took a considerable hit. Our annotators reported that with a lowered frequency threshold, the retrieved translations contained higher degree of misspellings. Our conclusion from this experiment is 100 is a reasonable threshold given the noisy nature of our corpora.

We conducted a similar analysis on our word translation tasks using European language pairs. As shown in Table 8, when English is the source language, our translation performance on frequent words is substantially better than rare words. However, when English is the target language, we did not observe any similar trend, the performance was roughly equal across the entire spectrum of words ranked by frequency. With Wikipedia corpus (not shown in the Table), we observed qualitatively similar trends.

1.9 A.9 Extended Examples of Lexicons

Table 12 lists an extended bilingual lexicon containing 90 words pairs (30 from each corpus) obtained using our method. We will release the complete lexicon of 1,100 word pairs upon acceptance.

1.10 A.10 Disabling Pair

We also disabled select bigrams in the corpus to investigate the contribution of phrases like . Any \(\langle number , string \rangle \) pair, was replaced with a random number and random string pair throughout the corpus. Our results showed a 48% p@1 performance dip indicating that these phrases contribute massively to the word translation phenomenon observed.

1.11 A.11 Loanword

We now slightly abuse the definition of a loanword and consider a word is a loanword if it appears in a context of words written in a different language, and define a simple measure to quantify to what extent this occurs in a two-language setting. Let c denote the context (single word left and right) of a word w. We first count the instances where the language labels of c and w agree, i.e., \(\mathcal {L}({w}) = \mathcal {L}({c})\) (e.g., is not a loanword in the following phrase: ). Let this number be denoted as \(\mathcal {N}_{not\text {-}borrowed }\). Similarly, we count the instances when c and w have different language labels, i.e., \(\mathcal {L}({w}) \ne \mathcal {L}({c})\). This scenario would arise when a word is borrowed from a different language (e.g., is a loanword in ). In our scheme, the Loan Word Index (LWI) of a word w is defined as LWI(w) = \(\frac{\mathcal {N}_{borrowed }}{\mathcal {N}_{borrowed } + \mathcal {N}_{not\text {-}borrowed }}\). A high LWI indicates substantial lexical borrowing of the word outside its language. For a word pair \(\langle w_{source }, w_{target } \rangle \), we define LWI(.) as the maximum of their individual LWIs. For example, if the LWI is high for the pair , it indicates that at least one of these words was substantially borrowed. Our hypothesis is that successfully translated word pairs would have a high LWI indicating at least one of the pair was used as a loanword facilitating translation. The average Loan Word Index of all successfully translated word pairs in our test data sets across all three corpora is 0.15. Compared to this, randomly generated word pairings have an average Loan Word Index of 0.09. We next performed a frequency preserving loan word exchange to modify the corpus where translated word pairs are interchanged to diminish the extent to which words are borrowed (e.g., phrases like is rewritten as ). Frequency is preserved by interchanging both words in a successfully translated word pair as many times as the least borrowed word is borrowed. In our example if was borrowed 10 times, and 15 times, we alter 10 instances where is borrowed with , and 10 instances where is borrowed with . We thus preserve word frequencies while diminishing the loanword phenomenon. We observed that the retrieval performance of our p@1 set dipped by 33% after this corpus modification indicating that frequent borrowing of words possibly positively contributed to our method’s translation performance.

1.12 A.12 Analysis of Discovered Words

In our translation scheme, we found that translations for nouns, adjectives and adverbs were successfully discovered (see Table 2). Preserving plurality ( , , ) on most occasions, translating numerals ( , ) were among some surprising observations considering the noisy social media setting. For a given source word, multiple valid synonymous target words were often among the top translations produced by our method (e.g., and for ; , and for ). Stylistic choices like contraction were reflected in the translation (e.g., (kyuki) mapped to (because), and (sahi) mapped to (right)). Verbs are conjugated differently in Hindi and English and word-for-word translations don’t typically exist - for instance translates to , thus words like were rarely successfully translated.

Polysemy: During single word translation, without context, resolving polysemous words to their true meanings w.r.t. the context is not possible. However, we noticed that in a few instances top translation choices of polysemous source words include valid translations of their different meanings. For example, the word can mean both low temperature or a common viral infection. In \(\mathcal {D}^{ covid }\), both these meanings were captured in the top translations.

Nativization of Loanwords: Lexical borrowing across language pairs in the context of loanwords (or borrowed words) has been studied in linguistics and computational linguistics. Borrowed words, also known as loanwords, are lexical items borrowed from a donor language. For example, the English word or is borrowed from Hindi, while () and () are Hindi words borrowed from English. We noticed nativized loanwords, i.e., borrowed words that underwent phonological repairs to adapt to a foreign language, translate back to their English donor counterpart (e.g., and translate to donor words and , respectively).

1.13 A.13 Topical Cohesion

We break topical cohesion by sampling en and es (de) from Europarl and Wikipedia respectively. Our results show that bilingual lexicons are still retrieved albeit with marginally lower performance. We conclude that topical cohesion possibly helps but may not be a prerequisite for retrieving a reasonably sized bilingual lexicon.

Table 12. A random sample of translated word pairs from our corpora.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

KhudaBukhsh, A.R., Palakodety, S., Mitchell, T.M. (2022). Harnessing Unsupervised Word Translation to Address Resource Inequality for Peace and Health. In: Hopfgartner, F., Jaidka, K., Mayr, P., Jose, J., Breitsohl, J. (eds) Social Informatics. SocInfo 2022. Lecture Notes in Computer Science, vol 13618. Springer, Cham. https://doi.org/10.1007/978-3-031-19097-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-19097-1_10
Published: 12 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19096-4
Online ISBN: 978-3-031-19097-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Harnessing Unsupervised Word Translation to Address Resource Inequality for Peace and Health

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Ethics Statement

1.2 A.2 News Networks

1.3 A.3 Analyzing Other Language Pairs

1.4 A.4 \(\mathcal {D}^{ covid }\) Data Set Visualization

1.5 A.5 Annotation

1.6 A.6 System Pipeline

1.7 A.7 Hyperparameters

1.8 A.8 Hyperparameter Sensitivity Analysis

1.9 A.9 Extended Examples of Lexicons

1.10 A.10 Disabling Pair

1.11 A.11 Loanword

1.12 A.12 Analysis of Discovered Words

1.13 A.13 Topical Cohesion

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation