skip to main content
10.1145/3555776.3577788acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
poster

A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using image and text analytics

Published: 07 June 2023 Publication History

Abstract

Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.

References

[1]
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez Agirre, Rada Mihalcea, German Rigau Claramunt, and Janyce Wiebe. 2016. Semeval-2016 task 1. In SemEval-2016. 10th International Workshop on Semantic Evaluation;. ACL.
[2]
Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, and Stefan Pletschacher. 2009. A realistic dataset for performance evaluation of document layout analysis. In 2009 10th International Conference on Document Analysis and Recognition. IEEE.
[3]
Gary Bradski and Adrian Kaehler. 2000. OpenCV. Dr. Dobb's journal of software tools 3 (2000), 2.
[4]
Peter F Brown, Jennifer C Lai, and Robert L Mercer. 1991. Aligning sentences in parallel corpora. In 29th Annual Meeting of the Association for Computational Linguistics. 169--176.
[5]
Christopher Cieri, Mike Maxwell, Stephanie Strassel, and Jennifer Tracey. 2016. Selection criteria for low resource language programs. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16).
[6]
Long Duong. 2017. Natural language processing for resource-poor languages. Ph.D. Dissertation.
[7]
Heena Girdher, Harmohan Sharma, and Akant Gupta. 2022. Comprehensive Survey on Devanagari OCR. Available at SSRN 4033489 (2022).
[8]
GoogleAI. 2020. Language-Agnostic BERT Sentence Embedding.
[9]
Philip Hausner and Michael Gertz. 2021. News Article Extraction Using Graph Embeddings. (2021).
[10]
David Hebert, Thomas Palfray, Stephane Nicolas, Pierrick Tranouez, and Thierry Paquet. 2014. Automatic article extraction in old newspapers digitized collections. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. 3--8.
[11]
Angelus Francis Xavier Maffei. 1883. An English-Konkani Dictionary. Basel Mission Press.
[12]
Annie Rajan and Ambuja Salgaonkar. 2020. Sentiment Analysis for Konkani Language: Konkani Poetry, a Case Study. In ICT Systems and Sustainability. Springer, 321--329.
[13]
Annie Rajan and Ambuja Salgaonkar. 2022. Survey of NLP Resources in Low-Resource Languages Nepali, Sindhi and Konkani. In Information and Communication Technology for Competitive Strategies (ICTCS 2020). Springer, 121--132.
[14]
Annie Rajan, Ambuja Salgaonkar, and Ramprasad Joshi. 2020. A survey of Konkani NLP resources. Computer Science Review 38 (2020), 100299.
[15]
Karthik Revanuru, Kaushik Turlapaty, and Shrisha Rao. 2017. Neural machine translation of indian languages. In Proceedings of the 10th annual ACM India compute conference. 11--20.
[16]
Manchester University of Salford. 2004. Pattern Recognition and Image Analysis.
[17]
Dániel Varga, László Németh, Péter Halácsy, András Kornai, Viktor Trón, and Viktor Nagy. 2005. Parallel corpora for medium density languages In Proceedings of the RANLP 2005. (2005).
[18]
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing
March 2023
1932 pages
ISBN:9781450395175
DOI:10.1145/3555776
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2023

Check for updates

Author Tags

  1. low resource language
  2. parallel data augmentation
  3. information mining

Qualifiers

  • Poster

Conference

SAC '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 38
    Total Downloads
  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media