nameGist: a novel phonetic algorithm with bilingual support

Khan, Shahidul Islam; Hasan, Md. Mahmudul; Hossain, Mohammad Imran; Hoque, Abu Sayed Md. Latiful

doi:10.1007/s10772-019-09653-2

nameGist: a novel phonetic algorithm with bilingual support

Published: 29 October 2019

Volume 22, pages 1135–1148, (2019)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Shahidul Islam Khan ORCID: orcid.org/0000-0002-8740-2744^1,2,
Md. Mahmudul Hasan³,
Mohammad Imran Hossain⁴ &
…
Abu Sayed Md. Latiful Hoque¹

246 Accesses
4 Citations
Explore all metrics

Abstract

Phonetic algorithm plays an essential role in many applications including name-matching, database record linkage, spelling correction, search recommendations, etc. Since 1918, many phonetic algorithms have been proposed by the researchers. Soundex, Match Rating Codex, NYSIIS, Metaphone, and Double Metaphone are among the frequently used phonetic algorithms. These algorithms were primarily developed for English phonetics, and they perform well for their intended purposes. Above algorithms do not support Bengali Language and show poor performance for Bengali phonetic representation in the English language. Some phonetic algorithms, e.g., NameSignifcance, Modified NameSignifcance, etc., have been proposed recently by researchers to deal with Bengali phonetic names but their performances are not up to the mark for English names. Besides, these algorithms do not support names written in the Bengali Language, i.e., Bengali Unicode. Bengali language, also known as Bangla among natives, is counted as the seventh most spoken language in the world. More than 250 million people, around the world, speak in Bengali. Use of Bengali Unicode is increasing in Bangladesh and around the globe with the increasing use of computers everywhere. For example, in different healthcare systems, a patient’s name can be stored both in English representation of Bengali or Bengali Unicode. Being unable to process Bengali Unicode leads to failure of linking information of the same patient from multiple databases. This creates a problem in record linkage or entity matching. In this paper, we proposed a novel phonetic algorithm—nameGist which can efficiently encode Bengali phonetic names in English representation, Bengali Unicode names and English phonetic names. We have tested nameGist in various datasets which contains Bengali Phonetic names, Bengali Unicode names, English Phonetic (American or British) names and a mixture of these types. In each case, our proposed algorithm, nameGist, performed better than other algorithms in terms of accuracy and F-measure. NameGist can be used to solve record linkage and entity resolution problems for Bengali, English, and mixed names effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Information extraction from electronic medical documents: state of the art and future research directions

Article 08 November 2022

GPT for medical entity recognition in Spanish

Article Open access 23 April 2024

References

Bengali (Bangla) - University of Washington. (2017). Retrieved October 4, 2018, from https://asian.washington.edu/fields/bengali-bangla.
Beyond the Top 1000 Names - USA Social Security Administrations. (2017). Retrieved October 4, 2018, from https://www.ssa.gov/oact/babynames/limits.html.
Christen, P. (2012). Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. New York: Springer.
Book Google Scholar
De Brou, D., & Olsen, M. (1986). The guth algorithm and the nominal record linkage of multi-ethnic populations. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 19(1), 20–24.
Article Google Scholar
Frequently Occurring Surnames from the Census 2000 - US Census Bureau. (2014). Retrieved October 4, 2018, from https://www.census.gov/topics/population/genealogy/data/2000_surnames.html.
International Mother Language Day - UNESCO. (2017). Retrieved October 4, 2018, from http://www.unesco.org/new/en/international-mother-language-day/.
Jellyfish - a python library for doing approximate and phonetic matching of strings. (2018). Retrieved October 4, 2018, from https://github.com/jamesturk/jellyfish.
Khan, A. B. A., Ghazanfar, M. S., & Khan, S.I. (2017). Application of phonetic encoding for analyzing similarity of patient’s data: Bangladesh perspective. In 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC), (pp. 664–667). IEEE.
Khan, S. I., & Hoque, A. S. M. L. (2016). An analysis of the problems for health data integration in Bangladesh. In 2016 International Conference on Innovations in Science, Engineering and Technology (ICISET), (pp. 1–4).
Khan, S. I., & Hoque, A. S. M. L. (2016). Similarity analysis of patients’ data: Bangladesh perspective. In 2016 International Conference on Medical Engineering, Health Informatics and Technology (MediTec), (pp. 1–5). IEEE.
Khan, S. I., & Hoque, A. S. M. L. (2016). Towards development of national health data warehouse for knowledge discovery. Intelligent Systems Technologies and Applications, Advances in Intelligent Systems and Computing (Vol. 385, pp. 413–421). New York: Springer.
Khan, S. I., Hoque, A. S. M. L., & Ullah, M. (2016). National health data warehouse bangladesh for remote health monitoring: Features, problems and privacy issues. In Remote Health Monitoring Workshop.
Lewis, M. P. (2018). Ethnologue: Languages of the world. Dallas: SIL International.
Google Scholar
Match rating approach - Wikipedia. (2017). Retrieved October 4, 2018, from https://en.wikipedia.org/wiki/Match_rating_approach.
Open source name database. (2013). Retrieved October 4, 2018, from https://github.com/smashew/NameDatabases.
Peled, O., Fire, M., Lior, R., & Yuval, E. (2016). Matching entities across online social networks. Neurocomputing, 210, 61–106.
Article Google Scholar
Philips, L. (1990). Hanging on the metaphone. Computer Language, 7(12), 39–43.
Google Scholar
Philips, L. (2000). The double metaphone search algorithm. C/C++ Users Journal, 18(6), 38–43.
Google Scholar
Soundex System - National Archives. (2007). Retrieved October 4, 2018, from https://www.archives.gov/research/census/soundex.html.
Unicode Bengali name collection. (2017). Retrieved October 4, 2018, from https://bit.ly/2FZEmZV.
UzZaman, N., & Khan, M. (2004). A bangla phonetic encoding for better spelling suggesions. Tech. rep., BRAC University.
UzZaman, N., & Khan, M. (2005). A double metaphone encoding for bangla and its application in spelling checker. Tech. rep., BRAC University.
World population prospects - United Nations. (2017). Retrieved October 4, 2018, from https://population.un.org/wpp/DataQuery/.
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1–2), 69–90.
Article Google Scholar

Download references

Acknowledgements

This research is supported by the ICT Division, Ministry of Posts, Telecommunications and Information Technology, the Government of the People’s Republic of Bangladesh.

Funding

Funding was provided by Ministry of Posts, Telecommunications and Information Technology (Grant No. 56.00.0000.028.33.025.14-154).

Author information

Authors and Affiliations

Department of CSE, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
Shahidul Islam Khan & Abu Sayed Md. Latiful Hoque
Department of CSE, International Islamic University Chittagong, Chittagong, Bangladesh
Shahidul Islam Khan
Dream71 Bangladesh Limited, Dhaka, Bangladesh
Md. Mahmudul Hasan
Brainstation-23 Limited, Dhaka, Bangladesh
Mohammad Imran Hossain

Authors

Shahidul Islam Khan
View author publications
You can also search for this author in PubMed Google Scholar
Md. Mahmudul Hasan
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Imran Hossain
View author publications
You can also search for this author in PubMed Google Scholar
Abu Sayed Md. Latiful Hoque
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shahidul Islam Khan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khan, S.I., Hasan, M.M., Hossain, M.I. et al. nameGist: a novel phonetic algorithm with bilingual support. Int J Speech Technol 22, 1135–1148 (2019). https://doi.org/10.1007/s10772-019-09653-2

Download citation

Received: 03 September 2019
Accepted: 12 October 2019
Published: 29 October 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10772-019-09653-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

nameGist: a novel phonetic algorithm with bilingual support

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Information extraction from electronic medical documents: state of the art and future research directions

GPT for medical entity recognition in Spanish

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

nameGist: a novel phonetic algorithm with bilingual support

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Information extraction from electronic medical documents: state of the art and future research directions

GPT for medical entity recognition in Spanish

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation