Arabic real time entity resolution using inverted indexing

Alian, Marwah; Al-Naymat, Ghazi; Ramadan, Banda

doi:10.1007/s10579-020-09504-6

Arabic real time entity resolution using inverted indexing

Original Paper
Published: 07 October 2020

Volume 54, pages 921–941, (2020)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

186 Accesses
3 Citations
Explore all metrics

Abstract

Arabic datasets that have two or more records for the same world entity (i.e. person, object, etc.) make institutions suffer from low quality and degraded performance due to duplication in their Arabic datasets without having any mechanism for detecting these duplicates. The operation that distinguishes records for the same real-world entity is called Entity Resolution (ER). It is considered as a tool for linking records across databases as well as for matching query records with existing databases in real-time. Indexing is a major step in the ER process that aims at reducing the search space. Several indexing techniques are available for use with the ER process in general for English Databases. However, such techniques are not validated if they work well with other languages, such as Arabic. The Dynamic Similarity Aware Inverted Index (DySimII) is one of the indexing techniques that are utilized with dynamic databases to match query records in real time and is demonstrated to work well with English language. In this paper, we propose a framework—Arabic Real Time Entity Resolution (ARTER)—that uses DySimII with Arabic databases to perform real time ER. We also examine using different string similarity functions required for comparing records in the matching process for the aim of evaluating which similarity function is more suitable for comparing Arabic strings. A real-world Arabic database is used to conduct our experimental evaluation where two stemmers and three similarity functions are used to see the effect on DySimII with Arabic dataset. The results represent that matching accuracy is improved using Asem stemmer when the number of corrupted attributes is increased, also testing the three similarity functions show that using winkler similarity function provides better matching accuracy while N-gram provides better results when used with Asem stemmer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised learning blocking keys technique for indexing Arabic entity resolution

Article 15 January 2018

Marwah Alian, Arafat Awajan & Bandan Ramadan

An Approach for Extracting and Disambiguating Arabic Persons’ Names Using Clustered Dictionaries and Scored Patterns

Arabic Named Entity Recognition—A Survey and Analysis

References

Al_Molijy, A. A., Hmeidi, I., & Alsmadi, I. I. (2012). Indexing of Arabic documents automatically based on lexical analysis. International Journal on Natural Language Computing (IJNLC), 1(1), 1–8.
Google Scholar
Alian, M., Al-Naymat, G., Ramadan, B. (2017). Using Transliteration with Entity Resolution for Arabic Datasets. In 14th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA’2017). Hammamet.
Alian, M., Awajan, A., & Ramadan, B. (2019). Unsupervised learning blocking keys technique for indexing Arabic entity resolution. International Journal of Speech Technology, 22(3), 621–628.
Article Google Scholar
Al-Jumaily, H., Martínez, P., Martínez-Fernández, J. L., & Der Goot, E. V. (2012). A real time Named Entity Recognition system for Arabic text mining. Language Resources and Evaluation., 46, 543–563.
Article Google Scholar
Al-Lahham, Y., Matarneh, K., Hassan, M.(2018). Conditional Arabic Light Stemmer: CondLight. The International Arab Journal of Information Technology, Special Issue, 15, 3A.
Al-Shalabi, R., Obeidat, R. (2008). Improving KNN Arabic Text Classification with N-Grams Based Document Indexing. In 6th International Conference on Informatics and Systems (INFOS 2008). Cairo-Egypt.
Azmi, A. M., & Al-Thanyyan, S. (2012). A text summarizer for Arabic. Computer Speech and Language, 26(4), 260–273.
Article Google Scholar
Bahassine, S., Kissi, M., Madani, A. (2014). New stemming for arabic text classification using feature selection and decision trees. In IEEE 5th International Conference on Arabic Language Processing (CITALA). Oujda, Morocco.
Bazzi, M. S. E., Zaki, T., Mammass, D. (2016). Ennaji, A. Stemming versus multi-words indexing for Arabic documents classification. In 11th International Conference on Intelligent Systems: Theories and Applications (SITA). (pp. 1–5). Mohammedia, Morocco.
Ben Guirat, S., Bounhas, I., Slimani, Y. (2016). A hybrid model for Arabic document indexing. In 17th IEEE/ACIS International Conference on Software Engineering, 17th IEEE/ACIS International Conference on Software Engineering, Artificial IntelligenceArtificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). (pp. 109–114). Shanghai.
Boudchiche, M., Mazroui, A., Ould Bebah, M. O. A., Lakhouaj, A., & Boudlal, A. (2017). AlKhalil Morpho Sys 2: a robust Arabic morpho-syntactic analyzer. Journal of King Saud University Computer and Information Sciences, 29, 141–146.
Article Google Scholar
Boulaknadel, S., Daille, B., Driss, A. (2008). Multi-word term indexing for Arabic document retrieval. In 2008 IEEE Symposium on Computers and Communications Marrakech (pp. 869–873).
Bounhas, I., Ayed, R., Elayeb, B., & Saoud, B. N. B. (2015). A hybrid possibilistic approach for Arabic full morphological disambiguation. Data and Knowledge Engineering., 100, 240–254.
Article Google Scholar
Buckwalter, T. (2002). Buckwalter arabic morphological analyzer, Version 1.0. Linguistic Data Consortium.
Chelli, A. (2016). ASem Light Stemmer. http://www.arabicstemmer.com/.
Chelli, A., Balla, A., Zerrouki, T. (2012). Advanced search in Quran: classification and proposition of all possible features. In The eighth international conference on Language Resources and Evaluation (LREC’2012) Workshop. (pp. 7–12). https://www.researchgate.net/publication/268523279_Proceedings_of_LREC%272012_Workshop. Accessed Dec 16, 2016.
Christen, P. (2012). Data matching: concepts and techniques for record link-age, entity resolution, and duplicate detection. Cham: Springer.
Christen, P., Gayler, R., Hawking, D. (2009). Similarity-aware indexing for real-time entity resolution. In ACM Conference on Information and Knowledge Management (CIKM). (pp. 1565–1568). Hong Kong.
Christen, P., Goiser, K. (2007). Quality and complexity measures for data linkage and deduplication. In Fabrice J. Guillet, Howard J. Hamilton, ed., Quality measures in data mining. Cham: Springer.
Darwish, K. (2002). Building a Shallow Arabic morphological analyzer in one day. In the ACL-02 Workshop on computational approaches to semitic languages.
Darwish, D., Oard, K. (2002). Term Selection for Searching Printed Arabic. In the 25th ACM SIGIR Conference, (pp. 261–268).
Diab, M., Hacioglu, K., and Jurafsky, D. (2004). Automatic tagging of Arabic text: from raw test to base phrase chunks. HLT-NAACL.
Elmagarmid, A., Ipeirotis, P., & Verykios, V. S. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.
Article Google Scholar
Giridhar, N. S., Prema, K. V., & Reddy, S. N. V. (2011). A prospective study of stemming algorithms for web text mining. Ganpat University Journal of Engineering & Technology, 1(1), 28–34.
Google Scholar
Hayder, K., Al Ameed, K., Al Ketbi, O.S., Al Kaabi, A.A., Al Shebli, K.S., Al Shamsi, F., Al Nuaimi, N.H., Al Muhairi, S.S.(2005). Arabic light stemmer: anew enhanced approach. In The Second International Conference on Innovations in Information Technology (IIT’05). (pp. 1–9).
Jivani, A. G. (2011). A Comparative Study Of Stemming Algorithms. International Journal of Computer Technology and Applications (IJCTA)., 2, 1930–1938.
Google Scholar
Khoja, S., Garside, R. (1999). Stemming Arabic Text. Lancaster University. http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps.
Larkey, L., Ballesteros, L., Connell, M.E. (2002). Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In SIGIR’02 (pp. 275–282). Tampere, Finland.
Mazari, A. C., Aliane,H., Alimazighi, Z. (2013). A conceptual indexing approach for Arabic texts. In 2013 ACS International Conference on Computer Systems and Applications (AICCSA) (pp. 1–1). Ifrane.
Mubarak, H. (2018). Build Fast and Accurate Lemmatization for Arabic. In the 11th International Conference on Language Resources and Evaluation (LREC). (pp. 1128–1132).
Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1), 31–88.
Article Google Scholar
Otair, M. A. (2013). Comparative analysis of Arabic stemming algorithms. International Journal of Managing Information Technology., 5(2), 1–12.
Article Google Scholar
Ramadan, B. (2016). Indexing techniques for real-time entity resolution. PhD Thesis, Australian National University, Canberra.
Ramadan, B., Christen, P., Liang, H., & Gayler, R. W. (2015). Dynamic sorted neighborhood indexing for real-time entity resolution. Journal of Data and Information Quality., 6(4), 1–29.
Article Google Scholar
Ramadan, B., Christen, P., Liang, H., Gayler, R., Hawking, D. (2013). Dynamic similarity-aware inverted indexing for real-time entity resolution. In International Workshop on Data Mining Applications in Industry and Government (DMApps’13). Gold Coast, Australia held at PAKDD’13.
Sophoclis, N. N., Abdeen, M., El-Horbaty, E. S. M., Yagoub, M. (2012). A novel approach for indexing Arabic documents through GPU computing. In 25th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). (pp. 1–4). Montreal.
Taghva, K., Elkoury, R., Coombs, J. (2005). Arabic Stemming without a root dictionary. In The International Conference on Information Technology: Coding and Computing (ITCC’05).
Tran, K.N., Vatsalan, D., Christen, P. (2013). GeCo—an online personal data Generator and Corruptor. In ACM Conference on Information and Knowledge Management (ICIKM’13). (pp. 2473–2475). San Francisco. http://dmm.anu.edu.au/geco.
Wang, Y., Qin, J., Wang, W. (2017). Efficient Approximate Entity Matching Using Jaro-Winkler Distance. In 18th International Conference on Web Information Systems Engineering (WISE).
Winkler, W. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In The Section on Survey Research Methods. (pp. 354–359). American Statistical Association.
Yancey, W.E. (2005). Evaluating string comparator performance for record linkage. Technical Report RR2005/05.

Download references

Author information

Authors and Affiliations

Hashemite University, Zarqa, Jordan
Marwah Alian
Ajman University, Ajman, United Arab Emirates
Ghazi Al-Naymat
Princess Sumaya University for Technology, Amman, Jordan
Marwah Alian & Ghazi Al-Naymat
Prince Sultan University, Riyadh, Saudi Arabia
Banda Ramadan

Authors

Marwah Alian
View author publications
You can also search for this author in PubMed Google Scholar
Ghazi Al-Naymat
View author publications
You can also search for this author in PubMed Google Scholar
Banda Ramadan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marwah Alian.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alian, M., Al-Naymat, G. & Ramadan, B. Arabic real time entity resolution using inverted indexing. Lang Resources & Evaluation 54, 921–941 (2020). https://doi.org/10.1007/s10579-020-09504-6

Download citation

Accepted: 03 September 2020
Published: 07 October 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10579-020-09504-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Arabic real time entity resolution using inverted indexing

Abstract

Access this article

Similar content being viewed by others

Unsupervised learning blocking keys technique for indexing Arabic entity resolution

An Approach for Extracting and Disambiguating Arabic Persons’ Names Using Clustered Dictionaries and Scored Patterns

Arabic Named Entity Recognition—A Survey and Analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Arabic real time entity resolution using inverted indexing

Abstract

Access this article

Similar content being viewed by others

Unsupervised learning blocking keys technique for indexing Arabic entity resolution

An Approach for Extracting and Disambiguating Arabic Persons’ Names Using Clustered Dictionaries and Scored Patterns

Arabic Named Entity Recognition—A Survey and Analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation