LexEQUAL: Supporting Multiscript Matching in Database Systems

Kumaran, A.; Haritsa, Jayant R.

doi:10.1007/978-3-540-24741-8_18

LexEQUAL: Supporting Multiscript Matching in Database Systems

A. Kumaran¹¹ &
Jayant R. Haritsa¹¹

Conference paper

2048 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2992))

Abstract

To effectively support today’s global economy, database systems need to store and manipulate text data in multiple languages simultaneously. Current database systems do support the storage and management of multilingual data, but are not capable of querying or matching text data across different scripts. As a first step towards addressing this lacuna, we propose here a new query operator called LexEQUAL, which supports multiscript matching of proper names. The operator is implemented by first transforming matches in multiscript text space into matches in the equivalent phoneme space, and then using standard approximate matching techniques to compare these phoneme strings. The algorithm incorporates tunable parameters that impact the phonetic match quality and thereby determine the match performance in the multiscript space. We evaluate the performance of the LexEQUAL operator on a real multiscript names dataset and demonstrate that it is possible to simultaneously achieve good recall and precision by appropriate parameter settings. We also show that the operator run-time can be made extremely efficient by utilizing a combination of q-gram and database indexing techniques. Thus, we show that the LexEQUAL operator can complement the standard lexicographic operators, representing a first step towards achieving complete multilingual functionality in database systems.

A poster version of this paper appears in the Proc. of the 20th IEEE Intl. Conf. on Data Engineering, March 2004.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Navarro, G.: Faster Approximate String Matching. Algorithmica 23(2), 127–158 (1999)
Article MATH MathSciNet Google Scholar
Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.: Searching in Metric Space. ACM Computing Surveys 33(3), 273–321 (2001)
Article Google Scholar
Davis, M.: Unicode collation algorithm. Unicode Consortium Technical Report (2001)
Google Scholar
Dhvani - A Text-to-Speech System for Indian Languages, http://dhvani.sourceforge.net/
The Foreign Word – The Language Site, Alicante, Spain, http://www.ForeignWord.com
Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate String Joins in a Database (almost) for Free. In: Proc. of 27th VLDB Conf. (September 2001)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (2001)
Google Scholar
International Organization for Standardization. ISO/IEC 9075-1-5:1999, Information Technology – Database Languages – SQL (parts 1 through 5) (1999)
Google Scholar
The International Phonetic Association. Univ. of Glasgow, Glasgow, UK, http://www.arts.gla.ac.uk/IPA/ipa.html
Jurafskey, D., Martin, J.: Speech and Language Processing. Pearson Education (2000)
Google Scholar
Knuth, D.: The Art of Computer Programming. Sorting and Searching, vol. 3. Addison-Wesley, Reading (1993)
Google Scholar
Kumaran, A., Haritsa, J.: On Database Support for Multilingual Environments. In: Proc. of 9th IEEE RIDE Workshop (March 2003)
Google Scholar
Kumaran, A., Haritsa, J.: On the Costs of Multilingualism in Database Systems. In: Proc. of 29th VLDB Conference (September 2003)
Google Scholar
Kumaran, A., Haritsa, J.: Supporting Multilexical Matching in Database Systems. DSL/SERC Technical Report TR-2004-01 (2004)
Google Scholar
Lambert, B., Chang, K., Lin, S.: Descriptive analysis of the drug name lexicon. Drug Information Journal 35, 163–172 (2001)
Google Scholar
Liberman, M., Church, K.: Text Analysis and Word Pronunciation in TTS Synthesis. Advances in Speech Processing (1992)
Google Scholar
Melton, J., Simon, A.: SQL 1999: Understanding Relational Language Components. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Mareuil, P., Corredor-Ardoy, C., Adda-Decker, M.: Multilingual Automatic Phoneme Clustering. In: Proc. of 14th Intl. Congress of Phonetic Sciences (August 1999)
Google Scholar
Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)
Article Google Scholar
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing Text with Approximate q-grams. In: Proc. of 11th Combinatorial Pattern Matching Conf. (June 2000)
Google Scholar
Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)
Google Scholar
The Oxford English Dictionary. Oxford University Press (1999)
Google Scholar
Pfeifer, U., Poersch, T., Fuhr, N.: Searching Proper Names in Databases. In: Proc. Conf. Hypertext-Information Retrieval-Multimedia (April 1995)
Google Scholar
Rabiner, L., Juang, B.: Fundamentals of Speech Processing. Prentice-Hall, Englewood Cliffs (1993)
Google Scholar
The Unicode Consortium. The Unicode Standard. Addison-Wesley (2000)
Google Scholar
The Unisyn Project. The Center for Speech Technology Research, Univ. of Edinburgh, United Kingdom, http://www.cstr.ed.ac.uk/projects/unisyn/
Zobel, J., Dart, P.: Finding Approximate Matches in Large Lexicons. Software – Practice and Experience 25(3), 331–345 (1995)
Article Google Scholar
Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Proc. of 19th ACM SIGIR Conf. (August 1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560012, INDIA
A. Kumaran & Jayant R. Haritsa

Authors

A. Kumaran
View author publications
You can also search for this author in PubMed Google Scholar
Jayant R. Haritsa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Purdue University,
Elisa Bertino
Laboratory of Distributed Multimedia Information Systems and Applications, Technical University of Crete (MUSIC/TUC) Chania, 73100, Crete, Greece
Stavros Christodoulakis
Institute of Computer Science, FO.R.T.H., Vassilika Vouton, P.O. Box 1385, GR 71110, Heraklion, Greece
Dimitris Plexousakis
Department of Computer Science, University of Crete, P.O.Box 2208, GR 71409, Heraklion, Greece
Vassilis Christophides
National and Kapodistrian University of Athens, Greece
Manolis Koubarakis
IPD, Universität Karlsruhe, Am Fasanengarten 5, 76131, Karlsruhe,
Klemens Böhm
Department of Computer Science and Communication, University of Insubria, 22100, Varese, Italy
Elena Ferrari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumaran, A., Haritsa, J.R. (2004). LexEQUAL: Supporting Multiscript Matching in Database Systems. In: Bertino, E., et al. Advances in Database Technology - EDBT 2004. EDBT 2004. Lecture Notes in Computer Science, vol 2992. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24741-8_18

Download citation

DOI: https://doi.org/10.1007/978-3-540-24741-8_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21200-3
Online ISBN: 978-3-540-24741-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics