Skip to main content

LexEQUAL: Supporting Multiscript Matching in Database Systems

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2992))

Abstract

To effectively support today’s global economy, database systems need to store and manipulate text data in multiple languages simultaneously. Current database systems do support the storage and management of multilingual data, but are not capable of querying or matching text data across different scripts. As a first step towards addressing this lacuna, we propose here a new query operator called LexEQUAL, which supports multiscript matching of proper names. The operator is implemented by first transforming matches in multiscript text space into matches in the equivalent phoneme space, and then using standard approximate matching techniques to compare these phoneme strings. The algorithm incorporates tunable parameters that impact the phonetic match quality and thereby determine the match performance in the multiscript space. We evaluate the performance of the LexEQUAL operator on a real multiscript names dataset and demonstrate that it is possible to simultaneously achieve good recall and precision by appropriate parameter settings. We also show that the operator run-time can be made extremely efficient by utilizing a combination of q-gram and database indexing techniques. Thus, we show that the LexEQUAL operator can complement the standard lexicographic operators, representing a first step towards achieving complete multilingual functionality in database systems.

A poster version of this paper appears in the Proc. of the 20th IEEE Intl. Conf. on Data Engineering, March 2004.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R., Navarro, G.: Faster Approximate String Matching. Algorithmica 23(2), 127–158 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  2. Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.: Searching in Metric Space. ACM Computing Surveys 33(3), 273–321 (2001)

    Article  Google Scholar 

  3. Davis, M.: Unicode collation algorithm. Unicode Consortium Technical Report (2001)

    Google Scholar 

  4. Dhvani - A Text-to-Speech System for Indian Languages, http://dhvani.sourceforge.net/

  5. The Foreign Word – The Language Site, Alicante, Spain, http://www.ForeignWord.com

  6. Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate String Joins in a Database (almost) for Free. In: Proc. of 27th VLDB Conf. (September 2001)

    Google Scholar 

  7. Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (2001)

    Google Scholar 

  8. International Organization for Standardization. ISO/IEC 9075-1-5:1999, Information Technology – Database Languages – SQL (parts 1 through 5) (1999)

    Google Scholar 

  9. The International Phonetic Association. Univ. of Glasgow, Glasgow, UK, http://www.arts.gla.ac.uk/IPA/ipa.html

  10. Jurafskey, D., Martin, J.: Speech and Language Processing. Pearson Education (2000)

    Google Scholar 

  11. Knuth, D.: The Art of Computer Programming. Sorting and Searching, vol. 3. Addison-Wesley, Reading (1993)

    Google Scholar 

  12. Kumaran, A., Haritsa, J.: On Database Support for Multilingual Environments. In: Proc. of 9th IEEE RIDE Workshop (March 2003)

    Google Scholar 

  13. Kumaran, A., Haritsa, J.: On the Costs of Multilingualism in Database Systems. In: Proc. of 29th VLDB Conference (September 2003)

    Google Scholar 

  14. Kumaran, A., Haritsa, J.: Supporting Multilexical Matching in Database Systems. DSL/SERC Technical Report TR-2004-01 (2004)

    Google Scholar 

  15. Lambert, B., Chang, K., Lin, S.: Descriptive analysis of the drug name lexicon. Drug Information Journal 35, 163–172 (2001)

    Google Scholar 

  16. Liberman, M., Church, K.: Text Analysis and Word Pronunciation in TTS Synthesis. Advances in Speech Processing (1992)

    Google Scholar 

  17. Melton, J., Simon, A.: SQL 1999: Understanding Relational Language Components. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  18. Mareuil, P., Corredor-Ardoy, C., Adda-Decker, M.: Multilingual Automatic Phoneme Clustering. In: Proc. of 14th Intl. Congress of Phonetic Sciences (August 1999)

    Google Scholar 

  19. Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)

    Article  Google Scholar 

  20. Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing Text with Approximate q-grams. In: Proc. of 11th Combinatorial Pattern Matching Conf. (June 2000)

    Google Scholar 

  21. Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)

    Google Scholar 

  22. The Oxford English Dictionary. Oxford University Press (1999)

    Google Scholar 

  23. Pfeifer, U., Poersch, T., Fuhr, N.: Searching Proper Names in Databases. In: Proc. Conf. Hypertext-Information Retrieval-Multimedia (April 1995)

    Google Scholar 

  24. Rabiner, L., Juang, B.: Fundamentals of Speech Processing. Prentice-Hall, Englewood Cliffs (1993)

    Google Scholar 

  25. The Unicode Consortium. The Unicode Standard. Addison-Wesley (2000)

    Google Scholar 

  26. The Unisyn Project. The Center for Speech Technology Research, Univ. of Edinburgh, United Kingdom, http://www.cstr.ed.ac.uk/projects/unisyn/

  27. Zobel, J., Dart, P.: Finding Approximate Matches in Large Lexicons. Software – Practice and Experience 25(3), 331–345 (1995)

    Article  Google Scholar 

  28. Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Proc. of 19th ACM SIGIR Conf. (August 1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kumaran, A., Haritsa, J.R. (2004). LexEQUAL: Supporting Multiscript Matching in Database Systems. In: Bertino, E., et al. Advances in Database Technology - EDBT 2004. EDBT 2004. Lecture Notes in Computer Science, vol 2992. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24741-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24741-8_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21200-3

  • Online ISBN: 978-3-540-24741-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics