skip to main content
10.1145/1031171.1031212acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Indexing text data under space constraints

Published: 13 November 2004 Publication History

Abstract

An important class of queries is the LIKE predicate in SQL. In the absence of an index, LIKE queries are subject to performance degradation. The notion of indexing on substrings (or <i>q</i>-grams) has been explored earlier without sufficient consideration of efficiency. <i>q</i>-grams are used to prune away rows that do not qualify for the query. The problem is to identify a finite number of grams subject to storage constraint that gives maximal pruning for a given query workload. Our contributions include: i) a formal problem definition, that produces results within a provable error bound, ii) performance evaluation of the application of the novel method to real data, and iii) parallelization of the algorithm, scaling considerations and a proposal to handle scaling issues.

References

[1]
J. Cho and S. Rajagopalan. A Fast Regular Expression Indexing Engine. In Proc. of ICDE, 2002.
[2]
Digital Bibliography & Library Project. http://dblp.uni-trier.de/.
[3]
Khuller,S., Moss, A., and Naor, J. The Budgeted Maximum Coverage Problem. IPL, V 70, Num 1: 39--45, 1999.
[4]
E. Ukkonen. Online construction of Suffix-trees. Algorithmica, 1993.
[5]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q grams in a dbms for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.
[6]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.
[7]
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H., Rivals, E., Vingron, M. q-gram based Database Searching Using Suffix Array (QUASAR) RECOMB, 1999, pp. 77--83.
[8]
Hore, B., Hacigumus, H., Iyer, B., Mehrotra, S. Indexing Text Data under Space Constraints TR-DB-04-02, www-db.ics.uci.edu/pages/publications/index.shtml
[9]
D. S. Hochbaum. Approximating covering and packing problems: Set cover, vertex cover, independent set, and related problems. Approximation algorithms for NP-hard problems, PWS Publishing Co., Boston, 1996.
[10]
D. S. Hochbaum and A. Pathria. Analysis of the Greedy Approach in Problems of Maximum k-Coverage. Naval Research Quarterly, (45):615--627, 1998.
[11]
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001.
[12]
S. Chaudhury. Compressing SQL Workloads. ACM SIGMOD 2002
[13]
Garey, M. R, Johnson, D. S. Computers and Intractability: A Guide to Theory of NP-Completeness. Freeman, San Francisco, 1979.
[14]
Bayer, R., and Unteraurer, K. Preffix B-trees ACM Trans. Database System. 2(1977), pp 11--26.
[15]
Bayer, R., and McCreight, C. Organization and maintenance of large ordered indexes Acta Informatica, 1972, pp173--189.
[16]
Ferragina, P., and Grossi, R. A fully-dynamic data structure for external substring search. ACM STOC, 1995, pp 693--702.
[17]
Hopcroft, J., E., and Ullman, D. Introduction to automata theory, languages and computation. Addison-Wesley, 1979.
[18]
Baeza-Yates, R., and Gonnet, G, H. Fast text searching for regular expressions or automaton searching on Tries. JACM, Vol 43, 1996, pp. 915--936.
[19]
Crochemore, M., Hancart, C. Automata for Matching Patterns, Handbook of formal languages. Rosenberg, C., and Salaama, A. eds 2, Springer-Verlag, 1997, pp. 399--462
[20]
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M., T., and Seiferas, J. The smallest automaton recognizing the subwords of a text. Theoretical computer science, 40(1), 1985, pp. 31--55.
[21]
Aho, A, V., and Corasick, M, J. E.cient String matching: an aid to bibliographic search. Comm. ACM. 1975, pp. 332--340.
[22]
McCreight, E., M. A space efficient suffix-tree construction algorithm. J. ACM 23, 1976, pp. 262--272.
[23]
Manber, U., and Myers, G. Suffix Arrays a new method for on-line string searches. Siam Journal on Computing 22, 1993, pp. 935--948.
[24]
Knuth, D., E. The Art of Computer Programming. Addison-Wesley, 1973 Vol 3: Sorting and Searching.
[25]
Salton, G. Automatic Text Processing. Addison-Wesley, 1989.
[26]
Chan, C., Y, and Ioannidis, Y., E. Bitmap Index Desing and Evaluation, ACM SIGMOD 1998. pp 355--366.
[27]
Gray, J., Reuters, A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Pub, 1993.
[28]
Chakrabarti, K., Mehrotra, S. E.cient Concurrency Control in Multidimensional Access methods. SIGMOD, 1999, pp 25--36.

Cited By

View all
  • (2013)Efficient processing of substring match queries with inverted variable-length gram indexesInformation Sciences10.1016/j.ins.2013.04.037244(119-141)Online publication date: Sep-2013
  • (2011)A robust index for regular expression queriesProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063968(2365-2368)Online publication date: 24-Oct-2011
  • (2010)Efficient processing of substring match queries with inverted q-gram indexes2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)10.1109/ICDE.2010.5447866(721-732)Online publication date: Mar-2010
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management
November 2004
678 pages
ISBN:1581138741
DOI:10.1145/1031171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. B-tree
  2. SQL
  3. index
  4. like queries
  5. q-grams

Qualifiers

  • Article

Conference

CIKM04
Sponsor:
CIKM04: Conference on Information and Knowledge Management
November 8 - 13, 2004
D.C., Washington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2013)Efficient processing of substring match queries with inverted variable-length gram indexesInformation Sciences10.1016/j.ins.2013.04.037244(119-141)Online publication date: Sep-2013
  • (2011)A robust index for regular expression queriesProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063968(2365-2368)Online publication date: 24-Oct-2011
  • (2010)Efficient processing of substring match queries with inverted q-gram indexes2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)10.1109/ICDE.2010.5447866(721-732)Online publication date: Mar-2010
  • (2009)Space-Constrained Gram-Based Indexing for Efficient Approximate String SearchProceedings of the 2009 IEEE International Conference on Data Engineering10.1109/ICDE.2009.32(604-615)Online publication date: 29-Mar-2009
  • (2008)Cost-based variable-length-gram selection for string collections to support approximate queries efficientlyProceedings of the 2008 ACM SIGMOD international conference on Management of data10.1145/1376616.1376655(353-364)Online publication date: 9-Jun-2008
  • (2007)Extending q-grams to estimate selectivity of string matching with low edit distanceProceedings of the 33rd international conference on Very large data bases10.5555/1325851.1325877(195-206)Online publication date: 23-Sep-2007

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media