research-article

Improved count suffix trees for natural language data

Authors:

Klemens BöhmAuthors Info & Claims

IDEAS '08: Proceedings of the 2008 international symposium on Database engineering & applications

Pages 231 - 236

https://doi.org/10.1145/1451940.1451972

Published: 10 September 2008 Publication History

Abstract

With more and more natural language text stored in databases, handling respective query predicates becomes very important. Optimizing queries with predicates includes (sub)string estimation, i.e., estimating the selectivity of query terms based on small summary statistics before query execution. Count Suffix Trees (CST) are commonly used to this end. While CST yield good estimates, they are expensive to build and require a large amount of memory to be stored. To fit in the data dictionary of database systems, they have to be severely pruned. Existing pruning techniques are based on suffix frequency or tree depth. In this paper, we propose new filtering and pruning techniques that reduce both the size of CST over natural-language texts and the cost of building them. The core idea is to exploit features of the natural language data, i.e., regarding only the suffixes that are useful in a linguistic sense. The most important innovations are (a) a new aggressive approximate syllabification technique to filter out suffixes, (b) a new affix and prefix stripping procedure that conflates more terms than conventional stemming techniques, (c) the deployment of state-of-the-art trigram techniques and a new syllable-based mechanism to filter out non-words (i.e., misspellings and other language anomalies like foreign words), which would cause an over-proportional growth of the CST otherwise. -- Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80% and shorten the build phase significantly. From a different perspective, if storage space remains unchanged, the accuracy of selectivity estimates computed from the CST increases by up to 70%.

References

[1]

J. Bae and S. Lee. Substring count estimation in extremely long strings. IEICE - Trans. Inf. Syst., E89-D(3):1148--1156, 2006.

Digital Library

[2]

S. Bressan and R. Irawan. Morphologic non-word error detection. Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA '04), pages 31--35, 2004.

Digital Library

[3]

S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem, in Proceedings of ICDE 2004, Boston, MA, USA, 2004.

Digital Library

[4]

D. W. Cummings. American English Spelling: An Informal Description. Johns Hopkins University Press, 1988.

[5]

D. Graff. The aquaint corpus of english news text. Linguistic Data Consortium, Philadelphia, 2002.

[6]

H. Jagadish, O. Kapitskaia, and D. Srivastava. One-dimensional and multi-dimensional substring selectivity estimation. The VLDB Journal The International Journal on Very Large Data Bases, 9(3):214--230, 2000.

Digital Library

[7]

P. Krishnan, J. S. Vitter, and B. Iyer. Estimating alphanumeric selectivity in the presence of wildcards. In ACM SIGMOD International Conference on Management of Data, pages 12--13. ACM, 1996.

Digital Library

[8]

R. Krovetz. Viewing Morphology as an Inference Process,. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191--203, 1993.

Digital Library

[9]

K. Kukich. Techniques for automatically correcting words in text. ACM Computer Surveys, 24:379--439, 1992.

Digital Library

[10]

M. Lennon, D. Pierce, B. Tarry, and P. Willett. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3(177--183), 1981.

[11]

D. D. Lewis. Reuters-21578. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

[12]

F. M. Lian. Word hy-phen-a-tion by com-put-er. PhD thesis, Stanford University, Stanford, August 1983.

Digital Library

[13]

J. B. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22--31, 1968.

[14]

M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.

[15]

Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. Practical methods for constructing suffix trees. The VLDB Journal, 14(3):281--299, 2005.

Digital Library

[16]

E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.

Digital Library

[17]

P. Weiner. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1--11, 1973.

Digital Library

[18]

E. M. Zamora, J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information of Processing and Management, 17:305--316, 1981.

[19]

C. Silverstein, H. Marais, M. Henzinger, M. Moricz. Analysis of a Very Large Web Search Engine Query Log. ACM SIGIR Forum, Vol. 33, pp. 6--12, 1999, ISSN:0163-5840

Digital Library

[20]

R. Giegerich, S. Kurtz, J. Stoye, Efficient Implementation of Lazy Suffix Trees, Software: Practice and Experience, Vol. 33, No 11, 2003

Cited By

Willkomm JSchäler MBöhm K(2021)Accurate Cardinality Estimation of Co-occurring Words Using Suffix TreesDatabase Systems for Advanced Applications10.1007/978-3-030-73197-7_50(721-737)Online publication date: 11-Apr-2021
https://dl.acm.org/doi/10.1007/978-3-030-73197-7_50

Improved count suffix trees for natural language data
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Processing comparable corpora with Bilingual Suffix Trees
EMNLP '02: Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10

We introduce Bilingual Suffix Trees (BST), a data structure that is suitable for exploiting comparable corpora. We discuss algorithms that use BSTs in order to create parallel corpora and learn translations of unseen words from comparable corpora. ...
Improved selectivity estimator for XML queries based on structural synopsis

With the increasing popularity of XML database applications, the use of efficient XML query optimizers is becoming very essential. The performance of an XML query optimizer depends heavily on the query selectivity estimators it uses to find the best ...
Introduction to Chinese Natural Language Processing

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

IDEAS '08: Proceedings of the 2008 international symposium on Database engineering & applications

September 2008

289 pages

ISBN:9781605581880

DOI:10.1145/1451940

General Chair:
Bipin C. Desai
Concordia University

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 September 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

IDEAS '08

Sponsor:

ACM

IDEAS '08: Proceedings of the 12th international symposium on database engineering & applications

September 10 - 12, 2008

Coimbra, Portugal

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
171
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Willkomm JSchäler MBöhm K(2021)Accurate Cardinality Estimation of Co-occurring Words Using Suffix TreesDatabase Systems for Advanced Applications10.1007/978-3-030-73197-7_50(721-737)Online publication date: 11-Apr-2021
https://dl.acm.org/doi/10.1007/978-3-030-73197-7_50

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten