Efficient Range Queries over Uncertain Strings

Dai, Dongbo; Xie, Jiang; Zhang, Huiran; Dong, Jiaqi

doi:10.1007/978-3-642-31235-9_5

Dongbo Dai¹⁸,
Jiang Xie^18,20,
Huiran Zhang¹⁸ &
…
Jiaqi Dong¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7338))

Included in the following conference series:

International Conference on Scientific and Statistical Database Management

1746 Accesses
4 Citations

Abstract

Edit distance based string range query is used extensively in the data integration, keyword search, biological function prediction and many others. In the presence of uncertainty, however, answering range queries is more challenging than those in deterministic scenarios since there are exponentially many possible worlds to be considered. This work extends existing filtering techniques tailored for deterministic strings to uncertain settings. We first design probabilistic q-gram filtering method that can work both efficiently and effectively. Another filtering technique, frequency distance based filtering, is also adapted to work with uncertain strings. To achieve further speed-up, we combined two state-of-the-art approaches based on cumulative distribution functions and local perturbation to improve lower bounds and upper bounds. Comprehensive experiment results show that our filter-based scheme, in the uncertain settings, is more efficient than existing methods only leveraging cumulative distribution functions or local perturbation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Indexing metric uncertain data for range queries and range joins

Article 24 May 2017

S-MRST: a novel framework for indexing uncertain data

Article 07 September 2016

Computing Probability Threshold Set Similarity on Probabilistic Sets

References

Gusfield, D.: Algorithms on strings, trees, and sequences. Cambridge University Press (1999)
Google Scholar
Sarawagi, S.: Sequence Data Mining (Advanced Methods for Knowledge Discovery from Complex Data). Spinger (2005)
Google Scholar
Dong, G., Pei, J.: Sequence Data Mining (Advances in Database Systems). Springer (2007)
Google Scholar
Hadjieleftheriou, M., Li, C.: Efficient approximate search on string collections. In: ICDE Tutorial (2009)
Google Scholar
Gravano, L., Ipirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) forfree. In: VLDB, pp. 491–500 (2001)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324 (2003)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Google Scholar
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR, pp. 284–291 (2006)
Google Scholar
Buhler, J.: Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17(5), 419–428 (2001)
Article Google Scholar
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD, pp. 327–338 (2010)
Google Scholar
Ge, T., Li, Z.: Approximate substring matching over uncertain strings. In: VLDB, pp. 772–782 (2011)
Google Scholar
Kahveci, T., Singh, A.: An efficient index structure for string databases. In: VLDB, pp. 351–360 (2001)
Google Scholar
Venkateswaran, J., Kahveci, T., Jermaine, C., Lachwani, D.: Reference-based indexing for metric spaces with costly distance measures. The VLDB Journal 17(5), 1231–1251 (2008)
Article Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Google Scholar
Li, C., Wang, B., Yang, X.: VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Google Scholar
Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD, pp. 353–364 (2008)
Google Scholar
Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: FOCS, pp. 240–248 (1991)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. In: VLDB, pp. 933–944 (2008)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Google Scholar
Zhang, Z., Hadjielefttheriou, M., Ooi, B.C., Srivastava, D.: B ^ed-Tree: An all-purpose index structure for string similarity search based on edit distance. In: SIGMOD, pp. 915–926 (2010)
Google Scholar
Behm, A., Li, C., Carey, M.: Answering approximate string queries on large data sets using external memory. In: ICDE, pp. 888–899 (2011)
Google Scholar
Yao, B., Li, F., Hadjieleftheriou, M., Hou, K.: Approximate string search in spatial databases. In: ICDE, pp. 545–556 (2010)
Google Scholar
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12 (2007)
Google Scholar
Aggarwal, C.C., Yu, P.S.: A Survey of Uncertain Data Algorithms and Applications. IEEE Transaction on Knowledge and Data Engineering (TKDE) 21(5), 609–623 (2009)
Article Google Scholar
Sutinen, E., Tarhio, J.: On Using q-Gram Locations in Approximate String Matching. In: Spirakis, P.G. (ed.) ESA 1995. LNCS, vol. 979, pp. 327–340. Springer, Heidelberg (1995)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, China
Dongbo Dai, Jiang Xie & Huiran Zhang
School of Computer Science, Fudan University, Shanghai, China
Jiaqi Dong
Department of Mathematics, University of California, Irvine, CA, USA
Jiang Xie

Authors

Dongbo Dai
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Xie
View author publications
You can also search for this author in PubMed Google Scholar
Huiran Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Dong
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science, EPFL IC SIN-GE, Ecole Polytechnique Federale de Lausanne, Batiment BC, Station 14, 1015, Lausanne, Switzerland
Anastasia Ailamaki
Department of Computer Science, Gonzaga University, 502 E. Boone Avenue, 99258-0026, Spokane, WA, USA
Shawn Bowers

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dai, D., Xie, J., Zhang, H., Dong, J. (2012). Efficient Range Queries over Uncertain Strings. In: Ailamaki, A., Bowers, S. (eds) Scientific and Statistical Database Management. SSDBM 2012. Lecture Notes in Computer Science, vol 7338. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31235-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-31235-9_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31234-2
Online ISBN: 978-3-642-31235-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics