Abstract
Edit distance based string range query is used extensively in the data integration, keyword search, biological function prediction and many others. In the presence of uncertainty, however, answering range queries is more challenging than those in deterministic scenarios since there are exponentially many possible worlds to be considered. This work extends existing filtering techniques tailored for deterministic strings to uncertain settings. We first design probabilistic q-gram filtering method that can work both efficiently and effectively. Another filtering technique, frequency distance based filtering, is also adapted to work with uncertain strings. To achieve further speed-up, we combined two state-of-the-art approaches based on cumulative distribution functions and local perturbation to improve lower bounds and upper bounds. Comprehensive experiment results show that our filter-based scheme, in the uncertain settings, is more efficient than existing methods only leveraging cumulative distribution functions or local perturbation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gusfield, D.: Algorithms on strings, trees, and sequences. Cambridge University Press (1999)
Sarawagi, S.: Sequence Data Mining (Advanced Methods for Knowledge Discovery from Complex Data). Spinger (2005)
Dong, G., Pei, J.: Sequence Data Mining (Advances in Database Systems). Springer (2007)
Hadjieleftheriou, M., Li, C.: Efficient approximate search on string collections. In: ICDE Tutorial (2009)
Gravano, L., Ipirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) forfree. In: VLDB, pp. 491–500 (2001)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324 (2003)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR, pp. 284–291 (2006)
Buhler, J.: Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17(5), 419–428 (2001)
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD, pp. 327–338 (2010)
Ge, T., Li, Z.: Approximate substring matching over uncertain strings. In: VLDB, pp. 772–782 (2011)
Kahveci, T., Singh, A.: An efficient index structure for string databases. In: VLDB, pp. 351–360 (2001)
Venkateswaran, J., Kahveci, T., Jermaine, C., Lachwani, D.: Reference-based indexing for metric spaces with costly distance measures. The VLDB Journal 17(5), 1231–1251 (2008)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Xiao, C., Wang, W., Lin, X., Yu, J.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Li, C., Wang, B., Yang, X.: VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD, pp. 353–364 (2008)
Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: FOCS, pp. 240–248 (1991)
Xiao, C., Wang, W., Lin, X.: Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. In: VLDB, pp. 933–944 (2008)
Xiao, C., Wang, W., Lin, X., Yu, J.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Zhang, Z., Hadjielefttheriou, M., Ooi, B.C., Srivastava, D.: B ed-Tree: An all-purpose index structure for string similarity search based on edit distance. In: SIGMOD, pp. 915–926 (2010)
Behm, A., Li, C., Carey, M.: Answering approximate string queries on large data sets using external memory. In: ICDE, pp. 888–899 (2011)
Yao, B., Li, F., Hadjieleftheriou, M., Hou, K.: Approximate string search in spatial databases. In: ICDE, pp. 545–556 (2010)
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12 (2007)
Aggarwal, C.C., Yu, P.S.: A Survey of Uncertain Data Algorithms and Applications. IEEE Transaction on Knowledge and Data Engineering (TKDE) 21(5), 609–623 (2009)
Sutinen, E., Tarhio, J.: On Using q-Gram Locations in Approximate String Matching. In: Spirakis, P.G. (ed.) ESA 1995. LNCS, vol. 979, pp. 327–340. Springer, Heidelberg (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dai, D., Xie, J., Zhang, H., Dong, J. (2012). Efficient Range Queries over Uncertain Strings. In: Ailamaki, A., Bowers, S. (eds) Scientific and Statistical Database Management. SSDBM 2012. Lecture Notes in Computer Science, vol 7338. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31235-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-31235-9_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31234-2
Online ISBN: 978-3-642-31235-9
eBook Packages: Computer ScienceComputer Science (R0)