Skip to main content

An Adaptive Approach of Approximate Substring Matching

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9642))

Included in the following conference series:

Abstract

Approximate substring matching is a common problem in many applications. In this paper, we study approximate substring matching with edit distance constraints. Existing methods are very sensitive to query strings or query parameters like query length and edit distance. To address the problem, we propose a new approach using partition scheme. It first partitions a query into several segments, and finds matching substrings of these segments as candidates, then performs a bidirectional verification on these candidates to get final results. We devise an even partition scheme to efficiently find candidates, and a best partition scheme to find high quality candidates. Furthermore, through theoretical analysis, we find that the best partition scheme cannot always outperform the even partition scheme. Thus we propose an adaptive approach for selectively choosing scheme using statistic knowledge. We conduct comprehensive experiments to demonstrate the efficiency and quality of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://genome.ucsc.edu/.

  2. 2.

    http://www.uniprot.org/.

  3. 3.

    https://www.gutenberg.org/.

References

  1. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, pp. 131–140. ACM (2007)

    Google Scholar 

  2. Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. University of Zurich, Zurich (2007)

    Google Scholar 

  3. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm (1994)

    Google Scholar 

  4. Deng, D., Li, G., Feng, J.: A pivotal prefix based filtering algorithm for string similarity search. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 673–684. ACM (2014)

    Google Scholar 

  5. Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient similarity search in very large string sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 262–279. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  6. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 2000, pp. 390–398. IEEE (2000)

    Google Scholar 

  7. Fredriksson, K., Navarro, G.: Average-optimal single and multiple approximate string matching. J. Exp. Algorithmics (JEA) 9, 1–4 (2004)

    MathSciNet  MATH  Google Scholar 

  8. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)

    Google Scholar 

  9. Jiang, Y., Deng, D., Wang, J., Li, G., Feng, J.: Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 341–348. ACM (2013)

    Google Scholar 

  10. Kim, Y., Park, H., Shim, K., Woo, K.G.: Efficient processing of substring match queries with inverted variable-length gram indexes. Inf. Sci. 244, 119–141 (2013)

    Article  MathSciNet  Google Scholar 

  11. Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)

    Google Scholar 

  12. Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. Proc. VLDB Endowment 5(3), 253–264 (2011)

    Article  Google Scholar 

  13. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  14. Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics 26(5), 589–595 (2010)

    Article  Google Scholar 

  15. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  16. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. (CSUR) 39(1), 2 (2007)

    Article  MATH  Google Scholar 

  17. Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–363. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  18. Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD Conference, pp. 1033–1044 (2011)

    Google Scholar 

  19. Ukkonen, E.: Approximate string-matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1993. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)

    Chapter  Google Scholar 

  20. Wandelt, S., Deng, D., Gerdjikov, S., Mishra, S., Mitankin, P., Patil, M., Siragusa, E., Tiskin, A., Wang, W., Wang, J., et al.: State-of-the-art in string similarity search and join. ACM SIGMOD Rec. 43(1), 64–76 (2014)

    Article  Google Scholar 

  21. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD Conference, pp. 85–96 (2012)

    Google Scholar 

  22. Wang, J., Yang, X., Wang, B.: Cache-aware parallel approximate matching and join algorithms using bwt. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 404–412. ACM (2013)

    Google Scholar 

  23. Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD Conference, pp. 353–364 (2008)

    Google Scholar 

  24. Yang, X., Wang, B., Li, C., Wang, J., Xie, X.: Efficient direct search on compressed genomic data. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8–12, 2013, pp. 961–972 (2013)

    Google Scholar 

  25. Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: Improving the performance of approximate queries on string collections. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 377–392. ACM (2015)

    Google Scholar 

Download references

Acknowledgements

The work was partially supported by the NSF of China for Outstanding Young Scholars under grant 61322208, the NSF of China under grants 61272178, 61572122, the NSF of China for Key Program under grant 61532021, and ARC DP140103499.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaochun Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, J., Yang, X., Wang, B., Liu, C. (2016). An Adaptive Approach of Approximate Substring Matching. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32025-0_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32024-3

  • Online ISBN: 978-3-319-32025-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics