Skip to main content
Log in

Hierarchical filtering: improving similar substring matching under edit distance

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Similar substring matching, as an essential operation in applications including read mapping and text retrieval, has attracted significant attention in the research community. In this paper, we study the problem of similar substring matching with edit distance constraints. Existing methods generally utilize a filtering-and-verification framework to solve this problem – a filtering procedure is employed to reduce the searching space before going to a computationally expensive verification step, and the efficiency depends critically on balancing the cost of filtering and verification. The common filtering paradigm is based on the principle of Pigeonhole stating that a matching result must exactly match at least a certain number of substrings from the query, where the substrings act as a filter. However, the polynomial growth of filters caused by enlarging the number of substrings in filters, leading to the cost of filtering and verification is not well-balanced for the existing methods. To this end, we propose a novel filtering paradigm hierarchical filtering, aiming at achieving a fine-grained balance on the cost of filtering and verification. Unlike using a fixed number of substrings in a filter, our method allows the filters contain a different number of substrings that avoids the polynomial growth of filters. The filters are picked in accord with a scoring metric. We devise a tree-based filtering framework for hierarchical filtering. Also, the cost of filtering and verification is further reduced by eliminating the duplication of filters. Extensive experiments have been conducted on four real-world datasets by comparing to state-of-the-art methods Hobbes3, BWA, and BLAST, etc. The results show that our method outperforms the competing methods under a wide range of parameter settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Algorithm 1
Fig. 8
Algorithm 2
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Algorithm 3
Algorithm 4
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Data Availability

The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.

Notes

  1. A subsequence can be consisted of inconsecutive characters.

  2. http://hgdownload.cse.ucsc.edu/goldenpath/hg18/chromosomes/

  3. http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/

  4. http://fruitfly.org/sequence/

  5. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

References

  1. Ahmadi, A., Behm, A., Honnalli, N., Li, C., Weng, L., Xie, X.: Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 40(6), e41–e41 (2011). https://doi.org/10.1093/nar/gkr1246https://doi.org/10.1093/nar/gkr1246

    Article  Google Scholar 

  2. Kim, J., Li, C., Xie, X.: Hobbes3: dynamic generation of variable-length signatures for efficient approximate subsequence mappings. In: ICDE, IEEE, pp. 169–180. https://doi.org/10.1109/ICDE.2016.7498238https://doi.org/10.1109/ICDE.2016.7498238(2016)

  3. Kim, Y., Park, H., Shim, K., Woo, K.G.: Efficient processing of substring match queries with inverted variable-length gram indexes. Inform. Sci. 244, 119–141 (2013). https://doi.org/10.1016/j.ins.2013.04.037https://doi.org/10.1016/j.ins.2013.04.037

    Article  MathSciNet  MATH  Google Scholar 

  4. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 25(14), 1754–1760 (2009). https://doi.org/10.1093/bioinformatics/btp324

    Article  Google Scholar 

  5. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, IEEE, pp. 257–266. https://doi.org/10.1109/ICDE.2008.4497434 (2008)

  6. Wang, J., Li, G., Deng, D., Zhang, Y., Feng, J.: Two birds with one stone: an efficient hierarchical framework for top-k and threshold-based string similarity search. In: ICDE, IEEE, pp. 519–530. https://doi.org/10.1109/ICDE.2015.7113311 (2015)

  7. Wang, J., Yang, X., Wang, B., Liu, C.: An adaptive approach of approximate substring matching. In: DASFAA, Springer, pp. 501–516. https://doi.org/10.1007/978-3-319-32025-0_31 (2016)

  8. Qin, J., Wang, W., Xiao, C., Lu, Y., Lin, X., Wang, H.: Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Trans. Database Syst. 38(3), 1–44 (2013). https://doi.org/10.1145/2508020.2508023

    Article  MathSciNet  MATH  Google Scholar 

  9. Wang, J., Yang, X., Wang, B., Liu, C.: Ls-join: local similarity join on string collections. IEEE Trans. Knowl. Data Eng. 29(9), 1928–1942 (2017). https://doi.org/10.1109/TKDE.2017.2687460

    Article  Google Scholar 

  10. Kim, J., Li, C., Xie, X.: Improving read mapping using additional prefix grams. BMC Bioinform. 15(1), 42 (2014). https://doi.org/10.1186/1471-2105-15-42

    Article  Google Scholar 

  11. Kim, Y., Shim, K.: Efficient top-k algorithms for approximate substring matching. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp. 385–396 (2013)

  12. Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theo. Comput. Sci. 92(1), 191–211 (1992). https://doi.org/10.1016/0304-3975(92)90143-4

    Article  MathSciNet  MATH  Google Scholar 

  13. Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999). https://doi.org/10.1145/316542.316550

    Article  MathSciNet  MATH  Google Scholar 

  14. Cheng, H., Jiang, H., Yang, J., Xu, Y., Shang, Y.: Bitmapper: an efficient all-mapper based on bit-vector computing. BMC Bioinform. 16 (1), 192 (2015). https://doi.org/10.1186/s12859-015-0626-9

    Article  Google Scholar 

  15. Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011). https://doi.org/10.14778/2078331.2078340

    Google Scholar 

  16. Yang, X., Wang, B., Li, C., Wang, J., Xie, X.: Efficient direct search on compressed genomic data. In: ICDE, IEEE, pp. 961–972. https://doi.org/10.1109/ICDE.2013.6544889 (2013)

  17. Chen, C., Qin, J., Wang, W.: On gapped set intersection size estimation. In: CIKM, ACM, pp. 1351–1360. https://doi.org/10.1145/2806416.2806438 (2015)

  18. Consortium, T.G.P.: A map of human genome variation from population-scale sequencing. Nature 467(7319)), 1061–1073 (2010). https://doi.org/10.1038/nature09534

    Article  Google Scholar 

  19. Weese, D., Holtgrewe, M., Reinert, K.: Razers 3: faster, fully sensitive read mapping. Bioinformatics 28(20), 2592–2599 (2012). https://doi.org/10.1093/bioinformatics/bts505

    Article  Google Scholar 

  20. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molecular Bio. 215(3), 403–410 (1990). https://doi.org/10.1016/S0022-2836(05)80360-2

    Article  Google Scholar 

  21. Qiu, T., Yang, X., Wang, B., Han, Y., Wang, S.: Efficient approximate subsequence matching using hybrid signatures. In: DASFAA, Springer, pp. 600–609. https://doi.org/10.1007/978-3-319-91452-7_39 (2018)

  22. Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD, ACM, pp. 353–364. https://doi.org/10.1145/1376616.1376655 (2008)

  23. Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41(7), e78–e78 (2013). https://doi.org/10.1093/nar/gkt005

    Article  Google Scholar 

  24. Hanhan, R., Garzón, E., Jahshan, Z., Teman, A., Lanuzza, M., Yavits, L.: Edam: edit distance tolerant approximate matching content addressable memory. In: ISCA, ACM, pp. 495—-507. https://doi.org/10.1145/3470496.3527424 (2022)

  25. Lam, T.W., Sung, W.-K., Tam, S.-L., Wong, C.-K., Yiu, S.-M.: Compressed indexing and local alignment of dna. Bioinformatics 24(6), 791–797 (2008). https://doi.org/10.1093/bioinformatics/btn032

    Article  Google Scholar 

  26. Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows–wheeler transform. Bioinformatics 26(5), 589–595 (2010). https://doi.org/10.1093/bioinformatics/btp698

    Article  Google Scholar 

  27. Yang, X., Liu, H., Wang, B.: Alae: accelerating local alignment with affine gap exactly in biosequence databases. PVLDB 5(11), 1507–1518 (2012). https://doi.org/10.14778/2350229.2350265

    Google Scholar 

  28. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, IEEE, pp. 390–398. https://doi.org/10.1109/SFCS.2000.892127 (2000)

  29. Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Tech. Rep. (1994)

  30. Newkirk, D., Biesinger, J., Chon, A., Yokomori, K., Xie, X.: Arem: aligning short reads from chip-sequencing by expectation maximization. J. Comput. Biol. 18(11), 1495–1505 (2011). https://doi.org/10.1089/cmb.2011.0185

    Article  MathSciNet  Google Scholar 

  31. Roberts, A., Pachter, L.: Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10(1), 71–73 (2013). https://doi.org/10.1038/nmeth.2251

    Article  Google Scholar 

  32. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Bio. 10(3), R25 (2009). https://doi.org/10.1186/gb-2009-10-3-r25

    Article  Google Scholar 

  33. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012). https://doi.org/10.1038/nmeth.1923

    Article  Google Scholar 

  34. Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008). https://doi.org/10.14778/1453856.1453957

    MathSciNet  Google Scholar 

  35. Echihabi, K., Zoumpatianos, K., Palpanas, T.: High-dimensional similarity search for scalable data science. In: ICDE, IEEE, pp. 2369–2372. https://doi.org/10.1109/ICDE51399.2021.00268 (2021)

  36. Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. ACM Trans. Alg. 3(1), 1–19 (2007). https://doi.org/10.1145/1186810.1186812

    MathSciNet  MATH  Google Scholar 

  37. Fiori, F.J., Pakalén, W., Tarhio, J.: Approximate string matching with SIMD. Comput. J. 65(6), 1472–1488 (2021). https://doi.org/10.1093/comjnl/bxaa193

    Article  MathSciNet  Google Scholar 

  38. Song, G., Shim, K., Lee, H.: Substring similarity search with synonyms. In: ICDE, IEEE, pp. 2003–2008. https://doi.org/10.1109/ICDE51399.2021.00191 (2021)

  39. Zhang, Z., Pun, C.-M.: Learning ordinal constraint binary codes for fast similarity search. Inf. Process. Manag. 59(3), 102919 (2022). https://doi.org/10.1016/j.ipm.2022.102919

    Article  Google Scholar 

  40. Meng, Z., Shen, H.: Fast top-k similarity search in large dynamic attributed networks. Inf. Process. Manag. 56(6), 102074 (2019). https://doi.org/10.1016/j.ipm.2019.102074

    Article  Google Scholar 

  41. Lu, M., Huang, Y., Xie, M., Liu, J.: Rank hash similarity for fast similarity search. Inf. Process. Manag. 49(1), 158–168 (2013). https://doi.org/10.1016/j.ipm.2012.07.003

    Article  Google Scholar 

  42. Yuan, H., Li, G.: Distributed in-memory trajectory similarity search and join on road network. In: ICDE, IEEE, pp. 1262–1273. https://doi.org/10.1109/ICDE.2019.00115 (2019)

  43. Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: the ulisse approach. PVLDB 11(13), 2236–2248 (2018). https://doi.org/10.14778/3275366.3284968

    Google Scholar 

Download references

Funding

This work is partly supported by the National Natural Science Foundation of China (Nos.62002245 and 61802268) and the Natural Science Foundation of Liaoning Province (Nos. 2022-BS-218 and 2022-MS-303).

Author information

Authors and Affiliations

Authors

Contributions

Tao Qiu and Chuanyu Zong wrote the main manuscript text. Tao Qiu, Xiaochun Yang, and Bing Li proposed the algorithms. Chuanyu Zong prepared all figures. Tao Qiu and Bin Wang conducted the experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Tao Qiu.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiu, T., Zong, C., Yang, X. et al. Hierarchical filtering: improving similar substring matching under edit distance. World Wide Web 26, 1967–2001 (2023). https://doi.org/10.1007/s11280-022-01128-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-022-01128-w

Keywords

Navigation