Abstract
Similar substring matching, as an essential operation in applications including read mapping and text retrieval, has attracted significant attention in the research community. In this paper, we study the problem of similar substring matching with edit distance constraints. Existing methods generally utilize a filtering-and-verification framework to solve this problem – a filtering procedure is employed to reduce the searching space before going to a computationally expensive verification step, and the efficiency depends critically on balancing the cost of filtering and verification. The common filtering paradigm is based on the principle of Pigeonhole stating that a matching result must exactly match at least a certain number of substrings from the query, where the substrings act as a filter. However, the polynomial growth of filters caused by enlarging the number of substrings in filters, leading to the cost of filtering and verification is not well-balanced for the existing methods. To this end, we propose a novel filtering paradigm hierarchical filtering, aiming at achieving a fine-grained balance on the cost of filtering and verification. Unlike using a fixed number of substrings in a filter, our method allows the filters contain a different number of substrings that avoids the polynomial growth of filters. The filters are picked in accord with a scoring metric. We devise a tree-based filtering framework for hierarchical filtering. Also, the cost of filtering and verification is further reduced by eliminating the duplication of filters. Extensive experiments have been conducted on four real-world datasets by comparing to state-of-the-art methods Hobbes3, BWA, and BLAST, etc. The results show that our method outperforms the competing methods under a wide range of parameter settings.
Similar content being viewed by others
Data Availability
The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.
Notes
A subsequence can be consisted of inconsecutive characters.
References
Ahmadi, A., Behm, A., Honnalli, N., Li, C., Weng, L., Xie, X.: Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 40(6), e41–e41 (2011). https://doi.org/10.1093/nar/gkr1246https://doi.org/10.1093/nar/gkr1246
Kim, J., Li, C., Xie, X.: Hobbes3: dynamic generation of variable-length signatures for efficient approximate subsequence mappings. In: ICDE, IEEE, pp. 169–180. https://doi.org/10.1109/ICDE.2016.7498238https://doi.org/10.1109/ICDE.2016.7498238(2016)
Kim, Y., Park, H., Shim, K., Woo, K.G.: Efficient processing of substring match queries with inverted variable-length gram indexes. Inform. Sci. 244, 119–141 (2013). https://doi.org/10.1016/j.ins.2013.04.037https://doi.org/10.1016/j.ins.2013.04.037
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 25(14), 1754–1760 (2009). https://doi.org/10.1093/bioinformatics/btp324
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, IEEE, pp. 257–266. https://doi.org/10.1109/ICDE.2008.4497434 (2008)
Wang, J., Li, G., Deng, D., Zhang, Y., Feng, J.: Two birds with one stone: an efficient hierarchical framework for top-k and threshold-based string similarity search. In: ICDE, IEEE, pp. 519–530. https://doi.org/10.1109/ICDE.2015.7113311 (2015)
Wang, J., Yang, X., Wang, B., Liu, C.: An adaptive approach of approximate substring matching. In: DASFAA, Springer, pp. 501–516. https://doi.org/10.1007/978-3-319-32025-0_31 (2016)
Qin, J., Wang, W., Xiao, C., Lu, Y., Lin, X., Wang, H.: Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Trans. Database Syst. 38(3), 1–44 (2013). https://doi.org/10.1145/2508020.2508023
Wang, J., Yang, X., Wang, B., Liu, C.: Ls-join: local similarity join on string collections. IEEE Trans. Knowl. Data Eng. 29(9), 1928–1942 (2017). https://doi.org/10.1109/TKDE.2017.2687460
Kim, J., Li, C., Xie, X.: Improving read mapping using additional prefix grams. BMC Bioinform. 15(1), 42 (2014). https://doi.org/10.1186/1471-2105-15-42
Kim, Y., Shim, K.: Efficient top-k algorithms for approximate substring matching. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp. 385–396 (2013)
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theo. Comput. Sci. 92(1), 191–211 (1992). https://doi.org/10.1016/0304-3975(92)90143-4
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999). https://doi.org/10.1145/316542.316550
Cheng, H., Jiang, H., Yang, J., Xu, Y., Shang, Y.: Bitmapper: an efficient all-mapper based on bit-vector computing. BMC Bioinform. 16 (1), 192 (2015). https://doi.org/10.1186/s12859-015-0626-9
Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011). https://doi.org/10.14778/2078331.2078340
Yang, X., Wang, B., Li, C., Wang, J., Xie, X.: Efficient direct search on compressed genomic data. In: ICDE, IEEE, pp. 961–972. https://doi.org/10.1109/ICDE.2013.6544889 (2013)
Chen, C., Qin, J., Wang, W.: On gapped set intersection size estimation. In: CIKM, ACM, pp. 1351–1360. https://doi.org/10.1145/2806416.2806438 (2015)
Consortium, T.G.P.: A map of human genome variation from population-scale sequencing. Nature 467(7319)), 1061–1073 (2010). https://doi.org/10.1038/nature09534
Weese, D., Holtgrewe, M., Reinert, K.: Razers 3: faster, fully sensitive read mapping. Bioinformatics 28(20), 2592–2599 (2012). https://doi.org/10.1093/bioinformatics/bts505
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molecular Bio. 215(3), 403–410 (1990). https://doi.org/10.1016/S0022-2836(05)80360-2
Qiu, T., Yang, X., Wang, B., Han, Y., Wang, S.: Efficient approximate subsequence matching using hybrid signatures. In: DASFAA, Springer, pp. 600–609. https://doi.org/10.1007/978-3-319-91452-7_39 (2018)
Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD, ACM, pp. 353–364. https://doi.org/10.1145/1376616.1376655 (2008)
Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41(7), e78–e78 (2013). https://doi.org/10.1093/nar/gkt005
Hanhan, R., Garzón, E., Jahshan, Z., Teman, A., Lanuzza, M., Yavits, L.: Edam: edit distance tolerant approximate matching content addressable memory. In: ISCA, ACM, pp. 495—-507. https://doi.org/10.1145/3470496.3527424 (2022)
Lam, T.W., Sung, W.-K., Tam, S.-L., Wong, C.-K., Yiu, S.-M.: Compressed indexing and local alignment of dna. Bioinformatics 24(6), 791–797 (2008). https://doi.org/10.1093/bioinformatics/btn032
Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows–wheeler transform. Bioinformatics 26(5), 589–595 (2010). https://doi.org/10.1093/bioinformatics/btp698
Yang, X., Liu, H., Wang, B.: Alae: accelerating local alignment with affine gap exactly in biosequence databases. PVLDB 5(11), 1507–1518 (2012). https://doi.org/10.14778/2350229.2350265
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, IEEE, pp. 390–398. https://doi.org/10.1109/SFCS.2000.892127 (2000)
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Tech. Rep. (1994)
Newkirk, D., Biesinger, J., Chon, A., Yokomori, K., Xie, X.: Arem: aligning short reads from chip-sequencing by expectation maximization. J. Comput. Biol. 18(11), 1495–1505 (2011). https://doi.org/10.1089/cmb.2011.0185
Roberts, A., Pachter, L.: Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10(1), 71–73 (2013). https://doi.org/10.1038/nmeth.2251
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Bio. 10(3), R25 (2009). https://doi.org/10.1186/gb-2009-10-3-r25
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012). https://doi.org/10.1038/nmeth.1923
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008). https://doi.org/10.14778/1453856.1453957
Echihabi, K., Zoumpatianos, K., Palpanas, T.: High-dimensional similarity search for scalable data science. In: ICDE, IEEE, pp. 2369–2372. https://doi.org/10.1109/ICDE51399.2021.00268 (2021)
Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. ACM Trans. Alg. 3(1), 1–19 (2007). https://doi.org/10.1145/1186810.1186812
Fiori, F.J., Pakalén, W., Tarhio, J.: Approximate string matching with SIMD. Comput. J. 65(6), 1472–1488 (2021). https://doi.org/10.1093/comjnl/bxaa193
Song, G., Shim, K., Lee, H.: Substring similarity search with synonyms. In: ICDE, IEEE, pp. 2003–2008. https://doi.org/10.1109/ICDE51399.2021.00191 (2021)
Zhang, Z., Pun, C.-M.: Learning ordinal constraint binary codes for fast similarity search. Inf. Process. Manag. 59(3), 102919 (2022). https://doi.org/10.1016/j.ipm.2022.102919
Meng, Z., Shen, H.: Fast top-k similarity search in large dynamic attributed networks. Inf. Process. Manag. 56(6), 102074 (2019). https://doi.org/10.1016/j.ipm.2019.102074
Lu, M., Huang, Y., Xie, M., Liu, J.: Rank hash similarity for fast similarity search. Inf. Process. Manag. 49(1), 158–168 (2013). https://doi.org/10.1016/j.ipm.2012.07.003
Yuan, H., Li, G.: Distributed in-memory trajectory similarity search and join on road network. In: ICDE, IEEE, pp. 1262–1273. https://doi.org/10.1109/ICDE.2019.00115 (2019)
Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: the ulisse approach. PVLDB 11(13), 2236–2248 (2018). https://doi.org/10.14778/3275366.3284968
Funding
This work is partly supported by the National Natural Science Foundation of China (Nos.62002245 and 61802268) and the Natural Science Foundation of Liaoning Province (Nos. 2022-BS-218 and 2022-MS-303).
Author information
Authors and Affiliations
Contributions
Tao Qiu and Chuanyu Zong wrote the main manuscript text. Tao Qiu, Xiaochun Yang, and Bing Li proposed the algorithms. Chuanyu Zong prepared all figures. Tao Qiu and Bin Wang conducted the experiments. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qiu, T., Zong, C., Yang, X. et al. Hierarchical filtering: improving similar substring matching under edit distance. World Wide Web 26, 1967–2001 (2023). https://doi.org/10.1007/s11280-022-01128-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-022-01128-w