Hierarchical filtering: improving similar substring matching under edit distance

Qiu, Tao; Zong, Chuanyu; Yang, Xiaochun; Wang, Bin; Li, Bing

doi:10.1007/s11280-022-01128-w

Hierarchical filtering: improving similar substring matching under edit distance

Published: 06 December 2022

Volume 26, pages 1967–2001, (2023)
Cite this article

World Wide Web Aims and scope Submit manuscript

Tao Qiu¹,
Chuanyu Zong¹,
Xiaochun Yang²,
Bin Wang² &
…
Bing Li³

198 Accesses
1 Altmetric
Explore all metrics

Abstract

Similar substring matching, as an essential operation in applications including read mapping and text retrieval, has attracted significant attention in the research community. In this paper, we study the problem of similar substring matching with edit distance constraints. Existing methods generally utilize a filtering-and-verification framework to solve this problem – a filtering procedure is employed to reduce the searching space before going to a computationally expensive verification step, and the efficiency depends critically on balancing the cost of filtering and verification. The common filtering paradigm is based on the principle of Pigeonhole stating that a matching result must exactly match at least a certain number of substrings from the query, where the substrings act as a filter. However, the polynomial growth of filters caused by enlarging the number of substrings in filters, leading to the cost of filtering and verification is not well-balanced for the existing methods. To this end, we propose a novel filtering paradigm hierarchical filtering, aiming at achieving a fine-grained balance on the cost of filtering and verification. Unlike using a fixed number of substrings in a filter, our method allows the filters contain a different number of substrings that avoids the polynomial growth of filters. The filters are picked in accord with a scoring metric. We devise a tree-based filtering framework for hierarchical filtering. Also, the cost of filtering and verification is further reduced by eliminating the duplication of filters. Extensive experiments have been conducted on four real-world datasets by comparing to state-of-the-art methods Hobbes3, BWA, and BLAST, etc. The results show that our method outperforms the competing methods under a wide range of parameter settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Adaptive Approach of Approximate Substring Matching

GFSF: A Novel Similarity Join Method Based on Frequency Vector

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs

Data Availability

The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.

Notes

A subsequence can be consisted of inconsecutive characters.
http://hgdownload.cse.ucsc.edu/goldenpath/hg18/chromosomes/
http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/
http://fruitfly.org/sequence/
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

References

Ahmadi, A., Behm, A., Honnalli, N., Li, C., Weng, L., Xie, X.: Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 40(6), e41–e41 (2011). https://doi.org/10.1093/nar/gkr1246 https://doi.org/10.1093/nar/gkr1246
Article Google Scholar
Kim, J., Li, C., Xie, X.: Hobbes3: dynamic generation of variable-length signatures for efficient approximate subsequence mappings. In: ICDE, IEEE, pp. 169–180. https://doi.org/10.1109/ICDE.2016.7498238 https://doi.org/10.1109/ICDE.2016.7498238(2016)
Kim, Y., Park, H., Shim, K., Woo, K.G.: Efficient processing of substring match queries with inverted variable-length gram indexes. Inform. Sci. 244, 119–141 (2013). https://doi.org/10.1016/j.ins.2013.04.037 https://doi.org/10.1016/j.ins.2013.04.037
Article MathSciNet MATH Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 25(14), 1754–1760 (2009). https://doi.org/10.1093/bioinformatics/btp324
Article Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, IEEE, pp. 257–266. https://doi.org/10.1109/ICDE.2008.4497434 (2008)
Wang, J., Li, G., Deng, D., Zhang, Y., Feng, J.: Two birds with one stone: an efficient hierarchical framework for top-k and threshold-based string similarity search. In: ICDE, IEEE, pp. 519–530. https://doi.org/10.1109/ICDE.2015.7113311 (2015)
Wang, J., Yang, X., Wang, B., Liu, C.: An adaptive approach of approximate substring matching. In: DASFAA, Springer, pp. 501–516. https://doi.org/10.1007/978-3-319-32025-0_31 (2016)
Qin, J., Wang, W., Xiao, C., Lu, Y., Lin, X., Wang, H.: Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Trans. Database Syst. 38(3), 1–44 (2013). https://doi.org/10.1145/2508020.2508023
Article MathSciNet MATH Google Scholar
Wang, J., Yang, X., Wang, B., Liu, C.: Ls-join: local similarity join on string collections. IEEE Trans. Knowl. Data Eng. 29(9), 1928–1942 (2017). https://doi.org/10.1109/TKDE.2017.2687460
Article Google Scholar
Kim, J., Li, C., Xie, X.: Improving read mapping using additional prefix grams. BMC Bioinform. 15(1), 42 (2014). https://doi.org/10.1186/1471-2105-15-42
Article Google Scholar
Kim, Y., Shim, K.: Efficient top-k algorithms for approximate substring matching. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp. 385–396 (2013)
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theo. Comput. Sci. 92(1), 191–211 (1992). https://doi.org/10.1016/0304-3975(92)90143-4
Article MathSciNet MATH Google Scholar
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999). https://doi.org/10.1145/316542.316550
Article MathSciNet MATH Google Scholar
Cheng, H., Jiang, H., Yang, J., Xu, Y., Shang, Y.: Bitmapper: an efficient all-mapper based on bit-vector computing. BMC Bioinform. 16 (1), 192 (2015). https://doi.org/10.1186/s12859-015-0626-9
Article Google Scholar
Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011). https://doi.org/10.14778/2078331.2078340
Google Scholar
Yang, X., Wang, B., Li, C., Wang, J., Xie, X.: Efficient direct search on compressed genomic data. In: ICDE, IEEE, pp. 961–972. https://doi.org/10.1109/ICDE.2013.6544889 (2013)
Chen, C., Qin, J., Wang, W.: On gapped set intersection size estimation. In: CIKM, ACM, pp. 1351–1360. https://doi.org/10.1145/2806416.2806438 (2015)
Consortium, T.G.P.: A map of human genome variation from population-scale sequencing. Nature 467(7319)), 1061–1073 (2010). https://doi.org/10.1038/nature09534
Article Google Scholar
Weese, D., Holtgrewe, M., Reinert, K.: Razers 3: faster, fully sensitive read mapping. Bioinformatics 28(20), 2592–2599 (2012). https://doi.org/10.1093/bioinformatics/bts505
Article Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molecular Bio. 215(3), 403–410 (1990). https://doi.org/10.1016/S0022-2836(05)80360-2
Article Google Scholar
Qiu, T., Yang, X., Wang, B., Han, Y., Wang, S.: Efficient approximate subsequence matching using hybrid signatures. In: DASFAA, Springer, pp. 600–609. https://doi.org/10.1007/978-3-319-91452-7_39 (2018)
Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD, ACM, pp. 353–364. https://doi.org/10.1145/1376616.1376655 (2008)
Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41(7), e78–e78 (2013). https://doi.org/10.1093/nar/gkt005
Article Google Scholar
Hanhan, R., Garzón, E., Jahshan, Z., Teman, A., Lanuzza, M., Yavits, L.: Edam: edit distance tolerant approximate matching content addressable memory. In: ISCA, ACM, pp. 495—-507. https://doi.org/10.1145/3470496.3527424 (2022)
Lam, T.W., Sung, W.-K., Tam, S.-L., Wong, C.-K., Yiu, S.-M.: Compressed indexing and local alignment of dna. Bioinformatics 24(6), 791–797 (2008). https://doi.org/10.1093/bioinformatics/btn032
Article Google Scholar
Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows–wheeler transform. Bioinformatics 26(5), 589–595 (2010). https://doi.org/10.1093/bioinformatics/btp698
Article Google Scholar
Yang, X., Liu, H., Wang, B.: Alae: accelerating local alignment with affine gap exactly in biosequence databases. PVLDB 5(11), 1507–1518 (2012). https://doi.org/10.14778/2350229.2350265
Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, IEEE, pp. 390–398. https://doi.org/10.1109/SFCS.2000.892127 (2000)
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Tech. Rep. (1994)
Newkirk, D., Biesinger, J., Chon, A., Yokomori, K., Xie, X.: Arem: aligning short reads from chip-sequencing by expectation maximization. J. Comput. Biol. 18(11), 1495–1505 (2011). https://doi.org/10.1089/cmb.2011.0185
Article MathSciNet Google Scholar
Roberts, A., Pachter, L.: Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10(1), 71–73 (2013). https://doi.org/10.1038/nmeth.2251
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Bio. 10(3), R25 (2009). https://doi.org/10.1186/gb-2009-10-3-r25
Article Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012). https://doi.org/10.1038/nmeth.1923
Article Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008). https://doi.org/10.14778/1453856.1453957
MathSciNet Google Scholar
Echihabi, K., Zoumpatianos, K., Palpanas, T.: High-dimensional similarity search for scalable data science. In: ICDE, IEEE, pp. 2369–2372. https://doi.org/10.1109/ICDE51399.2021.00268 (2021)
Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. ACM Trans. Alg. 3(1), 1–19 (2007). https://doi.org/10.1145/1186810.1186812
MathSciNet MATH Google Scholar
Fiori, F.J., Pakalén, W., Tarhio, J.: Approximate string matching with SIMD. Comput. J. 65(6), 1472–1488 (2021). https://doi.org/10.1093/comjnl/bxaa193
Article MathSciNet Google Scholar
Song, G., Shim, K., Lee, H.: Substring similarity search with synonyms. In: ICDE, IEEE, pp. 2003–2008. https://doi.org/10.1109/ICDE51399.2021.00191 (2021)
Zhang, Z., Pun, C.-M.: Learning ordinal constraint binary codes for fast similarity search. Inf. Process. Manag. 59(3), 102919 (2022). https://doi.org/10.1016/j.ipm.2022.102919
Article Google Scholar
Meng, Z., Shen, H.: Fast top-k similarity search in large dynamic attributed networks. Inf. Process. Manag. 56(6), 102074 (2019). https://doi.org/10.1016/j.ipm.2019.102074
Article Google Scholar
Lu, M., Huang, Y., Xie, M., Liu, J.: Rank hash similarity for fast similarity search. Inf. Process. Manag. 49(1), 158–168 (2013). https://doi.org/10.1016/j.ipm.2012.07.003
Article Google Scholar
Yuan, H., Li, G.: Distributed in-memory trajectory similarity search and join on road network. In: ICDE, IEEE, pp. 1262–1273. https://doi.org/10.1109/ICDE.2019.00115 (2019)
Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: the ulisse approach. PVLDB 11(13), 2236–2248 (2018). https://doi.org/10.14778/3275366.3284968
Google Scholar

Download references

Funding

This work is partly supported by the National Natural Science Foundation of China (Nos.62002245 and 61802268) and the Natural Science Foundation of Liaoning Province (Nos. 2022-BS-218 and 2022-MS-303).

Author information

Authors and Affiliations

School of Computer Science, Shenyang Aerospace University, Shenyang, China
Tao Qiu & Chuanyu Zong
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Xiaochun Yang & Bin Wang
A*STAR Centre for Frontier AI Research (CFAR), Singapore, Singapore
Bing Li

Authors

Tao Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Chuanyu Zong
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bing Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Tao Qiu and Chuanyu Zong wrote the main manuscript text. Tao Qiu, Xiaochun Yang, and Bing Li proposed the algorithms. Chuanyu Zong prepared all figures. Tao Qiu and Bin Wang conducted the experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Tao Qiu.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qiu, T., Zong, C., Yang, X. et al. Hierarchical filtering: improving similar substring matching under edit distance. World Wide Web 26, 1967–2001 (2023). https://doi.org/10.1007/s11280-022-01128-w

Download citation

Received: 21 September 2022
Revised: 04 November 2022
Accepted: 19 November 2022
Published: 06 December 2022
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11280-022-01128-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical filtering: improving similar substring matching under edit distance

Abstract

Access this article

Similar content being viewed by others

An Adaptive Approach of Approximate Substring Matching

GFSF: A Novel Similarity Join Method Based on Frequency Vector

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical filtering: improving similar substring matching under edit distance

Abstract

Access this article

Similar content being viewed by others

An Adaptive Approach of Approximate Substring Matching

GFSF: A Novel Similarity Join Method Based on Frequency Vector

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation