Skip to main content

Efficient Approximate Subsequence Matching Using Hybrid Signatures

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10827))

Abstract

In this paper, we focus on the problem of approximate subsequence matching, also called the read mapping problem in genomics, which is finding similar subsequences (A subsequence refers to a substring which has consecutive characters) of a query (DNA subsequence) from a reference genome under a user-specified similarity threshold k. Existing methods first extract subsequences from a query to generate signatures, then produce candidate positions using the generated signatures, and finally verify these candidate positions to obtain the true mapping positions. However, there exist two main issues in these works: (1) producing many candidate positions; and (2) generating large numbers of signatures, among which many signatures are redundant. To address the above two issues, we propose a novel filtering technique, called hybrid signatures, which can achieve a better balance between the filtering ability of signatures and the overhead of producing candidate positions. Accordingly, we devise an adaptive algorithm to produce candidate positions using hybrid signatures. Finally, the experimental results on real-world genomic sequences show that our method outperforms state-of-the-art methods in query efficiency.

The work is partially supported by the National Natural Science Foundation of China (Nos. 61572122, U1736104, 61532021).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    \(\left( {\begin{array}{c}n\\ r\end{array}}\right) \) means the number of r-combinations for a set with size n.

  2. 2.

    http://hobbes.ics.uci.edu/.

  3. 3.

    http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/.

  4. 4.

    http://fruitfly.org/sequence/.

  5. 5.

    ftp://ftp-trace.ncbi.nih.gov/1000genomes/.

References

  1. Ahmadi, A., Behm, A., Honnalli, N., Li, C., Xie, X.: Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 40, e41 (2012)

    Article  Google Scholar 

  2. Kim, J., Li, C., Xie, X.: Improving read mapping using additional prefix grams. BMC Bioinf. 15(1), 42 (2014)

    Article  Google Scholar 

  3. Kim, J., Li, C., Xie, X.: Hobbes3: dynamic generation of variable-length signatures for efficient approximate subsequence mappings. In: ICDE 2016. IEEE (2016)

    Google Scholar 

  4. Yang, X., Wang, B., Li, C., Wang, J., Xie, X.: Efficient direct search on compressed genomic data. In: ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp. 961–972 (2013)

    Google Scholar 

  5. Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: improving the performance of approximate queries on string collections. In: SIGMOD 2015, pp. 377–392 (2015)

    Google Scholar 

  6. Qin, J., Wang, W., Xiao, C., Lu, Y., Lin, X., Wang, H.: Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Trans. Database Syst. 38(3), 16 (2013)

    Article  MathSciNet  Google Scholar 

  7. Wang, J., Yang, X., Wang, B., Liu, C.: LS-Join: local similarity join on string collections. IEEE Trans. Knowl. Data Eng. 29(9), 1928–1942 (2017)

    Article  Google Scholar 

  8. Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41, e78 (2013)

    Article  Google Scholar 

  9. Cheng, H., Jiang, H., Yang, J., Xu, Y., Shang, Y.: BitMapper: an efficient all-mapper based on bit-vector computing. BMC Bioinf. 16, 192 (2016)

    Article  Google Scholar 

  10. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 10, r25 (2009)

    Article  Google Scholar 

  11. Langmead, B., Salzberg, S.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012)

    Article  Google Scholar 

  12. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)

    Article  Google Scholar 

  13. Newkirk, D., Biesinger, J., Chon, A., Yokomori, K.: AREM: aligning short reads from ChIP-sequencing by expectation maximization. J. Comput. Biol. 18, 1495–1505 (2011)

    Article  MathSciNet  Google Scholar 

  14. Roberts, A., Pachter, L.: Streaming fragment assignment for realtime analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Qiu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qiu, T., Yang, X., Wang, B., Han, Y., Wang, S. (2018). Efficient Approximate Subsequence Matching Using Hybrid Signatures. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds) Database Systems for Advanced Applications. DASFAA 2018. Lecture Notes in Computer Science(), vol 10827. Springer, Cham. https://doi.org/10.1007/978-3-319-91452-7_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-91452-7_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-91451-0

  • Online ISBN: 978-3-319-91452-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics