ACRES: efficient query answering on large compressed sequences

Wang, Bin; Yang, Xiaochun; Wang, Guoren

doi:10.1007/s11280-017-0518-1

ACRES: efficient query answering on large compressed sequences

Published: 30 November 2017

Volume 21, pages 1349–1376, (2018)
Cite this article

World Wide Web Aims and scope Submit manuscript

Bin Wang¹,
Xiaochun Yang¹ &
Guoren Wang¹

550 Accesses
Explore all metrics

Abstract

With the advances in next generation sequencing, the amount of genomic sequence data being produced continues to grow at an exponential rate. It is estimated that the entire genome of each individual human, each containing about 3 billion letters, could be made available in the next a few years. An increasingly pressing issue in genomics and medicine is how to efficiently store and query these massive amounts of sequence data. Recently a lossless compression technique has been proposed to drastically reduce the storage space of genomic sequences, taking advantage of the fact that any two genomes from the same species are highly similar and therefore only their differences need to be encoded. In this paper we study how to efficiently answer queries on the compressed sequences without first decompressing them. We study three important types of queries, including retrieving a subsequence, finding subsequences matching a given pattern, and finding subsequences similar to a pattern. We propose an index structure, filtering techniques, and efficient algorithms for answering these queries. We further demonstrate the utility of these algorithms using a real dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

Tomasz Kociumaka, Jakub Radoszewski & Tatiana Starikovskaya

Bioinformatics: new tools and applications in life science and personalized medicine

Article 06 January 2021

Iuliia Branco & Altino Choupina

Machine Learning for Bioinformatics

Notes

We use the terms “string” and “sequence” in a synonymous way. Note, however, that we clearly distinguish between the terms “substring” and “subsequence,” the latter being the much more general term.
We use (x,y] to express a PMR that overlapping with its left interval, similar, [x,y) represents a region overlapping with its right interval.
See http://silver.ics.uci.edu/~dnazip/index.html.
See http://www.ncbi.nlm.nih.gov/IEB/ToolBox.

References

Aluru, S., Ko, P.: Encyclopedia of Algorithms, Chapter on “Lookup Tables, Suffix Trees and Suffix Arrays”. Springer (2008)
Arasu, A., Ganti, V., Kaushik, R.: Exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Baeze-Yates, R.A., Navarro, G.: Faster approximate string matching. Algorithmica 23(2), 127–158 (1999)
Article MathSciNet MATH Google Scholar
Bayardo, R., Ma, Y., Srikant, R.: Scaling up all-pairs similarity search. In: WWW Conference (2007)
Brandon, M.C., Wallace, D.C.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14), 1731–1738 (2009)
Article Google Scholar
Chaudhuri, S., Ganti, V.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)
Christley, S., Lu, Y., Li, C., Xie, X.: Human genomes as email attachments. Bioinformatics 25(2), 274–275 (2009)
Article Google Scholar
Dublin, M.: So long, data depression. Genome Technology (2009)
González, R., Navarro, G.: Compressed text indexes with fast locate. In: CPM, p. 4580. LNCS (2007)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)
Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Set similarity selection queries at interactive speeds. In: ICDE (2008)
Kärkkäinen, J., Navarro, G., Ukkonen, E.: Approximate string matching over ziv-lempel compressed text. In: CPM, p. 1848. LNCS (2000)
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: CPM, pp. 203–210. LNCS 2676 (2003)
Li, C., Wang, B., X. Yang.: Vgram: Improving performance of approximate queries on string collections using variablelength grams. In: VLDB, pp. 303–314 (2007)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv., 33(1) (2001)
Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)
Google Scholar
Papapetrou, P., Athitsos, V., Kollios, G., Gunopulos, D.: Referencebased alignment in large sequence databases. In: VLDB (2009)
Sarawagi, S., Kirpai, A.: Efficient set joins on similarity predicatess. In: SIGMOD (2004)
Venkateswaran, J., Lachwani, D., Kahveci, T., Jermaine, C.: Reference-based indexing of sequence databases. In: VLDB, pp. 906–917 (2006)
Wang, W, Xiao, C., Lin, X., Zhang, C.: Efficent approximate entity extraction with edit distance constraints. In: SIGMOD (2009)
Wang, B., Zhu, R., Yang, X., Wang, G.: Top-k representative documents query over geo-textual data stream. World Wide Web-internet Web Inf. Syst., 20(8) (2017)
Welch, T.A.: A technique for high performance data compression. IEEE Comput. Mag., 17(6) (1984)
Wheeler, D.A., Srinivasan, M., et al.: The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008)
Article Google Scholar
Wu, S., Manber, U.: Fast text searching allowing errors. Comm. of the ACM 35(10), 83–91 (1992)
Article Google Scholar
Yang, X., Wang, B., Li, C.: Cost-based variable- length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD (2008)
Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: Improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 1–46 (2016)
Article MathSciNet Google Scholar
Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: Improving the performance of approximate queries on string collections. In: SIGMOD, pp. 377–392 (2016)
Zhu, R., Wang, B., Yang, X., Zheng, B., Wang, G.: Sap: Improving continuous top-k queries over streaming data. IEEE Trans. Knowl. Data Eng. 29(6), 1310–1328 (2017)
Article Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 530–536 (1978)
Article MATH Google Scholar

Download references

Acknowledgments

The work is partially supported by the National Natural Science Foundation of China (Nos.61572122, 61532021, U173610071), Liaoning BaiQianWan Talents Program, and the Fundamental Research Funds for the Central Universities(No. N161606002).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Northeastern University, Shenyang, China
Bin Wang, Xiaochun Yang & Guoren Wang

Authors

Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Guoren Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaochun Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, B., Yang, X. & Wang, G. ACRES: efficient query answering on large compressed sequences. World Wide Web 21, 1349–1376 (2018). https://doi.org/10.1007/s11280-017-0518-1

Download citation

Received: 16 October 2017
Revised: 16 November 2017
Accepted: 22 November 2017
Published: 30 November 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11280-017-0518-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ACRES: efficient query answering on large compressed sequences

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

Bioinformatics: new tools and applications in life science and personalized medicine

Machine Learning for Bioinformatics

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ACRES: efficient query answering on large compressed sequences

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

Bioinformatics: new tools and applications in life science and personalized medicine

Machine Learning for Bioinformatics

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation