Skip to main content

Advertisement

Log in

ACRES: efficient query answering on large compressed sequences

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

With the advances in next generation sequencing, the amount of genomic sequence data being produced continues to grow at an exponential rate. It is estimated that the entire genome of each individual human, each containing about 3 billion letters, could be made available in the next a few years. An increasingly pressing issue in genomics and medicine is how to efficiently store and query these massive amounts of sequence data. Recently a lossless compression technique has been proposed to drastically reduce the storage space of genomic sequences, taking advantage of the fact that any two genomes from the same species are highly similar and therefore only their differences need to be encoded. In this paper we study how to efficiently answer queries on the compressed sequences without first decompressing them. We study three important types of queries, including retrieving a subsequence, finding subsequences matching a given pattern, and finding subsequences similar to a pattern. We propose an index structure, filtering techniques, and efficient algorithms for answering these queries. We further demonstrate the utility of these algorithms using a real dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16

Similar content being viewed by others

Notes

  1. We use the terms “string” and “sequence” in a synonymous way. Note, however, that we clearly distinguish between the terms “substring” and “subsequence,” the latter being the much more general term.

  2. We use (x,y] to express a PMR that overlapping with its left interval, similar, [x,y) represents a region overlapping with its right interval.

  3. See http://silver.ics.uci.edu/~dnazip/index.html.

  4. See http://www.ncbi.nlm.nih.gov/IEB/ToolBox.

References

  1. Aluru, S., Ko, P.: Encyclopedia of Algorithms, Chapter on “Lookup Tables, Suffix Trees and Suffix Arrays”. Springer (2008)

  2. Arasu, A., Ganti, V., Kaushik, R.: Exact set-similarity joins. In: VLDB, pp. 918–929 (2006)

  3. Baeze-Yates, R.A., Navarro, G.: Faster approximate string matching. Algorithmica 23(2), 127–158 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bayardo, R., Ma, Y., Srikant, R.: Scaling up all-pairs similarity search. In: WWW Conference (2007)

  5. Brandon, M.C., Wallace, D.C.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14), 1731–1738 (2009)

    Article  Google Scholar 

  6. Chaudhuri, S., Ganti, V.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)

  7. Christley, S., Lu, Y., Li, C., Xie, X.: Human genomes as email attachments. Bioinformatics 25(2), 274–275 (2009)

    Article  Google Scholar 

  8. Dublin, M.: So long, data depression. Genome Technology (2009)

  9. González, R., Navarro, G.: Compressed text indexes with fast locate. In: CPM, p. 4580. LNCS (2007)

  10. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)

  11. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)

  12. Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Set similarity selection queries at interactive speeds. In: ICDE (2008)

  13. Kärkkäinen, J., Navarro, G., Ukkonen, E.: Approximate string matching over ziv-lempel compressed text. In: CPM, p. 1848. LNCS (2000)

  14. Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: CPM, pp. 203–210. LNCS 2676 (2003)

  15. Li, C., Wang, B., X. Yang.: Vgram: Improving performance of approximate queries on string collections using variablelength grams. In: VLDB, pp. 303–314 (2007)

  16. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008)

  17. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv., 33(1) (2001)

  18. Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)

    Google Scholar 

  19. Papapetrou, P., Athitsos, V., Kollios, G., Gunopulos, D.: Referencebased alignment in large sequence databases. In: VLDB (2009)

  20. Sarawagi, S., Kirpai, A.: Efficient set joins on similarity predicatess. In: SIGMOD (2004)

  21. Venkateswaran, J., Lachwani, D., Kahveci, T., Jermaine, C.: Reference-based indexing of sequence databases. In: VLDB, pp. 906–917 (2006)

  22. Wang, W, Xiao, C., Lin, X., Zhang, C.: Efficent approximate entity extraction with edit distance constraints. In: SIGMOD (2009)

  23. Wang, B., Zhu, R., Yang, X., Wang, G.: Top-k representative documents query over geo-textual data stream. World Wide Web-internet Web Inf. Syst., 20(8) (2017)

  24. Welch, T.A.: A technique for high performance data compression. IEEE Comput. Mag., 17(6) (1984)

  25. Wheeler, D.A., Srinivasan, M., et al.: The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008)

    Article  Google Scholar 

  26. Wu, S., Manber, U.: Fast text searching allowing errors. Comm. of the ACM 35(10), 83–91 (1992)

    Article  Google Scholar 

  27. Yang, X., Wang, B., Li, C.: Cost-based variable- length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD (2008)

  28. Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: Improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 1–46 (2016)

    Article  MathSciNet  Google Scholar 

  29. Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: Improving the performance of approximate queries on string collections. In: SIGMOD, pp. 377–392 (2016)

  30. Zhu, R., Wang, B., Yang, X., Zheng, B., Wang, G.: Sap: Improving continuous top-k queries over streaming data. IEEE Trans. Knowl. Data Eng. 29(6), 1310–1328 (2017)

    Article  Google Scholar 

  31. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  32. Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 530–536 (1978)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The work is partially supported by the National Natural Science Foundation of China (Nos.61572122, 61532021, U173610071), Liaoning BaiQianWan Talents Program, and the Fundamental Research Funds for the Central Universities(No. N161606002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaochun Yang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, B., Yang, X. & Wang, G. ACRES: efficient query answering on large compressed sequences. World Wide Web 21, 1349–1376 (2018). https://doi.org/10.1007/s11280-017-0518-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-017-0518-1

Keywords

Navigation