Abstract
With the advances in next generation sequencing, the amount of genomic sequence data being produced continues to grow at an exponential rate. It is estimated that the entire genome of each individual human, each containing about 3 billion letters, could be made available in the next a few years. An increasingly pressing issue in genomics and medicine is how to efficiently store and query these massive amounts of sequence data. Recently a lossless compression technique has been proposed to drastically reduce the storage space of genomic sequences, taking advantage of the fact that any two genomes from the same species are highly similar and therefore only their differences need to be encoded. In this paper we study how to efficiently answer queries on the compressed sequences without first decompressing them. We study three important types of queries, including retrieving a subsequence, finding subsequences matching a given pattern, and finding subsequences similar to a pattern. We propose an index structure, filtering techniques, and efficient algorithms for answering these queries. We further demonstrate the utility of these algorithms using a real dataset.
Similar content being viewed by others
Notes
We use the terms “string” and “sequence” in a synonymous way. Note, however, that we clearly distinguish between the terms “substring” and “subsequence,” the latter being the much more general term.
We use (x,y] to express a PMR that overlapping with its left interval, similar, [x,y) represents a region overlapping with its right interval.
References
Aluru, S., Ko, P.: Encyclopedia of Algorithms, Chapter on “Lookup Tables, Suffix Trees and Suffix Arrays”. Springer (2008)
Arasu, A., Ganti, V., Kaushik, R.: Exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Baeze-Yates, R.A., Navarro, G.: Faster approximate string matching. Algorithmica 23(2), 127–158 (1999)
Bayardo, R., Ma, Y., Srikant, R.: Scaling up all-pairs similarity search. In: WWW Conference (2007)
Brandon, M.C., Wallace, D.C.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14), 1731–1738 (2009)
Chaudhuri, S., Ganti, V.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)
Christley, S., Lu, Y., Li, C., Xie, X.: Human genomes as email attachments. Bioinformatics 25(2), 274–275 (2009)
Dublin, M.: So long, data depression. Genome Technology (2009)
González, R., Navarro, G.: Compressed text indexes with fast locate. In: CPM, p. 4580. LNCS (2007)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)
Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Set similarity selection queries at interactive speeds. In: ICDE (2008)
Kärkkäinen, J., Navarro, G., Ukkonen, E.: Approximate string matching over ziv-lempel compressed text. In: CPM, p. 1848. LNCS (2000)
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: CPM, pp. 203–210. LNCS 2676 (2003)
Li, C., Wang, B., X. Yang.: Vgram: Improving performance of approximate queries on string collections using variablelength grams. In: VLDB, pp. 303–314 (2007)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv., 33(1) (2001)
Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)
Papapetrou, P., Athitsos, V., Kollios, G., Gunopulos, D.: Referencebased alignment in large sequence databases. In: VLDB (2009)
Sarawagi, S., Kirpai, A.: Efficient set joins on similarity predicatess. In: SIGMOD (2004)
Venkateswaran, J., Lachwani, D., Kahveci, T., Jermaine, C.: Reference-based indexing of sequence databases. In: VLDB, pp. 906–917 (2006)
Wang, W, Xiao, C., Lin, X., Zhang, C.: Efficent approximate entity extraction with edit distance constraints. In: SIGMOD (2009)
Wang, B., Zhu, R., Yang, X., Wang, G.: Top-k representative documents query over geo-textual data stream. World Wide Web-internet Web Inf. Syst., 20(8) (2017)
Welch, T.A.: A technique for high performance data compression. IEEE Comput. Mag., 17(6) (1984)
Wheeler, D.A., Srinivasan, M., et al.: The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008)
Wu, S., Manber, U.: Fast text searching allowing errors. Comm. of the ACM 35(10), 83–91 (1992)
Yang, X., Wang, B., Li, C.: Cost-based variable- length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD (2008)
Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: Improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 1–46 (2016)
Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: Improving the performance of approximate queries on string collections. In: SIGMOD, pp. 377–392 (2016)
Zhu, R., Wang, B., Yang, X., Zheng, B., Wang, G.: Sap: Improving continuous top-k queries over streaming data. IEEE Trans. Knowl. Data Eng. 29(6), 1310–1328 (2017)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 530–536 (1978)
Acknowledgments
The work is partially supported by the National Natural Science Foundation of China (Nos.61572122, 61532021, U173610071), Liaoning BaiQianWan Talents Program, and the Fundamental Research Funds for the Central Universities(No. N161606002).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, B., Yang, X. & Wang, G. ACRES: efficient query answering on large compressed sequences. World Wide Web 21, 1349–1376 (2018). https://doi.org/10.1007/s11280-017-0518-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-017-0518-1