Linear Time Maximum Segmentation Problems in Column Stream Model

Cazaux, Bastien; Kosolobov, Dmitry; Mäkinen, Veli; Norri, Tuukka

doi:10.1007/978-3-030-32686-9_23

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11811))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

589 Accesses
3 Citations

Abstract

We study a lossy compression scheme linked to the biological problem of founder reconstruction: The goal in founder reconstruction is to replace a set of strings with a smaller set of founders such that the original connections are maintained as well as possible. A general formulation of this problem is NP-hard, but when limiting to reconstructions that form a segmentation of the input strings, polynomial time solutions exist. We proposed in our earlier work (WABI 2018) a linear time solution to a formulation where minimum segment length was bounded, but it was left open if the same running time can be obtained when the targeted compression level (number of founders) is bounded and lossyness is minimized. This optimization is captured by the Maximum Segmentation problem: Given a threshold M and a set \(\mathcal {R} = \{\mathcal {R}_1,\ldots ,\mathcal {R}_m\}\) of strings of the same length n, find a minimum cost partition P where for each segment \([i,j] \in P\), the compression level \(\vert \{\mathcal {R}_k[i,j]: 1\le k \le m\} \vert \) is bounded from above by M. We give linear time algorithms to solve the problem for two different (compression quality) measures on P: the average length of the intervals of the partition and the length of the minimal interval of the partition. These algorithms make use of positional Burrows–Wheeler transform and the range maximum queue, an extension of range maximum queries to the case where the input string can be operated as a queue. For the latter, we present a new solution that may be of independent interest. The solutions work in a streaming model where one column of the input strings is introduced at a time.

This work was partially supported by the Academy of Finland (grant 309048).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The original of cartesian tree is for Range Minimum Query.
2.
https://github.com/tsnorri/founder-sequences.

References

Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000). https://doi.org/10.1007/10719839_9
Chapter Google Scholar
Blin, G., Rizzi, R., Sikora, F., Vialette, S.: Minimum mosaic inference of a set of recombinants. Int. J. Found. Comput. Sci. 24(1), 51–66 (2013)
Article MathSciNet Google Scholar
Durbin, R.: Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics 30(9), 1266–1272 (2014)
Article Google Scholar
Fischer, J., Heun, V.: Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 36–48. Springer, Heidelberg (2006). https://doi.org/10.1007/11780441_5
Chapter Google Scholar
Gajewska, H., Tarjan, R.E.: Deques with heap order. Inf. Process. Lett. 22(4), 197–200 (1986)
Article Google Scholar
Mäkinen, V., Norri, T.: Applying the positional Burrows-Wheeler transform to all-pairs hamming distance. Submitted manuscript (2018)
Google Scholar
Norri, T., Cazaux, B., Kosolobov, D., Mäkinen, V.: Minimum segmentation for pan-genomic founder reconstruction in linear time. In 18th International Workshop on Algorithms in Bioinformatics, WABI 2018, Helsinki, Finland, 20–22 August 2018, pp. 15:1–15:15 (2018). https://doi.org/10.4230/LIPIcs.WABI.2018.15
Rastas, P., Ukkonen, E.: Haplotype inference via hierarchical genotype parsing. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS, vol. 4645, pp. 85–97. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74126-8_9
Chapter Google Scholar
Ukkonen, E.: Finding founder sequences from a set of recombinants. In: Guigó, R., Gusfield, D. (eds.) WABI 2002. LNCS, vol. 2452, pp. 277–286. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45784-4_21
Chapter Google Scholar
Valenzuela, D., Norri, T., Niko, V., Pitkänen, E., Mäkinen, V.: Towards pan-genome read alignment to improve variation calling. BMC Genomics 19(Suppl. 2), 87 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Helsinki, Finland
Bastien Cazaux, Veli Mäkinen & Tuukka Norri
Ural Federal University, Ekaterinburg, Russia
Dmitry Kosolobov

Authors

Bastien Cazaux
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Kosolobov
View author publications
You can also search for this author in PubMed Google Scholar
Veli Mäkinen
View author publications
You can also search for this author in PubMed Google Scholar
Tuukka Norri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bastien Cazaux .

Editor information

Editors and Affiliations

University of A Coruña, A Coruña, Spain
Nieves R. Brisaboa
University of Helsinki, Helsinki, Finland
Simon J. Puglisi

Appendix

About the Column Stream Model

Given an algorithm for a problem with an input \(\mathcal {I}\) and an output \(\mathcal {O}\), the space complexity of this algorithm corresponds to the space used by \(\mathcal {I}\) and by \(\mathcal {O}\) and the auxiliary space which is the temporary space used by this algorithm. Therefore the space complexity is in \(\varOmega (\vert \mathcal {I} \vert +\vert \mathcal {O} \vert )\). In the case of problems of Maximum Segmentation, all algorithms have a space complexity of \(\varOmega (nm)\) where the input is a set of m strings of size n. As we want to avoid an auxiliary space of \(\varTheta (nm)\) (this could be too big for a computer), we cannot use the random access model. Indeed the random access model corresponds to open all the file in input in the temporary memory. We suggest a specific streaming data model where the set of strings of the same length is seen column by column: the Column Stream Model. In this model, the size of the input is in \(\varTheta (m)\) which is acceptable.

To prove the realism of this model, we implemented a streaming way to read a file and we tested this implementation with files of different sizes (see Fig. 2). The experiments were run on a machine with an Intel Xeon E5-2680 v4 2.4GHz CPU, which has a 35 MB Intel SmartCache. The machine has 256 gigabytes of memory at a speed of 2400MT/s. The code was compiled with g++ using the -Ofast optimization flag.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cazaux, B., Kosolobov, D., Mäkinen, V., Norri, T. (2019). Linear Time Maximum Segmentation Problems in Column Stream Model. In: Brisaboa, N., Puglisi, S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science(), vol 11811. Springer, Cham. https://doi.org/10.1007/978-3-030-32686-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-32686-9_23
Published: 03 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32685-2
Online ISBN: 978-3-030-32686-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Linear Time Maximum Segmentation Problems in Column Stream Model

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation