Abstract
We define a generalization of the prefix sum problem in which the vector can be masked by segments of a second (Boolean) vector. This problem is shown to be related to several other prefix sum, set intersection and approximate string match problems, via specific algorithms, reductions and conditional lower bounds. To our knowledge, we are the first to consider the fully internal measurement queries and prove lower bounds for them. We also discuss the hardness of the sparse variation in both static and dynamic settings. Finally, we provide a parallel algorithm to compute the answers to all possible queries when both vectors are fixed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We will use \(\log \) to denote \(\log _2\), though as our \(\log \) all eventually end up in asymptotic notation, the constant bases are irrelevant.
References
Andersson, A.: Faster deterministic sorting and searching in linear space. In 37th Annual Symposium on Foundations of Computer Science, FOCS 1996, Burlington, Vermont, USA, 14–16 October 1996, pp. 135–141. IEEE Computer Society (1996)
Bansal, N., Williams, R.: Regularity lemmas and combinatorial algorithms. Theory Comput. 8(1), 69–94 (2012)
Beame, P., Fich, F.E.: Optimal bounds for the predecessor problem and related problems. J. Comput. Syst. Sci. 65(1), 38–72 (2002)
Bille, P., et al.: Dynamic relative compression, dynamic partial sums, and substring concatenation. Algorithmica 80(11), 3207–3224 (2017). https://doi.org/10.1007/s00453-017-0380-7
Blelloch Guy, E.: Prefix sums and their applications. In: Synthesis of Parallel Algorithms, vol. 1, pp. 35–60. M. Kaufmann (1993)
Chan, T.M.: Speeding up the four Russians algorithm by about one more logarithmic factor. In: SODA, pp. 212–217 (2015)
Clifford, R., Grønlund, A., Larsen, K.G., Starikovskaya, T.: Upper and lower bounds for dynamic data structures on strings. In: Niedermeier, R., Vallée, B. (eds.) 35th Symposium on Theoretical Aspects of Computer Science, STACS 2018, 28 February–3 March 2018, Caen, France, vol. 96, pp. 22:1–22:14. LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)
Clifford, R., Iliopoulos, C.S.: Approximate string matching for music analysis. Soft. Comput. 8(9), 597–603 (2004). https://doi.org/10.1007/s00500-004-0384-5
Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. Theor. Comput. Sci. 411(40–42), 3795–3800 (2010)
Dhulipala, L., Blelloch, G.E., Shun, J.: Theoretically efficient parallel graph algorithms can be fast and scalable. ACM Trans. Parallel Comput. 8(1), 1–70 (2021)
Fredman, M.L., Willard, D.E.: Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci. 47(3), 424–436 (1993)
Goldstein, I., Lewenstein, M., Porat, E.: On the hardness of set disjointness and set intersection with bounded universe. In: Lu, P., Zhang, G. (eds.) 30th International Symposium on Algorithms and Computation (ISAAC 2019), 8–11 December 2019, Shanghai University of Finance and Economics, Shanghai, China, vol. 149, pp. 7:1–7:22. LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)
Golovnev, A., Guo, S., Horel, T., Park, S., Vaikuntanathan, V.: Data structures meet cryptography: 3SUM with preprocessing. In: Makarychev, K., Makarychev, Y., Tulsiani, M., Kamath, G., Chuzhoy, J. (eds.) Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, 22–26 June 2020, pp. 294–307. ACM (2020)
Kalai, A.: Efficient pattern-matching with don’t cares. In: Eppstein, D. (ed.) Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 6–8 January 2002, San Francisco, CA, USA, pp. 655–656. ACM/SIAM (2002)
Keller, O., Kopelowitz, T., Feibish, S.L., Lewenstein, M.: Generalized substring compression. Theor. Comput. Sci. 525, 42–54 (2014)
Kociumaka, T.: Efficient data structures for internal queries in texts. PhD Thesis. University of Warsaw (2019)
Kociumaka, T., Radoszewski, J., Rytter, W., Walen, T.: Internal pattern matching queries in a text and applications. In: Indyk, P. (ed.) Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, 4–6 January 2015, pp. 532–551. SIAM (2015)
Kopelowitz, T., Pettie, S., Porat, E.: Dynamic set intersection. In: Dehne, F., Sack, J.-R., Stege, U. (eds.) WADS 2015. LNCS, vol. 9214, pp. 470–481. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21840-3_39
Kopelowitz, T., Pettie, S., Porat, E.: Higher lower bounds from the 3SUM conjecture. In: Krauthgamer, R. (ed.) Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, 10–12 January 2016, pp. 1272–1287. SIAM (2016)
Kopelowitz, T., Porat, E.: The strong 3SUM-INDEXING conjecture is false. arXiv preprint arXiv:1907.11206 (2019)
Green Larsen, K.: Personal communication
Patrascu, M.: Towards polynomial lower bounds for dynamic problems. In: Schulman, L.J. (ed.) Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5–8 June 2010, pp. 603–610. ACM (2010)
Patrascu, M., Demaine, E.D.: Tight bounds for the partial-sums problem. In: Ian Munro, J. (ed.) Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, 11–14 January 2004, pp. 20–29. SIAM (2004)
Pibiri, G.E., Venturini, R.: Practical trade-offs for the prefix-sum problem. Softw. Pract. Exp. 51(5), 921–949 (2021)
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space \(\theta \)(N). Inf. Process. Lett. 17(2), 81–84 (1983)
Williams, V.V.: Multiplying matrices faster than Coppersmith-Winograd. In: STOC, pp. 887–898 (2012)
Huacheng, Yu.: An improved combinatorial algorithm for Boolean matrix multiplication. Inf. Comput. 261, 240–247 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Details Omitted from Sect. 3
Proof of Theorem 3. Given a bit vector B of length m and an array A of length n, there is a data structure that uses \(O(\frac{mn}{f(n)}+m+n)\) words of space that can answer masked prefix sum queries in \(O(f(n)+g(n))\) time and support updates in \(O(\frac{mn\log f(n)}{g(n)f(n)} +g(n))\) time, for any functions f(n) and g(n) with \(0<f(n) < n\) and \(0< g(n) < m+n\).
Alternatively, for any \(c > 0\), there is a data structure that uses \(O(\frac{mn}{f(n)} + \frac{n^{1+c}}{c\log n}+m+n)\) words of space and can answer masked prefix sum queries in \(O(\frac{f(n)}{c\log n}+g(n))\) time and support updates in \(O(\frac{mn \log f(n)}{g(n)f(n)}+\frac{n^{1+c}}{g(n)}+g(n))\) time. If \(m = O(n)\), setting \(f(n) = n^{2/3}\log n\), \(g(n) = n^{2/3}\) and \(c = 1/3\) yields an \(O(n^{4/3}/\log n)\)-word data structure with \(O(n^{2/3})\) query and update times.
Proof
We first present a data structure with amortized bounds on update operations. The main idea is to rebuild the data structures from Theorem 1 every g(n) updates. Since Theorem 1 presents multiple trade-offs, in the rest of the proof, we use s(m, n), p(m, n) and q(n) to represent the space cost, preprocessing time and query time of the data structures in that theorem. Before a rebuilding is triggered, we maintain two copies of the array and the bit mask: A and B store the current content of this array and the bit mask, respectively, while \(A'\) and \(B'\) store their content when the previous rebuilding happened. Thus, the data structure, D, constructed in the previous rebuilding, can be used to answer masked prefix sum queries over \(A'\) and \(B'\). For the updates arrived after the previous rebuilding, we maintain two lists: a list \(L_A\) that stores a sorted list of the indexes of the entries of A that have been updated since the previous rebuilding, and a list \(L_B\) that stores a sorted list of the indexes of the entries of B that have been updated since the previous rebuilding. Since the length of either list is at most \(g(n) < m+n\), all the data structures occupy \(O(s(m,n) + m + n)\) words.
We then answer a masked prefix sum query as follows. Let k and i be the parameters of the query, i.e., we aim at computing \(\sum _{j=1}^{k} A[j]\cdot B[i+j-1]\). We first perform such a query using D in q(n) time and get what the answer would be if there had been no updates since the last rebuilding. Since both \(L_A\) and \(L_B\) are sorted, we can walk through them to compute the indexes of the elements of A that have either been updated since the last rebuilding, or it is mapped by the query to a bit in B that has been updated since the last rebuilding. This uses O(g(n)) time. Then, for each such index d, we consult A, \(A'\), B and \(B'\) to compute how much the update, to either A[d] or \(B[d + i-1]\), affects the answer to the query compared to the answer given by D. This again requires O(g(n)) time over all these indexes. This entire process then answers a query in \(O(q(n)+g(n))\) time.
For each update, it requires O(1) time to keep A and B up-to-date. It also requires an update to the sorted list \(L_A\) or \(L_B\), which can be done in O(g(n)) time. Finally, since the rebuilding requires O(p(m, n)) time and it is done every g(n) updates, the amortized cost of each update is then \(O(p(m,n)/g(n) + g(n))\).
The bounds in this theorem thus follows from the specific bounds on s(m, n), p(m, n) and q(n) in Theorem 1.
Finally, to deamortize using the global rebuilding approach, instead of rebuilding this data structure entirely during the update operation that triggers the rebuilding, we rebuild it over the next g(n) updates. This requires us to create two additional lists \(L_A'\) and \(L_B'\): Each time a rebuilding starts, we rename \(L_A\) and \(L_B\) to \(L_A'\) and \(L_B'\), and create new empty lists \(L_A\) and \(L_B\) to maintain indexes of the updates that arrive after the rebuilding starts. To answer a query, we cannot use the data structure that is currently being rebuilt since it is not complete, but we use the previous version of it and consult \(L_A\), \(L_B\), \(L_A'\) and \(L_B'\) to compute the answer using ideas similar to those described in previous paragraphs. \(\square \)
B Details Omitted from Sect. 5
Proof of Lemma 5. Assume a constant-size alphabet \(\varSigma \). Then, there is a linear-time reductions from the InternalHammingDistance to the internal inner product problem, and vice versa. Moreover, there is a linear-time reductions from the InternalEMWW problem to the internal inner product problem, and vice versa.
Proof
The reduction from the InternalHammingDistance to the internal inner product. For each letter \(\sigma \in \varSigma \), we change S and T to be bit vectors: \(\sigma \) in T become 1 and \(\varSigma \setminus \{\sigma \}\) become 0, while in S, \(\sigma \) become 0 and \(\varSigma \setminus \{\sigma \}\) become 1. That is, the Hamming distance query sums a constant number of internal inner products in order to answer the query.
The reduction from the internal inner product problem to the InternalHammingDistance. Assume we have two bit vectors A and B. Every 1 in A is transferred to 001, and 0 to 010, while in B, each 1 is transferred to 001, and 0 to 100. Let S and T be the transformed strings from A and B, respectively. It is easy to see that only 1 against 1 in A against B causes 0 mismatches between the corresponding substrings of S and T and any of the other 3 combinations results in 2 mismatches. Where corresponding substrings means that the starting and ending positions of the substrings are chosen to fit the original query, i.e. by multiplying the query indices by 3. Note that this reduction transfers the internal inner product to the InternalEMWW, as well.
The reduction from the InternalEMWW problem to the internal inner product. In a similar way, the inner product solves the exact matching with wildcards problem. We repeat the same process as described previously for Hamming distance but this time, wildcards are always transferred to 0 in both S and T. It is easy to see that when the sum over all the inner products is 0, there is an exact match with wildcards.
\(\square \)
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Das, R., He, M., Kondratovsky, E., Munro, J.I., Wu, K. (2022). Internal Masked Prefix Sums and Its Connection to Fully Internal Measurement Queries. In: Arroyuelo, D., Poblete, B. (eds) String Processing and Information Retrieval. SPIRE 2022. Lecture Notes in Computer Science, vol 13617. Springer, Cham. https://doi.org/10.1007/978-3-031-20643-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-20643-6_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20642-9
Online ISBN: 978-3-031-20643-6
eBook Packages: Computer ScienceComputer Science (R0)