Abstract
Task-based programming provides programmers with an intuitive abstraction to express parallelism, and runtimes with the flexibility to adapt the schedule and load-balancing to the hardware. Although many profiling tools have been developed to understand these characteristics, the interplay between task scheduling and data reuse in the cache hierarchy has not been explored. These interactions are particularly intriguing due to the flexibility task-based runtimes have in scheduling tasks, which may allow them to improve cache behavior.
This work presents StatTask, a novel statistical cache model that can predict cache behavior for arbitrary task schedules and cache sizes from a single execution, without programmer annotations. StatTask enables fast and accurate modeling of data locality in task-based applications for the first time. We demonstrate the potential of this new analysis to scheduling by examining applications from the BOTS benchmarks suite, and identifying several important opportunities for reuse-aware scheduling.
This work was supported by the Swedish Foundation for Strategic Research project FFL12-0051, the Swedish Research Council Linnaeus UPMARC centre of excellence.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note that these properties have never been formally described.
- 2.
For a particular input data set.
- 3.
The sizes of the inputs were all within 5%.
References
The cache complexity of multithreaded cache oblivious algorithms. Theory of Computing Systems 45(2) (2009)
Acar, U., Blelloch, G., Blumofe, R.: The data locality of work stealing. Theory Comput. Syst. 35(3), 321–347 (2002)
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exper. 23(2), 187–198 (2011)
Berg, E., Hagersten, E.: Statcache: a probabilistic approach to efficient and accurate data locality analysis. In: Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software (2004)
Berg, E., Hagersten, E.: Fast data-locality profiling of native execution. SIGMETRICS Perform. Eval. Rev. 33(1), 169–180 (2005)
Berg, E., Zeffer, H., Hagersten, E.: A statistical multiprocessor cache model. In: IEEE International Symposium on Performance Analysis of Systems and Software, pp. 89–99, March 2006
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)
Cao, Q., Zuo, M.: A scheduling strategy supporting OpenMP task on heterogeneous multicore. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPS 2012, Shanghai, China, 21–25 May 2012, pp. 2077–2084 (2012)
Chen, Q., Guo, M., Huang, Z.: Cats: cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS 2012, pp. 163–172 (2012)
Ding, Y., Hu, K., Zhao, Z.: Performance monitoring and analysis of task-based OpenMP (2013)
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP. In: International Conference on Parallel Processing, ICPP 2009, pp. 124–131, September 2009
Eklov, D., Black-Schaffer, D., Hagersten, E.: StatCC: a statistical cache contention model. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 551–552 (2010)
Eklöv, D., Hagersten, E.: StatStack: efficient modeling of LRU caches. In: Proceeding International Symposium on Performance Analysis of Systems and Software: ISPASS 2010, pp. 55–65. IEEE (2010)
Ghosh, P., Yan, Y., Eachempati, D., Chapman, B.: A prototype implementation of OpenMP task dependency support. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 128–140. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40698-0_10
Jaleel, A., Cohn, R.S., keung Luk, C., Jacob, B.: Cmp$im: a pin-based on-the-fly multi-core cache simulator. In: The Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), Co-located with ISCA 2008 (2008)
Lorenz, D., Philippen, P., Schmidl, D., Wolf, F.: Profiling of OpenMP tasks with Score-P. In: 41st International Conference on Parallel Processing Workshops, ICPPW 2012, Pittsburgh, PA, USA, 10–13 September 2012, pp. 444–453 (2012)
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2005, pp. 190–200 (2005)
OpenMP Architecture Review Board. OpenMP application program interface version 3.0 (2008)
Schmidl, D., Philippen, P., Lorenz, D., Rössel, C., Geimer, M., Mey, D., Mohr, B., Wolf, F.: Performance analysis techniques for task-based OpenMP applications. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 196–209. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30961-8_15
Weidendorfer, J., Kowarschik, M., Trinitis, C.: A tool suite for simulation based analysis of memory access behavior. In: Bubak, M., Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3038, pp. 440–447. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24688-6_58
Weng, T., Chapman, B.: Towards optimisation of openmp codes for synchronisation and data reuse. Int. J. High Perform. Comput. Netw. 1(1–3), 43–54 (2004)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
A Appendix: Proofs
A Appendix: Proofs
Lemma
Let E and \(E'\) be execution traces such that they share the exact same set of accesses, but in different order. Then
Proof
Let \((x_j, x_k)\) be an element of \(\tilde{\mathcal {R}}^{E}\). By definition, \(a^j = a^k\). Since E and \(E'\) share the same set of accesses, there exist \(x_0\) and \(x_1\) in \(E'\) such that \(x_j=x_0\) and \(x_k=x_1\). Lets assume \(x_0 < x_1\), since \(a^j = a^k\) then \((x_0, x_1) \in \tilde{\mathcal {R}}^{E'}\). If \(x_1 < x_0\) then \((x_0, x_1) \in \tilde{\mathcal {R}}^{E'}\)
Lemma
\(\mathcal {M}^{-1}\) and \(\mathcal {M}\) are inverses.
Proof
Let T be a task, such that \(E_T = x_1\dots x_r\). We can see that
Conversely,
Theorem
\(Q = \mathcal {R}^{E_{S'}}\)
Proof
It is straightforward to prove that \(Q\subseteq \mathcal {R}^{E_{S'}}\). It is enough to observe that \(\mathcal {C}_{S}(T, T')\) gives al the pairs in \(\tilde{\mathcal {R}}^{E_S}\) that start in T and end in \(T'\). Since this set is execution independent, those pairs are also in \(\tilde{\mathcal {R}}^{E_S'}\). The condition \(\forall T_q \in S'\) such that \( T(x_k)< T_q < T(x_j) \Rightarrow a^j \notin a(T_q) \) filters out the non-consecutive reuses. Therefore, all the elements of Q are consecutive reuses.
We will now show an outline for the proof that \(R^{E_{S'}} \subseteq Q\). If \((x_j, x_{j+d}) \in (T_0 \rightarrow T_0)_{S'}\), then \(T(x_j) = T(x_{j+d}) = T_0\). As S and \(S'\) use the same task universe, \(\exists T_r \in S\) such that \(T_0^S = T_r^{S'}\). By Lemma 1, \(E_{T_0} = E_{T_r}\). Therefore, \(\exists x_p \in E_{T_r}\) such that \(x_p = x_j\) and \(x_{p+d} = x_{j+d}\), as private reuses are relatively offset. Then \((x_j, x_{j+d}) \in (T_r \rightarrow T_r)_S \in Q_{T_r, T_r}\subseteq Q\).
Let’s now consider the case \((x_j, x_{j+d}) \in (T_n \rightarrow T_m)_{S'}\). The tasks \(T_n\) and \(T_m\) also occur in S. We will assume that \(T_n < T_m\). Let \(T_1,\dots ,T_k\) such that \(T_n< T_1< \cdots< T_k < T_m\). The proof is by induction on k.
When \(k=0\), then the sequence \(T_nT_m\) occur in S. Therefore, it exists \(x_r\) in \(E_S\) such \([x_j, x_j+d] = [x_r, x_r+d]\). Since \(x_r = x_j\) and \((x_j, x_{j+d}) \in \mathcal {R}^{E_{S'}}\), we know that \(\forall x_r< x_s < x_{r+d}\), \(a^r \ne a^s\). Therefore \((x_j, x_{j+d}) = (x_r, x_{r+d}) \in (T_n \rightarrow T_m)^0 \subseteq Q_{Tn,Tm} \subseteq Q\).
When \(k=1\), then \(T_nT_{n+1}T_m \in S\). Lets assume that \(a(T_{n+1}) \ne a(T_n)\), thus, \(\forall x_s\) such that \(x_j< x_s < x_j+d \Rightarrow a^s \ne a^j\). Therefore \((x_j, x_{j+d}) \in (T_n \rightarrow T_m) \subseteq Q_{T_n,T_m}\).
If that does not happen, it is enough to assume that there are unique \(x_1, \dots , x_q\) such that \(x_j< x_1< \dots< x_q < x_{j+d}\) and that \(a^j = a^1 = \cdots = a^q = a^{j+d}\), with all the accesses in between with different addresses. This means that the pairs \((x_j, x_1), (x_1, x_2), \dots , (x_q, x_{j+d})\) are elements of \(\mathcal {R}^{E_S}\). Therefore, by definition of \(\mathcal {C}\), this means that \((x_j, x_{j+d}) \in \mathcal {C}(T_n, T_m) \subseteq Q_{T_n,T_m} \subseteq Q\).
The final case is \(k \Rightarrow k+1\). Let \(T_nT_1\dots T_{k+1}T_m \in S\). Two cases are necessary to prove. The first one, where the set of addresses are of tasks \(T_1,\dots , T_k\) are disjoint from \(T_n, T_m\). If that happens, then the only thing left to check is the set of addresses of \(T_k\), which is analogous to the case \(k=1\) for \(T_{k+1}\). Otherwise, a unique number of accesses with the same address appear in \(T_1, \dots , T_k\), which by inductive hypothesis can be used transitively to obtain a pair in Q.
The proof for when \(T_m < T_n\) is analogous.
Theorem
\(\gamma = \delta _{\mathcal {R}^{E_{S'}}}\)
Proof
Let \((x_j, x_k) \in \mathcal {R}^{E_{S'}}\), and let \(T_{x_j} = T(x_j)\) and \(T_{x_k} = T(x_k)\). Let \(T_{n_1},\dots ,T_{n_r}\) be such that \(T_{x_j}T_{n_1}\dots T_{n_r}T_{x_k} \in S'\). These are the tasks scheduled between the starting and ending tasks causing the reuse in \(S'\). Lets also assume \(T_{m_1},\dots ,T_{m_s}\) such that \(T_{x_j}T_{m_1}\dots T_{m_s}T_{x_k} \in S\), thus representing the tasks between the starting and ending tasks of the reuse in S. It is easy to see the following memory access sequence when \(S'\) is executed:
Therefore, the since the access distance is linear, we can see that
On the other hand, we can also see the following sequence when S is executed:
analogously, see that \( \delta _{\mathcal {R}^{E_{S}}}(x_j, x_k) = \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_l^{T_{x_j}}) + \nu _{S}(T_{x_j},T_{x_k}) + \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_1^{T_{x_k}}, x_k)\), and therefore, \(\delta _{\mathcal {R}^{E_{S}}}(x_j, x_l^{T_{x_j}}) + \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_1^{T_{x_k}}, x_k) =\delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_k) - \nu _{S}(T_{x_j},T_{x_k})\). Since the sequence \(x_j \dots x_l^{T_{x_j}}\) is identical both in \(E_S\) and \(E_{S'}\) then
\(\delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_l^{T_{x_j}}) = \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}})\). The same holds for \(x_1^{T_{x_k}} \dots x_k\). Then,
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Ceballos, G., Hagersten, E., Black-Schaffer, D. (2016). Formalizing Data Locality in Task Parallel Applications. In: Carretero, J., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10049. Springer, Cham. https://doi.org/10.1007/978-3-319-49956-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-49956-7_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49955-0
Online ISBN: 978-3-319-49956-7
eBook Packages: Computer ScienceComputer Science (R0)