Formalizing Data Locality in Task Parallel Applications

Ceballos, Germán; Hagersten, Erik; Black-Schaffer, David

doi:10.1007/978-3-319-49956-7_4

Germán Ceballos³⁰,
Erik Hagersten³⁰ &
David Black-Schaffer³⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10049))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

903 Accesses
5 Citations

Abstract

Task-based programming provides programmers with an intuitive abstraction to express parallelism, and runtimes with the flexibility to adapt the schedule and load-balancing to the hardware. Although many profiling tools have been developed to understand these characteristics, the interplay between task scheduling and data reuse in the cache hierarchy has not been explored. These interactions are particularly intriguing due to the flexibility task-based runtimes have in scheduling tasks, which may allow them to improve cache behavior.

This work presents StatTask, a novel statistical cache model that can predict cache behavior for arbitrary task schedules and cache sizes from a single execution, without programmer annotations. StatTask enables fast and accurate modeling of data locality in task-based applications for the first time. We demonstrate the potential of this new analysis to scheduling by examining applications from the BOTS benchmarks suite, and identifying several important opportunities for reuse-aware scheduling.

This work was supported by the Swedish Foundation for Strategic Research project FFL12-0051, the Swedish Research Council Linnaeus UPMARC centre of excellence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that these properties have never been formally described.
2.
For a particular input data set.
3.
The sizes of the inputs were all within 5%.

References

The cache complexity of multithreaded cache oblivious algorithms. Theory of Computing Systems 45(2) (2009)
Google Scholar
Acar, U., Blelloch, G., Blumofe, R.: The data locality of work stealing. Theory Comput. Syst. 35(3), 321–347 (2002)
Article MathSciNet MATH Google Scholar
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exper. 23(2), 187–198 (2011)
Article Google Scholar
Berg, E., Hagersten, E.: Statcache: a probabilistic approach to efficient and accurate data locality analysis. In: Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software (2004)
Google Scholar
Berg, E., Hagersten, E.: Fast data-locality profiling of native execution. SIGMETRICS Perform. Eval. Rev. 33(1), 169–180 (2005)
Article Google Scholar
Berg, E., Zeffer, H., Hagersten, E.: A statistical multiprocessor cache model. In: IEEE International Symposium on Performance Analysis of Systems and Software, pp. 89–99, March 2006
Google Scholar
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)
Article MathSciNet MATH Google Scholar
Cao, Q., Zuo, M.: A scheduling strategy supporting OpenMP task on heterogeneous multicore. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPS 2012, Shanghai, China, 21–25 May 2012, pp. 2077–2084 (2012)
Google Scholar
Chen, Q., Guo, M., Huang, Z.: Cats: cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS 2012, pp. 163–172 (2012)
Google Scholar
Ding, Y., Hu, K., Zhao, Z.: Performance monitoring and analysis of task-based OpenMP (2013)
Google Scholar
Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP. In: International Conference on Parallel Processing, ICPP 2009, pp. 124–131, September 2009
Google Scholar
Eklov, D., Black-Schaffer, D., Hagersten, E.: StatCC: a statistical cache contention model. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 551–552 (2010)
Google Scholar
Eklöv, D., Hagersten, E.: StatStack: efficient modeling of LRU caches. In: Proceeding International Symposium on Performance Analysis of Systems and Software: ISPASS 2010, pp. 55–65. IEEE (2010)
Google Scholar
Ghosh, P., Yan, Y., Eachempati, D., Chapman, B.: A prototype implementation of OpenMP task dependency support. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 128–140. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40698-0_10
Chapter Google Scholar
Jaleel, A., Cohn, R.S., keung Luk, C., Jacob, B.: Cmp$im: a pin-based on-the-fly multi-core cache simulator. In: The Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), Co-located with ISCA 2008 (2008)
Google Scholar
Lorenz, D., Philippen, P., Schmidl, D., Wolf, F.: Profiling of OpenMP tasks with Score-P. In: 41st International Conference on Parallel Processing Workshops, ICPPW 2012, Pittsburgh, PA, USA, 10–13 September 2012, pp. 444–453 (2012)
Google Scholar
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2005, pp. 190–200 (2005)
Google Scholar
OpenMP Architecture Review Board. OpenMP application program interface version 3.0 (2008)
Google Scholar
Schmidl, D., Philippen, P., Lorenz, D., Rössel, C., Geimer, M., Mey, D., Mohr, B., Wolf, F.: Performance analysis techniques for task-based OpenMP applications. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 196–209. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30961-8_15
Chapter Google Scholar
Weidendorfer, J., Kowarschik, M., Trinitis, C.: A tool suite for simulation based analysis of memory access behavior. In: Bubak, M., Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3038, pp. 440–447. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24688-6_58
Chapter Google Scholar
Weng, T., Chapman, B.: Towards optimisation of openmp codes for synchronisation and data reuse. Int. J. High Perform. Comput. Netw. 1(1–3), 43–54 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, Uppsala University, Uppsala, Sweden
Germán Ceballos, Erik Hagersten & David Black-Schaffer

Authors

Germán Ceballos
View author publications
You can also search for this author in PubMed Google Scholar
Erik Hagersten
View author publications
You can also search for this author in PubMed Google Scholar
David Black-Schaffer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Germán Ceballos , Erik Hagersten or David Black-Schaffer .

Editor information

Editors and Affiliations

Carlos III University of Madrid, Getafe, Spain
Jesus Carretero
Carlos III University of Madrid, Getafe, Spain
Javier Garcia-Blas
Mathematical Support for Computers, N. I. Lobachevsky State University of Nizhny Novgorod, Nizhniy Novgorod, Russia
Victor Gergel
Research Computing Center (RCC), Moscow State University, Moscow, Russia
Vladimir Voevodin
Research Computing Center (RCC), Moscow State University, Moscow, Russia
Iosif Meyerov
E.U. Politécnica, Universidad de Extremaddura, Cáceres, Spain
Juan A. Rico-Gallego
Ingenieria de Sistemas Informáticos, Universidad de Extremaddura, Cáceres, Spain
Juan C. Díaz-Martín
Universitat Politécnica de València, Valencia, Spain
Pedro Alonso
Distributed and Parallel Systems Group, Institute for Computer Science, Innsbruck, Austria
Juan Durillo
Carlos III University of Madrid, Getafe, Spain
José Daniel Garcia Sánchez
UCD School of Computer Science, University College Dublin, Dublin, Ireland
Alexey L. Lastovetsky
University of Calabria, Rende (CS), Italy
Fabrizio Marozzo
Information Science and Engineering, Central South University, Changsha, Hunan, China
Qin Liu
Information Science and Engineering, Central South University, Changsha, Hunan, China
Zakirul Alam Bhuiyan
Ludwig Maximilian University of Munich, Munich, Germany
Karl Fürlinger
Informatik 10 - Rechnertechnik, Technische Universität München, Munich, Germany
Josef Weidendorfer
High Performance Computing Center (HLRS), Stuttgart, Germany
José Gracia

A Appendix: Proofs

Lemma

Let E and $E'$ be execution traces such that they share the exact same set of accesses, but in different order. Then

$$ (x_j, x_k) \in \tilde{\mathcal {R}}^{E} \Rightarrow (x_j, x_k) \in \tilde{\mathcal {R}}^{E'} \vee (x_k, x_j) \in \tilde{\mathcal {R}}^{E'}. $$

Proof

Let $(x_j, x_k)$ be an element of $\tilde{\mathcal {R}}^{E}$. By definition, $a^j = a^k$. Since E and $E'$ share the same set of accesses, there exist $x_0$ and $x_1$ in $E'$ such that $x_j=x_0$ and $x_k=x_1$. Lets assume $x_0 < x_1$, since $a^j = a^k$ then $(x_0, x_1) \in \tilde{\mathcal {R}}^{E'}$. If $x_1 < x_0$ then $(x_0, x_1) \in \tilde{\mathcal {R}}^{E'}$

Lemma

$\mathcal {M}^{-1}$ and $\mathcal {M}$ are inverses.

Proof

Let T be a task, such that $E_T = x_1\dots x_r$. We can see that

$$ \mathcal {M}(\mathcal {M}^{-1}(x_1\dots x_r)) = \mathcal {M}(T(x_1)) = \mathcal {M}(T) = E_T = x_1\dots x_r $$

Conversely,

$$ \mathcal {M}^{-1}(\mathcal {M}(T)) = \mathcal {M}^{-1}(E_T) = \mathcal {M}^{-1}(x_1\dots x_r) = T(x_1) = T $$

Theorem

$Q = \mathcal {R}^{E_{S'}}$

Proof

It is straightforward to prove that $Q\subseteq \mathcal {R}^{E_{S'}}$. It is enough to observe that $\mathcal {C}_{S}(T, T')$ gives al the pairs in $\tilde{\mathcal {R}}^{E_S}$ that start in T and end in $T'$. Since this set is execution independent, those pairs are also in $\tilde{\mathcal {R}}^{E_S'}$. The condition $\forall T_q \in S'$ such that $ T(x_k)< T_q < T(x_j) \Rightarrow a^j \notin a(T_q) $ filters out the non-consecutive reuses. Therefore, all the elements of Q are consecutive reuses.

We will now show an outline for the proof that $R^{E_{S'}} \subseteq Q$. If $(x_j, x_{j+d}) \in (T_0 \rightarrow T_0)_{S'}$, then $T(x_j) = T(x_{j+d}) = T_0$. As S and $S'$ use the same task universe, $\exists T_r \in S$ such that $T_0^S = T_r^{S'}$. By Lemma 1, $E_{T_0} = E_{T_r}$. Therefore, $\exists x_p \in E_{T_r}$ such that $x_p = x_j$ and $x_{p+d} = x_{j+d}$, as private reuses are relatively offset. Then $(x_j, x_{j+d}) \in (T_r \rightarrow T_r)_S \in Q_{T_r, T_r}\subseteq Q$.

Let’s now consider the case $(x_j, x_{j+d}) \in (T_n \rightarrow T_m)_{S'}$. The tasks $T_n$ and $T_m$ also occur in S. We will assume that $T_n < T_m$. Let $T_1,\dots ,T_k$ such that $T_n< T_1< \cdots< T_k < T_m$. The proof is by induction on k.

When $k=0$, then the sequence $T_nT_m$ occur in S. Therefore, it exists $x_r$ in $E_S$ such $[x_j, x_j+d] = [x_r, x_r+d]$. Since $x_r = x_j$ and $(x_j, x_{j+d}) \in \mathcal {R}^{E_{S'}}$, we know that $\forall x_r< x_s < x_{r+d}$, $a^r \ne a^s$. Therefore $(x_j, x_{j+d}) = (x_r, x_{r+d}) \in (T_n \rightarrow T_m)^0 \subseteq Q_{Tn,Tm} \subseteq Q$.

When $k=1$, then $T_nT_{n+1}T_m \in S$. Lets assume that $a(T_{n+1}) \ne a(T_n)$, thus, $\forall x_s$ such that $x_j< x_s < x_j+d \Rightarrow a^s \ne a^j$. Therefore $(x_j, x_{j+d}) \in (T_n \rightarrow T_m) \subseteq Q_{T_n,T_m}$.

If that does not happen, it is enough to assume that there are unique $x_1, \dots , x_q$ such that $x_j< x_1< \dots< x_q < x_{j+d}$ and that $a^j = a^1 = \cdots = a^q = a^{j+d}$, with all the accesses in between with different addresses. This means that the pairs $(x_j, x_1), (x_1, x_2), \dots , (x_q, x_{j+d})$ are elements of $\mathcal {R}^{E_S}$. Therefore, by definition of $\mathcal {C}$, this means that $(x_j, x_{j+d}) \in \mathcal {C}(T_n, T_m) \subseteq Q_{T_n,T_m} \subseteq Q$.

The final case is $k \Rightarrow k+1$. Let $T_nT_1\dots T_{k+1}T_m \in S$. Two cases are necessary to prove. The first one, where the set of addresses are of tasks $T_1,\dots , T_k$ are disjoint from $T_n, T_m$. If that happens, then the only thing left to check is the set of addresses of $T_k$, which is analogous to the case $k=1$ for $T_{k+1}$. Otherwise, a unique number of accesses with the same address appear in $T_1, \dots , T_k$, which by inductive hypothesis can be used transitively to obtain a pair in Q.

The proof for when $T_m < T_n$ is analogous.

Theorem

$\gamma = \delta _{\mathcal {R}^{E_{S'}}}$

Proof

Let $(x_j, x_k) \in \mathcal {R}^{E_{S'}}$, and let $T_{x_j} = T(x_j)$ and $T_{x_k} = T(x_k)$. Let $T_{n_1},\dots ,T_{n_r}$ be such that $T_{x_j}T_{n_1}\dots T_{n_r}T_{x_k} \in S'$. These are the tasks scheduled between the starting and ending tasks causing the reuse in $S'$. Lets also assume $T_{m_1},\dots ,T_{m_s}$ such that $T_{x_j}T_{m_1}\dots T_{m_s}T_{x_k} \in S$, thus representing the tasks between the starting and ending tasks of the reuse in S. It is easy to see the following memory access sequence when $S'$ is executed:

$$\begin{aligned} x_j \dots x_l^{T_{x_j}} x_1^{T_{n_1}} \dots x_l^{T_{n_1}} \dots x_1^{T_{n_r}} \dots x_l^{T_{n_r}} x_1^{T_{x_k}} \dots x_k = x_j \dots x_l^{T_{x_j}} E_{T_{n_1}} \dots E_{T_{n_r}} x_1^{T_{x_k}} \dots x_k. \end{aligned}$$

Therefore, the since the access distance is linear, we can see that

$$\begin{aligned} \delta _{\mathcal {R}^{E_{S'}}}(x_j, x_k)= & {} \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}}) + |E_{T_{n_1}}| + \cdots + |E_{T_{n_r}}| + \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_1^{T_{x_k}}, x_k)\\= & {} \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}}) + |\mathcal {M}(T_{n_1})| + \cdots + |\mathcal {M}(T_{n_r})| + \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_1^{T_{x_k}}, x_k)\\= & {} \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}}) + \nu _{S'}(T_{x_j}, T{x_k}) + \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_1^{T_{x_k}}, x_k) \end{aligned}$$

On the other hand, we can also see the following sequence when S is executed:

$$ x_j \dots x_l^{T_{x_j}} x_1^{T_{m_1}} \dots x_l^{T_{m_1}} \dots x_1^{T_{m_s}} \dots x_l^{T_{m_s}} x_1^{T_{x_k}} \dots x_k, $$

analogously, see that $ \delta _{\mathcal {R}^{E_{S}}}(x_j, x_k) = \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_l^{T_{x_j}}) + \nu _{S}(T_{x_j},T_{x_k}) + \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_1^{T_{x_k}}, x_k)$, and therefore, $\delta _{\mathcal {R}^{E_{S}}}(x_j, x_l^{T_{x_j}}) + \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_1^{T_{x_k}}, x_k) =\delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_k) - \nu _{S}(T_{x_j},T_{x_k})$. Since the sequence $x_j \dots x_l^{T_{x_j}}$ is identical both in $E_S$ and $E_{S'}$ then

$\delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_l^{T_{x_j}}) = \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}})$. The same holds for $x_1^{T_{x_k}} \dots x_k$. Then,

$$\begin{aligned} \delta _{\mathcal {R}^{E_{S'}}}(x_j, x_k)= & {} \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}}) + \nu _{S'}(T_{x_j}, T{x_k}) + \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_1^{T_{x_k}}, x_k) \\= & {} \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}}) + \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_1^{T_{x_k}}, x_k) + \nu _{S'}(T_{x_j}, T{x_k})\\= & {} \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_k) - \nu _{S}(T_{x_j},T_{x_k}) + \nu _{S'}(T_{x_j}, T_{x_k})\\= & {} \gamma (x_j, x_k) \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ceballos, G., Hagersten, E., Black-Schaffer, D. (2016). Formalizing Data Locality in Task Parallel Applications. In: Carretero, J., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10049. Springer, Cham. https://doi.org/10.1007/978-3-319-49956-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-49956-7_4
Published: 19 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49955-0
Online ISBN: 978-3-319-49956-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Formalizing Data Locality in Task Parallel Applications

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

A Appendix: Proofs

A Appendix: Proofs

Lemma

Proof

Lemma

Proof

Theorem

Proof

Theorem

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation