Skip to main content

Formalizing Data Locality in Task Parallel Applications

  • Conference paper
  • First Online:
Book cover Algorithms and Architectures for Parallel Processing (ICA3PP 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10049))

Abstract

Task-based programming provides programmers with an intuitive abstraction to express parallelism, and runtimes with the flexibility to adapt the schedule and load-balancing to the hardware. Although many profiling tools have been developed to understand these characteristics, the interplay between task scheduling and data reuse in the cache hierarchy has not been explored. These interactions are particularly intriguing due to the flexibility task-based runtimes have in scheduling tasks, which may allow them to improve cache behavior.

This work presents StatTask, a novel statistical cache model that can predict cache behavior for arbitrary task schedules and cache sizes from a single execution, without programmer annotations. StatTask enables fast and accurate modeling of data locality in task-based applications for the first time. We demonstrate the potential of this new analysis to scheduling by examining applications from the BOTS benchmarks suite, and identifying several important opportunities for reuse-aware scheduling.

This work was supported by the Swedish Foundation for Strategic Research project FFL12-0051, the Swedish Research Council Linnaeus UPMARC centre of excellence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that these properties have never been formally described.

  2. 2.

    For a particular input data set.

  3. 3.

    The sizes of the inputs were all within 5%.

References

  1. The cache complexity of multithreaded cache oblivious algorithms. Theory of Computing Systems 45(2) (2009)

    Google Scholar 

  2. Acar, U., Blelloch, G., Blumofe, R.: The data locality of work stealing. Theory Comput. Syst. 35(3), 321–347 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  3. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exper. 23(2), 187–198 (2011)

    Article  Google Scholar 

  4. Berg, E., Hagersten, E.: Statcache: a probabilistic approach to efficient and accurate data locality analysis. In: Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software (2004)

    Google Scholar 

  5. Berg, E., Hagersten, E.: Fast data-locality profiling of native execution. SIGMETRICS Perform. Eval. Rev. 33(1), 169–180 (2005)

    Article  Google Scholar 

  6. Berg, E., Zeffer, H., Hagersten, E.: A statistical multiprocessor cache model. In: IEEE International Symposium on Performance Analysis of Systems and Software, pp. 89–99, March 2006

    Google Scholar 

  7. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  8. Cao, Q., Zuo, M.: A scheduling strategy supporting OpenMP task on heterogeneous multicore. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPS 2012, Shanghai, China, 21–25 May 2012, pp. 2077–2084 (2012)

    Google Scholar 

  9. Chen, Q., Guo, M., Huang, Z.: Cats: cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS 2012, pp. 163–172 (2012)

    Google Scholar 

  10. Ding, Y., Hu, K., Zhao, Z.: Performance monitoring and analysis of task-based OpenMP (2013)

    Google Scholar 

  11. Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP. In: International Conference on Parallel Processing, ICPP 2009, pp. 124–131, September 2009

    Google Scholar 

  12. Eklov, D., Black-Schaffer, D., Hagersten, E.: StatCC: a statistical cache contention model. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 551–552 (2010)

    Google Scholar 

  13. Eklöv, D., Hagersten, E.: StatStack: efficient modeling of LRU caches. In: Proceeding International Symposium on Performance Analysis of Systems and Software: ISPASS 2010, pp. 55–65. IEEE (2010)

    Google Scholar 

  14. Ghosh, P., Yan, Y., Eachempati, D., Chapman, B.: A prototype implementation of OpenMP task dependency support. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 128–140. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40698-0_10

    Chapter  Google Scholar 

  15. Jaleel, A., Cohn, R.S., keung Luk, C., Jacob, B.: Cmp$im: a pin-based on-the-fly multi-core cache simulator. In: The Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), Co-located with ISCA 2008 (2008)

    Google Scholar 

  16. Lorenz, D., Philippen, P., Schmidl, D., Wolf, F.: Profiling of OpenMP tasks with Score-P. In: 41st International Conference on Parallel Processing Workshops, ICPPW 2012, Pittsburgh, PA, USA, 10–13 September 2012, pp. 444–453 (2012)

    Google Scholar 

  17. Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2005, pp. 190–200 (2005)

    Google Scholar 

  18. OpenMP Architecture Review Board. OpenMP application program interface version 3.0 (2008)

    Google Scholar 

  19. Schmidl, D., Philippen, P., Lorenz, D., Rössel, C., Geimer, M., Mey, D., Mohr, B., Wolf, F.: Performance analysis techniques for task-based OpenMP applications. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 196–209. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30961-8_15

    Chapter  Google Scholar 

  20. Weidendorfer, J., Kowarschik, M., Trinitis, C.: A tool suite for simulation based analysis of memory access behavior. In: Bubak, M., Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3038, pp. 440–447. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24688-6_58

    Chapter  Google Scholar 

  21. Weng, T., Chapman, B.: Towards optimisation of openmp codes for synchronisation and data reuse. Int. J. High Perform. Comput. Netw. 1(1–3), 43–54 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Germán Ceballos , Erik Hagersten or David Black-Schaffer .

Editor information

Editors and Affiliations

A Appendix: Proofs

A Appendix: Proofs

Lemma

Let E and \(E'\) be execution traces such that they share the exact same set of accesses, but in different order. Then

$$ (x_j, x_k) \in \tilde{\mathcal {R}}^{E} \Rightarrow (x_j, x_k) \in \tilde{\mathcal {R}}^{E'} \vee (x_k, x_j) \in \tilde{\mathcal {R}}^{E'}. $$

Proof

Let \((x_j, x_k)\) be an element of \(\tilde{\mathcal {R}}^{E}\). By definition, \(a^j = a^k\). Since E and \(E'\) share the same set of accesses, there exist \(x_0\) and \(x_1\) in \(E'\) such that \(x_j=x_0\) and \(x_k=x_1\). Lets assume \(x_0 < x_1\), since \(a^j = a^k\) then \((x_0, x_1) \in \tilde{\mathcal {R}}^{E'}\). If \(x_1 < x_0\) then \((x_0, x_1) \in \tilde{\mathcal {R}}^{E'}\)

Lemma

\(\mathcal {M}^{-1}\) and \(\mathcal {M}\) are inverses.

Proof

Let T be a task, such that \(E_T = x_1\dots x_r\). We can see that

$$ \mathcal {M}(\mathcal {M}^{-1}(x_1\dots x_r)) = \mathcal {M}(T(x_1)) = \mathcal {M}(T) = E_T = x_1\dots x_r $$

Conversely,

$$ \mathcal {M}^{-1}(\mathcal {M}(T)) = \mathcal {M}^{-1}(E_T) = \mathcal {M}^{-1}(x_1\dots x_r) = T(x_1) = T $$

Theorem

\(Q = \mathcal {R}^{E_{S'}}\)

Proof

It is straightforward to prove that \(Q\subseteq \mathcal {R}^{E_{S'}}\). It is enough to observe that \(\mathcal {C}_{S}(T, T')\) gives al the pairs in \(\tilde{\mathcal {R}}^{E_S}\) that start in T and end in \(T'\). Since this set is execution independent, those pairs are also in \(\tilde{\mathcal {R}}^{E_S'}\). The condition \(\forall T_q \in S'\) such that \( T(x_k)< T_q < T(x_j) \Rightarrow a^j \notin a(T_q) \) filters out the non-consecutive reuses. Therefore, all the elements of Q are consecutive reuses.

We will now show an outline for the proof that \(R^{E_{S'}} \subseteq Q\). If \((x_j, x_{j+d}) \in (T_0 \rightarrow T_0)_{S'}\), then \(T(x_j) = T(x_{j+d}) = T_0\). As S and \(S'\) use the same task universe, \(\exists T_r \in S\) such that \(T_0^S = T_r^{S'}\). By Lemma 1, \(E_{T_0} = E_{T_r}\). Therefore, \(\exists x_p \in E_{T_r}\) such that \(x_p = x_j\) and \(x_{p+d} = x_{j+d}\), as private reuses are relatively offset. Then \((x_j, x_{j+d}) \in (T_r \rightarrow T_r)_S \in Q_{T_r, T_r}\subseteq Q\).

Let’s now consider the case \((x_j, x_{j+d}) \in (T_n \rightarrow T_m)_{S'}\). The tasks \(T_n\) and \(T_m\) also occur in S. We will assume that \(T_n < T_m\). Let \(T_1,\dots ,T_k\) such that \(T_n< T_1< \cdots< T_k < T_m\). The proof is by induction on k.

When \(k=0\), then the sequence \(T_nT_m\) occur in S. Therefore, it exists \(x_r\) in \(E_S\) such \([x_j, x_j+d] = [x_r, x_r+d]\). Since \(x_r = x_j\) and \((x_j, x_{j+d}) \in \mathcal {R}^{E_{S'}}\), we know that \(\forall x_r< x_s < x_{r+d}\), \(a^r \ne a^s\). Therefore \((x_j, x_{j+d}) = (x_r, x_{r+d}) \in (T_n \rightarrow T_m)^0 \subseteq Q_{Tn,Tm} \subseteq Q\).

When \(k=1\), then \(T_nT_{n+1}T_m \in S\). Lets assume that \(a(T_{n+1}) \ne a(T_n)\), thus, \(\forall x_s\) such that \(x_j< x_s < x_j+d \Rightarrow a^s \ne a^j\). Therefore \((x_j, x_{j+d}) \in (T_n \rightarrow T_m) \subseteq Q_{T_n,T_m}\).

If that does not happen, it is enough to assume that there are unique \(x_1, \dots , x_q\) such that \(x_j< x_1< \dots< x_q < x_{j+d}\) and that \(a^j = a^1 = \cdots = a^q = a^{j+d}\), with all the accesses in between with different addresses. This means that the pairs \((x_j, x_1), (x_1, x_2), \dots , (x_q, x_{j+d})\) are elements of \(\mathcal {R}^{E_S}\). Therefore, by definition of \(\mathcal {C}\), this means that \((x_j, x_{j+d}) \in \mathcal {C}(T_n, T_m) \subseteq Q_{T_n,T_m} \subseteq Q\).

The final case is \(k \Rightarrow k+1\). Let \(T_nT_1\dots T_{k+1}T_m \in S\). Two cases are necessary to prove. The first one, where the set of addresses are of tasks \(T_1,\dots , T_k\) are disjoint from \(T_n, T_m\). If that happens, then the only thing left to check is the set of addresses of \(T_k\), which is analogous to the case \(k=1\) for \(T_{k+1}\). Otherwise, a unique number of accesses with the same address appear in \(T_1, \dots , T_k\), which by inductive hypothesis can be used transitively to obtain a pair in Q.

The proof for when \(T_m < T_n\) is analogous.

Theorem

\(\gamma = \delta _{\mathcal {R}^{E_{S'}}}\)

Proof

Let \((x_j, x_k) \in \mathcal {R}^{E_{S'}}\), and let \(T_{x_j} = T(x_j)\) and \(T_{x_k} = T(x_k)\). Let \(T_{n_1},\dots ,T_{n_r}\) be such that \(T_{x_j}T_{n_1}\dots T_{n_r}T_{x_k} \in S'\). These are the tasks scheduled between the starting and ending tasks causing the reuse in \(S'\). Lets also assume \(T_{m_1},\dots ,T_{m_s}\) such that \(T_{x_j}T_{m_1}\dots T_{m_s}T_{x_k} \in S\), thus representing the tasks between the starting and ending tasks of the reuse in S. It is easy to see the following memory access sequence when \(S'\) is executed:

$$\begin{aligned} x_j \dots x_l^{T_{x_j}} x_1^{T_{n_1}} \dots x_l^{T_{n_1}} \dots x_1^{T_{n_r}} \dots x_l^{T_{n_r}} x_1^{T_{x_k}} \dots x_k = x_j \dots x_l^{T_{x_j}} E_{T_{n_1}} \dots E_{T_{n_r}} x_1^{T_{x_k}} \dots x_k. \end{aligned}$$

Therefore, the since the access distance is linear, we can see that

$$\begin{aligned} \delta _{\mathcal {R}^{E_{S'}}}(x_j, x_k)= & {} \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}}) + |E_{T_{n_1}}| + \cdots + |E_{T_{n_r}}| + \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_1^{T_{x_k}}, x_k)\\= & {} \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}}) + |\mathcal {M}(T_{n_1})| + \cdots + |\mathcal {M}(T_{n_r})| + \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_1^{T_{x_k}}, x_k)\\= & {} \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}}) + \nu _{S'}(T_{x_j}, T{x_k}) + \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_1^{T_{x_k}}, x_k) \end{aligned}$$

On the other hand, we can also see the following sequence when S is executed:

$$ x_j \dots x_l^{T_{x_j}} x_1^{T_{m_1}} \dots x_l^{T_{m_1}} \dots x_1^{T_{m_s}} \dots x_l^{T_{m_s}} x_1^{T_{x_k}} \dots x_k, $$

analogously, see that \( \delta _{\mathcal {R}^{E_{S}}}(x_j, x_k) = \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_l^{T_{x_j}}) + \nu _{S}(T_{x_j},T_{x_k}) + \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_1^{T_{x_k}}, x_k)\), and therefore, \(\delta _{\mathcal {R}^{E_{S}}}(x_j, x_l^{T_{x_j}}) + \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_1^{T_{x_k}}, x_k) =\delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_k) - \nu _{S}(T_{x_j},T_{x_k})\). Since the sequence \(x_j \dots x_l^{T_{x_j}}\) is identical both in \(E_S\) and \(E_{S'}\) then

\(\delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_l^{T_{x_j}}) = \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}})\). The same holds for \(x_1^{T_{x_k}} \dots x_k\). Then,

$$\begin{aligned} \delta _{\mathcal {R}^{E_{S'}}}(x_j, x_k)= & {} \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}}) + \nu _{S'}(T_{x_j}, T{x_k}) + \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_1^{T_{x_k}}, x_k) \\= & {} \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_j, x_l^{T_{x_j}}) + \delta _{\tilde{\mathcal {R}}^{E_{S'}}}(x_1^{T_{x_k}}, x_k) + \nu _{S'}(T_{x_j}, T{x_k})\\= & {} \delta _{\tilde{\mathcal {R}}^{E_{S}}}(x_j, x_k) - \nu _{S}(T_{x_j},T_{x_k}) + \nu _{S'}(T_{x_j}, T_{x_k})\\= & {} \gamma (x_j, x_k) \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Ceballos, G., Hagersten, E., Black-Schaffer, D. (2016). Formalizing Data Locality in Task Parallel Applications. In: Carretero, J., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10049. Springer, Cham. https://doi.org/10.1007/978-3-319-49956-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49956-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49955-0

  • Online ISBN: 978-3-319-49956-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics