A Case Study of Implementing Supernode Transformations

Steinbrecher, Johann; Philippidis, Cesar J.; Shang, Weijia

doi:10.1007/s10766-013-0248-7

A Case Study of Implementing Supernode Transformations

Published: 15 May 2013

Volume 42, pages 320–342, (2014)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Johann Steinbrecher¹,
Cesar J. Philippidis¹ &
Weijia Shang¹

212 Accesses
Explore all metrics

Abstract

Supernode transformation is a technique to decrease the communication overhead by partitioning and scheduling a loop nest to a multi-processor system. This is achieved by grouping a number of iterations in a perfectly nested loop with regular dependences as a \(supernode\). Previous work has been focusing on finding the optimal supernode size and shape as well as an optimal execution schedule for multi-processor systems with unbounded resources. This paper emphasizes on the actual implementation strategies of supernode transformations on multi-core systems with limited resources. Using an example, the longest common subsequence (LCS) problem, we present and compare three different multithreading implementations. A formula for the total execution time of each method is presented. The techniques are benchmarked on a 12-core and a 4-core machine. On the 12-core machine our first technique, which yields increased data locality, speeds up the unaltered sequential loop nest 16.7 times. Combining this technique with skewing the loop by changing the linear schedule scores a 42.6 speedup. A more sophisticated method that executes entire rows of the loop nest in one thread scores a 59.5 speedup. Concepts presented and discussed in this paper on the LCS problem serve as basic foundation for implementations at regular dependence algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

1003.1 standard for information technology portable operating system interface (posix) rationale (informative). IEEE Std 1003.1-2001. Rationale (Informative) pp. i–310 (2001)
Baskaran, M.M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. CGO ’10, pp. 200–209. ACM, New York, NY (2010)
Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pp. 7–16. IEEE Computer Society, Washington, DC (2004)
Ben Mabrouk, B., Hasni, H., Mahjoub, Z.: Parallelization of the dynamic programming algorithm for solving the longest common subsequence problem. In: Computer Systems and Applications (AICCSA), 2010 IEEE/ACS International Conference on, pp. 1–8 (2010)
Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh international Symposium on, pp. 39–48 (2000)
Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’08, pp. 101–113. ACM, New York, NY (2008)
Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Cambridge (2007)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2001)
MATH Google Scholar
Dimitriou, G., Polychronopoulos, C.: Loop scheduling for multithreaded processors. In: Parallel Computing in Electrical Engineering, 2004. PARELEC 2004. International Conference on, pp. 361–366 (2004)
Du, Z.-H., Lim, C.-C., Li, X.-F., Yang, C., Zhao, Q., Ngai, T.-F.: A Cost-driven Compilation Framework for Speculative Parallelization of Sequential Programs vol. 39, pp. 71–81. ACM, New York, NY (2004)
Feautrier, P.: Dataflow analysis of array and scalar references. Int. J. Parallel Program. 20, 23–53 (1991)
Article MATH Google Scholar
Garcia, T., Myoupo, J.-F., Seme, D.: A coarse-grained multicomputer algorithm for the longest common subsequence problem. In: Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, pp. 349–356 (2003)
Goumas, G., Drosinos, N., Koziris, N.: Communication-aware supernode shape. Parallel Distrib. Syst. IEEE Trans. 20(4), 498–511 (2009)
Article Google Scholar
Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pp. 147–157. ACM, New York, NY (2009)
Hirschberg, D.S.: Algorithms for the longest common subsequence problem. J. ACM 24, 664–675 (1977)
Article MATH MathSciNet Google Scholar
Hodzic, E., Shang, W.: On time optimal supernode shape. Parallel Distrib. Syst. IEEE Trans. 13(12), 1220–1233 (2002)
Article Google Scholar
Hodzic, E., Shang, W.: On supernode transformation with minimized total running time. Parallel Distrib. Syst. IEEE Trans. 9(5), 417–428 (1998)
Article Google Scholar
Hogstedt, K., Carter, L., Ferrante, J.: On the parallel execution time of tiled loops. Parallel Distrib. Syst. IEEE Trans. 14(3), 307–321 (2003)
Article Google Scholar
Hu, S.-H., Wang, C.-W., Chen, H.-L.: An efficient and hardware-implementable systolic algorithm for the longest common subsequence problem. In: Machine Learning and Cybernetics, 2008 International Conference on, vol. 6, pp. 3150–3155 (2008)
Inoue, H., Nakatani, T.: Performance of multi-process and multi-thread processing on multi-core smt processors. In: Workload Characterization (IISWC), 2010 IEEE International Symposium on, pp. 1–10 (2010)
Irigoin, F., Triolet, R.: Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’88, pp. 319–329. ACM, New York, NY (1988)
Korkin, D., Wang, Q., Shang, Y.: An efficient parallel algorithm for the multiple longest common subsequence (mlcs) problem. In: Parallel Processing, 2008. ICPP ’08. 37th International Conference on, pp. 354–363 (2008)
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, Saint-Malo, France, pp. 451–460. ACM, New York, NY (2010)
Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Comput. 24(3–4), 445–475 (1998)
Article MATH MathSciNet Google Scholar
Liu, J., Wu, S.: Research on longest common subsequence fast algorithm. In: 2011 International Conference on Consumer Electronics, Communications and Networks (CECNet), pp. 4338–4341 (2011)
Mitchell, N., Carter, L., Ferrante, J., Tullsen, D.: Ilp versus tlp on smt. In: Supercomputing, ACM/IEEE 1999 Conference, p. 37 (1999)
Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short simd architectures. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pp. 2–11. ACM, New York, NY (2008)
Ramanujam, J., Sadayappan, P.: Tiling multidimensional iteration spaces for multicomputers. J. Parallel Distrib. Comput. 16(2), 108–120 (1992)
Article Google Scholar
Riakiotakis, I., Tsanakas, P.: Dynamic scheduling of nested loops with uniform dependencies in heterogeneous networks of workstations. In: Parallel Architectures, Algorithms and Networks, 2005. ISPAN 2005. Proceedings. 8th International Symposium on, p. 6 (2005)
Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.-Z., Stratton, J.A., Hwu, W.-M.W.: Program optimization space pruning for a multithreaded gpu. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’08, pp. 195–204. ACM, New York, NY (2008)
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core x86 architecture for visual computing. In: ACM SIGGRAPH 2008 papers, SIGGRAPH ’08, pp. 18:1–18:15. ACM, New York, NY (2008)
Shang, W., Fortes, J.A.B.: Time optimal linear schedules for algorithms with uniform dependencies. IEEE Trans. Comput. 40(6), 723–742 (1991)
Article MathSciNet Google Scholar
Xue, J.: Communication-minimal tiling of uniform dependence loops. J. Parallel Distrib. Comput. 42(1), 42–59 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Engineering Department, Santa Clara University, 500 El Camino Real, Santa Clara, CA, 95053, USA
Johann Steinbrecher, Cesar J. Philippidis & Weijia Shang

Authors

Johann Steinbrecher
View author publications
You can also search for this author in PubMed Google Scholar
Cesar J. Philippidis
View author publications
You can also search for this author in PubMed Google Scholar
Weijia Shang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johann Steinbrecher.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Steinbrecher, J., Philippidis, C.J. & Shang, W. A Case Study of Implementing Supernode Transformations. Int J Parallel Prog 42, 320–342 (2014). https://doi.org/10.1007/s10766-013-0248-7

Download citation

Received: 20 May 2012
Accepted: 26 April 2013
Published: 15 May 2013
Issue Date: April 2014
DOI: https://doi.org/10.1007/s10766-013-0248-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Case Study of Implementing Supernode Transformations

Abstract

Access this article

Similar content being viewed by others

A Compilation and Run-Time Framework for Maximizing Performance of Self-scheduling Algorithms

Observations from Parallelising Three Maximum Common (Connected) Subgraph Algorithms

Just in Time Load Balancing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Case Study of Implementing Supernode Transformations

Abstract

Access this article

Similar content being viewed by others

A Compilation and Run-Time Framework for Maximizing Performance of Self-scheduling Algorithms

Observations from Parallelising Three Maximum Common (Connected) Subgraph Algorithms

Just in Time Load Balancing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation