Skip to main content
Log in

A Case Study of Implementing Supernode Transformations

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Supernode transformation is a technique to decrease the communication overhead by partitioning and scheduling a loop nest to a multi-processor system. This is achieved by grouping a number of iterations in a perfectly nested loop with regular dependences as a \(supernode\). Previous work has been focusing on finding the optimal supernode size and shape as well as an optimal execution schedule for multi-processor systems with unbounded resources. This paper emphasizes on the actual implementation strategies of supernode transformations on multi-core systems with limited resources. Using an example, the longest common subsequence (LCS) problem, we present and compare three different multithreading implementations. A formula for the total execution time of each method is presented. The techniques are benchmarked on a 12-core and a 4-core machine. On the 12-core machine our first technique, which yields increased data locality, speeds up the unaltered sequential loop nest 16.7 times. Combining this technique with skewing the loop by changing the linear schedule scores a 42.6 speedup. A more sophisticated method that executes entire rows of the loop nest in one thread scores a 59.5 speedup. Concepts presented and discussed in this paper on the LCS problem serve as basic foundation for implementations at regular dependence algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. 1003.1 standard for information technology portable operating system interface (posix) rationale (informative). IEEE Std 1003.1-2001. Rationale (Informative) pp. i–310 (2001)

  2. Baskaran, M.M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. CGO ’10, pp. 200–209. ACM, New York, NY (2010)

  3. Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pp. 7–16. IEEE Computer Society, Washington, DC (2004)

  4. Ben Mabrouk, B., Hasni, H., Mahjoub, Z.: Parallelization of the dynamic programming algorithm for solving the longest common subsequence problem. In: Computer Systems and Applications (AICCSA), 2010 IEEE/ACS International Conference on, pp. 1–8 (2010)

  5. Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh international Symposium on, pp. 39–48 (2000)

  6. Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’08, pp. 101–113. ACM, New York, NY (2008)

  7. Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Cambridge (2007)

    Google Scholar 

  8. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  9. Dimitriou, G., Polychronopoulos, C.: Loop scheduling for multithreaded processors. In: Parallel Computing in Electrical Engineering, 2004. PARELEC 2004. International Conference on, pp. 361–366 (2004)

  10. Du, Z.-H., Lim, C.-C., Li, X.-F., Yang, C., Zhao, Q., Ngai, T.-F.: A Cost-driven Compilation Framework for Speculative Parallelization of Sequential Programs vol. 39, pp. 71–81. ACM, New York, NY (2004)

  11. Feautrier, P.: Dataflow analysis of array and scalar references. Int. J. Parallel Program. 20, 23–53 (1991)

    Article  MATH  Google Scholar 

  12. Garcia, T., Myoupo, J.-F., Seme, D.: A coarse-grained multicomputer algorithm for the longest common subsequence problem. In: Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, pp. 349–356 (2003)

  13. Goumas, G., Drosinos, N., Koziris, N.: Communication-aware supernode shape. Parallel Distrib. Syst. IEEE Trans. 20(4), 498–511 (2009)

    Article  Google Scholar 

  14. Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pp. 147–157. ACM, New York, NY (2009)

  15. Hirschberg, D.S.: Algorithms for the longest common subsequence problem. J. ACM 24, 664–675 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  16. Hodzic, E., Shang, W.: On time optimal supernode shape. Parallel Distrib. Syst. IEEE Trans. 13(12), 1220–1233 (2002)

    Article  Google Scholar 

  17. Hodzic, E., Shang, W.: On supernode transformation with minimized total running time. Parallel Distrib. Syst. IEEE Trans. 9(5), 417–428 (1998)

    Article  Google Scholar 

  18. Hogstedt, K., Carter, L., Ferrante, J.: On the parallel execution time of tiled loops. Parallel Distrib. Syst. IEEE Trans. 14(3), 307–321 (2003)

    Article  Google Scholar 

  19. Hu, S.-H., Wang, C.-W., Chen, H.-L.: An efficient and hardware-implementable systolic algorithm for the longest common subsequence problem. In: Machine Learning and Cybernetics, 2008 International Conference on, vol. 6, pp. 3150–3155 (2008)

  20. Inoue, H., Nakatani, T.: Performance of multi-process and multi-thread processing on multi-core smt processors. In: Workload Characterization (IISWC), 2010 IEEE International Symposium on, pp. 1–10 (2010)

  21. Irigoin, F., Triolet, R.: Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’88, pp. 319–329. ACM, New York, NY (1988)

  22. Korkin, D., Wang, Q., Shang, Y.: An efficient parallel algorithm for the multiple longest common subsequence (mlcs) problem. In: Parallel Processing, 2008. ICPP ’08. 37th International Conference on, pp. 354–363 (2008)

  23. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, Saint-Malo, France, pp. 451–460. ACM, New York, NY (2010)

  24. Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Comput. 24(3–4), 445–475 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  25. Liu, J., Wu, S.: Research on longest common subsequence fast algorithm. In: 2011 International Conference on Consumer Electronics, Communications and Networks (CECNet), pp. 4338–4341 (2011)

  26. Mitchell, N., Carter, L., Ferrante, J., Tullsen, D.: Ilp versus tlp on smt. In: Supercomputing, ACM/IEEE 1999 Conference, p. 37 (1999)

  27. Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short simd architectures. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pp. 2–11. ACM, New York, NY (2008)

  28. Ramanujam, J., Sadayappan, P.: Tiling multidimensional iteration spaces for multicomputers. J. Parallel Distrib. Comput. 16(2), 108–120 (1992)

    Article  Google Scholar 

  29. Riakiotakis, I., Tsanakas, P.: Dynamic scheduling of nested loops with uniform dependencies in heterogeneous networks of workstations. In: Parallel Architectures, Algorithms and Networks, 2005. ISPAN 2005. Proceedings. 8th International Symposium on, p. 6 (2005)

  30. Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.-Z., Stratton, J.A., Hwu, W.-M.W.: Program optimization space pruning for a multithreaded gpu. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’08, pp. 195–204. ACM, New York, NY (2008)

  31. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core x86 architecture for visual computing. In: ACM SIGGRAPH 2008 papers, SIGGRAPH ’08, pp. 18:1–18:15. ACM, New York, NY (2008)

  32. Shang, W., Fortes, J.A.B.: Time optimal linear schedules for algorithms with uniform dependencies. IEEE Trans. Comput. 40(6), 723–742 (1991)

    Article  MathSciNet  Google Scholar 

  33. Xue, J.: Communication-minimal tiling of uniform dependence loops. J. Parallel Distrib. Comput. 42(1), 42–59 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johann Steinbrecher.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Steinbrecher, J., Philippidis, C.J. & Shang, W. A Case Study of Implementing Supernode Transformations. Int J Parallel Prog 42, 320–342 (2014). https://doi.org/10.1007/s10766-013-0248-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0248-7

Keywords

Navigation