Skip to main content
Log in

Parallelizing Complex Streaming Applications on Distributed Scratchpad Memory Multicore Architecture

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Multicore processors can provide sufficient computing power and flexibility for complex streaming applications, such as high-definition video processing. For less hardware complexity and power consumption, the distributed scratchpad memory architecture is considered, instead of the cache memory architecture. However, the distributed design poses new challenges to programming. It is difficult to exploit all available capabilities and achieve maximal throughput, due to the combined complexity of inter-processor communication, synchronization, and workload balancing. In this study, we developed an efficient design flow for parallelizing multimedia applications on a distributed scratchpad memory multicore architecture. An application is first partitioned into streaming components and then mapped onto multicore processors. Various hardware-dependent factors and application-specific characteristics are involved in generating efficient task partitions and allocating resources appropriately. To test and verify the proposed design flow, three popular multimedia applications were implemented: a full-HD motion JPEG decoder, an object detector, and a full-HD H.264/AVC decoder. For demonstration purposes, SONY PlayStation\(^{\circledR }\)3 was selected as the target platform. Simulation results show that, on PS3, the full-HD motion JPEG decoder with the proposed design flow can decode about 108.9 frames per second (fps) in the 1080p format. The object detection application can perform real-time object detection at 2.84 fps at \(1280 \times 960\) resolution, 11.75 fps at \(640 \times 480\) resolution, and 62.52 fps at \(320 \times 240\) resolution. The full-HD H.264/AVC decoder applications can achieve nearly 50 fps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Bai, K., Shrivastava, A.: Heap data management for limited local memory (LLM) multi-core processors. In: Proceedings of the CODES+ISSS, pp. 317–325 (2010)

  2. Baik, H., Sihn, K., Kim, Y., Bae, S., Han, N., Song, H.J.: Analysis and parallelization of H.264 decoder on cell broadband engine architecture. In: Proceedings of the IEEE Symposium Signal Processing and Information Technology, pp. 791–795 (2007)

  3. Bai, K., Shrivastava, A., Kudchadker, S.: Stack data management for limited local memory (LLM) multi-core processors. In: Proceedings of the ASAP, pp. 231–234 (2011)

  4. Chen, S.-K., Lin, T.-J., Liu, C.-W.: Parallel object detection on multicore platforms. In: IEEE Workshop on Signal Processing Systems, pp. 75–80 (2007)

  5. Che, W., Panda, A., Chatha, K.S.: Compilation of stream programs for multicore processors that incorporate scratchpad memories. In: Proceedings of the DATE, pp. 1118–1123 (2011)

  6. Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Rec. H.264 and ISO/IEC 14496–10 AVC (2003)

  7. Gschwind, M.: The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor. Int. J. Parallel Program. 35(3), 233–262 (2007)

    Article  Google Scholar 

  8. Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 4th edn. Morgan Kaufmann Publishers, California (2007)

    Google Scholar 

  9. IBM Corp.: C/C++ Language Extensions for Cell Broadband Engine Architecture. User Guide (2008)

  10. IBM Corp.: Cell Programming Guide. User Guide, (2008)

  11. IBM Corp.: Cell Programming Tutorial. User Guide, (2008)

  12. IBM Corp.: SPE Runtime Management Library. User Guide, (2008)

  13. Ismail, L., Guerchi, D.: Performance evaluation of convolution of the cell broadband engine processor. IEEE Trans. Parallel Distrib. Syst. 22(2), 337–351 (2011)

    Google Scholar 

  14. Jung, S.C., Shrivastava, S., Bai, K.: Dynamic code mapping for limited local memory systems. In: Proceedings of the ASAP, pp. 13–20 (2010)

  15. Kahn, G.: The semantics of a simple language for parallel programming. In: Proceedings of the IFIP Congress, pp. 471–475 (1974)

  16. Kudlur, M., Mahlke, S.: Orchestrating the execution of stream programs on multicore platforms. In: Proceedings of the PLDI, pp. 114–124 (2008)

  17. Kapasi, U., Rixner, S., Dally, W., Khailany, B., Ahn, J., Mattson, P., Owens, J.: Programmable stream processors. IEEE Comput. 36(8), 54–62 (2003)

    Article  Google Scholar 

  18. Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005)

    Article  Google Scholar 

  19. Kistler, M., Perrone, M., Petrini, F.: Cell multiprocessor communication network: built for speed. IEEE Micro. 26(3), 10–23 (2006)

    Article  Google Scholar 

  20. Kim, Y., Kim, J., Bae, S., Baik, H., Song, H. J.: H.264/AVC decoder parallelization and optimization on asymmetric multicore platform using dynamic load balancing. In: IEEE International Conference on Multimedia and Expo., pp. 1001–1004 (2008)

  21. McCool, M.: Data-parallel programming on the cell BE and the GPU using the RapidMind development platform. In: GSPx Multicore Applications Conference (2006)

  22. Ohara, M., Inoue, H., Sohda, Y., Komatsu, H., Nakatani, T.: MPI microtask for programming the cell broadband engine\(^{\rm TM}\) processor. IBM Syst. J. 45(1), 85–102 (2006)

    Article  Google Scholar 

  23. OpenCV on the cell. http://cell.fixstars.com/opencv/index.php/OpenCV_on_the_Cell (2010)

  24. Pennebarker, W.B., Mitchell, J.L.: JPEG: Still Image Data Compression Standard. Kluwer, Massachusetts (1993)

    Google Scholar 

  25. Perez, J.M., Bellens, P., Badia, R.M., Labarta, J.: CellSs: making it easier to program the cell broadband engine processor. IBM J. Res. Dev. 51(5), 593–604 (2007)

    Article  Google Scholar 

  26. Sarje, A., Zola, J., Aluru, S.: Accelerating pairwise computations on cell processors. IEEE Trans. Parallel Distrib. Syst. 22(1), 69–77 (2011)

    Google Scholar 

  27. Sugano, H., Miyamoto, R.: A real-time object recognition system on cell broadband engine. In: Mery, D., Rueda, L. (eds.) Advances in Image and Video Technology, LNCS Series 4872, pp. 932–943. Springer, Berlin (2007)

    Google Scholar 

  28. Tol, E. van der, Jaspers, E., Gelderblom, R.: Mapping of H.264 decoding on multiprocessor architecture. In: Proceedings of the SPIE Conference on Image and Video Communications and Processing, pp. 707–718 (2003)

  29. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Symposium Computer Vision and Pattern Recognition, pp. 511–518 (2001)

Download references

Acknowledgments

This work was supported in part by the Nation Science Council, Taiwan, under Grant NSC-102-2220-E-009-013- and Ministry of Economic Affairs, Taiwan, under Grant MOEA-101-EC-17-A-02-S1-202.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shin-Kai Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, SK., Hung, CY., Chen, CC. et al. Parallelizing Complex Streaming Applications on Distributed Scratchpad Memory Multicore Architecture. Int J Parallel Prog 42, 875–899 (2014). https://doi.org/10.1007/s10766-013-0256-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0256-7

Keywords

Navigation