A study of 3D Network-on-Chip design for data parallel H.264 coding

https://doi.org/10.1016/j.micpro.2011.06.009Get rights and content

Abstract

In this paper, we implement, analyze and compare different Network-on-Chip (NoC) architectures aiming at higher efficiencies for MPEG-4/H.264 coding. Two-dimensional (2D) and three-dimensional (3D) NoCs based on Non-Uniform Cache Access (NUCA) are analyzed. We present results using a full system simulator with realistic workloads. Experiments show the average network latencies in two 3D NoCs are reduced by 28% and 34% respectively, comparing with 2D design. It is also shown that heat dissipation is a trade-off in improving performance of 3D chips. Our analysis and experiment results provide a guideline to design efficient 3D NoCs for data parallel H.264 coding applications.

Highlights

► We propose and evaluate different NoC designs for data parallel H.264 coding. ► We find that shared data accesses are the system performance bottlenecks. ► Two designs with processors on the top and distributed have been evaluated. ► We find that performance improvement is neglectable with more than four pillars. ► The results of this paper give a guideline for NoC based data parallel H.264 coding.

Introduction

Portable embedded multimedia devices are very popular nowadays in the market of consumer electronics. These devices are capable of processing audio and video data which are the future application and trend of MPEG-4 [1] standard. H.264, also known as Advanced Video Coding (AVC), is the latest international standard of video stream coding [1]. It has been defined in MPEG-4 part 10. Previous standards such as MPEG-2 have been widely used for coding video streams over digital television signal and video conference system. However, new applications and services such as camera phone and on-line video services require higher coding efficiency. H.264 has been used in a wide range of applications such as Blu-ray Disc, videos from YouTube and the iTunes Store, DVB broadcast, direct-broadcast satellite television service, cable television services, and real-time video conference. Not surprisingly, H.264 is hungry for higher processing power which requires parallelism and multi-core computing. However, the current portable devices are usually suffering from the limited processing ability and low efficiency, thus a new chip design is required.

According to the Moore’s Law, the fast developing Integrated Circuit (IC) manufacturing technology has provided the industry with billions of transistors on a single chip [2]. At the same time, the IP blocks integrated on chip has been increasing which leads to an exponential rise in the complexity of their interaction. If this trend holds, the traditional digital system design methods, especially SoC will encounter critical challenges and performance bottlenecks. One of the most well known and critical problems is the communication bottleneck. Most bus-based SoCs have the bus based communication architecture, such as simple, hierarchical or crossbar-type buses. In contrast with the increasing chip capacity, bus based systems do not scale well with the system size in terms of bandwidth, clock frequency and power consumption [3].

To address these problems and improve the system performance, NoC was proposed as a promising solution in the field of Chip Multiprocessor (CMP) [3]. It endeavors to bring network communication methodologies into on-chip communications. The design approach of NoC is to create a communication infrastructure beforehand and then map the computational resources to it via resource dependent interfaces. Processing Elements (PEs) in a NoC are connected by routers and network links, and data are transferred in the form of network packets. This modular approach also provides more efficient communication by leveraging computer network principles. Therefore, in this paper, NoC is chosen to be the platform for the study of H.264 coding.

Data parallel coding, in which video stream data are distributed to PEs, is studied in this paper. Multiple video stream data can be processed simultaneously in data parallel coding. Our experiments show that data dependencies among coding threads are a major bottleneck of H.264 coding. Scalability of data parallel processing in a NoC requires minimized inter-processor communication.

To the best of our knowledge, it is the first paper about NoC designs for data parallel H.264 coding. Most prior research has focused on functional partition of the H.264 coding. In this paper, we analyze the impact of NoC design, and the influence of temperature and performance of 2D/3D NoC for data parallel H.264 coding.

The remainder of this paper is organized as follows. Section 2 discusses the related works. Section 3 gives an illustrative example of the H.264 coding standard. Section 4 analyzes the data parallelism methods of H.264 alongside with a real world application. Section 5 presents a detailed 2D/3D NoC design analysis for data parallel H.264 coding. Experimental results are shown in Section 6. Section 7 concludes the paper.

Section snippets

Related works

Different data-parallel splitting approaches of H.264 decoding have been evaluated in [4], including single-row, multi-column, non-blocking slice-parallel, blocking slice-parallel and rotating slice-parallel. It is shown that the execution time of each parallelization approach is related to the size and shape of the frame partition.

Parallel scalability of H.264 decoding process has been investigated in [5]. The authors pointed out that previous strategies are insufficiently scalable. Therefore,

The H.264 coding standard

Like the traditional coding system such as MPEG-2 and MPEG-1, H.264 is based on motion compensation and inverted coding. However, in H.264, several advanced coding technologies have been introduced, such as multiple motion estimation, inter-frame estimation and multi-frame estimation [1].

A video sequence in H.264 is constructed by multiple groups of pictures, each of which includes several frames, and each frame includes one or more slices which are built by multiple macroblocks. A macroblock

The parallelism of H.264 encoding

In this section, we analyze the parallelism of H.264 encoding. As discussed in the previous section, the decoding process of H.264 is a part of the encoding process, hence the analysis on the parallelization of the encoding process are suitable for the decoding process as well.

NoC design analysis for H.264 coding

In this section, we present and analyze different NoC architectures designed for H.264 coding.

Experimental evaluation

In this section, we present the experimental evaluation for system performances based on the simulation of H.264 coding. X264 from PARSEC is used to test the encoding efficiency of different architectures including 2D, 3D-top and 3D-DL with different pillar numbers.

Conclusion

In this paper, different NoC design alternatives for data parallel H.264 coding have been proposed and evaluated. Our study shows that the inter-thread communication and shared data accesses are the system performance bottlenecks. In NoC based systems, a centralized non-uniform cache architecture is required to minimize the cost of data dependency. We find that 3D NoCs are better than the 2D ones in terms of network delivery delay which is reflected by hop counts. Two design approaches with

Acknowledgments

This work is supported by Turku Center for Computer Science (TUCS). The authors would also like to thank the anonymous reviewers for their feedback and suggestions.

Thomas Canhao Xu received his M.Eng. degree in Software Engineering from Zhejiang University, China in 2007. He has been teaching the National Certification of Information Engineer (NCIE) and Wish certified Network Engineer (WNE) for two and half years. He has authored four textbooks for WNE education. Since September 2008, he has been working in the Computer Systems laboratory, University of Turku as a researcher. He is also a Ph.D. student in the Turku Centre for Computer Science (TUCS),

References (34)

  • F.C. Pereira et al.

    The MPEG-4 Book

    (2002)
  • Intel, Intel core i7 processor extreme edition and intel core i7 processor datasheet, vol. 1, December...
  • W.J. Dally, B. Towles, Route packets, not wires: on-chip inteconnection networks, in: Proceedings of the 38th...
  • F.H. Seitner, R.M. Schreier, M. Bleyer, M. Gelautz, Evaluation of data-parallel splitting approaches for h.264...
  • C. Meenderinck, A. Azevedo, M. Alvarez, B. Juurlink, A. Ramirez, Parallel scalability of h.264, in: Proceeding of the...
  • M. Kim, D. Kim, G. Sobelman, Mpeg-4 performance analysis for a cdma network-on-chip, in: Proceedings of the 2005...
  • J. Xu, W. Wolf, J. Henkel, S. Chakradhar, T. Lv, A case study in networks-on-chip design for embedded video, in:...
  • H.G. Lee, U. Ogras, R. Marculescu, N. Chang, Design space exploration and prototyping for on-chip multimedia...
  • F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, M. Kandemir, Design and management of 3d chip...
  • D. Park, S. Eachempati, R. Das, A.K. Mishra, Y. Xie, N. Vijaykrishnan, C.R. Das, Mira: a multi-layered on-chip...
  • T. Kgil, S. D’Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt, K. Flautner, Picoserver: using 3d...
  • M.A. Baker et al.

    A scalable parallel h.264 decoder on the cell broadband engine architecture

  • T. Tillo et al.

    Redundant slice optimal allocation for h.264 multiple description coding

    IEEE Transactions on Circuits and Systems for Video Technology

    (2008)
  • C. Bienia, S. Kumar, J.P. Singh, K. Li, The parsec benchmark suite: characterization and architectural implications,...
  • C. Bienia, S. Kumar, K. Li, Parsec vs. splash-2: a quantitative comparison of two multithreaded benchmark suites on...
  • A.W. Yin, Generalization of Slot Table Size for Virtual Circuits on Nostrum Networks on Chip, Royal Institute of...
  • V.F. Pavlidis et al.

    3-d topologies for networks-on-chip

    IEEE Transactions on Very Large Scale Integration Systems

    (2007)
  • Cited by (14)

    • CSquare: A new kilo-core-oriented topology

      2015, Microprocessors and Microsystems
      Citation Excerpt :

      In this paper, the authors focus on implementing a complete kilo-core feasible, gate lever SoC design based on asynchronous circuits. Some researchers have also focused on 3D-topology, such as [21,22]. In this paper, we propose CSquare, a scalable cluster-formed topology.

    • Providing multiple hard latency and throughput guarantees for packet switching networks on chip

      2013, Computers and Electrical Engineering
      Citation Excerpt :

      The proposed concept is applicable to all asynchronous scheduled VC based routers and not limited to the one presented in the following. It is independent of topology and dimension of the NoC and is applicable to optical networks [35,36] or 3D networks [37]. All distirbuted routing schemes can be used.

    • Traffic engineered NoC for streaming applications

      2013, Microprocessors and Microsystems
      Citation Excerpt :

      These are well suited for streaming applications where communication requirements are well known. Streaming applications such as HiperLAN/2 Baseband Processors [6], Real-time Object Recognition Processors [7] and H.264 encoders [8,9] have well understood communication patterns and bandwidth requirements. Adequate throughput, latency and bandwidth guarantees between process blocks can be provided by establishing provisioned, contention-free routes between nodes.

    • A new suggestion for improvement of mesh topology on NOC

      2017, Proceedings of 2016 5th International Conference on Computer Science and Network Technology, ICCSNT 2016
    • Methods for TSVs placement in 3D Network-on-Chip

      2017, Conference of Open Innovation Association, FRUCT
    View all citing articles on Scopus

    Thomas Canhao Xu received his M.Eng. degree in Software Engineering from Zhejiang University, China in 2007. He has been teaching the National Certification of Information Engineer (NCIE) and Wish certified Network Engineer (WNE) for two and half years. He has authored four textbooks for WNE education. Since September 2008, he has been working in the Computer Systems laboratory, University of Turku as a researcher. He is also a Ph.D. student in the Turku Centre for Computer Science (TUCS), Turku, Finland. His research interests include software system support for network-on-chip platforms, system level 3D multiprocessor architecture design and software engineering.

    Alexander Wei Yin received his M.Sc. degree in System-on-Chip design from the Royal Institute of Technology (KTH), Stockholm, Sweden in 2008. Since January 2008, he has been working in the Computer Systems laboratory, University of Turku as a researcher. He is also a Ph.D. student in the Turku Centre for Computer Science (TUCS), Turku, Finland. His research interests include low power techniques, fault tolerant designs and 3D integrated circuit architectures on network-on-chip platforms.

    Pasi Liljeberg received his Ph.D. degree from University of Turku, Finland, in 2005. Since January 2010 he has been working in the Computer Systems laboratory as senior lecturer. During the period 2007–2009 he has worked as an Academy of Finland postdoctoral researcher. His current research interests include network-on-chip intelligent communication architectures, on-chip fault tolerant design, 3D multiprocessor system architectures, globally-asynchronous locally-synchronous platforms for nanoscale NoC and formal approaches in embedded system development. He has more than 60 international refereed papers. He has established and is leading a research group focusing on fault tolerant self-timed communication platform for nanoscale systems.

    Hannu Tenhunen received the Diplomas from Helsinki University of Technology, Finland, 1982 and Ph.D. from Cornell University, NY, 1986. In 1985, he joined Signal Processing Laboratory, Tampere University of Technology, Finland, as Associate Professor and later served as professor and department director. Since 1992, he has been with Professor in Royal Institute of Technology (KTH), Sweden where he also served as dean. Currently he is director of Turku Centre for Computer Science, Finland and at University of Turku. His current research interests are VLSI architectures and systems, especially Network-on-Chip systems. He has over 600 reviewed publications and 16 patents internationally.

    This paper is based on an earlier paper submitted to the IEEE 27th Norchip Conference.

    View full text