Skip to main content
Log in

Analysis of cache behaviour and software optimizations for faster on-chip network simulations

  • Original Article
  • Published:
International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

Abstract

Fast simulations are critical in reducing time to market in chip multiprocessors and system-on-chips. Several simulators have been used to evaluate the performance and power consumed by network-on-chips (NoCs). To speedup the simulations, it is necessary to investigate and optimize the hotspots in the simulator source code. Among several simulators available, Booksim2.0 has been chosen for the experimentation as it is being extensively used in the NoC community. In this paper, the cache and memory system behavior of Booksim2.0 have been analyzed to accurately monitor input dependent performance bottlenecks. The measurements show that cache and memory usage patterns vary widely based on the input parameters given to Booksim2.0. Based on these measurements, the cache configuration having the least misses has been identified. To further reduce the cache misses, software optimization techniques such as removal of unused functions, loop interchanging and replacing post-increment operator with pre-increment operator for non-primitive data types have been employed. The cache misses were reduced by 18.52%, 5.34% and 3.91% by employing above technology respectively. Thread parallelization and vectorization have been employed to improve the overall performance of Booksim2.0. The OpenMP programming model and SIMD are used for parallelizing and vectorizing the more time-consuming portions of Booksim2.0. Speedups of 2.93× and 3.97× were observed for the Mesh topology with \(30\times 30\) network size by employing thread parallelization and vectorization respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Abbreviations

CMPs:

Chip level multiprocessors

MPSoCs:

Multi-processor system-on-chips

SoCs:

System-on-chips

NoCs:

Network-on-chips

L1:

Level 1 cache

LL:

Last level cache

I1:

First level instruction cache

D1:

First level data cache

MPKI:

Misses per kilo instructions

CPI:

Cycles per instruction

SIMD:

Single instruction, multiple data

References

  • Agarwal N, Krishna T, Peh LS, Jha N (2009) GARNET: a detailed on-chip network model inside a full-system simulator. In: IEEE international symposium on performance analysis of systems and software, 2009. ISPASS 2009, pp 33–42. https://doi.org/10.1109/ISPASS.2009.4919636

  • Asaduzzaman A, Mahgoub I (2006) Cache modeling and optimization for portable devices running MPEG-4 video decoder. Multimed Tools Appl 28(2):239–256. https://doi.org/10.1007/s11042-006-6145-y

    Article  Google Scholar 

  • Ben-Itzhak Y, Zahavi E, Cidon I, Kolodny A, (2012) HNOCS: modular open-source simulator for heterogeneous NoCs. In: International conference on embedded computer systems (SAMOS), pp 51–57. https://doi.org/10.1109/SAMOS.2012.6404157

  • Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. https://doi.org/10.1145/2024716.2024718

    Article  Google Scholar 

  • Catania V, Mineo A, Monteleone S, Palesi M, Patti D (2015) Noxim: an open, extensible and cycle-accurate network on chip simulator. In: 26th IEEE International conference on application-specific systems, architectures and processors, ASAP 2015, Toronto, ON, Canada, July 27–29, 2015, pp 162– 163. https://doi.org/10.1109/ASAP.2015.7245728

  • Catania V, Mineo A, Monteleone S, Palesi M, Patti D (2016) Cycle-accurate network on chip simulation with noxim. ACM Trans Model Comput Simul 27(1):4:1–4:25. https://doi.org/10.1145/2953878

    Article  Google Scholar 

  • Cherniack M, Galvez E, Franklin M, Zdonik S (2003) Profile-driven cache management. In: 19th international conference on data engineering, 2003. Proceedings, pp 645–656. https://doi.org/10.1109/ICDE.2003.1260828

  • Coppa E, Demetrescu C, Finocchi I (2014a) Input-sensitive profiling. IEEE Trans Software Eng 40(12):1185–1205. https://doi.org/10.1109/TSE.2014.2339825

    Article  Google Scholar 

  • Coppa E, Demetrescu C, Finocchi I, Marotta R (2014b) Estimating the empirical cost function of routines with dynamic workloads. In: Proceedings of annual IEEE/ACM international symposium on code generation and optimization. ACM, New York, NY, USA, CGO ’14, pp 230:230–230:239. https://doi.org/10.1145/2544137.2544143

  • Curtsinger C, Berger ED (2015) COZ: finding code that counts with causal profiling. In: Proceedings of the 25th symposium on operating systems principles. ACM, New York, NY, USA, SOSP ’15, pp 184–197. https://doi.org/10.1145/2815400.2815409

  • Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In: Proceedings of the 38th design automation conference (IEEE Cat. No.01CH37232), pp 684–689. https://doi.org/10.1109/DAC.2001.156225

  • Infante A (2014) Identifying caching opportunities, effortlessly. In: Companion Proceedings of the 36th international conference on software engineering. ACM, New York, NY, USA, ICSE Companion 2014, pp 730–732. https://doi.org/10.1145/2591062.2591198

  • Intel Corporation (2017) Intel Advisor XE

  • Jensen SH, Sridharan M, Sen K, Chandra S (2015) MemInsight: platform-independent memory debugging for JavaScript. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, New York, NY, USA, ESEC/FSE 2015, pp 345–356. https://doi.org/10.1145/2786805.2786860

  • Jiang N, Becker D, Michelogiannakis G, Balfour J, Towles B, Shaw D, Kim J, Dally W (2013) A detailed and flexible cycle-accurate network-on-chip simulator. In: 2013 IEEE international symposium on performance analysis of systems and software (ISPASS), pp 86–96. https://doi.org/10.1109/ISPASS.2013.6557149

  • Kahng AB, Li B, Peh LS (2010) ORION 2.0 : a power-area simulator for interconnection networks. Tvlsi XX(1):1–5. https://doi.org/10.1109/TVLSI.2010.2091686

    Article  Google Scholar 

  • Kahng AB, Lin B, Nath S (2015) Orion 3.0: a comprehensive NoC router estimation tool. IEEE Embed Syst Lett 7(2):41–45. https://doi.org/10.1109/LES.2015.2402197

    Article  Google Scholar 

  • Kowarschik M, Wei C (2003) An overview of cache optimization techniques and cache-aware numerical algorithms, pp 213–232. https://doi.org/10.1007/3-540-36574-510

  • Larsen S, Rabbah R, Amarasinghe S (2005) Exploiting vector parallelism in software pipelined loops. In: 38th annual IEEE/ACM international symposium on microarchitecture (MICRO’05), pp 11–129. https://doi.org/10.1109/MICRO.2005.20

  • Lebeck A, Wood D (1994) Cache profiling and the SPEC benchmarks: a case study. Computer 27(10):15–26. https://doi.org/10.1109/2.318580

    Article  Google Scholar 

  • Liu X, Mellor-Crummey J (2013) A data-centric profiler for parallel programs. In: 2013 international conference for high performance computing, networking, storage and analysis (SC), pp 1–12. https://doi.org/10.1145/2503210.2503297

  • Mahlke S, Moseley T, Hank R, Bruening D, Cho HK (2013) Instant profiling: instrumentation sampling for profiling datacenter applications. In: Proceedings of the 2013 IEEE/ACM international symposium on code generation and optimization (CGO). IEEE Computer Society, Washington, DC, USA, CGO ’13, pp 1–10. https://doi.org/10.1109/CGO.2013.6494982

  • Marjamki D (2011) Cppcheck, A tool for static C/C++ code analysis. http://cppcheck.sourceforge.net/

  • Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. In: Proceedings of the 28th ACM SIGPLAN conference on programming language design and implementation. ACM, New York, NY, USA, PLDI ’07, pp 89–100. https://doi.org/10.1145/1250734.1250746

  • Nguyen K, Xu G (2013) Cachetor: detecting cacheable data to remove bloat. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, New York, NY, USA, ESEC/FSE 2013, pp 268–278. https://doi.org/10.1145/2491411.2491416

  • Nie J, Cheng B, Li S, Wang L, Li XF (2010) Vectorization for Java. Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) 6289 LNCS: 3–17. https://doi.org/10.1007/978-3-642-15672-43

  • Nikounia SH, Mohammadi S (2015) Gem5v: a modified gem5 for simulating virtualized systems. J Supercomput 71(4):1484–1504. https://doi.org/10.1007/s11227-014-1375-7

    Article  Google Scholar 

  • Nistor A, Ravindranath L (2014) SunCat: Helping developers understand and predict performance problems in smartphone applications. In: Proceedings of the 2014 international symposium on software testing and analysis. ACM, New York, NY, USA, ISSTA 2014, pp 282–292. https://doi.org/10.1145/2610384.2610410

  • Nistor A, Song L, Marinov D, Lu S (2013) Toddler: detecting performance problems via similar memory-access patterns. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, Piscataway, NJ, USA, ICSE ’13, pp 562–571

  • Pande PP, Grecu C, Jones M, Ivanov A, Saleh R (2005) Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Trans Comput 54(8):1025–1040. https://doi.org/10.1109/TC.2005.134

    Article  Google Scholar 

  • Pienaar JA, Hundt R (2013) JSWhiz: static analysis for javascript memory leaks. In: Proceedings of the 2013 IEEE/ACM international symposium on code generation and optimization (CGO). IEEE Computer Society, Washington, DC, USA, CGO ’13, pp 1–11. https://doi.org/10.1109/CGO.2013.6495007

  • Porterfield AK (1989) Software methods for improvement of cache performance on supercomputer applications. Ph.D. thesis, Rice University

  • Puente V, Gregorio J, Beivide R (2002) SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems. In: 10th Euromicro workshop on Parallel, distributed and network-based processing, 2002. Proceedings, pp 15–22. https://doi.org/10.1109/EMPDP.2002.994207

  • Randall M, Lewis A (2002) A parallel implementation of ant colony optimization. J Parallel Distrib Comput 62(9):1421–1432. https://doi.org/10.1006/jpdc.2002.1854

    Article  MATH  Google Scholar 

  • Sembrant A, Black-Schaffer D, Hagersten E (2012) Phase guided profiling for fast cache modeling. In: Proceedings of the tenth international symposium on code generation and optimization. ACM, New York, NY, USA, CGO ’12, pp 175–185. https://doi.org/10.1145/2259016.2259040

  • Song L, Kavi K, Cytron R (2003) An unfolding-based loop optimization technique. In: Software and compilers for embedded systems: 7th international workshop, SCOPES 2003, Vienna, Austria, September 24–26, 2003. Proceedings, pp 117–132. Springer, Berlin. https://doi.org/10.1007/978-3-540-39920-99

  • Varga A (1999) Using the OMNet++ discrete event simulation system in education. IEEE Trans Educ 42(4):372–373. https://doi.org/10.1109/13.804564

    Article  Google Scholar 

  • Wehner P, Rettkowski J, Kleinschmidt T, Ghringer D (2015) MpSoCSim: An extended OVP simulator for modeling and evaluation of network-on-chip based heterogeneous MPSoCs. In: 2015 international conference on embedded computer systems: architectures, modeling, and simulation (SAMOS), pp 390–395. https://doi.org/10.1109/SAMOS.2015.7363704

  • Xu G, Bond MD, Qin F, Rountev A (2011) LeakChaser: helping programmers narrow down causes of memory leaks. In: Proceedings of the 32nd ACM SIGPLAN conference on programming language design and implementation. ACM, New York, NY, USA, PLDI ’11, pp 270–282. https://doi.org/10.1145/1993498.1993530

  • Yan D, Xu G, Rountev A (2012) Uncovering performance problems in Java applications with reference propagation profiling. In: Proceedings of the 34th international conference on software engineering. IEEE Press, Piscataway, NJ, USA, ICSE ’12, pp 134–144

  • Zaparanuks D, Hauswirth M (2012) Algorithmic profiling. In: Proceedings of the 33rd ACM SIGPLAN conference on programming language design and implementation. ACM, New York, NY, USA, PLDI ’12, pp 67–76. https://doi.org/10.1145/2254064.2254074

  • Zhao Q, Cutcutache I, Wong WF (2008) PiPA: Pipelined profiling and analysis on multi-core systems. In: Proceedings of the 6th annual IEEE/ACM international symposium on code generation and optimization. ACM, New York, NY, USA, CGO ’08, pp 185–194. https://doi.org/10.1145/1356058.1356083

Download references

Acknowledgements

This work was supported by the Department of Science and Technology, Government of India under the Grant DST-SERB YSS/2015/000196 project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khyamling Parane.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Prasad, B.M.P., Parane, K. & Talawar, B. Analysis of cache behaviour and software optimizations for faster on-chip network simulations. Int J Syst Assur Eng Manag 10, 696–712 (2019). https://doi.org/10.1007/s13198-019-00799-5

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13198-019-00799-5

Keywords

Navigation