Abstract
Fast simulations are critical in reducing time to market in chip multiprocessors and system-on-chips. Several simulators have been used to evaluate the performance and power consumed by network-on-chips (NoCs). To speedup the simulations, it is necessary to investigate and optimize the hotspots in the simulator source code. Among several simulators available, Booksim2.0 has been chosen for the experimentation as it is being extensively used in the NoC community. In this paper, the cache and memory system behavior of Booksim2.0 have been analyzed to accurately monitor input dependent performance bottlenecks. The measurements show that cache and memory usage patterns vary widely based on the input parameters given to Booksim2.0. Based on these measurements, the cache configuration having the least misses has been identified. To further reduce the cache misses, software optimization techniques such as removal of unused functions, loop interchanging and replacing post-increment operator with pre-increment operator for non-primitive data types have been employed. The cache misses were reduced by 18.52%, 5.34% and 3.91% by employing above technology respectively. Thread parallelization and vectorization have been employed to improve the overall performance of Booksim2.0. The OpenMP programming model and SIMD are used for parallelizing and vectorizing the more time-consuming portions of Booksim2.0. Speedups of 2.93× and 3.97× were observed for the Mesh topology with \(30\times 30\) network size by employing thread parallelization and vectorization respectively.
Similar content being viewed by others
Abbreviations
- CMPs:
-
Chip level multiprocessors
- MPSoCs:
-
Multi-processor system-on-chips
- SoCs:
-
System-on-chips
- NoCs:
-
Network-on-chips
- L1:
-
Level 1 cache
- LL:
-
Last level cache
- I1:
-
First level instruction cache
- D1:
-
First level data cache
- MPKI:
-
Misses per kilo instructions
- CPI:
-
Cycles per instruction
- SIMD:
-
Single instruction, multiple data
References
Agarwal N, Krishna T, Peh LS, Jha N (2009) GARNET: a detailed on-chip network model inside a full-system simulator. In: IEEE international symposium on performance analysis of systems and software, 2009. ISPASS 2009, pp 33–42. https://doi.org/10.1109/ISPASS.2009.4919636
Asaduzzaman A, Mahgoub I (2006) Cache modeling and optimization for portable devices running MPEG-4 video decoder. Multimed Tools Appl 28(2):239–256. https://doi.org/10.1007/s11042-006-6145-y
Ben-Itzhak Y, Zahavi E, Cidon I, Kolodny A, (2012) HNOCS: modular open-source simulator for heterogeneous NoCs. In: International conference on embedded computer systems (SAMOS), pp 51–57. https://doi.org/10.1109/SAMOS.2012.6404157
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. https://doi.org/10.1145/2024716.2024718
Catania V, Mineo A, Monteleone S, Palesi M, Patti D (2015) Noxim: an open, extensible and cycle-accurate network on chip simulator. In: 26th IEEE International conference on application-specific systems, architectures and processors, ASAP 2015, Toronto, ON, Canada, July 27–29, 2015, pp 162– 163. https://doi.org/10.1109/ASAP.2015.7245728
Catania V, Mineo A, Monteleone S, Palesi M, Patti D (2016) Cycle-accurate network on chip simulation with noxim. ACM Trans Model Comput Simul 27(1):4:1–4:25. https://doi.org/10.1145/2953878
Cherniack M, Galvez E, Franklin M, Zdonik S (2003) Profile-driven cache management. In: 19th international conference on data engineering, 2003. Proceedings, pp 645–656. https://doi.org/10.1109/ICDE.2003.1260828
Coppa E, Demetrescu C, Finocchi I (2014a) Input-sensitive profiling. IEEE Trans Software Eng 40(12):1185–1205. https://doi.org/10.1109/TSE.2014.2339825
Coppa E, Demetrescu C, Finocchi I, Marotta R (2014b) Estimating the empirical cost function of routines with dynamic workloads. In: Proceedings of annual IEEE/ACM international symposium on code generation and optimization. ACM, New York, NY, USA, CGO ’14, pp 230:230–230:239. https://doi.org/10.1145/2544137.2544143
Curtsinger C, Berger ED (2015) COZ: finding code that counts with causal profiling. In: Proceedings of the 25th symposium on operating systems principles. ACM, New York, NY, USA, SOSP ’15, pp 184–197. https://doi.org/10.1145/2815400.2815409
Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In: Proceedings of the 38th design automation conference (IEEE Cat. No.01CH37232), pp 684–689. https://doi.org/10.1109/DAC.2001.156225
Infante A (2014) Identifying caching opportunities, effortlessly. In: Companion Proceedings of the 36th international conference on software engineering. ACM, New York, NY, USA, ICSE Companion 2014, pp 730–732. https://doi.org/10.1145/2591062.2591198
Intel Corporation (2017) Intel Advisor XE
Jensen SH, Sridharan M, Sen K, Chandra S (2015) MemInsight: platform-independent memory debugging for JavaScript. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, New York, NY, USA, ESEC/FSE 2015, pp 345–356. https://doi.org/10.1145/2786805.2786860
Jiang N, Becker D, Michelogiannakis G, Balfour J, Towles B, Shaw D, Kim J, Dally W (2013) A detailed and flexible cycle-accurate network-on-chip simulator. In: 2013 IEEE international symposium on performance analysis of systems and software (ISPASS), pp 86–96. https://doi.org/10.1109/ISPASS.2013.6557149
Kahng AB, Li B, Peh LS (2010) ORION 2.0 : a power-area simulator for interconnection networks. Tvlsi XX(1):1–5. https://doi.org/10.1109/TVLSI.2010.2091686
Kahng AB, Lin B, Nath S (2015) Orion 3.0: a comprehensive NoC router estimation tool. IEEE Embed Syst Lett 7(2):41–45. https://doi.org/10.1109/LES.2015.2402197
Kowarschik M, Wei C (2003) An overview of cache optimization techniques and cache-aware numerical algorithms, pp 213–232. https://doi.org/10.1007/3-540-36574-510
Larsen S, Rabbah R, Amarasinghe S (2005) Exploiting vector parallelism in software pipelined loops. In: 38th annual IEEE/ACM international symposium on microarchitecture (MICRO’05), pp 11–129. https://doi.org/10.1109/MICRO.2005.20
Lebeck A, Wood D (1994) Cache profiling and the SPEC benchmarks: a case study. Computer 27(10):15–26. https://doi.org/10.1109/2.318580
Liu X, Mellor-Crummey J (2013) A data-centric profiler for parallel programs. In: 2013 international conference for high performance computing, networking, storage and analysis (SC), pp 1–12. https://doi.org/10.1145/2503210.2503297
Mahlke S, Moseley T, Hank R, Bruening D, Cho HK (2013) Instant profiling: instrumentation sampling for profiling datacenter applications. In: Proceedings of the 2013 IEEE/ACM international symposium on code generation and optimization (CGO). IEEE Computer Society, Washington, DC, USA, CGO ’13, pp 1–10. https://doi.org/10.1109/CGO.2013.6494982
Marjamki D (2011) Cppcheck, A tool for static C/C++ code analysis. http://cppcheck.sourceforge.net/
Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. In: Proceedings of the 28th ACM SIGPLAN conference on programming language design and implementation. ACM, New York, NY, USA, PLDI ’07, pp 89–100. https://doi.org/10.1145/1250734.1250746
Nguyen K, Xu G (2013) Cachetor: detecting cacheable data to remove bloat. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, New York, NY, USA, ESEC/FSE 2013, pp 268–278. https://doi.org/10.1145/2491411.2491416
Nie J, Cheng B, Li S, Wang L, Li XF (2010) Vectorization for Java. Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) 6289 LNCS: 3–17. https://doi.org/10.1007/978-3-642-15672-43
Nikounia SH, Mohammadi S (2015) Gem5v: a modified gem5 for simulating virtualized systems. J Supercomput 71(4):1484–1504. https://doi.org/10.1007/s11227-014-1375-7
Nistor A, Ravindranath L (2014) SunCat: Helping developers understand and predict performance problems in smartphone applications. In: Proceedings of the 2014 international symposium on software testing and analysis. ACM, New York, NY, USA, ISSTA 2014, pp 282–292. https://doi.org/10.1145/2610384.2610410
Nistor A, Song L, Marinov D, Lu S (2013) Toddler: detecting performance problems via similar memory-access patterns. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, Piscataway, NJ, USA, ICSE ’13, pp 562–571
Pande PP, Grecu C, Jones M, Ivanov A, Saleh R (2005) Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Trans Comput 54(8):1025–1040. https://doi.org/10.1109/TC.2005.134
Pienaar JA, Hundt R (2013) JSWhiz: static analysis for javascript memory leaks. In: Proceedings of the 2013 IEEE/ACM international symposium on code generation and optimization (CGO). IEEE Computer Society, Washington, DC, USA, CGO ’13, pp 1–11. https://doi.org/10.1109/CGO.2013.6495007
Porterfield AK (1989) Software methods for improvement of cache performance on supercomputer applications. Ph.D. thesis, Rice University
Puente V, Gregorio J, Beivide R (2002) SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems. In: 10th Euromicro workshop on Parallel, distributed and network-based processing, 2002. Proceedings, pp 15–22. https://doi.org/10.1109/EMPDP.2002.994207
Randall M, Lewis A (2002) A parallel implementation of ant colony optimization. J Parallel Distrib Comput 62(9):1421–1432. https://doi.org/10.1006/jpdc.2002.1854
Sembrant A, Black-Schaffer D, Hagersten E (2012) Phase guided profiling for fast cache modeling. In: Proceedings of the tenth international symposium on code generation and optimization. ACM, New York, NY, USA, CGO ’12, pp 175–185. https://doi.org/10.1145/2259016.2259040
Song L, Kavi K, Cytron R (2003) An unfolding-based loop optimization technique. In: Software and compilers for embedded systems: 7th international workshop, SCOPES 2003, Vienna, Austria, September 24–26, 2003. Proceedings, pp 117–132. Springer, Berlin. https://doi.org/10.1007/978-3-540-39920-99
Varga A (1999) Using the OMNet++ discrete event simulation system in education. IEEE Trans Educ 42(4):372–373. https://doi.org/10.1109/13.804564
Wehner P, Rettkowski J, Kleinschmidt T, Ghringer D (2015) MpSoCSim: An extended OVP simulator for modeling and evaluation of network-on-chip based heterogeneous MPSoCs. In: 2015 international conference on embedded computer systems: architectures, modeling, and simulation (SAMOS), pp 390–395. https://doi.org/10.1109/SAMOS.2015.7363704
Xu G, Bond MD, Qin F, Rountev A (2011) LeakChaser: helping programmers narrow down causes of memory leaks. In: Proceedings of the 32nd ACM SIGPLAN conference on programming language design and implementation. ACM, New York, NY, USA, PLDI ’11, pp 270–282. https://doi.org/10.1145/1993498.1993530
Yan D, Xu G, Rountev A (2012) Uncovering performance problems in Java applications with reference propagation profiling. In: Proceedings of the 34th international conference on software engineering. IEEE Press, Piscataway, NJ, USA, ICSE ’12, pp 134–144
Zaparanuks D, Hauswirth M (2012) Algorithmic profiling. In: Proceedings of the 33rd ACM SIGPLAN conference on programming language design and implementation. ACM, New York, NY, USA, PLDI ’12, pp 67–76. https://doi.org/10.1145/2254064.2254074
Zhao Q, Cutcutache I, Wong WF (2008) PiPA: Pipelined profiling and analysis on multi-core systems. In: Proceedings of the 6th annual IEEE/ACM international symposium on code generation and optimization. ACM, New York, NY, USA, CGO ’08, pp 185–194. https://doi.org/10.1145/1356058.1356083
Acknowledgements
This work was supported by the Department of Science and Technology, Government of India under the Grant DST-SERB YSS/2015/000196 project.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Prasad, B.M.P., Parane, K. & Talawar, B. Analysis of cache behaviour and software optimizations for faster on-chip network simulations. Int J Syst Assur Eng Manag 10, 696–712 (2019). https://doi.org/10.1007/s13198-019-00799-5
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-019-00799-5