ABSTRACT
With the rising number of cores in mobile devices, the cache hierarchy in mobile application processors gets deeper, and the cache size gets bigger. However, the cacheline size remained relatively constant over the last decade in mobile application processors. In this work, we investigate whether the cacheline size in mobile application processors is due for a refresh, by looking at inefficiencies in the cache hierarchy which tend to be exacerbated when increasing the cacheline size: false sharing and cacheline utilization.
Firstly, we look at false sharing, which is more likely to arise at larger cacheline sizes and can severely impact performance. False sharing occurs when non-shared data structures, mapped onto the same cacheline, are being accessed by threads running on different cores, causing avoidable invalidations and subsequent misses. False sharing has been found in various places such as scientific workloads and real applications. We find that whilst increasing the cacheline size does increase false sharing, it still is negligible when compared to known cases of false sharing in scientific workloads, due to the limited level of thread-level parallelism in mobile workloads.
Secondly, we look at cacheline utilization which measures the number of bytes in a cacheline actually used by the processor. This effect has been investigated under various names for a multitude of server and desktop applications. As a low cacheline utilization implies that very little of the fetched cachelines was used by the processor, this causes waste in bandwidth and energy in moving data across the memory hierarchy. The energy cost associated with data movements is much higher compared to logic operations, increasing the need for cache efficiency, especially in the case of an energy-constrained platform like a mobile device. We find that the cacheline utilization of mobile workloads is low in general, decreasing when increasing the cacheline size. When increasing the cacheline size from 64 bytes to 128 bytes, the number of misses will be reduced by 10%--30%, depending on the workload. However, because of the low cacheline utilization, this more than doubles the amount of unused traffic to the L1 caches.
Using the cacheline utilization as a metric in this way, illustrates an important point. If a change in cacheline size would only be assessed on its local effects, we find that this change in cacheline size will only have advantages as the miss rate decreases. However, at system level, this change will increase the stress on the bus and increase the amount of wasted energy due to unused traffic. Using cacheline utilization as a metric underscores the need for system-level research when changing characteristics of the cache hierarchy.
- ArndaleBoard.org. Arndale Octa Board.Google Scholar
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Technical Report TR-811-08, Princeton University, Jan. 2008.Google Scholar
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1--7, Aug. 2011. Google ScholarDigital Library
- S. Boyd-wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proceeding OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation, pages 1--8, 2010. Google ScholarDigital Library
- C. Chen, B. Falsafi, and A. Moshovos. Accurate and Complexity-Effective Spatial Pattern Prediction. In 10th International Symposium on High Performance Computer Architecture (HPCA'04), pages 276--276. IEEE, Feb. 2004. Google ScholarDigital Library
- D. Dice. False sharing induced by card table marking, 2011.Google Scholar
- M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenström. The detection and elimination of useless misses in multiprocessors. ACM SIGARCH Computer Architecture News, 21(2):88--97, May 1993. Google ScholarDigital Library
- C. Gao, A. Gutierrez, R. G. Dreslinski, T. Mudge, K. Flautner, and G. Blake. A study of Thread Level Parallelism on mobile devices. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 126--127. IEEE, Mar. 2014.Google ScholarCross Ref
- A. Gutierrez, R. G. Dreslinski, T. F. Wenisch, T. Mudge, A. Saidi, C. Emmons, and N. Paver. Full-system analysis and characterization of interactive smartphone applications. In 2011 IEEE International Symposium on Workload Characterization (IISWC), pages 81--90. IEEE, Nov. 2011. Google ScholarDigital Library
- J. L. Hennessy and D. A. Patterson. Computer Architecture, A Quantitative Approach. Morgan Kaufmann, 4th edition, 2007. Google ScholarDigital Library
- H. Kim and P. V. Gratz. Leveraging Unused Cache Block Words to Reduce Power in CMP Interconnect. Computer Architecture Letters, 9(1):33--36, Jan. 2010. Google ScholarDigital Library
- S. Kumar, H. Zhao, A. Shriraman, E. Matthews, S. Dwarkadas, and L. Shannon. Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 376--388. IEEE, Dec. 2012. Google ScholarDigital Library
- T. Liu, C. Tian, Z. Hu, and E. D. Berger. PREDATOR: predictive false sharing detection. 19th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 46(10):3, Oct. 2014. Google ScholarDigital Library
- M. Nanavati, M. Spear, N. Taylor, S. Rajagopalan, D. T. Meyer, W. Aiello, and A. Warfield. Whose cache line is it anyway? In Proceedings of the 8th ACM European Conference on Computer Systems - EuroSys '13, page 141, New York, New York, USA, Apr. 2013. ACM Press. Google ScholarDigital Library
- D. Pandiyan and C.-J. Wu. Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms. In 2014 {IEEE} International Symposium on Workload Characterization, {IISWC} 2014, Raleigh, NC, USA, October 26-28, 2014, pages 171--180, 2014.Google Scholar
- Pendragron Software Organization. CaffeineMark 3.0.Google Scholar
- P. Pujara and A. Aggarwal. Cache Noise Prediction. Computers, IEEE Transactions on, 57(10):1372--1386, Oct. 2008. Google ScholarDigital Library
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24. IEEE, 2007. Google ScholarDigital Library
- Rovio Entertainment Ltd. Angry Birds.Google Scholar
- D. Sunwoo, W. Wang, M. Ghosh, C. Sudanthi, G. Blake, C. D. Emmons, and N. C. Paver. A structured approach to the simulation, analysis and characterization of smartphone applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC), pages 113--122. IEEE, Sept. 2013.Google ScholarCross Ref
- The Embedded Microprocessor Benchmark Consortium. AndEBench, 2015.Google Scholar
- J. Torrellas, H. Lam, and J. Hennessy. False sharing and spatial locality in multiprocessor caches. IEEE Transactions on Computers, 43(6):651--663, June 1994. Google ScholarDigital Library
- H. Zhao, A. Shriraman, S. Kumar, and S. Dwarkadas. Protozoa: Adaptive Granularity Cache Coherence. ACM SIGARCH Computer Architecture News, 41(3):547, July 2013. Google ScholarDigital Library
Index Terms
- Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads
Recommendations
Effective cache prefetching on bus-based multiprocessors
Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching ...
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer ArchitectureOn-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...
Comments