skip to main content
10.1145/2818950.2818980acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads

Published:05 October 2015Publication History

ABSTRACT

With the rising number of cores in mobile devices, the cache hierarchy in mobile application processors gets deeper, and the cache size gets bigger. However, the cacheline size remained relatively constant over the last decade in mobile application processors. In this work, we investigate whether the cacheline size in mobile application processors is due for a refresh, by looking at inefficiencies in the cache hierarchy which tend to be exacerbated when increasing the cacheline size: false sharing and cacheline utilization.

Firstly, we look at false sharing, which is more likely to arise at larger cacheline sizes and can severely impact performance. False sharing occurs when non-shared data structures, mapped onto the same cacheline, are being accessed by threads running on different cores, causing avoidable invalidations and subsequent misses. False sharing has been found in various places such as scientific workloads and real applications. We find that whilst increasing the cacheline size does increase false sharing, it still is negligible when compared to known cases of false sharing in scientific workloads, due to the limited level of thread-level parallelism in mobile workloads.

Secondly, we look at cacheline utilization which measures the number of bytes in a cacheline actually used by the processor. This effect has been investigated under various names for a multitude of server and desktop applications. As a low cacheline utilization implies that very little of the fetched cachelines was used by the processor, this causes waste in bandwidth and energy in moving data across the memory hierarchy. The energy cost associated with data movements is much higher compared to logic operations, increasing the need for cache efficiency, especially in the case of an energy-constrained platform like a mobile device. We find that the cacheline utilization of mobile workloads is low in general, decreasing when increasing the cacheline size. When increasing the cacheline size from 64 bytes to 128 bytes, the number of misses will be reduced by 10%--30%, depending on the workload. However, because of the low cacheline utilization, this more than doubles the amount of unused traffic to the L1 caches.

Using the cacheline utilization as a metric in this way, illustrates an important point. If a change in cacheline size would only be assessed on its local effects, we find that this change in cacheline size will only have advantages as the miss rate decreases. However, at system level, this change will increase the stress on the bus and increase the amount of wasted energy due to unused traffic. Using cacheline utilization as a metric underscores the need for system-level research when changing characteristics of the cache hierarchy.

References

  1. ArndaleBoard.org. Arndale Octa Board.Google ScholarGoogle Scholar
  2. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Technical Report TR-811-08, Princeton University, Jan. 2008.Google ScholarGoogle Scholar
  3. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1--7, Aug. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Boyd-wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proceeding OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation, pages 1--8, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Chen, B. Falsafi, and A. Moshovos. Accurate and Complexity-Effective Spatial Pattern Prediction. In 10th International Symposium on High Performance Computer Architecture (HPCA'04), pages 276--276. IEEE, Feb. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Dice. False sharing induced by card table marking, 2011.Google ScholarGoogle Scholar
  7. M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenström. The detection and elimination of useless misses in multiprocessors. ACM SIGARCH Computer Architecture News, 21(2):88--97, May 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Gao, A. Gutierrez, R. G. Dreslinski, T. Mudge, K. Flautner, and G. Blake. A study of Thread Level Parallelism on mobile devices. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 126--127. IEEE, Mar. 2014.Google ScholarGoogle ScholarCross RefCross Ref
  9. A. Gutierrez, R. G. Dreslinski, T. F. Wenisch, T. Mudge, A. Saidi, C. Emmons, and N. Paver. Full-system analysis and characterization of interactive smartphone applications. In 2011 IEEE International Symposium on Workload Characterization (IISWC), pages 81--90. IEEE, Nov. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. L. Hennessy and D. A. Patterson. Computer Architecture, A Quantitative Approach. Morgan Kaufmann, 4th edition, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Kim and P. V. Gratz. Leveraging Unused Cache Block Words to Reduce Power in CMP Interconnect. Computer Architecture Letters, 9(1):33--36, Jan. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Kumar, H. Zhao, A. Shriraman, E. Matthews, S. Dwarkadas, and L. Shannon. Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 376--388. IEEE, Dec. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Liu, C. Tian, Z. Hu, and E. D. Berger. PREDATOR: predictive false sharing detection. 19th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 46(10):3, Oct. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Nanavati, M. Spear, N. Taylor, S. Rajagopalan, D. T. Meyer, W. Aiello, and A. Warfield. Whose cache line is it anyway? In Proceedings of the 8th ACM European Conference on Computer Systems - EuroSys '13, page 141, New York, New York, USA, Apr. 2013. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Pandiyan and C.-J. Wu. Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms. In 2014 {IEEE} International Symposium on Workload Characterization, {IISWC} 2014, Raleigh, NC, USA, October 26-28, 2014, pages 171--180, 2014.Google ScholarGoogle Scholar
  16. Pendragron Software Organization. CaffeineMark 3.0.Google ScholarGoogle Scholar
  17. P. Pujara and A. Aggarwal. Cache Noise Prediction. Computers, IEEE Transactions on, 57(10):1372--1386, Oct. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24. IEEE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Rovio Entertainment Ltd. Angry Birds.Google ScholarGoogle Scholar
  20. D. Sunwoo, W. Wang, M. Ghosh, C. Sudanthi, G. Blake, C. D. Emmons, and N. C. Paver. A structured approach to the simulation, analysis and characterization of smartphone applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC), pages 113--122. IEEE, Sept. 2013.Google ScholarGoogle ScholarCross RefCross Ref
  21. The Embedded Microprocessor Benchmark Consortium. AndEBench, 2015.Google ScholarGoogle Scholar
  22. J. Torrellas, H. Lam, and J. Hennessy. False sharing and spatial locality in multiprocessor caches. IEEE Transactions on Computers, 43(6):651--663, June 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Zhao, A. Shriraman, S. Kumar, and S. Dwarkadas. Protozoa: Adaptive Granularity Cache Coherence. ACM SIGARCH Computer Architecture News, 41(3):547, July 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems
              October 2015
              278 pages
              ISBN:9781450336048
              DOI:10.1145/2818950

              Copyright © 2015 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 5 October 2015

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited
            • Article Metrics

              • Downloads (Last 12 months)7
              • Downloads (Last 6 weeks)1

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader