skip to main content
10.1145/2818950.2818980acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads

Published: 05 October 2015 Publication History

Abstract

With the rising number of cores in mobile devices, the cache hierarchy in mobile application processors gets deeper, and the cache size gets bigger. However, the cacheline size remained relatively constant over the last decade in mobile application processors. In this work, we investigate whether the cacheline size in mobile application processors is due for a refresh, by looking at inefficiencies in the cache hierarchy which tend to be exacerbated when increasing the cacheline size: false sharing and cacheline utilization.
Firstly, we look at false sharing, which is more likely to arise at larger cacheline sizes and can severely impact performance. False sharing occurs when non-shared data structures, mapped onto the same cacheline, are being accessed by threads running on different cores, causing avoidable invalidations and subsequent misses. False sharing has been found in various places such as scientific workloads and real applications. We find that whilst increasing the cacheline size does increase false sharing, it still is negligible when compared to known cases of false sharing in scientific workloads, due to the limited level of thread-level parallelism in mobile workloads.
Secondly, we look at cacheline utilization which measures the number of bytes in a cacheline actually used by the processor. This effect has been investigated under various names for a multitude of server and desktop applications. As a low cacheline utilization implies that very little of the fetched cachelines was used by the processor, this causes waste in bandwidth and energy in moving data across the memory hierarchy. The energy cost associated with data movements is much higher compared to logic operations, increasing the need for cache efficiency, especially in the case of an energy-constrained platform like a mobile device. We find that the cacheline utilization of mobile workloads is low in general, decreasing when increasing the cacheline size. When increasing the cacheline size from 64 bytes to 128 bytes, the number of misses will be reduced by 10%--30%, depending on the workload. However, because of the low cacheline utilization, this more than doubles the amount of unused traffic to the L1 caches.
Using the cacheline utilization as a metric in this way, illustrates an important point. If a change in cacheline size would only be assessed on its local effects, we find that this change in cacheline size will only have advantages as the miss rate decreases. However, at system level, this change will increase the stress on the bus and increase the amount of wasted energy due to unused traffic. Using cacheline utilization as a metric underscores the need for system-level research when changing characteristics of the cache hierarchy.

References

[1]
ArndaleBoard.org. Arndale Octa Board.
[2]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Technical Report TR-811-08, Princeton University, Jan. 2008.
[3]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1--7, Aug. 2011.
[4]
S. Boyd-wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proceeding OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation, pages 1--8, 2010.
[5]
C. Chen, B. Falsafi, and A. Moshovos. Accurate and Complexity-Effective Spatial Pattern Prediction. In 10th International Symposium on High Performance Computer Architecture (HPCA'04), pages 276--276. IEEE, Feb. 2004.
[6]
D. Dice. False sharing induced by card table marking, 2011.
[7]
M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenström. The detection and elimination of useless misses in multiprocessors. ACM SIGARCH Computer Architecture News, 21(2):88--97, May 1993.
[8]
C. Gao, A. Gutierrez, R. G. Dreslinski, T. Mudge, K. Flautner, and G. Blake. A study of Thread Level Parallelism on mobile devices. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 126--127. IEEE, Mar. 2014.
[9]
A. Gutierrez, R. G. Dreslinski, T. F. Wenisch, T. Mudge, A. Saidi, C. Emmons, and N. Paver. Full-system analysis and characterization of interactive smartphone applications. In 2011 IEEE International Symposium on Workload Characterization (IISWC), pages 81--90. IEEE, Nov. 2011.
[10]
J. L. Hennessy and D. A. Patterson. Computer Architecture, A Quantitative Approach. Morgan Kaufmann, 4th edition, 2007.
[11]
H. Kim and P. V. Gratz. Leveraging Unused Cache Block Words to Reduce Power in CMP Interconnect. Computer Architecture Letters, 9(1):33--36, Jan. 2010.
[12]
S. Kumar, H. Zhao, A. Shriraman, E. Matthews, S. Dwarkadas, and L. Shannon. Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 376--388. IEEE, Dec. 2012.
[13]
T. Liu, C. Tian, Z. Hu, and E. D. Berger. PREDATOR: predictive false sharing detection. 19th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 46(10):3, Oct. 2014.
[14]
M. Nanavati, M. Spear, N. Taylor, S. Rajagopalan, D. T. Meyer, W. Aiello, and A. Warfield. Whose cache line is it anyway? In Proceedings of the 8th ACM European Conference on Computer Systems - EuroSys '13, page 141, New York, New York, USA, Apr. 2013. ACM Press.
[15]
D. Pandiyan and C.-J. Wu. Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms. In 2014 {IEEE} International Symposium on Workload Characterization, {IISWC} 2014, Raleigh, NC, USA, October 26-28, 2014, pages 171--180, 2014.
[16]
Pendragron Software Organization. CaffeineMark 3.0.
[17]
P. Pujara and A. Aggarwal. Cache Noise Prediction. Computers, IEEE Transactions on, 57(10):1372--1386, Oct. 2008.
[18]
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24. IEEE, 2007.
[19]
Rovio Entertainment Ltd. Angry Birds.
[20]
D. Sunwoo, W. Wang, M. Ghosh, C. Sudanthi, G. Blake, C. D. Emmons, and N. C. Paver. A structured approach to the simulation, analysis and characterization of smartphone applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC), pages 113--122. IEEE, Sept. 2013.
[21]
The Embedded Microprocessor Benchmark Consortium. AndEBench, 2015.
[22]
J. Torrellas, H. Lam, and J. Hennessy. False sharing and spatial locality in multiprocessor caches. IEEE Transactions on Computers, 43(6):651--663, June 1994.
[23]
H. Zhao, A. Shriraman, S. Kumar, and S. Dwarkadas. Protozoa: Adaptive Granularity Cache Coherence. ACM SIGARCH Computer Architecture News, 41(3):547, July 2013.

Cited By

View all
  • (2024)An analysis of cache configuration’s impacts on the miss rate of big data applications using gem5Serbian Journal of Electrical Engineering10.2298/SJEE2402217D21:2(217-234)Online publication date: 2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems
October 2015
278 pages
ISBN:9781450336048
DOI:10.1145/2818950
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cacheline utilization
  2. False sharing
  3. Mobile devices
  4. Mobile workloads

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

MEMSYS '15
MEMSYS '15: International Symposium on Memory Systems
October 5 - 8, 2015
DC, Washington DC, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)An analysis of cache configuration’s impacts on the miss rate of big data applications using gem5Serbian Journal of Electrical Engineering10.2298/SJEE2402217D21:2(217-234)Online publication date: 2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media