Active Store Window: Enabling Far Store-Load Forwarding with Scalability and Complexity-Efficiency

Zhang, Zhen-Hao; Wang, Xiao-Yin; Tong, Dong; Yi, Jiang-Fang; Lu, Jun-Lin; Wang, Ke-Yi

doi:10.1007/s11390-012-1263-7

Active Store Window: Enabling Far Store-Load Forwarding with Scalability and Complexity-Efficiency

Regular Paper
Published: 12 July 2012

Volume 27, pages 769–780, (2012)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Zhen-Hao Zhang^1,2,3,
Xiao-Yin Wang^1,2,3,
Dong Tong^1,2,3,
Jiang-Fang Yi^1,2,3,
Jun-Lin Lu^1,2,3 &
…
Ke-Yi Wang^1,2,3

87 Accesses
Explore all metrics

Abstract

Conventional dynamically scheduled processors often use fully associative structures named load/store queue (LSQ) to implement the value communication between loads and the older in-flight stores and to detect the store-load order violation. But this in-flight forwarding only occupies about 15% of all store-load communications, which makes the CAM-based micro-architecture the major bottleneck to scale store-load communication further. This paper presents a new micro-architecture named ASW (short for active store window). It provides a new structure named speculative active store window to implement more aggressively speculative store-load forwarding than conventional LSQ. This structure could forward the data of committed stores to the executing loads without accessing to L1 data cache, which is referred to as far forwarding in this paper. At the back-end of the pipeline, it uses in-order load re-execution filtered by the tagged SSBF (short for store sequence bloom filter) to verify the correctness of the store-load forwarding. The speculative active store window and tagged store sequence bloom filter are all set-associate structures that are more efficient and scalable than fully associative structures. Experiments show that this simpler and faster design outperforms a conventional load/store queue based design and the NoSQ design on most benchmarks by 10.22% and 8.71% respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Article 02 August 2021

Gemini: A Novel Hardware and Software Implementation of High-performance PCIe SSD

Article 21 July 2016

Scalable NUMA-aware persistent B+-tree for non-volatile memory devices

Article 17 November 2022

References

Wulf W A, McKee S A. Hitting the memory wall: Implications of the obvious. Computer Architecture News, 1995, 23(1): 20–24.
Article Google Scholar
Park I, Ooi C L, Vijaykumar T N. Reducing design complexity of the load/store queue. In Proc. the 36th MICRO, San Diego, USA, Dec. 3-5, 2003, pp.411–422.
Gandhi A, Akkary H, Rajwar R, Srinivasan S T, Lai K. Scalable load and store processing in latency tolerant processors. In Proc. the 32nd ISCA, Madison, USA, June 4-8, 2005, pp.446–457.
Pericàs M, Cristal A, Cazorla F J, Gonzàlez R, Veidenbaum A, Jimènez D A, ValeroM. A two-level load/store queue based on execution locality. In Proc. the 35th ISCA, Beijing, China, June 21-25, 2008, pp.25–36.
Sethumadhavan S, Desikan R, Burger D, Moore C R, Keckler S W. Scalable hardware memory disambiguation for high ILP processors. In Proc. the 36th MICRO, San Diego, USA, Dec. 3-5, 2003, pp.399–410.
Baugh L, Zilles C. Decomposing the load-store queue by function for power reduction and scalability. IBM Journal of Research and Development, 2006, 50(2/3): 287–297.
Article Google Scholar
Sha T T, Martin M M K, Roth A. Scalable store-load forwarding via store queue index prediction. In Proc. the 38th MICRO, Barcelona, Spain, Nov. 12-16, 2005, pp.159–170.
Stone S S, Woley K M, Frank M I. Address-indexed memory disambiguation and store-to-load forwarding. In Proc. the 38th MICRO, Barcelona, Spain, Nov. 12-16, 2005, pp.171–182.
Roesner F, Burger D, Keckler S W. Counting dependence predictors. In Proc. the 35th ISCA, Beijing, China, June 21-25, 2008, pp.215–226.
Sha T T, Martin M M K, Roth A. NoSQ: Store-load communication without a store queue. In Proc. the 39th MICRO, Orlando, USA, Dec. 9-13, 2006, pp.285–296.
Subramaniam S, Loh G H. Fire-and-forget: Load/store scheduling with no store queue at all. In Proc. the 39th MICRO, Orlando, USA, Dec. 9-13, 2006, pp.273–284.
Garg A, Rashid M W, Huang M. Slackened memory dependence enforcement: Combining opportunistic forwarding with decoupled verification. In Proc. the 33rd ISCA, Boston, USA, June 17-21, 2006, pp.142–154.
Sethumadhavan S, Roesner F, Emer J S, Burger D, Keckler S W. Late-binding: Enabling unordered load-store queue. In Proc. the 34th ISCA, San Diego, USA, June 9-13, 2007, pp.347–357.
Huang R, Garg A, Huang M. Software hardware cooperative memory disambiguation. In Proc. the 12th HPCA, Austin, USA, Feb. 11-15, 2006, pp.244–253.
Cain H W, Lipasti M H. Memory ordering: A value-based approach. In Proc. the 31st ISCA, München, Germany, June 19-23, 2004, pp.90–101.
Roth A. Store vulnerability window: Re-execution filtering for enhanced load optimization. In Proc. the 32nd ISCA, Madison, USA, June 4-8, 2005, pp.458–468.
Chrysos G Z, Emer J S. Memory dependence prediction using store sets. In Proc. the 25th ISCA, Barcelona, Spain, June 27-July 1, 1998, pp.142–153.
Moshovos A, Breach S E, Vijaykumar T N, Sohi G S. Dynamic speculation and synchronization of data dependences. In Proc. the 24th ISCA, Denver, USA, June 2-4, 1997, pp.181–193.
Hilton A, Roth A. Decoupled store completion/silent deterministic replay: Enabling scalable data memory for CPR/CFP processors. In Proc. the 36th ISCA, Austin, USA, June 20-24, 2009, pp.245–254.
Hilton A, Roth A. BOLT: Energy-efficient out-of-order latency-tolerant execution. In Proc. the 16th HPCA, Bangalore, India, Jan. 9-14, 2010, pp.1–12.
Mutlu O, Stark J, Wilkerson C, Patt Y N. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proc. the 9th HPCA, Anaheim, USA, Feb. 8-12, 2003, pp.129–140.
Akkary H, Rajwar R, Srinivasan S T. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proc. the 36th MICRO, San Diego, USA, Dec. 3-5, 2003, pp.423–434.

Download references

Author information

Authors and Affiliations

Microprocessor Research and Development Center, Peking University, Beijing, 100871, China
Zhen-Hao Zhang, Xiao-Yin Wang, Dong Tong (Member, CCF, ACM), Jiang-Fang Yi, Jun-Lin Lu & Ke-Yi Wang
Engineering Research Center of Microprocessor and System, Ministry of Education, Beijing, 100871, China
Zhen-Hao Zhang, Xiao-Yin Wang, Dong Tong (Member, CCF, ACM), Jiang-Fang Yi, Jun-Lin Lu & Ke-Yi Wang
School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Zhen-Hao Zhang, Xiao-Yin Wang, Dong Tong (Member, CCF, ACM), Jiang-Fang Yi, Jun-Lin Lu & Ke-Yi Wang

Authors

Zhen-Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Yin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dong Tong
View author publications
You can also search for this author in PubMed Google Scholar
Jiang-Fang Yi
View author publications
You can also search for this author in PubMed Google Scholar
Jun-Lin Lu
View author publications
You can also search for this author in PubMed Google Scholar
Ke-Yi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Yin Wang.

Additional information

This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2009ZX01029-001-002 and the Postdoctoral Science Foundation of China under Grant No. 20110490208.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, ZH., Wang, XY., Tong, D. et al. Active Store Window: Enabling Far Store-Load Forwarding with Scalability and Complexity-Efficiency. J. Comput. Sci. Technol. 27, 769–780 (2012). https://doi.org/10.1007/s11390-012-1263-7

Download citation

Received: 03 June 2011
Accepted: 23 April 2012
Published: 12 July 2012
Issue Date: July 2012
DOI: https://doi.org/10.1007/s11390-012-1263-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Active Store Window: Enabling Far Store-Load Forwarding with Scalability and Complexity-Efficiency

Abstract

Access this article

Similar content being viewed by others

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Gemini: A Novel Hardware and Software Implementation of High-performance PCIe SSD

Scalable NUMA-aware persistent B+-tree for non-volatile memory devices

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Active Store Window: Enabling Far Store-Load Forwarding with Scalability and Complexity-Efficiency

Abstract

Access this article

Similar content being viewed by others

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Gemini: A Novel Hardware and Software Implementation of High-performance PCIe SSD

Scalable NUMA-aware persistent B+-tree for non-volatile memory devices

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation