Abstract:
Deep learning recommendation systems, such as Facebook’s DLRM, enhance user experiences by providing personalized recommendations on social platforms. The use of CXL-base...Show MoreMetadata
Abstract:
Deep learning recommendation systems, such as Facebook’s DLRM, enhance user experiences by providing personalized recommendations on social platforms. The use of CXL-based memory extension is gaining attention while existing server DRAM capacity is not sufficient for huge memory requirements. Typically, frequently accessed hot embedding data is stored in local memory, whereas occasionally accessed cold embedding data resides in CXL memory. The distinction between hot and cold data is based on the training results. However, the characteristics of hot and cold embedding vectors can change between training sessions, posing challenges for consistent inference latency with increasing model sizes. This study explores techniques for accelerating large-scale DLRM inference through dynamic hot data rearrangement. The proposed hotness score-based page promotion involves periodic page promotion and demotion based on the changing hotness of embedding data. Additionally, a prioritizing cache prefetch based on hotness improves cache temporal locality, especially in multi-user scenarios. Simulation results demonstrate that proposed approaches is able to enhance DLRM inference speed by up to 8.65% compared to existing techniques.
Date of Conference: 19-22 May 2024
Date Added to IEEE Xplore: 02 July 2024
ISBN Information: