research-article

Exploring HW/SW Co-Optimizations for Accelerating Large-scale Texture Identification on Distributed GPUs

Authors:

Junsong Wang,

Xiaofan Zhang,

Yubo Li,

Yonghua LinAuthors Info & Claims

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Article No.: 28, Pages 1 - 10

https://doi.org/10.1145/3472456.3473520

Published: 05 October 2021 Publication History

Get Access

Abstract

Texture identification has been developed recently to support one-to-one verification and one-to-many search, which provides much broader support than texture classification in real-life applications. It has demonstrated great potentials to enable product traceability by identifying the unique texture information on the surface of the targeted objects. However, existing hardware acceleration schemes are not enough to support a large-scale texture identification, especially for the search task, where the number of texture images being searched can reach millions, creating enormous compute and memory demands and making real-time texture identification infeasible. To address these problems, we propose a comprehensive toolset with jointly optimization strategies from both hardware and software to deliver optimized GPU acceleration and leverage large-scale texture identification with real-time responses. Novel technologies include: 1) a highly-optimized cuBLAS implementation for efficiently running 2-nearest neighbors algorithm; 2) a hybrid cache design to incorporate host memory for streaming data toward GPUs, which delivers a 5 × larger memory capacity while running the targeted workloads; 3) a batch process to fully exploit the data reuse opportunities by considering available compute resources and memory bandwidth constraints. 4) an asymmetric local feature extraction to reduce the memory footprint for keeping feature matrices of reference texture images. To the best of our knowledge, this work is the first implementation to provide real-time large-scale texture identification on GPUs. By exploring the co-optimizations from both hardware and software, we can deliver 31 × faster search and 20 × larger feature cache capacity compared to a conventional CUDA implementation. We also demonstrate our proposed designs by proposing a distributed texture identification system with 14 Nvidia Tesla P100 GPUs which can complete 872,984 texture similarity comparisons in just one second.

References

[1]

R. Arandjelović and A. Zisserman. 2012. Three things everyone should know to improve object retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Abstract

References

Cited By

Recommendations

Application Performance on the Newest Processors and GPUs

Out-of-core implementation for accelerator kernels on heterogeneous clouds

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations