swTensor: accelerating tensor decomposition on Sunway architecture

Zhong, Xiaogang; Yang, Hailong; Luan, Zhongzhi; Gan, Lin; Yang, Guangwen; Qian, Depei

doi:10.1007/s42514-019-00017-5

swTensor: accelerating tensor decomposition on Sunway architecture

Regular Paper
Published: 20 November 2019

Volume 1, pages 161–176, (2019)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Xiaogang Zhong^1,2,
Hailong Yang ORCID: orcid.org/0000-0003-1101-7927^1,2^nAff4,
Zhongzhi Luan¹,
Lin Gan³,
Guangwen Yang³ &
…
Depei Qian¹

765 Accesses
2 Altmetric
Explore all metrics

Abstract

Modern applications are digesting and generating data with rich features that are stored in high dimensional array or tensor. The computation applied to tensor, such as Canonical Polyadic decomposition (CP decomposition) plays an important role in understanding the internal relationships within the data. Using CP decomposition to analyze large tensor with billions of sizes requires tremendous computation power. In the meanwhile, the emerging Sunway many-core processor has demonstrated its computation advantage in powering the first hundred petaFLOPS supercomputer in the world. In this paper, we propose swTensor that adapts the CP decomposition to Sunway processor by leveraging the MapReduce framework for automatic parallelization and the unique architecture of Sunway for high performance. Specifically, we divide the major computation of CP decomposition into four sub-procedures and implement each using MapReduce framework with customized design key-value pair. Also, we tile the data during the computation so that it fits into the limited local device memory on Sunway for better performance. Moreover, we propose a performance auto-tuning mechanism to search for the optimal parameter settings in swTensor. The experimental results demonstrate swTensor achieves better performance than the state-of-the-art BigTensor and CSTF with the average speedup of 1.36 $\times $ and 1.24 $\times $, respectively. Besides, swTensor exhibits better scalability when scaling across multiple Sunway processors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MuLOT: Multi-level Optimization of the Canonical Polyadic Tensor Decomposition at Large-Scale

Accelerating the Tucker Decomposition with Compressed Sparse Tensors

PASTA: a parallel sparse tensor algorithm benchmark suite

Article 01 August 2019

References

Aarts, E., Korst, J., Michiels, W.: Simulated annealing. Local Search Combin. Optim. 36, partii(3), 187–210 (2007)
MATH Google Scholar
Acar, E., Dunlavy, D.M., Kolda, T.G., Mørup, M.: Scalable tensor factorizations for incomplete data. Chemometr. Intell. Lab. Syst. 106(1), 41–56 (2011)
Article Google Scholar
Aja-Fernández, S., de Luis Garcia, R., Tao, D., Li, X.: Tensors in Image Processing and Computer Vision. Springer, Berlin (2009)
Book Google Scholar
Bertsimas, D., Tsitsiklis, J.: Simulated annealing. Stat. Sci. 8(1), 10–15 (1993)
Article Google Scholar
Beutel, A., Talukdar, P.P., Kumar, A., Faloutsos, C., Papalexakis, E.E., Xing, E.P.: Flexifact: scalable flexible factorization of coupled tensors on hadoop. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 109–117. SIAM (2014)
Blanco, Z., Liu, B., Dehnavi, M.M.: Cstf: Large-scale sparse tensor factorizations on distributed platforms. In: Proceedings of the 47th International Conference on Parallel Processing, p. 21. ACM (2018)
Chang, K.W., Yih, S.W.t., Yang, B., Meek, C.: Typed tensor decomposition of knowledge bases for relation extraction (2014)
Chen, B., Fu, H., Wei, Y., He, C., Zhang, W., Li, Y., Wan, W., Zhang, W., Gan, L., Zhang, W., et al.: Simulating the Wenchuan earthquake with accurate surface topography on Sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 40. IEEE Press (2018)
Choi, J., Liu, X., Smith, S., Simon, T.: Blocking optimization techniques for sparse tensor computation. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 568–577. IEEE (2018)
Choi, J.H., Vishwanathan, S.V.N.: Dfacto: distributed factorization of tensors. Adv. Neural Inf. Process. Syst. 2, 1296–1304 (2014)
Google Scholar
Cichocki, A., Mandic, D., De Lathauwer, L., Zhou, G., Zhao, Q., Caiafa, C., Phan, H.A.: Tensor decompositions for signal processing applications: from two-way to multiway component analysis. IEEE Signal Process. Mag. 32(2), 145–163 (2015)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Duan, X., Gao, P., Zhang, T., Zhang, M., Liu, W., Zhang, W., Xue, W., Fu, H., Gan, L., Chen, D., et al.: Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 12. IEEE Press (2018)
Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)
MATH Google Scholar
Han, Q., Yang, H., Luan, Z., Qian, D.: Accelerating tile low-rank gemm on Sunway architecture: Poster. In: Proceedings of the 16th ACM International Conference on Computing Frontiers, pp. 295–297. ACM (2019)
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. (tiis) 5(4), 19 (2016)
Google Scholar
He, X., Zhang, H., Kan, M.Y., Chua, T.S.: Fast matrix factorization for online recommendation with implicit feedback. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 549–558. ACM (2016)
Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products. J. Math. Phys. 6(1–4), 164–189 (1927)
Article Google Scholar
Hu, Y., Yang, H., Luan, Z., Qian, D.: Massively scaling seismic processing on sunway taihulight supercomputer. arXiv preprint arXiv:1907.11678 (2019)
Jeon, I., Papalexakis, E.E., Faloutsos, C., Sael, L., Kang, U.: Mining billion-scale tensors: algorithms and discoveries. Int. J. Very Large Data Bases 25(4), 519–544 (2016)
Article Google Scholar
Jeon, I., Papalexakis, E.E., Kang, U., Faloutsos, C.: Haten2: Billion-scale tensor decompositions. In: Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pp. 1047–1058. IEEE (2015)
Kang, U., Papalexakis, E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 316–324. ACM (2012)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Article MathSciNet Google Scholar
Lei, C., Yang, Y.H.: Tri-focal tensor-based multiple video synchronization with subframe optimization. IEEE Trans. Image Process. 15(9), 2473–2480 (2006)
Article Google Scholar
Li, J., Ma, Y., Yan, C., Vuduc, R.: Optimizing sparse tensor times matrix on multi-core and many-core architectures. In: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, pp. 26–33. IEEE Press (2016)
Li, L., Yu, T., Zhao, W., Fu, H., Wang, C., Tan, L., Yang, G., Thomson, J.: Large-scale hierarchical k-means for heterogeneous many-core supercomputers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 160–170. IEEE (2018a)
Li, M., Liu, Y., Yang, H., Luan, Z., Qian, D.: Multi-role sptrsv on sunway many-core architecture. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 594–601. IEEE (2018b)
Lim, L.H., Comon, P.: Multiarray signal processing: Tensor decomposition meets compressed sensing. arXiv preprint arXiv:1002.4935 (2010)
Liu, B., Wen, C., Sarwate, A.D., Dehnavi, M.M.: A unified optimization approach for sparse tensor operations on gpus. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 47–57. IEEE (2017)
Liu, C., Xie, B., Liu, X., Xue, W., Yang, H., Liu, X.: Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing, pp. 363–373. ACM (2018)
Liu, C., Yang, H., Sun, R., Luan, Z., Qian, D.: swtvm: Exploring the automated compilation for deep learning on sunway architecture. arXiv preprint arXiv:1904.07404 (2019)
Ma, Y., Li, J., Wu, X., Yan, C., Sun, J., Vuduc, R.: Optimizing sparse tensor times matrix on gpus. J. Parallel Distrib. Comput. 129, 99–109 (2019)
Article Google Scholar
Nickel, M., Tresp, V., Kriegel, H.P.: Factorizing yago: scalable machine learning for linked data. In: Proceedings of the 21st international conference on World Wide Web, pp. 271–280. ACM (2012)
Nisa, I., Li, J., Sukumaran-Rajam, A., Vuduc, R., Sadayappan, P.: Load-balanced sparse mttkrp on gpus. arXiv preprint arXiv:1904.03329 (2019)
Oh, S., Park, N., Jang, J.G., Sael, L., Kang, U.: High-performance tucker factorization on heterogeneous platforms. IEEE Trans. Parallel Distrib. Syst. (2019)
Park, N., Jeon, B., Lee, J., Kang, U.: Bigtensor: Mining billion-scale tensor made easy. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2457–2460. ACM (2016)
Phipps, E.T., Kolda, T.G.: Software for sparse tensor decomposition on emerging computing architectures. SIAM J. Sci. Comput. 41(3), C269–C290 (2019)
Article MathSciNet Google Scholar
Shashua, A., Hazan, T.: Non-negative tensor factorization with applications to statistics and computer vision. In: Proceedings of the 22nd international conference on Machine learning, pp. 792–799. ACM (2005)
Sidiropoulos, N.D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E.E., Faloutsos, C.: Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process. 65(13), 3551–3582 (2017)
Article MathSciNet Google Scholar
Smith, S., Choi, J.W., Li, J., Vuduc, R., Park, J., Liu, X., Karypis, G.: Frostt: The formidable repository of open sparse tensors and tools (2017)
Smith, S., Park, J., Karypis, G.: Sparse tensor factorization on many-core processors with high-bandwidth memory. In: Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, pp. 1058–1067. IEEE (2017)
Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis, and machine vision. Cengage Learning (2014)
Tew, P.A.: An investigation of sparse tensor formats for tensor libraries. Ph.D. thesis, Massachusetts Institute of Technology (2016)
Tsourakakis, C.E.: Data mining with mapreduce: Graph and tensor algorithms with applications. Diss. Master’s thesis, Carnegie Mellon University (2010)
Tucker, L.R.: Implications of factor analysis of three-way matrices for measurement of change. Probl. Meas. Change 15, 122–137 (1963)
Google Scholar
Wang, X., Xue, W., Liu, W., Wu, L.: swsptrsv: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 338–353. ACM (2018)
Article Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Tech. rep., Lawrence Berkeley National Lab.(LBNL), Berkeley (2009)
Xu, Z., Lin, J., Matsuoka, S.: Benchmarking sw26010 many-core processor. In: Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pp. 743–752. IEEE (2017)
Yelp dataset challenge. (2019) https://www.yelp.com/dataset/challenge
Zhong, X., Li, M., Yang, H., Liu, Y., Qian, D.: swmr: A framework for accelerating mapreduce applications on sunway taihulight. IEEE Trans. Emerg. Topics Comput. (2018)

Download references

Acknowledgements

The authors would like to thank all anonymous reviewers for their insightful comments and suggestions. This work is supported by National Key R&D Program of China (Grant No. 2016YFB1000503 and 2016YFA0602200), National Natural Science Foundation of China (Grant No. 61502019 and 61732002), State Key Laboratory of Software Development Environment (Grant No. SKLSDE-2018ZX-19), Center for High Performance Computing and System Simulation, Pilot National Laboratory for Marine Science and Technology (Qingdao). Hailong Yang is the corresponding author.

Author information

Hailong Yang
Present address: Beihang University, G823, New Main Building, Beijing, 100191, China

Authors and Affiliations

Sino-German Joint Software Institute, the School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Xiaogang Zhong, Hailong Yang, Zhongzhi Luan & Depei Qian
State Key Laboratory of Software Development Environment, Beihang University, Beijing, 100191, China
Xiaogang Zhong & Hailong Yang
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Lin Gan & Guangwen Yang

Authors

Xiaogang Zhong
View author publications
You can also search for this author inPubMed Google Scholar
Hailong Yang
View author publications
You can also search for this author inPubMed Google Scholar
Zhongzhi Luan
View author publications
You can also search for this author inPubMed Google Scholar
Lin Gan
View author publications
You can also search for this author inPubMed Google Scholar
Guangwen Yang
View author publications
You can also search for this author inPubMed Google Scholar
Depei Qian
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Hailong Yang.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhong, X., Yang, H., Luan, Z. et al. swTensor: accelerating tensor decomposition on Sunway architecture. CCF Trans. HPC 1, 161–176 (2019). https://doi.org/10.1007/s42514-019-00017-5

Download citation

Received: 09 May 2019
Accepted: 06 November 2019
Published: 20 November 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s42514-019-00017-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

swTensor: accelerating tensor decomposition on Sunway architecture

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MuLOT: Multi-level Optimization of the Canonical Polyadic Tensor Decomposition at Large-Scale

Accelerating the Tucker Decomposition with Compressed Sparse Tensors

PASTA: a parallel sparse tensor algorithm benchmark suite

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now