Skip to main content
Log in

A CUDA implementation of the Continuous Space Language Model

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). A detailed explanation of the CSLM algorithm is provided. Implementation was accomplished using a combination of CUBLAS library routines, NVIDIA NPP functions, and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated. The efficiency of the CUDA version of the open source implementation is analyzed and compared to that using the Intel Math Kernel Libraries (MKL) on a variety of CUDA enabled and multi-core CPU platforms. It is demonstrated that substantial performance benefit can be obtained using CUDA, even with nonoptimal code. Techniques for optimizing performance are then provided. Furthermore, an analysis is performed to determine the conditions in which the performance of CUDA exceeds that of the multi-core MKL realization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Allada V, Benjegerdes T, Bode B (2009) Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster. In: Proceedings of the IEEE international conference on cluster computing and workshops (CLUSTER), New Orleans, LA, Aug 31–Sept 4, 2009

    Google Scholar 

  2. Franco J, Bernabe G, Fernandez J, Acacio ME (2009) A parallel implementation of the 2D wavelet transform using CUDA. In: Proceedings of the 17th IEEE euromicro international conference on parallel, distributed, and network-based processing (PDP), Weimar, Germany, Feb 18–20, 2009

    Google Scholar 

  3. Phillips EH, Fatica M (2010) Implementing the Himeno benchmark with CUDA on GPU clusters. In: Proceedings of the 24th IEEE international symposium on parallel and distributed processing (IPDPS), Atlanta, GA, Apr 19–23, 2010

    Google Scholar 

  4. Du Z, Yin Z, Bader DA (2010) A tile-based parallel Viterbi algorithm for biological sequence alignment on GPU with CUDA. In: Proceedings of the IEEE international symposium on parallel and distributed processing, workshops, and PhD forum (IPDPSW), Atlanta, GA, Apr 19–23, 2010

    Google Scholar 

  5. Van Der Laan WJ, Jalba AC, Roerdink J (2011) Accelerating wavelet lifting on graphics hardware using CUDA. IEEE Trans Parallel Distrib Syst 22(1):132–146

    Article  Google Scholar 

  6. Han B, Taha TM (2010) Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors. Appl Opt 49(10):B83–B91

    Article  Google Scholar 

  7. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Skadron K (2008) A performance study of general-purpose applications on graphics processors using CUDA. J Parallel Distrib Comput 68(10):1370–1380

    Article  Google Scholar 

  8. Komatitsch D, Michea D, Erlebacher G (2009) Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA. J Parallel Distrib Comput 69(5):451–460

    Article  Google Scholar 

  9. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656

    Article  MATH  MathSciNet  Google Scholar 

  10. Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans Acoust Speech Signal Process 35(3):400–401

    Article  Google Scholar 

  11. Schwenk H (2010) Continuous-space language models for statistical machine translation. Prague Bull Math Linguist 93:137–146

    Article  Google Scholar 

  12. Schwenk H (2013) CSLM: Continuous Space Language Model toolkit. LIUM, University of Le Mans, France, 11 Sept (2012). www-lium.univ-lemans.fr/cslm/. Accessed 3 Sept 2013

  13. Schwenk H (2007) Continuous space language models. Comput Speech Lang 21:492–518

    Article  Google Scholar 

  14. Schwenk H, Dechelotte D, Gauvain J-L (2006) Continuous space language models for statistical machine translation. In: Proceedings of the joint conference ACL/Coling, July 2006

    Google Scholar 

  15. Whaley RC, Petitet A (2013) Automatically Tuned Linear Algebra Software (ATLAS). SourceForge, 10 July (2012). http://math-atlas.sourceforge.net/. Accessed 3 Sept 2013

  16. Thompson EA, Anderson T (2012) Use of CUDA for the continuous space language model. In: Proceedings of the IEEE high performance extreme computing conference (HPEC), Waltham, MA, Sept 10–12, 2012

    Google Scholar 

  17. Vesely K, Burget L, Grezl F (2010) Parallel training of neural networks for speech recognition. In: Proceedings of the 11th annual conference of the international speech communication association (INTERSPEECH), Mukuhari, Chiba, Japan, Sept 26–30, 2010

    Google Scholar 

  18. Raina R, Madhavan A, Ng AY (2009) Large-scale unsupervised learning using graphics processors. In: Proceedings of the 26th international conference on machine learning (ICML), Montreal, QC, Canada, June 14–18, 2009

    Google Scholar 

  19. Lopes N, Ribeiro B, Goncalves J (2012) Restricted Boltzmann machines and deep belief networks on multi-core processors. In: Proceedings of the 2012 annual international joint conference on neural networks (IJCNN), part of the 2012 IEEE world Congress on computational intelligence (WCCI), Brisbane, QLD, Australia, June 10–15, 2012

    Google Scholar 

  20. NVIDIA Performance Primitives (NPP) version 5.0. 7 Sept 2012. https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/NPP_Library.pdf. Accessed 3 Sept 2013

  21. OpenCL programming guide for the CUDA architecture, version 2.3. NVIDIA, 27 Aug 2009. http://www.nvidia.com/content/cudazone/download/OpenCL/NVIDIA_OpenCL_ProgrammingGuide.pdf. Accessed 3 Sept 2013

  22. Intel Math Kernel Library 11.0 (2013). http://software.intel.com/en-us/intel-mkl. Accessed 3 Sept 2013

  23. BLAS (basic linear algebra subprograms). Based upon work supported by the National Science Foundation under Grant No. ASC-9313958 and DOE Grant No. DE-FG0-3-94ER25219, 29 June 2013. http://www.netlib.org/blas/. Accessed 4 Sept 2013

  24. Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17

    Article  MATH  Google Scholar 

  25. Multicore CPU: how to disable a core. Kioskea, Aug 2013. http://en.kioskea.net/faq/616-multicore-cpu-how-to-disable-a-core. Accessed 3 Sept 2013

  26. Chen X, Eversole A, Li G, Yu D, Seide F (2012) Pipelined back-propagation for context-dependent deep neural networks. In: Proceedings of the 13th annual conference of the international speech communication association (INTERSPEECH), Portland, OR, Sept 9–13, 2012

    Google Scholar 

  27. Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Orti ES (2008) Evaluation and tuning of the level 3 CUBLAS for graphics processors. In: Proceedings of the 22nd IEEE international parallel and distributed processing symposium (IPDPS), Miami, FL, Apr 14–18, 2008

    Google Scholar 

Download references

Acknowledgements

Many thanks to Mike Pressler, IPFW Manager Electronics and Computer Support Services, for his outstanding technical support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elizabeth A. Thompson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thompson, E.A., Anderson, T.R. A CUDA implementation of the Continuous Space Language Model. J Supercomput 68, 65–86 (2014). https://doi.org/10.1007/s11227-013-1023-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-013-1023-7

Keywords

Navigation