skip to main content
research-article

Using Butterfly-patterned Partial Sums to Draw from Discrete Distributions

Published: 19 November 2019 Publication History

Abstract

We describe a simd technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relative probabilities. A table of alternate (“butterfly-patterned”) form is faster to compute, making better use of coalesced memory accesses; from this table, complete partial sums are computed on the fly during a binary search. Measurements using cuda 7.5 on an nvidia Titan Black gpu show that this technique makes an entire machine-learning application that uses a Latent Dirichlet Allocation topic model with 1,024 topics about 13% faster (when using single-precision floating-point data) or about 35% faster (when using double-precision floating-point data) than doing a straightforward matrix transposition after using coalesced accesses.

References

[1]
Amr Ahmed, Linagjie Hong, and Alexander J. Smola. 2015. Nested Chinese restaurant franchise processes: Applications to user tracking and document modeling. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). Microtome Publishing, Brookline, MA, 1426--1434. Retrieved from http://www.jmlr.org/proceedings/papers/v28/ahmed13.pdf.
[2]
Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, New York.
[3]
David M. Blei. 2012. Probabilistic topic models. Commun. ACM 55, 4 (Apr. 2012), 77--84.
[4]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (Mar. 2003), 993--1022. Retrieved from http://dl.acm.org/citation.cfm?id=944919.944937.
[5]
Yuri Dotsenko, Naga K. Govindaraju, Peter-Pike Sloan, Charles Boyd, and John Manferdelli. 2008. Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS ’08). ACM, New York, 205--213.
[6]
Peter M. Fenwick. 1994. A new data structure for cumulative frequency tables. Software Pract. Exper. 24, 3 (1994), 327--336.
[7]
Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101, suppl. 1 (2004), 5228--5235.
[8]
Diane Hu, Rob Hall, and Josh Attenberg. 2014. Style in the long tail: Discovering unique interests with latent variable models in large scale social E-commerce. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14). ACM, New York, 1640--1649.
[9]
D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 9 (Sept. 1952), 1098--1101.
[10]
S. Lennart Johnsson, Tim Harris, and Kapil K. Mathur. 1989. Matrix multiplication on the connection machine. In Proceedings of the 1989 ACM/IEEE Conference on Supercomputing. ACM, New York, NY, 326--332. http://doi.acm.org/10.1145/76263.76298
[11]
Joon Hee Kim, Amin Mantrach, Alejandro Jaimes, and Alice Oh. 2016. How to compete online for news audience: Modeling words that attract clicks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). ACM, New York, 1645--1654.
[12]
Donald E. Knuth. 1998. Seminumerical Algorithms (3rd edition). The Art of Computer Programming, Vol. 2. Addison-Wesley, Reading, MA.
[13]
Donald E. Knuth. 1998. Sorting and Searching (2nd edition). The Art of Computer Programming, Vol. 3. Addison-Wesley, Reading, MA.
[14]
Anthony Lee, Christopher Yau, Michael B. Giles, Arnaud Doucet, and Christopher C. Holmes. 2010. On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. J. Comput. Graph. Stat. 19, 4 (2010), 769--789. http://arxiv.org/pdf/0905.2441.pdf.
[15]
Aaron Q. Li, Amr Ahmed, Sujith Ravi, and Alexander J. Smola. 2014. Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14). ACM, New York, 891--900.
[16]
Mian Lu, Ge Bai, Qiong Luo, Jie Tang, and Jiuxin Zhao. 2013. Accelerating topic model training on a single machine. In Web Technologies and Applications (APWeb 2013), Yoshiharu Ishikawa, Jianzhong Li, Wei Wang, Rui Zhang, and Wenjie Zhang (Eds.). Lecture Notes in Computer Science, Vol. 7808. Springer, Berlin, 184--195.
[17]
Sepideh Maleki, Annie Yang, and Martin Burtscher. 2016. Higher-order and tuple-based massively-parallel prefix sums. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). ACM, New York, 539--552.
[18]
G. Marsaglia. 1963. Generating discrete random variables in a computer. Commun. ACM 6, 1 (Jan. 1963), 37--38.
[19]
Yossi Matias, Jeffrey Scott Vitter, and Wen-Chun Ni. 1993. Dynamic generation of discrete random variates. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’93). Society for Industrial and Applied Mathematics, Philadelphia, PA, 361--370. Retrieved from http://dl.acm.org/citation.cfm?id=313559.313807.
[20]
NVIDIA. 2015. Developer Zone website: CUDA Toolkit documentation: CUDA Toolkit v6.5 Programming Guide, section B.14. Warp shuffle functions. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions.
[21]
Daniel Ramage, Susan Dumais, and Dan Liebling. 2010. Characterizing microblogs with topic models. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. Association for the Advancement of Artificial Intelligence, Palo Alto, CA, 130--137.
[22]
Guy L. Steele Jr. 2016. Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions. GTC website. Retrieved from http://on-demand.gputechconf.com/gtc/2016/video/s6665-guy-steele-fast-splittable.mp4.
[23]
Guy L. Steele Jr. 2016. Using butterfly-patterned partial sums to draw from discrete distributions. In NVIDIA GPU Technology Conference. Retrieved from http://on-demand.gputechconf.com/gtc/2016/presentation/s6666-guy-steele-butterfly-pattern.pdf. Slides for talk S6665. Video available at Reference [22].
[24]
Guy L. Steele Jr. and Jean-Baptiste Tristan. 2015. Using butterfly-patterned partial sums to optimize GPU memory accesses for drawing from discrete distributions. CoRR (Computing Research Repository at arXiv.org) (May 2015). Retrieved from http://arxiv.org/abs/1505.03851.
[25]
Guy L. Steele Jr. and Jean-Baptiste Tristan. 2017. Using butterfly-patterned partial sums to draw from discrete distributions. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17). ACM, New York, 341--355. An early version of this paper is Reference [24].
[26]
Marc A. Suchard, Quanli Wang, Cliburn Chan, Jacob Frelinger, Andrew Cron, and Mike West. 2010. Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. J. Comput. Graphic. Stat. 19, 2 (2010), 419--438.
[27]
Jean-Baptiste Tristan, Daniel Huang, Joseph Tassarotti, Adam C. Pocock, Stephen Green, and Guy L. Steele Jr. 2014. Augur: Data-parallel probabilistic modeling. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, 2600--2608. Retrieved from http://papers.nips.cc/book/year-2014.
[28]
Jean-Baptiste Tristan, Joseph Tassarotti, and Guy L. Steele Jr. 2015. Efficient training of LDA on a GPU by mean-for-mode estimation. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). Microtome Publishing, Brookline, MA, 59--68. Retrieved from http://jmlr.org/proceedings/papers/v37/tristan15.pdf.
[29]
M. D. Vose. 1991. A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Software Engineer. 17, 9 (Sept. 1991), 972--975.
[30]
A. J. Walker. 1974. New fast method for generating discrete random numbers with arbitrary frequency distributions. Electron. Lett. 10, 8 (Apr. 1974), 127--128.
[31]
Alastair J. Walker. 1977. An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Software 3, 3 (Sept. 1977), 253--256.
[32]
Nicholas Wilt. 2013. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Addison-Wesley, Upper Saddle River, NJ.
[33]
Feng Yan, Ningyi Xu, and Yuan Qi. 2009. Parallel inference for latent Dirichlet allocation on graphics processing units. In Advances in Neural Information Processing Systems 22. Curran Associates, 2134--2142. Retrieved from http://papers.nips.cc/book/year-2009.
[34]
Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. StreamScan: Fast scan algorithms for GPUs without global barrier synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, 229--238.
[35]
Huasha Zhao, Biye Jiang, and John Canny. 2014. SAME but different: Fast and high-quality Gibbs parameter estimation. CoRR (Computing Research Repository at arXiv.org) (Sept. 2014). Retrieved from http://arxiv.org/abs/1409.5402.
[36]
Seth Zimmerman. 1959. An optimal search procedure. Amer. Math. Monthly 66, 8 (Oct. 1959), 690--693.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 6, Issue 4
December 2019
188 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3372747
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 November 2019
Accepted: 01 May 2019
Revised: 01 March 2019
Received: 01 August 2018
Published in TOPC Volume 6, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpu
  2. lda
  3. simd
  4. Butterfly
  5. coalesced memory access
  6. discrete distribution
  7. latent Dirichlet allocation
  8. machine learning
  9. memory bottleneck
  10. multithreading
  11. parallel computing
  12. random sampling
  13. transposed memory access

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 128
    Total Downloads
  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media