skip to main content
10.1145/3219819.3220073acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Learning and Interpreting Complex Distributions in Empirical Data

Published: 19 July 2018 Publication History

Abstract

To fit empirical data distributions and then interpret them in a generative way is a common research paradigm to understand the structure and dynamics underlying the data in various disciplines. However, previous works mainly attempt to fit or interpret empirical data distributions in a case-by-case way. Faced with complex data distributions in the real world, can we fit and interpret them by a unified but parsimonious parametric model? In this paper, we view the complex empirical data as being generated by a dynamic system which takes uniform randomness as input. By modeling the generative dynamics of data, we showcase a four-parameter dynamic model together with inference and simulation algorithms, which is able to fit and generate a family of distributions, ranging from Gaussian, Exponential, Power Law, Stretched Exponential (Weibull), to their complex variants with multi-scale complexities. Rather than a black box, our model can be interpreted by a unified differential equation, which captures the underlying generative dynamics. More powerful models can be constructed by our framework in a principled way. We validate our model by various synthetic datasets. We then apply our model to $16$ real-world datasets from different disciplines. We show the systematic biases of fitting these datasets by the most widely used methods and show the superiority of our model. In short, our model potentially provides a framework to fit complex distributions in empirical data, and more importantly, to understand their generative mechanisms.

Supplementary Material

MP4 File (zang_complex_distributions.mp4)

References

[1]
Albert-Laszlo Barabasi . 2005. The origin of bursts and heavy tails in human dynamics. Nature, Vol. 435, 7039 (2005), 207--211.
[2]
Albert-László Barabási and Réka Albert . 1999. Emergence of scaling in random networks. science, Vol. 286, 5439 (1999), 509--512.
[3]
Anna D Broido and Aaron Clauset . 2018. Scale-free networks are rare. arXiv preprint arXiv:1801.03400 (2018).
[4]
Richard H Byrd, Jean Charles Gilbert, and Jorge Nocedal . 2000. A trust region method based on interior point techniques for nonlinear programming. Mathematical Programming Vol. 89, 1 (2000), 149--185.
[5]
Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman . 2009. Power-law distributions in empirical data. SIAM review, Vol. 51, 4 (2009), 661--703.
[6]
Aaron Clauset, Maxwell Young, and Kristian Skrede Gleditsch . 2007. On the frequency of severe terrorist events. Journal of Conflict Resolution Vol. 51, 1 (2007), 58--87.
[7]
Luc Devroye . 1986. Sample-based non-uniform random variate generation Proceedings of the 18th conference on Winter simulation. ACM, 260--265.
[8]
Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos . 1999. On power-law relationships of the internet topology ACM SIGCOMM computer communication review, Vol. Vol. 29. ACM, 251--262.
[9]
Benjamin H Good, Michael J McDonald, Jeffrey E Barrick, Richard E Lenski, and Michael M Desai . 2017. The dynamics of molecular evolution over 60,000 generations. Nature, Vol. 551, 7678 (2017), 45.
[10]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio . 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.
[11]
Lei Guo, Enhua Tan, Songqing Chen, Zhen Xiao, and Xiaodong Zhang . 2008. The stretched exponential distribution of internet media access patterns Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing. ACM, 283--294.
[12]
R Dean Malmgren, Daniel B Stouffer, Adilson E Motter, and Lu'ıs AN Amaral . 2008. A Poissonian explanation for heavy tails in e-mail communication. Proceedings of the National Academy of Sciences, Vol. 105, 47 (2008), 18153--18158.
[13]
Michael Mitzenmacher . 2004. A brief history of generative models for power law and lognormal distributions. Internet mathematics, Vol. 1, 2 (2004), 226--251.
[14]
Kevin P. Murphy . 2014. Machine learning, a probabilistic perspective. (2014).
[15]
Mitchell G Newberry, Christopher A Ahern, Robin Clark, and Joshua B Plotkin . 2017. Detecting evolutionary forces in language change. Nature, Vol. 551, 7679 (2017), 223.
[16]
Mark EJ Newman . 2005. Power laws, Pareto distributions and Zipf's law. Contemporary physics, Vol. 46, 5 (2005), 323--351.
[17]
John Nolan . 2003. Stable distributions: models for heavy-tailed data. Birkhauser New York.
[18]
Joao Gama Oliveira and Albert-László Barabási . 2005. Human dynamics: Darwin and Einstein correspondence patterns. Nature, Vol. 437, 7063 (2005), 1251--1251.
[19]
Douglas Reynolds . 2015. Gaussian mixture models. Encyclopedia of biometrics (2015), 827--832.
[20]
Felisa A Smith, S Kathleen Lyons, SK Ernest, Kate E Jones, Dawn M Kaufman, Tamar Dayan, Pablo A Marquet, James H Brown, and John P Haskell . 2003. Body mass of late Quaternary mammals. Ecology, Vol. 84, 12 (2003), 3403--3403.
[21]
Alexei Vázquez, Joao Gama Oliveira, Zoltán Dezsö, Kwang-Il Goh, Imre Kondor, and Albert-László Barabási . 2006. Modeling bursts and heavy tails in human dynamics. Physical Review E, Vol. 73, 3 (2006), 036127.
[22]
G West . 2017. Scale: The universal laws of growth, innovation, sustainability and the pace of life in organisms and companies. (2017).
[23]
Ye Wu, Changsong Zhou, Jinghua Xiao, Jürgen Kurths, and Hans Joachim Schellnhuber . 2010. Evidence for a bimodal distribution in human communication. PNAS, Vol. 107, 44 (2010), 18803--18808.
[24]
Manzil Zaheer, Chun-Liang Li, Barnabás Póczos, and Ruslan Salakhutdinov . 2017. GAN Connoisseur: Can GANs Learn Simple 1D Parametric Distributions? (2017).
[25]
Chengxi Zang, Peng Cui, and Christos Faloutsos . 2016. Beyond Sigmoids: The NetTide Model for Social Network Growth, and Its Applications Proceedings of the 22Nd ACM SIGKDD (KDD '16). ACM, 2015--2024.
[26]
Chengxi Zang, Peng Cui, Christos Faloutsos, and Wenwu Zhu . 2017. Long Short Memory Process: Modeling Growth Dynamics of Microscopic Social Connectivity Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 565--574.
[27]
Chengxi Zang, Peng Cui, Christos Faloutsos, and Wenwu Zhu . 2018. On Power Law Growth of Social Networks. IEEE Transactions on Knowledge and Data Engineering (2018).
[28]
Chengxi Zang, Peng Cui, Chaoming Song, Christos Faloutsos, and Wenwu Zhu . 2017 a. Quantifying Structural Patterns of Information Cascades Proceedings of the 26th International Conference on WWW Companion. 867--868.
[29]
Chengxi Zang, Peng Cui, Chaoming Song, Christos Faloutsos, and Wenwu Zhu . 2017 b. Structural patterns of information cascades and their implications for dynamics and semantics. arXiv preprint arXiv:1708.02377 (2017).
[30]
Yilong Zha, Tao Zhou, and Changsong Zhou . 2016. Unfolding large-scale online collaborative human dynamics. Proceedings of the National Academy of Sciences, Vol. 113, 51 (2016), 14627--14632.
[31]
Tianyang Zhang, Peng Cui, Christos Faloutsos, Yunfei Lu, Hao Ye, Wenwu Zhu, and Shiqiang Yang . 2016 a. Come-and-go patterns of group evolution: A dynamic model Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1355--1364.
[32]
Tianyang Zhang, Peng Cui, Chaoming Song, Wenwu Zhu, and Shiqiang Yang . 2016 b. A multiscale survival process for modeling human activity patterns. PloS one, Vol. 11, 3 (2016), e0151473.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. complex distribution
  2. dynamic model
  3. heavy-tailed distribution
  4. interpretability
  5. survival analysis

Qualifiers

  • Research-article

Conference

KDD '18
Sponsor:

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 570
    Total Downloads
  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media