ABSTRACT
Ordinary least-squares estimation is proved to be the best linear unbiased estimator according to the Gauss-Markov theorem. In the last two decades, however, some researchers criticized that least-squares was substantially inaccurate in fitting power-law distributions; such criticism has caused a strong bias in research community. In this paper, we conduct extensive experiments to rebut that such criticism is complete nonsense. Specifically, we sample different sizes of discrete and continuous data from power-law models, showing that even though the long-tailed noises are sampled from power-law models, they cannot be treated as power-law data. We define the correct way to bin continuous power-law data into data points and propose an average strategy for least-squares to fit power-law distributions. Experiments on both simulated and real-world data show that our proposed method fits power-law data perfectly. We uncover a fundamental flaw in the popular method proposed by Clauset et al. [12]: it tends to discard the majority of power-law data and fit the long-tailed noises. Experiments also show that the reverse cumulative distribution function is a bad idea to plot power-law data in practice because it usually hides the true probability distribution of data. We hope that our research can clean up the bias about least-squares fitting power-law distributions.
Source code can be found at https://github.com/xszhong/LSavg.
- Lada A. Adamic and Bernardo A. Huberman. 2000. The Nature of Markets in the World Wide Web. Quarterly Journal of Electronic Commerce 1, 1 (2000), 5–12.Google Scholar
- Reka Albert, Hawoong Jeong, and Albert-László Barabási. 1999. Diameter of the World-Wide Web. Nature 401(1999), 130.Google ScholarCross Ref
- I. Artico, I. Smolyarenko, V. Vinciotti, and E. C. Wit. 2020. How rare are power-law networks really?. In Proceedings of the Royal Society A, Vol. 476. 20190742.Google ScholarCross Ref
- Eduardo M. Azevedo, Alex Deng, Jose Luis Montiel Olea, Justin Rao, and E. Glen Weyl. 2020. A/B Testing with Fat Tails. Journal of Political Economy 128, 12 (2020), 4614–000.Google ScholarCross Ref
- Albert-László Barabási and Réka Albert. 1999. Emergence of Scaling in Random Networks. Science 286(1999), 509–512.Google ScholarCross Ref
- H. Bauke. 2007. Parameter estimation for power-law distributions by maximum likelihood methods. The European Physical Journal B 58 (2007), 167–173.Google ScholarCross Ref
- Bernd Blaslus. 2020. Power-law distribution in the number of confirmed COVID-19 cases. Chaos: An Interdisciplinary Journal of Nonlinear Science 30, 9(2020).Google Scholar
- Eric Bonnet, Olivier Bour, Noelle E. Odling, Philippe Davy, Ian Main, Patience Cowie, and Brian Berkowitz. 2001. Scaling of fracture systems in geological media. Reviews of geophysics 39, 3 (2001), 347–383.Google Scholar
- Patrick Erik Bradley and Martin Behnisch. 2019. Heavy-tailed distributions for building stock data. Environment and Planning B: Urban Analytics and City Science 46, 7(2019), 1281–1296.Google ScholarCross Ref
- Anna D. Broido and Aaron Clauset. 2019. Scale-free networks are rare. Nature communications 10, 1 (2019), 1–10.Google Scholar
- Robert Malcolm Clark, S. J. D. Cox, and Geoff M. Laslett. 1999. Generalizations of power-law distributions applicable to sampled fault-trace lengths: model choice, parameter estimation and caveats. Geophysical Journal International 136, 2 (1999), 357–372.Google ScholarCross Ref
- Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. Power-law Distributions in Empirical Data. SIAM Rev. 51, 4 (2009), 661–703.Google ScholarDigital Library
- William G. Cochran. 1952. The Chi-square Test of Goodness of Fit. The Annals of Mathematical Statistics 23, 3 (1952), 315–345.Google ScholarCross Ref
- Donald Cochrane and Guy H. Orcutt. 1949. Application of least squares regression to relationships containing auto-correlated error terms. J. Amer. Statist. Assoc. 44, 245 (1949), 32–61.Google Scholar
- Brian Conrad and Michael Mitzenmacher. 2004. Power laws for monkeys typing randomly: the case of unequal probabilities. IEEE Transactions on information theory 50, 7 (2004), 1403–1414.Google ScholarDigital Library
- Bernat Corominas-Murtra and Ricard V. Solé. 2010. Universality of Zipf’s Law. Physical Review E 82, 1 (2010), 011102.Google ScholarCross Ref
- Alvaro Corral and Alvaro Gonzalez. 2019. Power Law Size Distributions in Geoscience Revisited. Earth and Space Science 6, 5 (2019), 673–697.Google ScholarCross Ref
- Alvaro Corral, Isabel Serra, and Ramon Ferrer i Cancho. 2020. Distinct flavors of Zipf’s law and its maximum likelihood fitting: Rank-size and size-distribution representations. Physical Review E 102, 5 (2020), 052113.Google ScholarCross Ref
- Frederik Michel Dekking, Cornelis Kraaikamp, Hendrik Paul Lopuhaä, and Ludolf Erwin Meester. 2005. A Modern Introduction to Probability and Statistics: Understanding why and how. Springer Science & Business Media.Google Scholar
- Anna Deluca and Alvaro Corral. 2013. Fitting and Goodness-of-Fit Test of Non-Truncated and Truncated Power-Law Distributions. Acta Geophysica 61, 6 (2013), 1351–1394.Google ScholarCross Ref
- Nicole Eikmeier and David F. Gleich. 2017. Revisiting Power-law Distributions in Spectra of Real World Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 817–826.Google ScholarDigital Library
- Zoltan Eisler, Imre Bartos, and Janos Kertesz. 2008. Fluctuation scaling in complex systems: Taylor’s law and beyond. Advances in Physics 57, 1 (2008), 89–142.Google ScholarCross Ref
- Brian J. Enquist, Evan P. Economo, Travis E. Huxman, Andrew P. Allen, Danielle D. Ignace, and James F. Gillooly. 2003. Scaling metabolism from organisms to ecosystems. Nature 423, 6940 (2003), 639–642.Google Scholar
- Brian J. Enquist and Karl J. Niklas. 2001. Invariant scaling relations across tree-dominated communities. Nature 410, 6829 (2001), 655–660.Google Scholar
- Xavier Gabaix. 2009. Power Laws in Economics and Finance. Annual Review of Economics 1, 1 (2009), 255–294.Google ScholarCross Ref
- M.L. Goldstein, S.A. Morris, and G.G. Yen. 2004. Problems with fitting to the power-law distribution. The European Physical Journal B 41 (2004), 255–258.Google ScholarCross Ref
- Beno Gutenberg and Charles F. Richter. 1944. Frequency of Earthquakes in California. Bulletin of the Seismological Society of America 34, 4 (1944), 185–188.Google ScholarCross Ref
- Bo-Ping Han and Milan Straskraba. 1998. Size dependence of biomass spectra and population density I. The effects of size scales and size intervals. Journal of Theoretical Biology 191, 3 (1998), 259–265.Google ScholarCross Ref
- Rudolf Hanel, Bernat Corominas-Murtra, Bo Liu, and Stefan Thurner. 2017. Fitting power-laws in empirical data with estimators that work for all exponents. PLoS ONE 12, 2 (2017), 1–15.Google ScholarCross Ref
- Charles R. Henderson. 1975. Best Linear Unbiased Estimation and Prediction under a Selection Model. Biometrics 31, 2 (1975), 423–447.Google ScholarCross Ref
- Hawoong Jeong, Balint Tombor, Reka Albert, Zoltan N. Oltvai, and A-L. Barabasi. 2000. The Large-Scale Organization of Metabolic Networks. Nature 407, 6804 (2000), 651–654.Google ScholarCross Ref
- Sonia Kefi, Max Rietkerk, Concepcion L. Alados, Yolanda Pueyo, Vasilios P. Papanastasis, Ahmed ElAich, and Peter C. De Ruiter. 2007. Spatial vegetation patterns and imminent desertification in Mediterranean arid ecosystems. Nature 449, 7159 (2007), 213–217.Google Scholar
- Wentian Li. 2002. Zipf’s Law Everywhere. Glottometrics 5(2002), 14–21.Google Scholar
- Edward T. Lu and Russell J. Hamilton. 1991. Avalanches and the Distribution of Solar Flares. The Astrophysical Journal 380 (1991), L89–L92.Google ScholarCross Ref
- R. Dean Malmgren, Daniel B. Stouffer, Adilson E. Motter, and Luis AN Amaral. 2008. A Poissonian explanation for heavy tails in e-mail communication. Proceedings of the National Academy of Sciences 105, 47(2008), 18153–18158.Google ScholarCross Ref
- Timothy D. Meehan. 2006. Energy Use and Animal Abundance in Litter and Soil Communities. Ecology 87, 7 (2006), 1650–1658.Google ScholarCross Ref
- Buddhika Nettasinghe and Vikram Krishnamurthy. 2021. Maximum Likelihood Estimation of Power-law Degree Distributions via Friendship Paradox-based Sampling. ACM Transactions on Knowledge Discovery from Data 15, 6 (2021), 1–28.Google Scholar
- Mark EJ. Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary physics 46, 5 (2005), 323–351.Google Scholar
- Jan Overgoor, Austin R. Benson, and Johan Ugander. 2019. Choosing to Grow a Graph: Modeling Network Formation as Discrete Choice. In Proceedings of the 2019 World Wide Web Conference. 1409–1420.Google ScholarDigital Library
- Karl Pearson. 1990. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling. Philos. Mag. 5, 50 (1990), 157–175.Google Scholar
- Steven T. Piantadosi. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic bulletin & review 21, 5 (2014), 1112–1130.Google Scholar
- G. Pickering, J. M. Bull, and D. J. Sanderson. 1995. Sampling power-law distributions. Tectonophysics 248(1995), 1–20.Google ScholarCross Ref
- Carla M.A. Pinto, A. Mendes Lopes, and J.A. Tenreiro Machado. 2012. A review of power laws in real life phenomena. Commun Nonlinear Sci Number Simulat 17 (2012), 3558–3578.Google ScholarCross Ref
- Robin L. Plackett. 1949. A Historical Note on the Method of Least Squares. Biometrika 36, 3/4 (1949), 458–460.Google ScholarCross Ref
- Derek J. De Solla Price. 1965. Networks of Scientific Papers. Science 149, 3683 (1965), 510–515.Google Scholar
- Salvador Pueyo and Roger Jovani. 2006. Comment on “A Keystone Mutualism Drives Pattern in a Power Function”. Science 313, 5794 (2006), 1739–1739.Google Scholar
- John A. Rice. 2006. Mathematical Statistics and Data Analysis. Cengage Learning.Google Scholar
- Andrea Rinaldo, Amos Maritan, Kent K. Cavender-Bares, and Sallie W. Chisholm. 2002. Cross-scale ecological dynamics and microbial size spectra in marine ecosystems. In Proceedings of the Royal Society of London. Series B: Biological Sciences, Vol. 269. 2051–2059.Google Scholar
- David W. Sims, David Righton, and Jonathan W. Pitchford. 2007. Minimizing errors in identifying Levy flight behaviour of organisms. Journal of Animal Ecology 76, 2 (2007), 222–229.Google ScholarCross Ref
- Nickolay Smirnov. 1948. Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics 19, 2 (1948), 279–281.Google ScholarCross Ref
- Michael A Stephens. 1974. EDF statistics for goodness of fit and some comparisons. Journal of the American statistical Association 69, 347(1974), 730–737.Google ScholarCross Ref
- Alex Stivala, Garry Robins, and Alessandro Lomi. 2020. Exponential random graph model parameter estimation for very large directed networks. PLoS ONE 15, 1 (2020), e0227804.Google ScholarCross Ref
- Gilbert Strang. 2016. Introduction to Linear Algebra. Wellesley-Cambridge Press.Google Scholar
- Yogesh Virkar and Aaron Clauset. 2014. Power-law distributions in binned empirical data. The Annals of Applied Statistics 8, 1 (2014), 89–119.Google ScholarCross Ref
- Geoffrey B. West, James H. Brown, and Brian J. Enquist. 1997. A General Model for the Origin of Allometric Scaling Laws in Biology. Science 276, 5309 (1997), 122–126.Google ScholarCross Ref
- Ethan P. White, Brian J. Enquist, and Jessica L. Green. 2008. On estimating the exponent of power‐law frequency distributions. Ecology 89, 4 (2008), 905–912.Google ScholarCross Ref
- J. C. Willis and G. Udny Yule. 1922. Some Statistics of Evolution and Geographical Distribution in Plants and Animals, and their Significance. Nature 109(1922), 177–179.Google ScholarCross Ref
- Chengxi Zang, Peng Cui, and Wenwu Zhu. 2018. Learning and Interpreting Complex Distributions in Empirical Data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2682–2691.Google ScholarDigital Library
- Xiaoshi Zhong. 2020. Time Expression and Named Entity Analysis and Recognition. Ph.D. Dissertation. Nanyang Technological University, Singapore.Google Scholar
- Tommaso Zillio and Richard Condit. 2007. The impact of neutrality, niche, differentiation and species input on diversity and abundance distributions. Oikos 116(2007), 931–940.Google ScholarCross Ref
- George Zipf. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley Press, Inc.Google Scholar
Index Terms
- Is Least-Squares Inaccurate in Fitting Power-Law Distributions? The Criticism is Complete Nonsense
Recommendations
Power-Law Distributions in Empirical Data
Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the ...
Fitting Johnson distributions using least squares: simulation applications
WSC '85: Proceedings of the 17th conference on Winter simulationA weighted least squares regression method is proposed for fitting cumulative probability distributions to data. This technique is illustrated for the Johnson translation system of distributions. The least squares procedure minimizes the distance between ...
Revisiting Power-law Distributions in Spectra of Real World Networks
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningBy studying a large number of real world graphs, we find empirical evidence that most real world graphs have a statistically significant power-law distribution with a cutoff in the singular values of the adjacency matrix and eigenvalues of the Laplacian ...
Comments