Standard deviation estimation from sums of unequal size samples

Miguel Casquilho; Jorge Buescu

doi:10.1515/mcma-2022-2118

Published by De Gruyter August 4, 2022

Standard deviation estimation from sums of unequal size samples

Miguel Casquilho and Jorge Buescu

From the journal Monte Carlo Methods and Applications

https://doi.org/10.1515/mcma-2022-2118

Showing a limited preview of this publication:

Abstract

In numerous industrial and related activities, the sums of the values of, frequently, unequal size samples are systematically recorded, for several purposes such as legal or quality control reasons. For the typical case where the individual values are not or no longer known, we address the point estimation, with confidence intervals, of the standard deviation (and mean) of the individual items, from those sums alone. The estimation may be useful also to corroborate estimates from previous statistical process control. An everyday case of a sum is the total weight of a set of items, such as a load of bags on a truck, which is used illustratively. For the parameters mean and standard deviation of the distribution, assumed Gaussian, we derive point estimates, which lead to weighted statistics, and we derive confidence intervals. For the latter, starting with a tentative reduction to equal size samples, we arrive at a solid conjecture for the mean, and a proposal for the standard deviation. All results are verifiable by direct computation or by simulation in a general and effective way. These computations can be run on public web pages of ours, namely for possible industrial use.

Keywords: Unequal size samples; sample sum; estimation; quality control; web computing

MSC 2010: 62-07; 62F10; 62F25; 62P30; 65C05

Funding source: Fundação para a Ciência e a Tecnologia

Award Identifier / Grant number: UIDB/ECI/04028/2020

Award Identifier / Grant number: UID/MAT/04561/2020

Funding statement: The first author does research at CERENA, Centro de Recursos Naturais e Ambiente (Research “Centre for Natural Resources and the Environment”), under the aegis of FCT, “Fundação para a Ciência e a Tecnologia” (Portuguese Science and Technology Foundation), Project UIDB/ECI/04028/2020. The second author thanks CMAFCIO, Centro de Matemática, Aplicações Fundamentais e Investigação Operacional (“Centre for Mathematics, Fundamental Applications, and Operational Research”), also under FCT, Project UID/MAT/04561/2020.

A Appendix

A.1 Maximum likelihood, variance

The sample average X ¯ t has (see Section 2) a Gaussian distribution

X ¯ t = 1 n t ⁢ ∑ i = 1 n t X t ∼ N ⁢ ( μ , σ 2 n t ) , t = 1 , … , T ,

i.e., with probability density function

f X ¯ t ( x ¯ t | μ , σ 2 ) = 1 ( σ / n t ) ⁢ 2 ⁢ π exp [ - 1 2 ( x ¯ t - μ σ / n t ) 2 ] .

The likelihood is thus the product

L ⁢ ( μ , σ 2 ) = ∏ t = 1 T 1 ( σ / n t ) ⁢ 2 ⁢ π ⁢ exp ⁡ [ - 1 2 ⁢ ( x ¯ t - μ σ / n t ) 2 ] .

The maxima of L may be obtained, given the monotonicity of the logarithm, by taking, as usual, the logarithm of the expression:

ln ⁡ L ⁢ ( μ , σ 2 ) = ∑ t = 1 T [ ln ⁡ n t σ ⁢ 2 ⁢ π - n t 2 ⁢ σ 2 ⁢ ( x ¯ t - μ ) 2 ] .

The values of the parameters μ and σ that maximize L satisfy the stationarity equations

(A.1) { ∂ ∂ ⁡ μ ⁢ ln ⁡ L ⁢ ( μ , σ ) = 0 , ∂ ∂ ⁡ σ ⁢ ln ⁡ L ⁢ ( μ , σ ) = 0 .

Solving the first equation for μ leads to

∂ ∂ ⁡ μ ⁢ ln ⁡ L = - 1 2 ⁢ σ 2 ⁢ ∂ ∂ ⁡ μ ⁢ ∑ t = 1 T n t ⁢ ( x ¯ t - μ ) 2 = 1 σ 2 ⁢ ∑ t = 1 T n t ⁢ ( x ¯ t - μ ) = 0

or, since σ ≠ 0 ,

∑ t = 1 T n t ⁢ ( x ¯ t - μ ^ ) = 0 ,

from which we finally obtain

(A.2) μ ^ = ∑ t = 1 T n t ⁢ x ¯ t ∑ t = 1 T n t .

Similarly, solving the second equation for σ leads to equations (A.3) to (A.5). We have

(A.3) ∂ ∂ ⁡ σ ⁢ ln ⁡ L = ∂ ∂ ⁡ σ ⁢ ( - T ) ⁢ ln ⁡ ( σ ⁢ 2 ⁢ π ) + ∂ ∂ ⁡ σ ⁢ ∑ t = 1 T [ 1 2 ⁢ ln ⁡ n t - n t 2 ⁢ σ 2 ⁢ ( x ¯ t - μ ) 2 ] ,

from which the stationary equation (A.1) for σ follows:

∂ ∂ ⁡ σ ⁢ ln ⁡ L = - T ⁢ 1 σ - 1 2 ⁢ ∑ t = 1 T n t ⁢ ( x ¯ t - μ ) 2 ⁢ ( - 2 ⁢ σ - 3 ) = 0

or, equivalently,

T σ = 1 σ 3 ⁢ ∑ t = 1 T n t ⁢ ( x ¯ t - μ ^ ) 2 ,

and finally

(A.4) σ ^ 2 = 1 T ⁢ ∑ t = 1 T n t ⁢ ( x ¯ t - μ ^ ) 2 .

It is a routine matter to check that the Hessian matrix at the critical point given by equations (A.2) and (A.4) is negative definite, since it is diagonal with negative diagonal entries (eigenvalues):

∂ 2 ⁡ ln ⁡ L ∂ ⁡ μ 2 = - 1 σ 2 ⁢ ∑ t = 1 T n t , ∂ 2 ⁡ ln ⁡ L ∂ ⁡ σ 2 = - 2 ⁢ T σ 2 .

We conclude that the unique critical point is a local and, indeed, global maximum of L.

Centering the biased (ML) estimator gives

(A.5) σ ^ 2 = 1 T - 1 ⁢ ∑ t = 1 T n t ⁢ ( x ¯ t - μ ^ ) 2 .

Introducing the weights (equation (2.6)) finally leads to

σ ^ 2 = N T - 1 ⁢ ∑ t = 1 T w t ⁢ ( x ¯ t - μ ^ ) 2 ,

which states that the variance sought is N times the weighted variance of the sample averages.

A.2 Gamma distribution

We recall the definitions of the gamma distribution

(A.6) g α , β ⁢ ( x ) = 1 β α ⁢ Γ ⁢ ( α ) ⁢ x α - 1 ⁢ e - x β for ⁢ α > 0 , β > 0 ,

and, for a positive integer k, of the χ k 2 distribution

(A.7) χ k 2 ⁢ ( x ) = 1 2 k 2 ⁢ Γ ⁢ ( k 2 ) ⁢ x k 2 - 1 ⁢ e - x 2 .

From equations (A.6) and (A.7), it follows immediately that

χ k 2 ⁢ ( x ) ≡ g k 2 , 2 ⁢ ( x ) ,

which is the classical identity in equation (3.5) relating the gamma and χ 2 distributions. On the other hand, for a > 0 ,

χ k 2 ⁢ ( a ⁢ x ) = 1 2 k 2 ⁢ Γ ⁢ ( k 2 ) ⁢ ( a ⁢ x ) k 2 - 1 ⁢ e - a ⁢ x 2 = a - 1 ⁢ 1 ( 2 a ) k 2 ⁢ Γ ⁢ ( k 2 ) ⁢ x k 2 - 1 ⁢ e - x ( 2 a ) ,

from which the desired identity

g k 2 , 2 a ⁢ ( x ) ≡ a ⁢ χ k 2 ⁢ ( a ⁢ x ) ,

referred by equation (3.6) in the text, immediately follows^[14].

Acknowledgements

We thank Prof. A. Turkman (FCUL) for her helpful advice.

References

[1] M. Bland, Estimating mean and standard deviation from the sample size, three quartiles, minimum, and maximum, Int. J. Stat. Medical Res. 4 (2015), 57–64. 10.6000/1929-6029.2015.04.01.6Search in Google Scholar

[2] J. Buescu and M. Casquilho, Estimating the standard deviation from sums of unequal size sample, XXIV Congresso da Sociedade Portuguesa de Estatística (XXIV Congress of the Portuguese Statistical Society) (2019), Porto (Portugal), 06–09 Nov. Search in Google Scholar

[3] M. Casquilho, Confidence intervals by simulation, http://web.tecnico.ulisboa.pt/~mcasquilho/compute/qc/f-BagsSimulCI.php (2020a), accessed 11-Nov-2021. Search in Google Scholar

[4] M. Casquilho, From gamma to chi2, http://web.tecnico.ulisboa.pt/~mcasquilho/compute/qc/f-BagsGammaChi2.php (2020b), accessed 11-Nov-2021. Search in Google Scholar

[5] M. Casquilho, Simulate trucks, http://web.tecnico.ulisboa.pt/~mcasquilho/compute/qc/f-BagsSimultrucks.php (2020c), accessed 11-Nov-2021. Search in Google Scholar

[6] M. Casquilho, Sums of unequal size samples: PE & CI, http://web.tecnico.ulisboa.pt/~mcasquilho/compute/qc/f-BagsSumUneqPECI.php (2020d), accessed 11-Nov-2021. Search in Google Scholar

[7] M. Casquilho, Estimates and confidence intervals for μ & σ, http://web.tecnico.ulisboa.pt/~mcasquilho/compute/qc/f-BagsPECI.php (2021), accessed 11-Nov-2021. Search in Google Scholar

[8] CISTI’2018, “Conferencia Ibérica de Sistemas y Tecnologías de Información”, Iberian Conference on Information Systems and Technology, Cáceres, Spain, June 2018, http://cisti.eu/ (2018), accessed 01-Mar-2018. Search in Google Scholar

[9] B. M. Colosimo, Message from the editor, J. Qual. Technol. 51 (2019), 1–2. 10.1080/00224065.2019.1569896Search in Google Scholar

[10] D. W. Dunnett, Pairwise multiple comparisons in the homogeneous variance, unequal sample size case, J. Amer. Statist. Assoc. 75 (1980), no. 372, 789–795. 10.1080/01621459.1980.10477551Search in Google Scholar

[11] Gnuplot, http://gnuplot.info/ (2019), accessed 11-Nov-2021. Search in Google Scholar

[12] G. Goswami and J. S. Liu, On learning strategies for evolutionary Monte Carlo, Stat. Comput. 17 (2007), 23–28. 10.1007/s11222-006-9002-ySearch in Google Scholar

[13] M. S. Hamada, E. Kelly and T. Buxton, Understanding the rule of 7: Statistical properties for various sample sizes, Qual. Eng. 26 (2014), 285–289. 10.1080/08982112.2013.805780Search in Google Scholar

[14] S. Hido, H. Kashima and Y. Takahashi, Roughly balanced bagging for imbalanced data, Stat. Anal. Data Min. 2 (2009), 412–426. 10.1137/1.9781611972788.13Search in Google Scholar

[15] S. P. Hozo, B. Djulbegovic and I. Hozo, Estimating the mean and variance from the median, range, and the size of a sample, BMC Med. Res. Methodol. 5 (2005), Paper No. 13. 10.1186/1471-2288-5-13Search in Google Scholar PubMed PubMed Central

[16] Indiana University, Common HTML error codes, https://kb.iu.edu/d/bfrc (2019), accessed 11-Nov-2021. Search in Google Scholar

[17] Interactive Tools, TestScience.org, https://testscience.org/interactive-tools/ (2019), 11-Nov-2021. Search in Google Scholar

[18] Keisan Online Calculator, “Gamma distribution (percentile) Calculator”, https://keisan.casio.com/exec/system/1180573218, and “Chi-square distribution (percentile) Calculator”, https://keisan.casio.com/exec/system/1180573197 (2020), both accessed 11-Nov-2021. Search in Google Scholar

[19] K. Kim and M. R. Reynolds, Jr., Multivariate monitoring using an mewma control chart with unequal sample sizes, J. Qual. Technol. 37 (2005), no. 4, 267–281. 10.1080/00224065.2005.11980330Search in Google Scholar

[20] D. Kwon and I. Reis, Simulation-based estimation of mean and standard deviation for meta-analysis via approximate Bayesian computation (ABC), BMC Med. Res. Methodol. 15 (2015), Paper No. 61. 10.1186/s12874-015-0055-5Search in Google Scholar PubMed PubMed Central

[21] T. K. Mak, Estimating variances for all sample sizes by the bootstrap, Comput. Statist. Data Anal. 46 (2004), 459–467. 10.1016/j.csda.2003.08.004Search in Google Scholar

[22] Matlab (Mathworks),gampdf, https://www.mathworks.com/help/stats/gampdf.html (2019), accessed 11-Nov-2021. Search in Google Scholar

[23] R. Mead, A quick method of estimating the standard deviation, Biometrika 3–4 (1966), 559–564. 10.1093/biomet/53.3-4.559Search in Google Scholar

[24] D. C. Montgomery, Introduction to Statistical Quality Control, 7th ed., John Wiley & Sons, Hoboken, 2013. Search in Google Scholar

[25] I. Ninan, O. Arancio and D. Rabinowitz, Estimation of the mean from sums with unknown numbers of summands, Biometrics 62 (2006), 918–920. 10.1111/j.1541-0420.2005.00518.xSearch in Google Scholar PubMed

[26] NIST/Sematech e-Handbook of Statistical Methods, “Engineering Statistics Handbook”, section 1.3.6.6.11. Gamma Distribution, http://www.itl.nist.gov/div898/handbook/ (2013), accessed 11-Nov-2021. Search in Google Scholar

[27] PHP, PHP Group, https://www.php.net/ (2019), accessed 11-Nov-2021. Search in Google Scholar

[28] V. M. Ponce, Visualab, http://visualab.sdsu.edu/online_calc.php (2002), accessed 11-Nov-2021. Search in Google Scholar

[29] Power and Sample Size.com, HyLown Consulting, Atlanta, GA (USA), http://powerandsamplesize.com/ (2013), accessed 11-Nov-2021. Search in Google Scholar

[30] R, The R Project for Statistical Computing, https://www.r-project.org/ (2019), accessed 11-Nov-2021. Search in Google Scholar

[31] P. H. Ramsey and P. P. Ramsey, Power of pairwise comparisons in the equal variance and unequal sample size case, British J. Math. Stat. Psychol. 61 (2008), 115–131. 10.1348/000711006X153051Search in Google Scholar PubMed

[32] S. M. Ross, Probability and Statistics for Engineers and Scientists, 4th ed., Elsevier Academic, London, 2009. Search in Google Scholar

[33] SAS Institute Inc., https://www.sas.com/ (2020), accessed 11-Nov-2021. Search in Google Scholar

[34] D. J. Shahar, Minimizing the variance of a weighted average, Open J. Stat. 7 (2017), 216–224. 10.4236/ojs.2017.72017Search in Google Scholar

[35] B. J. Smucker, E. del Castillo and J. L. Rosenberger, Model-robust designs for split-plot experiments, Comput. Statist. Data Anal. 56 (2012), 4111–4121. 10.1016/j.csda.2012.03.010Search in Google Scholar

[36] StatPages project, (2020), accessed 11-Nov-2021. Search in Google Scholar

[37] D. J. Torres, Describing the Pearson R distribution of aggregate data, Monte Carlo Methods Appl. 26 (2020), 17–32. 10.1515/mcma-2020-2054Search in Google Scholar PubMed PubMed Central

[38] T. Vesala, U. Rannik, M. Leclerc, T. Foken and K. Sabelfeld, Flux and concentration footprints, Agricultural Forest Meteorol. 127 (2004), no. 3–4, 111–116. 10.1016/j.agrformet.2004.07.007Search in Google Scholar

[39] R. E. Walpole, R. H. Myers, S. L. Myers and K. Ye, Probability & Statistics for Engineers & Scientists, 9th ed., Prentice-Hall, Boston, 2012. Search in Google Scholar

[40] X. Wan, W. Wang, J. Liu and T. Tong, Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range, BMC Med. Res. Methodol. 14 (2014), Paper No. 135. 10.1186/1471-2288-14-135Search in Google Scholar PubMed PubMed Central

[41] M. D. Wilkinson, Comment: The FAIR guiding principles for scientific data management and stewardship, Sci. Data 3 (2016), Article ID 160018. Search in Google Scholar

[42] World Wide Web Consortium, HTML 4.01 Specification, 17. Forms, https://www.w3.org/TR/html4/interact/forms.html (2018), accessed 11-Nov-2021. Search in Google Scholar

Received: 2021-11-28

Revised: 2022-07-25

Accepted: 2022-07-26

Published Online: 2022-08-04

Published in Print: 2022-09-01

Standard deviation estimation from sums of unequal size samples

Abstract

A Appendix

A.1 Maximum likelihood, variance

A.2 Gamma distribution

Acknowledgements

References

Journal and Issue

Articles in the same Issue