Abstract
Inspired by Diakonikolas and Kane (2016), we reduce the class of problems consisting of testing whether an unknown distribution over [n] equals a fixed distribution to the special case in which the fixed distribution is uniform over [n]. Our reduction preserves the parameters of the problem, which are n and the proximity parameter \(\epsilon >0\), up to a constant factor.
While this reduction yields no new bounds on the sample complexity of either problems, it provides a simple way of obtaining testers for equality to arbitrary fixed distributions from testers for the uniform distribution. The reduction first reduces the general case to the case of “grained distributions” (in which all probabilities are multiples of \(\varOmega (1/n)\)), and then reduces this case to the case of the uniform distribution. Using grained distributions as a pivot of the exposition, we call attention to this natural class.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
As an anecdote, we mention that, in course of their research, Goldreich, Goldwasser, and Ron considered the feasibility of testing properties of distributions, but being in the mindset that focused on complexity that is polylogarithmic in the size of the object (see discussion in [9, Sec. 1.4]), they found no appealing example and did not report of these thoughts in their paper [11].
- 2.
Testing equality to \(U_n\) is implicit in a test of the distribution of the endpoint of a relatively short random walk on a bounded-degree graph.
- 3.
See further discussion in Sect. 3.4.
- 4.
This may happen if and only if the support of \(q:[n]\rightarrow [0,1]\) is a strict subset of [n] (equiv., if \(m_i=0\) for some \(i\in [n]\)). Specifically, for every \(X\in [n]\), the support of \(F_q(X)\) equals \( S''\,{\mathop {=}\limits ^\mathrm{def}}\, S\cup \{{\langle {i,0}\rangle }:i\in [n] \& q(i)\!=\!0\} \subseteq S'\), whereas \(|S''|=m+|\{i\in [n]:q(i)\!=\!0\}|\).
- 5.
- 6.
Consider, for example, the case that \(q(i)=0.4\gamma /n\) if \(i\in [0.5n]\) and \(q(i)=(2\,-\,0.4\gamma )/n\) otherwise, and any distribution X such that \(\mathbf{Pr}[X\!=\!i]<\gamma /n\) if \(i\in [0.5n]\) and \(\mathbf{Pr}[X\!=\!i]=q(i)\) otherwise. Then, each of these possible X’s will be mapped by F to the same distribution, although such distributions may be \(0.1\gamma \)-far from the distribution associated with q.
- 7.
Typically, \(n'=n+1\). Recall that \(n'=n\) if and only if D itself is 6n-grained, in which case the reduction is not needed anyhow.
- 8.
- 9.
Like in Footnote 8, we note that Valiant and Valiant [16] stated this result for the “relative earthmover distance” (REMD) and commented that the total variation distance up to relabelling is upper-bounded by REMD. This claim appears as a special case of [18, Fact 1] (using \(\tau =0\)), and a detailed proof appears in [13].
- 10.
The constant 0.499 stands for an arbitrary large constant that is smaller than 0.5. Recall that the definition of \(\delta \)-far mandates that the relevant distance be greater than \(\delta \).
- 11.
Specifically, \(t\ge 2\) since \(2m>n\), whereas \(t=O(1)\) and \(m'=\varOmega (n)\) since \(m=O(n)\).
- 12.
Otherwise the following description reduces the problem of Theorem 4.3 to a testing problem regarding \((t\cdot {\lfloor m/t\rfloor })\)-grained distributions. In this case, we reduce the latter testing problem to one regarding m-grained distributions (e.g., by using a filter that maps each \(i\in [n]\) to itself with probability \(t\cdot {\lfloor m/t\rfloor }/m\) and maps it to n otherwise.
- 13.
Specifically, let \(q:[n]\rightarrow [0,2)\) be the function resulting from the first step (i.e., \(q(i)={\lfloor m\cdot p(i)\rfloor }/m\) if \(r(i)\le 1/2m\) and \(q(i)={\lceil m\cdot p(i)\rceil }/m\) otherwise). Then, \(\delta \,{\mathop {=}\limits ^\mathrm{def}}\,\sum _{i\in [n]}|q(i)-p(i)|=\sum _{i\in [n]}\min (r(i),(1/m)-r(i))\) and \(|1-\sum _{i\in [n]}q(i)|\le \delta \), since \(\left| \sum _{i\in [n]}q(i)-\sum _{i\in [n]}p(i)\right| \le \sum _{i\in [n]}|q(i)-p(i)|\).
- 14.
Specifically, letting \(\zeta _i=\zeta _i(f)\) denote the contribution of \(i\in H\) to \(\sum _{i\in H_f}s(i)\), we have \(\mathbb E[\zeta _i]\ge 0.9\cdot s(i)\) and \(\mathbb V[\zeta _i]\le \mathbb E[\zeta _i^2]\le s(i)^2\le s(i)/2m\). Hence, by Chebyshev’s Inequality, \(\mathbf{Pr}\left[ \sum _{i\in H}\zeta _i\le 0.2\delta \right] < \frac{\delta /2m}{(0.45\delta \,-\,0.2\delta )^2}\), since \(\mathbb V\left[ \sum _{i\in H}\zeta _i\right] \le \delta /2m\) and \(\mathbb E\left[ \sum _{i\in H}\zeta _i\right] \ge 0.9\cdot 0.5\delta \). This suffices for \(\delta =\omega (1/m)\). Actually, the same argument holds if \(\sum _{i\in H}s(i)^2=o(\delta ^2)\); the argument for the general case follows.
In general (esp., if \(\sum _{i\in H}s(i)^2=\varOmega (\delta ^2)\)), for a sufficiently small \(c'>0\), we define \(H'\,{\mathop {=}\limits ^\mathrm{def}}\,\{i\in H:s(i)\ge c'\cdot \delta \}\), and consider two cases.
-
(a)
If \(\sum _{i\in H\setminus H'}s(i)>0.3\cdot \delta \), then we use \(H\setminus H'\) instead of H, while noting that \(\mathbf{Pr}\left[ \sum _{i\in H\setminus H'}\zeta _i\le 0.2\delta \right]< \frac{c'\,\cdot \,\delta ^2}{(0.07\delta )^2}<c\), since \(\mathbb E\left[ \sum _{i\in H\setminus }\zeta _i\right] >0.9\cdot 0.3\delta \) and \(\mathbb V[\sum _{i\in H\setminus H'}\zeta _i] \le \sum _{i\in H\setminus H'}s(i)^2\le c'\delta \cdot \delta \).
-
(b)
If \(\sum _{i\in H'}s(i)>0.2\cdot \delta \), then we use \(H'\) instead of H, while noting that the probability that \(|f(H')|<|H'|\) is at most \({{|H'|}\atopwithdelims ()2}\cdot (1/k)\le {{1/c'}\atopwithdelims ()2}\cdot (1/k)<c\), where the last inequality holds for sufficiently large k (i.e., \(m=c\cdot k>(1/c')^2\) suffices).
(We proceed with H replaced by either \(H'\) or \(H\setminus H'\).)
-
(a)
References
Batu, T., Fischer, E., Fortnow, L., Kumar, R., Rubinfeld, R., White, P.: Testing random variables for independence and identity. In: 42nd FOCS, pp. 442–451 (2001)
Batu, T., Fortnow, L., Rubinfeld, R., Smith, W.D., White, P.: Testing that distributions are close. In: 41st FOCS, pp. 259–269 (2000)
Canonne, C.L.: A survey on distribution testing: your data is big. But is it blue? In: ECCC, TR015-063 (2015)
Chan, S., Diakonikolas, I., Valiant, P., Valiant, G.: Optimal algorithms for testing closeness of discrete distributions. In: 25th ACM-SIAM Symposium on Discrete Algorithms, pp. 1193–1203 (2014)
Diakonikolas, I., Kane, D.: A new approach for testing properties of discrete distributions. arXiv:1601.05557 [cs.DS] (2016)
Diakonikolas, I., Kane, D., Nikishkin, V.: Testing identity of structured distributions. In: 26th ACM-SIAM Symposium on Discrete Algorithms, pp. 1841–1854 (2015)
Diakonikolas, I., Gouleakis, T., Peebles, J., Price, E.: Collision-based testers are optimal for uniformity and closeness. In: ECCC, TR16-178 (2016)
Goldreich, O.: Introduction to property testing: lecture notes. Superseded by [9]. Drafts are available from the author’s web-page
Goldreich, O.: In: Introduction to Property Testing. Cambridge University Press, Cambridge (2017)
Goldreich, O.: On the optimal analysis of the collision probability tester (an exposition). This volume
Goldreich, O., Goldwasser, S., Ron, D.: Property testing and its connection to learning and approximation. J. ACM 45, 653–750 (1998). Extended abstract in 37th FOCS, 1996
Goldreich, O., Ron, D.: On testing expansion in bounded-degree graphs. In: ECCC, TR00-020, March 2000
Goldreich, O., Ron, D.: On the relation between the relative earth mover distance and the variation distance (an exposition). This volume
Paninski, L.: A coincidence-based test for uniformity given very sparsely-sampled discrete data. IEEE Trans. Inf. Theory 54, 4750–4755 (2008)
Parnas, M., Ron, D., Rubinfeld, R.: Tolerant property testing and distance approximation. J. Comput. Syst. Sci. 72(6), 1012–1042 (2006)
Valiant, G., Valiant, P.: Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In: 43rd ACM Symposium on the Theory of Computing, pp. 685–694 (2011)
Valiant, G., Valiant, P.: Instance-by-instance optimal identity testing. In: ECCC, TR13-111 (2013)
Valiant, G., Valiant, P.: Instance optimal learning. CoRR abs/1504.05321 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Reducing Testing m-Grained Distributions (over [n]) to the Case of \(n=O(m)\)
Appendix: Reducing Testing m-Grained Distributions (over [n]) to the Case of \(n=O(m)\)
Recall that Corollary 4.2 asserts that for every \(n,m\in \mathbb N\), the set of m-grained distributions over [n] has a tester of sample complexity \(O(\epsilon ^{-2}\cdot n/\log n)\). As commented in the main text, we believe that using the techniques of [16] one can reduce the complexity to \(O(\epsilon ^{-2}\cdot n'/\log n')\), where \(n'=\min (n,m)\). Here we show an alternative proof of this result. Specifically, we shall reduce \(\epsilon \)-testing m-grained distributions over [n] to \(\varOmega (\epsilon )\)-testing m-grained distributions over [O(m)], and apply Corollary 4.2.
The reduction will consist of using a deterministic filter \(f:[n]\rightarrow [k]\), where \(k=O(m)\), that will be selected uniformly at random among all such filters. We stress that this is fundamentally different from the randomized filters F used in the main text. Specifically, when applying F several times to the same input, we obtained outcomes that are independently and identically distributed, whereas when we apply a function f (which is selected at random) several times to the same input we obtain the same output.
Note that applying any function \(f:[n]\rightarrow [k]\) to any m-grained distribution yields an m-grained distribution. Our main result is that, for any distribution X over [n] that is \(\epsilon \)-far from being m-grained, for almost all functions \(f:[n]\rightarrow [O(m)]\), the distribution f(X) is \(\varOmega (\epsilon )\)-far from being m-grained.
Lemma A.1
(relative preservation of distance from m-grained distributions): For all sufficiently small \(c>0\) and all sufficiently large n and m, the following holds. If a distribution X over [n] is \(\epsilon \)-far from being m-grained, then, with probability at least \(1-36c\) over the choice of a function \(f:[n]\rightarrow [m/c]\), the distribution f(X) is \(0.02\cdot \epsilon \)-far from being m-grained.
Hence, we obtain a randomized reduction of the general problem of testing m-grained distributions (over [n]) to the special case of \(n=O(m)\), where the reduction consists of selecting at random a function \(f:[n]\rightarrow [m/c]\) and using it as a (deterministic) filter for reducing the general problem to its special case.
Proof:
Let \(k=m/c\) and let \(p:[n]\rightarrow [0,1]\) denote the probability function that describes X. Define \(r:[n]\rightarrow [0,1/m)\) such that \(r(i)=p(i)-{\lfloor m\cdot p(i)\rfloor }/m\). Denoting by \(\varDelta _G(p)\) the statistical distance between p and the set of m-grained distributions (i.e., half the norm-1 distance), we have
where Eq. (4) is due to the need to transform each p(i) to a multiple of 1/m and Eq. (5) is justified by a two-step correction process in which we first round each p(i) to the closest multiple of 1/m, and then we correct the resulting function so that it sums up to 1 (while keeping its values as multiples of 1/m).Footnote 13 Hence, using Eq. (5). the lemma’s hypothesis implies that \(\sum _{i\in [n]}\min (r(i),(1/m)-r(i)) > \epsilon \). We shall prove the lemma by lower-bounding (w.h.p.) the corresponding sum that refers to the distribution f(X), when f is selected at random. Specifically, letting \(p'(j)=\sum _{i:f(i)=j}p(i)\), we shall lower-bound the probability that \(\sum _{j\in [k]}\min (r'(j),(1/m)-r'(j))=\varOmega (\epsilon )\), where \(r'(j)=p'(j)-{\lfloor m\cdot p'(j)\rfloor }/m\), and then apply Eq. (4).
Before doing so, we introduce a few additional notations. Firstly, we let \(s(i)=\min (r(i),(1/m)-r(i))\), and let \(\delta =\sum _{i\in [n]}s(i)\), which is greater than \(\epsilon \) by the hypothesis. Next, we let \(H=\{i\in [n]:p(i)\ge 1/3m\}\) denote the set of “heavy” elements in X. We observe that \(|H|\le 3m\) and that for every \(i\in {\overline{H}}\,{\mathop {=}\limits ^\mathrm{def}}\,[n]\setminus H\) it holds that \(s(i)=r(i)=p(i)\), since \(p(i)<1/2m\) holds for every \(i\in {\overline{H}}\). We consider two cases, according to whether or not the sum \(\sum _{i\in {\overline{H}}}p(i)\) is smaller than \(0.5\cdot \delta \).
Claim A.1.1
(the first case): Suppose that \(\sum _{i\in {\overline{H}}}p(i)<0.5\cdot \delta \). Then, with probability at least \(1-16c\) over the choice of f, it holds that f(X) is \(0.05\epsilon \)-far from being m-grained.
Proof: In this case \(\sum _{i\in H}s(i) > 0.5\cdot \delta \), and we shall focus on the contribution of f(H) to the distance of f(X) from being m-grained. We shall show that, for almost all functions f, much of this weight is mapped (by f) in a one-to-one manner, and that the elements in \(\overline{H}\) do not change by much the weight mapped by f to f(H). Specifically, we consider a uniformly selected function \(f:[n]\rightarrow [k]\), and the following two good events defined on this probability space.
-
1.
The first (good) event is that the function f maps at least \(0.2\delta \) of the s(i)-mass of the i’s in H to distinct images. Intuitively, this is very likely given that the total s(i)-mass of i’s in H is greater than \(0.5\delta \) and that \(|H|\ll k\). Formally, denoting by \(H_f\) the (random variable that represents the) set of \(i\in H\) that satisfy \(f(i)\not \in f(H\setminus \{i\})\) (i.e., for every \(i\in H_f\) it holds that \(f^{-1}(f(i))\cap H=\{i\}\)), we claim that \(\mathbf{Pr}_f\left[ \sum _{i\in H_f}s(i)> 0.2\delta \right] >1-c\).
To see this, we first note that, for every \(i\in H\), conditioned on the values assigned to \(H\setminus \{i\}\), the probability that \(f(i)\not \in f(H\setminus \{i\})\) is at least \(\frac{k\,-\,(|H|\,-\,1)}{k}>1\,-\,|H|/k\ge 0.9\), where the inequality is due to \(|H|\le 3m=3c\cdot k\le 0.1\cdot k\). Hence, each \(i\in H\) contributes \(s(i)\le 1/2m\) to the sum (of s(i)’s with \(i\in H_f\)) with probability at least 0.9, also when conditioned on all other values assigned by f. It follows that \(\mathbf{Pr}_f\left[ \sum _{i\in H_f}s(i)> 0.2\delta \right] >1-c\), where the (typical) case of \(\delta =\omega (1/m)\) is straightforward.Footnote 14
-
2.
The second (good) event is that the function f does not map much p(i)-mass of i’s in \({\overline{H}}\) to the images occupied by H. Again, this is very likely given that \(|H|\ll k\). Specifically, observe that \(\mathbb E_f\left[ \sum _{i\in {\overline{H}}:f(i)\in f(H)}p(i)\right] \le \frac{|H|}{k}\cdot \sum _{i\in {\overline{H}}}p(i) < 3c\cdot \delta /2\), since \(p(i)=s(i)\) for every \(i\in {\overline{H}}\) (and \(|H|\le 3m\) and \(k=m/c\)). Letting \(S_f=\sum _{i\in {\overline{H}}:f(i)\in f(H)}p(i)\), we get \(\mathbf{Pr}_f[S_f<0.1\delta ]>1-\frac{3c\delta /2}{0.1\delta }=1-15c\).
Assuming that the two good events occur (which happens with probability at least \(1-16c\)), it follows that at least \(0.2\delta \) of the \(s(\cdot )\)-mass of H is mapped by f to distinct images and at most \(0.1\delta \) of the mass of \({\overline{H}}\) is mapped to these images. Hence, f(X) corresponds to a probability function \(p'\) such that \(r'(i)\,{\mathop {=}\limits ^\mathrm{def}}\, p'(i)-{\lfloor m\cdot p'(i)\rfloor }/m\) satisfies
where \(H_f=\{i\in H:f^{-1}(f(i))\cap H=\{i\}\}\) (as above). Hence, recalling that \(\delta >\epsilon \) and using Eq. (4), with probability at least \(1-16c\) over the choice of f, it holds that f(X) is \(0.05\epsilon \)-far from being m-grained. \(\square \)
Claim A.1.2
(the second case): Suppose that \(\delta '\,{\mathop {=}\limits ^\mathrm{def}}\,\sum _{i\in {\overline{H}}}p(i)\ge 0.5\cdot \delta \). Then, with probability at least \(1-36c\) over the choice of f, it holds that f(X) is \(0.02\epsilon \)-far from being m-grained.
Proof: In this case \(\sum _{i\in {\overline{H}}}s(i) > 0.5\cdot \delta \), and we shall focus on the contribution of \(f({\overline{H}})\) to the distance of f(X) from being m-grained. We shall show that, for almost all functions f, much of this weight is mapped (by f) to \([k]\setminus H\) and that the mass of the elements of \(\overline{H}\) is distributed almost uniformly. Specifically, we first show that more than half of the probability mass of \(\overline{H}\) is mapped disjointly of H. That is,
where the probability is taken uniformly over all possible choices of f. The proof is similar to the analysis of the second event in the proof of Claim A.1.2. Specifically, we consider random variables \(\zeta _i\)’s such that \(\zeta _i=p(i)\) if \(f(i)\not \in f(H)\) and \(\zeta _i=0\) otherwise, and observe that \(\mathbb E[\zeta _i]\ge \frac{k\,-\,|H|}{k}\cdot p(i)\ge (1-3c)\cdot p(i)\) (since \(|H|\le 3m\) and \(m=ck\)). Thus, \(\mathbb E\left[ \sum _{i\in {\overline{H}}}\zeta _i\right] \ge (1-3c)\cdot \delta '\) and Eq. (6) follows by Markov Inequality while using \(\sum _{i\in {\overline{H}}}\zeta _i\le \sum _{i\in {\overline{H}}}p(i)=\delta '\). This holds also if we fix the values of f on H and condition on it, which is what we do from this point on. Hence, we fix an arbitrary sequence of value for f(H), and consider the uniform distribution of f conditioned on this fixing as well as on the event in Eq. (6).
Actually, we decompose \(f:[n]\rightarrow [k]\) into three parts, denoted \(f',f''\) and \(f'''\), that represents its restriction to the three-way partition of [n] into (H, B, G) such that \(B=\{i\in {\overline{H}}:f(i)\in f(H)\}\) (and \(G=\{i\in {\overline{H}}:f(i)\not \in f(H)\}\)); indeed, \(f':H\rightarrow [k]\) is the restriction of f to H, whereas \(f'':B\rightarrow f(H)\) and \(f''':G\rightarrow [k]\setminus f(H)\) are its restrictions to the two parts of \(\overline{H}\). We fix arbitrary \(f':H\rightarrow [k]\) and \(f'':B\rightarrow f'(H)\), where \(B=\{i\in {\overline{H}}:f''(i)\in f'(H)\}\), such that \(\sum _{i\in B} p(i)<0.5\delta '\), while bearing in mind that such fixing (of \(f'\) and \(f''\)) arise from the choice of a random f with probability at least \(1-6c\). Our aim will be to show that, with high probability over the choice of \(f''':G\rightarrow [k]\setminus f(H)\), it holds that
where \(J(f''')\,{\mathop {=}\limits ^\mathrm{def}}\,\{j\in [k]:\sum _{i\in G:f'''(i)=j}p(i)\le 0.8/m\}\). (Recall that for any \(i\in G\subseteq {\overline{H}}\) it holds that \(p(i)=r(i)=s(i)<1/3m\).) This would imply that, with high probability, the distance of f(X) from being m-grained is at least
where the first inequality is due to the fact that \(p'(j)\,{\mathop {=}\limits ^\mathrm{def}}\,\mathbf{Pr}[f(X)\!=\!j]=\sum _{i\in G:f'''(i)=j}p(i)\le 0.8/m\) for every \(j\in J(f''')\) and so \(\min (p'(j),0.2/m)\ge p'(j)/4\). So all that remains is to show that Eq. (7) holds with high probability over the choice of \(f'''\).
Letting \(K'\,{\mathop {=}\limits ^\mathrm{def}}\,[k]\setminus f(H)\), we start by observing that, for every \(i\in G\), it holds that
where the equality can be seen by first fixing \(f'''\)-values for all elements in \(G\setminus \{i\}\) and then selecting \(f'''(i)\) uniformly in \(K'\). Upper-bounding the size of the set in Eq. (8) by \((1.4/3m)^{-1}\), and using \(m=ck\) and \(|K'|\ge k-3m\), we get
where the last inequality presupposes \(1.4\cdot (1-3c)\ge 1\) (equiv., \(c \le 2/21\)). It follows that
since \(\sum _{i\in G}p(i)\le \sum _{i\in {\overline{H}}}p(i)=\delta '\). Hence,
Recalling that \(\sum _{i\in B} p(i)<0.5\delta '\), which implies \(\sum _{i\in G} p(i)>0.5\delta '\), this implies that Eq. (7) holds with probability at least \(1-30c\) (over the choice of \(f'''\)).
Lastly, recall that \(\sum _{i\in B} p(i)<0.5\delta '\), where \(B=\{i\in {\overline{H}}:f''(i)\in f'(H)\}\), holds with probability at least \(1-6c\) (over the choice of \(f'\) and \(f''\)). The claim follows, since (as argued above) Eq. (7) implies that \(\sum _{j\in J(f''')}\min (p'(j),(1/m)-p'(j))>0.1\delta '\) (whereas using \(\delta '\ge \delta /2\ge \epsilon /2\) and Eq. (4), it follows that f(X) is \(0.02\epsilon \)-far from being m-grained). \(\square \)
Combining Claims A.1.1 and A.1.2, the lemma follows. \(\blacksquare \)
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Goldreich, O. (2020). The Uniform Distribution Is Complete with Respect to Testing Identity to a Fixed Distribution. In: Goldreich, O. (eds) Computational Complexity and Property Testing. Lecture Notes in Computer Science(), vol 12050. Springer, Cham. https://doi.org/10.1007/978-3-030-43662-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-43662-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43661-2
Online ISBN: 978-3-030-43662-9
eBook Packages: Computer ScienceComputer Science (R0)