Skip to main content

The Uniform Distribution Is Complete with Respect to Testing Identity to a Fixed Distribution

  • Chapter
  • First Online:
Computational Complexity and Property Testing

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12050))

Abstract

Inspired by Diakonikolas and Kane (2016), we reduce the class of problems consisting of testing whether an unknown distribution over [n] equals a fixed distribution to the special case in which the fixed distribution is uniform over [n]. Our reduction preserves the parameters of the problem, which are n and the proximity parameter \(\epsilon >0\), up to a constant factor.

While this reduction yields no new bounds on the sample complexity of either problems, it provides a simple way of obtaining testers for equality to arbitrary fixed distributions from testers for the uniform distribution. The reduction first reduces the general case to the case of “grained distributions” (in which all probabilities are multiples of \(\varOmega (1/n)\)), and then reduces this case to the case of the uniform distribution. Using grained distributions as a pivot of the exposition, we call attention to this natural class.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    As an anecdote, we mention that, in course of their research, Goldreich, Goldwasser, and Ron considered the feasibility of testing properties of distributions, but being in the mindset that focused on complexity that is polylogarithmic in the size of the object (see discussion in [9, Sec. 1.4]), they found no appealing example and did not report of these thoughts in their paper [11].

  2. 2.

    Testing equality to \(U_n\) is implicit in a test of the distribution of the endpoint of a relatively short random walk on a bounded-degree graph.

  3. 3.

    See further discussion in Sect. 3.4.

  4. 4.

    This may happen if and only if the support of \(q:[n]\rightarrow [0,1]\) is a strict subset of [n] (equiv., if \(m_i=0\) for some \(i\in [n]\)). Specifically, for every \(X\in [n]\), the support of \(F_q(X)\) equals \( S''\,{\mathop {=}\limits ^\mathrm{def}}\, S\cup \{{\langle {i,0}\rangle }:i\in [n] \& q(i)\!=\!0\} \subseteq S'\), whereas \(|S''|=m+|\{i\in [n]:q(i)\!=\!0\}|\).

  5. 5.

    Recall that the alternatives include the tests of [14] and [4] or the collision probability test (of [12]), per its improved analysis in [7, 10].

  6. 6.

    Consider, for example, the case that \(q(i)=0.4\gamma /n\) if \(i\in [0.5n]\) and \(q(i)=(2\,-\,0.4\gamma )/n\) otherwise, and any distribution X such that \(\mathbf{Pr}[X\!=\!i]<\gamma /n\) if \(i\in [0.5n]\) and \(\mathbf{Pr}[X\!=\!i]=q(i)\) otherwise. Then, each of these possible X’s will be mapped by F to the same distribution, although such distributions may be \(0.1\gamma \)-far from the distribution associated with q.

  7. 7.

    Typically, \(n'=n+1\). Recall that \(n'=n\) if and only if D itself is 6n-grained, in which case the reduction is not needed anyhow.

  8. 8.

    Valiant and Valiant [16] stated this result for the “relative earthmover distance” (REMD) and commented that the total variation distance up to relabelling is upper-bounded by REMD. This claim appears as a special case of [18, Fact 1] (using \(\tau =0\)), and a detailed proof appears in [13].

  9. 9.

    Like in Footnote 8, we note that Valiant and Valiant [16] stated this result for the “relative earthmover distance” (REMD) and commented that the total variation distance up to relabelling is upper-bounded by REMD. This claim appears as a special case of [18, Fact 1] (using \(\tau =0\)), and a detailed proof appears in [13].

  10. 10.

    The constant 0.499 stands for an arbitrary large constant that is smaller than 0.5. Recall that the definition of \(\delta \)-far mandates that the relevant distance be greater than \(\delta \).

  11. 11.

    Specifically, \(t\ge 2\) since \(2m>n\), whereas \(t=O(1)\) and \(m'=\varOmega (n)\) since \(m=O(n)\).

  12. 12.

    Otherwise the following description reduces the problem of Theorem 4.3 to a testing problem regarding \((t\cdot {\lfloor m/t\rfloor })\)-grained distributions. In this case, we reduce the latter testing problem to one regarding m-grained distributions (e.g., by using a filter that maps each \(i\in [n]\) to itself with probability \(t\cdot {\lfloor m/t\rfloor }/m\) and maps it to n otherwise.

  13. 13.

    Specifically, let \(q:[n]\rightarrow [0,2)\) be the function resulting from the first step (i.e., \(q(i)={\lfloor m\cdot p(i)\rfloor }/m\) if \(r(i)\le 1/2m\) and \(q(i)={\lceil m\cdot p(i)\rceil }/m\) otherwise). Then, \(\delta \,{\mathop {=}\limits ^\mathrm{def}}\,\sum _{i\in [n]}|q(i)-p(i)|=\sum _{i\in [n]}\min (r(i),(1/m)-r(i))\) and \(|1-\sum _{i\in [n]}q(i)|\le \delta \), since \(\left| \sum _{i\in [n]}q(i)-\sum _{i\in [n]}p(i)\right| \le \sum _{i\in [n]}|q(i)-p(i)|\).

  14. 14.

    Specifically, letting \(\zeta _i=\zeta _i(f)\) denote the contribution of \(i\in H\) to \(\sum _{i\in H_f}s(i)\), we have \(\mathbb E[\zeta _i]\ge 0.9\cdot s(i)\) and \(\mathbb V[\zeta _i]\le \mathbb E[\zeta _i^2]\le s(i)^2\le s(i)/2m\). Hence, by Chebyshev’s Inequality, \(\mathbf{Pr}\left[ \sum _{i\in H}\zeta _i\le 0.2\delta \right] < \frac{\delta /2m}{(0.45\delta \,-\,0.2\delta )^2}\), since \(\mathbb V\left[ \sum _{i\in H}\zeta _i\right] \le \delta /2m\) and \(\mathbb E\left[ \sum _{i\in H}\zeta _i\right] \ge 0.9\cdot 0.5\delta \). This suffices for \(\delta =\omega (1/m)\). Actually, the same argument holds if \(\sum _{i\in H}s(i)^2=o(\delta ^2)\); the argument for the general case follows.

    In general (esp., if \(\sum _{i\in H}s(i)^2=\varOmega (\delta ^2)\)), for a sufficiently small \(c'>0\), we define \(H'\,{\mathop {=}\limits ^\mathrm{def}}\,\{i\in H:s(i)\ge c'\cdot \delta \}\), and consider two cases.

    1. (a)

      If \(\sum _{i\in H\setminus H'}s(i)>0.3\cdot \delta \), then we use \(H\setminus H'\) instead of H, while noting that \(\mathbf{Pr}\left[ \sum _{i\in H\setminus H'}\zeta _i\le 0.2\delta \right]< \frac{c'\,\cdot \,\delta ^2}{(0.07\delta )^2}<c\), since \(\mathbb E\left[ \sum _{i\in H\setminus }\zeta _i\right] >0.9\cdot 0.3\delta \) and \(\mathbb V[\sum _{i\in H\setminus H'}\zeta _i] \le \sum _{i\in H\setminus H'}s(i)^2\le c'\delta \cdot \delta \).

    2. (b)

      If \(\sum _{i\in H'}s(i)>0.2\cdot \delta \), then we use \(H'\) instead of H, while noting that the probability that \(|f(H')|<|H'|\) is at most \({{|H'|}\atopwithdelims ()2}\cdot (1/k)\le {{1/c'}\atopwithdelims ()2}\cdot (1/k)<c\), where the last inequality holds for sufficiently large k (i.e., \(m=c\cdot k>(1/c')^2\) suffices).

    (We proceed with H replaced by either \(H'\) or \(H\setminus H'\).)

References

  1. Batu, T., Fischer, E., Fortnow, L., Kumar, R., Rubinfeld, R., White, P.: Testing random variables for independence and identity. In: 42nd FOCS, pp. 442–451 (2001)

    Google Scholar 

  2. Batu, T., Fortnow, L., Rubinfeld, R., Smith, W.D., White, P.: Testing that distributions are close. In: 41st FOCS, pp. 259–269 (2000)

    Google Scholar 

  3. Canonne, C.L.: A survey on distribution testing: your data is big. But is it blue? In: ECCC, TR015-063 (2015)

    Google Scholar 

  4. Chan, S., Diakonikolas, I., Valiant, P., Valiant, G.: Optimal algorithms for testing closeness of discrete distributions. In: 25th ACM-SIAM Symposium on Discrete Algorithms, pp. 1193–1203 (2014)

    Google Scholar 

  5. Diakonikolas, I., Kane, D.: A new approach for testing properties of discrete distributions. arXiv:1601.05557 [cs.DS] (2016)

  6. Diakonikolas, I., Kane, D., Nikishkin, V.: Testing identity of structured distributions. In: 26th ACM-SIAM Symposium on Discrete Algorithms, pp. 1841–1854 (2015)

    Google Scholar 

  7. Diakonikolas, I., Gouleakis, T., Peebles, J., Price, E.: Collision-based testers are optimal for uniformity and closeness. In: ECCC, TR16-178 (2016)

    Google Scholar 

  8. Goldreich, O.: Introduction to property testing: lecture notes. Superseded by [9]. Drafts are available from the author’s web-page

    Google Scholar 

  9. Goldreich, O.: In: Introduction to Property Testing. Cambridge University Press, Cambridge (2017)

    Google Scholar 

  10. Goldreich, O.: On the optimal analysis of the collision probability tester (an exposition). This volume

    Google Scholar 

  11. Goldreich, O., Goldwasser, S., Ron, D.: Property testing and its connection to learning and approximation. J. ACM 45, 653–750 (1998). Extended abstract in 37th FOCS, 1996

    Article  MathSciNet  Google Scholar 

  12. Goldreich, O., Ron, D.: On testing expansion in bounded-degree graphs. In: ECCC, TR00-020, March 2000

    Google Scholar 

  13. Goldreich, O., Ron, D.: On the relation between the relative earth mover distance and the variation distance (an exposition). This volume

    Google Scholar 

  14. Paninski, L.: A coincidence-based test for uniformity given very sparsely-sampled discrete data. IEEE Trans. Inf. Theory 54, 4750–4755 (2008)

    Article  MathSciNet  Google Scholar 

  15. Parnas, M., Ron, D., Rubinfeld, R.: Tolerant property testing and distance approximation. J. Comput. Syst. Sci. 72(6), 1012–1042 (2006)

    Article  MathSciNet  Google Scholar 

  16. Valiant, G., Valiant, P.: Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In: 43rd ACM Symposium on the Theory of Computing, pp. 685–694 (2011)

    Google Scholar 

  17. Valiant, G., Valiant, P.: Instance-by-instance optimal identity testing. In: ECCC, TR13-111 (2013)

    Google Scholar 

  18. Valiant, G., Valiant, P.: Instance optimal learning. CoRR abs/1504.05321 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oded Goldreich .

Editor information

Editors and Affiliations

Appendix: Reducing Testing m-Grained Distributions (over [n]) to the Case of \(n=O(m)\)

Appendix: Reducing Testing m-Grained Distributions (over [n]) to the Case of \(n=O(m)\)

Recall that Corollary 4.2 asserts that for every \(n,m\in \mathbb N\), the set of m-grained distributions over [n] has a tester of sample complexity \(O(\epsilon ^{-2}\cdot n/\log n)\). As commented in the main text, we believe that using the techniques of [16] one can reduce the complexity to \(O(\epsilon ^{-2}\cdot n'/\log n')\), where \(n'=\min (n,m)\). Here we show an alternative proof of this result. Specifically, we shall reduce \(\epsilon \)-testing m-grained distributions over [n] to \(\varOmega (\epsilon )\)-testing m-grained distributions over [O(m)], and apply Corollary 4.2.

The reduction will consist of using a deterministic filter \(f:[n]\rightarrow [k]\), where \(k=O(m)\), that will be selected uniformly at random among all such filters. We stress that this is fundamentally different from the randomized filters F used in the main text. Specifically, when applying F several times to the same input, we obtained outcomes that are independently and identically distributed, whereas when we apply a function f (which is selected at random) several times to the same input we obtain the same output.

Note that applying any function \(f:[n]\rightarrow [k]\) to any m-grained distribution yields an m-grained distribution. Our main result is that, for any distribution X over [n] that is \(\epsilon \)-far from being m-grained, for almost all functions \(f:[n]\rightarrow [O(m)]\), the distribution f(X) is \(\varOmega (\epsilon )\)-far from being m-grained.

Lemma A.1

(relative preservation of distance from m-grained distributions): For all sufficiently small \(c>0\) and all sufficiently large n and m, the following holds. If a distribution X over [n] is \(\epsilon \)-far from being m-grained, then, with probability at least \(1-36c\) over the choice of a function \(f:[n]\rightarrow [m/c]\), the distribution f(X) is \(0.02\cdot \epsilon \)-far from being m-grained.

Hence, we obtain a randomized reduction of the general problem of testing m-grained distributions (over [n]) to the special case of \(n=O(m)\), where the reduction consists of selecting at random a function \(f:[n]\rightarrow [m/c]\) and using it as a (deterministic) filter for reducing the general problem to its special case.

Proof:

Let \(k=m/c\) and let \(p:[n]\rightarrow [0,1]\) denote the probability function that describes X. Define \(r:[n]\rightarrow [0,1/m)\) such that \(r(i)=p(i)-{\lfloor m\cdot p(i)\rfloor }/m\). Denoting by \(\varDelta _G(p)\) the statistical distance between p and the set of m-grained distributions (i.e., half the norm-1 distance), we have

$$\begin{aligned} 2\cdot \varDelta _G(p)\ge & {} \sum _{i\in [n]}\min (r(i),(1/m)-r(i)) \end{aligned}$$
(4)
$$\begin{aligned} 2\cdot \varDelta _G(p)\le & {} 2\cdot \sum _{i\in [n]}\min (r(i),(1/m)-r(i)) \end{aligned}$$
(5)

where Eq. (4) is due to the need to transform each p(i) to a multiple of 1/m and Eq. (5) is justified by a two-step correction process in which we first round each p(i) to the closest multiple of 1/m, and then we correct the resulting function so that it sums up to 1 (while keeping its values as multiples of 1/m).Footnote 13 Hence, using Eq. (5). the lemma’s hypothesis implies that \(\sum _{i\in [n]}\min (r(i),(1/m)-r(i)) > \epsilon \). We shall prove the lemma by lower-bounding (w.h.p.) the corresponding sum that refers to the distribution f(X), when f is selected at random. Specifically, letting \(p'(j)=\sum _{i:f(i)=j}p(i)\), we shall lower-bound the probability that \(\sum _{j\in [k]}\min (r'(j),(1/m)-r'(j))=\varOmega (\epsilon )\), where \(r'(j)=p'(j)-{\lfloor m\cdot p'(j)\rfloor }/m\), and then apply Eq. (4).

Before doing so, we introduce a few additional notations. Firstly, we let \(s(i)=\min (r(i),(1/m)-r(i))\), and let \(\delta =\sum _{i\in [n]}s(i)\), which is greater than \(\epsilon \) by the hypothesis. Next, we let \(H=\{i\in [n]:p(i)\ge 1/3m\}\) denote the set of “heavy” elements in X. We observe that \(|H|\le 3m\) and that for every \(i\in {\overline{H}}\,{\mathop {=}\limits ^\mathrm{def}}\,[n]\setminus H\) it holds that \(s(i)=r(i)=p(i)\), since \(p(i)<1/2m\) holds for every \(i\in {\overline{H}}\). We consider two cases, according to whether or not the sum \(\sum _{i\in {\overline{H}}}p(i)\) is smaller than \(0.5\cdot \delta \).

Claim A.1.1

(the first case): Suppose that \(\sum _{i\in {\overline{H}}}p(i)<0.5\cdot \delta \). Then, with probability at least \(1-16c\) over the choice of f, it holds that f(X) is \(0.05\epsilon \)-far from being m-grained.

Proof: In this case \(\sum _{i\in H}s(i) > 0.5\cdot \delta \), and we shall focus on the contribution of f(H) to the distance of f(X) from being m-grained. We shall show that, for almost all functions f, much of this weight is mapped (by f) in a one-to-one manner, and that the elements in \(\overline{H}\) do not change by much the weight mapped by f to f(H). Specifically, we consider a uniformly selected function \(f:[n]\rightarrow [k]\), and the following two good events defined on this probability space.

  1. 1.

    The first (good) event is that the function f maps at least \(0.2\delta \) of the s(i)-mass of the i’s in H to distinct images. Intuitively, this is very likely given that the total s(i)-mass of i’s in H is greater than \(0.5\delta \) and that \(|H|\ll k\). Formally, denoting by \(H_f\) the (random variable that represents the) set of \(i\in H\) that satisfy \(f(i)\not \in f(H\setminus \{i\})\) (i.e., for every \(i\in H_f\) it holds that \(f^{-1}(f(i))\cap H=\{i\}\)), we claim that \(\mathbf{Pr}_f\left[ \sum _{i\in H_f}s(i)> 0.2\delta \right] >1-c\).

    To see this, we first note that, for every \(i\in H\), conditioned on the values assigned to \(H\setminus \{i\}\), the probability that \(f(i)\not \in f(H\setminus \{i\})\) is at least \(\frac{k\,-\,(|H|\,-\,1)}{k}>1\,-\,|H|/k\ge 0.9\), where the inequality is due to \(|H|\le 3m=3c\cdot k\le 0.1\cdot k\). Hence, each \(i\in H\) contributes \(s(i)\le 1/2m\) to the sum (of s(i)’s with \(i\in H_f\)) with probability at least 0.9, also when conditioned on all other values assigned by f. It follows that \(\mathbf{Pr}_f\left[ \sum _{i\in H_f}s(i)> 0.2\delta \right] >1-c\), where the (typical) case of \(\delta =\omega (1/m)\) is straightforward.Footnote 14

  2. 2.

    The second (good) event is that the function f does not map much p(i)-mass of i’s in \({\overline{H}}\) to the images occupied by H. Again, this is very likely given that \(|H|\ll k\). Specifically, observe that \(\mathbb E_f\left[ \sum _{i\in {\overline{H}}:f(i)\in f(H)}p(i)\right] \le \frac{|H|}{k}\cdot \sum _{i\in {\overline{H}}}p(i) < 3c\cdot \delta /2\), since \(p(i)=s(i)\) for every \(i\in {\overline{H}}\) (and \(|H|\le 3m\) and \(k=m/c\)). Letting \(S_f=\sum _{i\in {\overline{H}}:f(i)\in f(H)}p(i)\), we get \(\mathbf{Pr}_f[S_f<0.1\delta ]>1-\frac{3c\delta /2}{0.1\delta }=1-15c\).

Assuming that the two good events occur (which happens with probability at least \(1-16c\)), it follows that at least \(0.2\delta \) of the \(s(\cdot )\)-mass of H is mapped by f to distinct images and at most \(0.1\delta \) of the mass of \({\overline{H}}\) is mapped to these images. Hence, f(X) corresponds to a probability function \(p'\) such that \(r'(i)\,{\mathop {=}\limits ^\mathrm{def}}\, p'(i)-{\lfloor m\cdot p'(i)\rfloor }/m\) satisfies

$$\begin{aligned} \sum _{i\in H_f}\min (r'(i),(1/m)-r'(i))\ge & {} \sum _{i\in H_f}s(i) - \sum _{i\in {\overline{H}}:f(i)\in f(H)} p(i)\\> & {} 0.2\delta -0.1\delta , \end{aligned}$$

where \(H_f=\{i\in H:f^{-1}(f(i))\cap H=\{i\}\}\) (as above). Hence, recalling that \(\delta >\epsilon \) and using Eq. (4), with probability at least \(1-16c\) over the choice of f, it holds that f(X) is \(0.05\epsilon \)-far from being m-grained.    \(\square \)

Claim A.1.2

(the second case): Suppose that \(\delta '\,{\mathop {=}\limits ^\mathrm{def}}\,\sum _{i\in {\overline{H}}}p(i)\ge 0.5\cdot \delta \). Then, with probability at least \(1-36c\) over the choice of f, it holds that f(X) is \(0.02\epsilon \)-far from being m-grained.

Proof: In this case \(\sum _{i\in {\overline{H}}}s(i) > 0.5\cdot \delta \), and we shall focus on the contribution of \(f({\overline{H}})\) to the distance of f(X) from being m-grained. We shall show that, for almost all functions f, much of this weight is mapped (by f) to \([k]\setminus H\) and that the mass of the elements of \(\overline{H}\) is distributed almost uniformly. Specifically, we first show that more than half of the probability mass of \(\overline{H}\) is mapped disjointly of H. That is,

$$\begin{aligned} \mathbf{Pr}_{f:[n]\rightarrow [k]}\left[ \sum _{i\in {\overline{H}}:f(i)\not \in f(H)} p(i)>0.5\cdot \delta '\right] \ge 1-6c \end{aligned}$$
(6)

where the probability is taken uniformly over all possible choices of f. The proof is similar to the analysis of the second event in the proof of Claim A.1.2. Specifically, we consider random variables \(\zeta _i\)’s such that \(\zeta _i=p(i)\) if \(f(i)\not \in f(H)\) and \(\zeta _i=0\) otherwise, and observe that \(\mathbb E[\zeta _i]\ge \frac{k\,-\,|H|}{k}\cdot p(i)\ge (1-3c)\cdot p(i)\) (since \(|H|\le 3m\) and \(m=ck\)). Thus, \(\mathbb E\left[ \sum _{i\in {\overline{H}}}\zeta _i\right] \ge (1-3c)\cdot \delta '\) and Eq. (6) follows by Markov Inequality while using \(\sum _{i\in {\overline{H}}}\zeta _i\le \sum _{i\in {\overline{H}}}p(i)=\delta '\). This holds also if we fix the values of f on H and condition on it, which is what we do from this point on. Hence, we fix an arbitrary sequence of value for f(H), and consider the uniform distribution of f conditioned on this fixing as well as on the event in Eq. (6).

Actually, we decompose \(f:[n]\rightarrow [k]\) into three parts, denoted \(f',f''\) and \(f'''\), that represents its restriction to the three-way partition of [n] into (HBG) such that \(B=\{i\in {\overline{H}}:f(i)\in f(H)\}\) (and \(G=\{i\in {\overline{H}}:f(i)\not \in f(H)\}\)); indeed, \(f':H\rightarrow [k]\) is the restriction of f to H, whereas \(f'':B\rightarrow f(H)\) and \(f''':G\rightarrow [k]\setminus f(H)\) are its restrictions to the two parts of \(\overline{H}\). We fix arbitrary \(f':H\rightarrow [k]\) and \(f'':B\rightarrow f'(H)\), where \(B=\{i\in {\overline{H}}:f''(i)\in f'(H)\}\), such that \(\sum _{i\in B} p(i)<0.5\delta '\), while bearing in mind that such fixing (of \(f'\) and \(f''\)) arise from the choice of a random f with probability at least \(1-6c\). Our aim will be to show that, with high probability over the choice of \(f''':G\rightarrow [k]\setminus f(H)\), it holds that

$$\begin{aligned} \sum _{i\in G:f'''(i)\in J(f''')}p(i)>0.4\delta '. \end{aligned}$$
(7)

where \(J(f''')\,{\mathop {=}\limits ^\mathrm{def}}\,\{j\in [k]:\sum _{i\in G:f'''(i)=j}p(i)\le 0.8/m\}\). (Recall that for any \(i\in G\subseteq {\overline{H}}\) it holds that \(p(i)=r(i)=s(i)<1/3m\).) This would imply that, with high probability, the distance of f(X) from being m-grained is at least

$$\begin{aligned} \sum _{j\in J(f''')} \min \left( \sum _{i:f'''(i)=j}p(i)\;,\;\frac{1}{m}-\frac{0.8}{m}\right)\ge & {} \sum _{j\in J(f''')} 0.25\cdot \sum _{i:f'''(i)=j}p(i) \\> & {} 0.25\cdot 0.4\delta ' \\\ge & {} 0.05\delta , \end{aligned}$$

where the first inequality is due to the fact that \(p'(j)\,{\mathop {=}\limits ^\mathrm{def}}\,\mathbf{Pr}[f(X)\!=\!j]=\sum _{i\in G:f'''(i)=j}p(i)\le 0.8/m\) for every \(j\in J(f''')\) and so \(\min (p'(j),0.2/m)\ge p'(j)/4\). So all that remains is to show that Eq. (7) holds with high probability over the choice of \(f'''\).

Letting \(K'\,{\mathop {=}\limits ^\mathrm{def}}\,[k]\setminus f(H)\), we start by observing that, for every \(i\in G\), it holds that

$$\begin{aligned}&\mathbf{Pr}_{f''':G\rightarrow K'}[f'''(i)\not \in J(f''')]\nonumber \\&\;\;\le \mathbf{Pr}_{f''':G\rightarrow K'} \left[ \sum _{\ell \in G\setminus \{i\}:f'''(\ell )=f'''(i)}p(\ell ) \;>\;\frac{0.8}{m}-\frac{1}{3m}\right] \nonumber \\&\;\;= \mathbf{Pr}_{f''':G\rightarrow K'} \left[ f'''(i)\in \left\{ j\in K': \sum _{\ell \in G\setminus \{i\}:f'''(\ell )=j} p(\ell ) \;>\;\frac{1.4}{3m}\right\} \right] \end{aligned}$$
(8)

where the equality can be seen by first fixing \(f'''\)-values for all elements in \(G\setminus \{i\}\) and then selecting \(f'''(i)\) uniformly in \(K'\). Upper-bounding the size of the set in Eq. (8) by \((1.4/3m)^{-1}\), and using \(m=ck\) and \(|K'|\ge k-3m\), we get

$$\begin{aligned} \mathbf{Pr}_{f''':G\rightarrow K'}[f'''(i)\not \in J(f''')]\le & {} \frac{3m}{1.4}\cdot \frac{1}{|K'|} \\\le & {} \frac{3ck}{1.4}\cdot \frac{1}{k-3ck} \\\le & {} 3c, \end{aligned}$$

where the last inequality presupposes \(1.4\cdot (1-3c)\ge 1\) (equiv., \(c \le 2/21\)). It follows that

$$\begin{aligned} \mathbb E_{f''':G\rightarrow K'}\left[ \sum _{i\in G:f'''(i)\not \in J(f''')}p(i)\right]= & {} \sum _{i\in G}\mathbf{Pr}_{f''':G\rightarrow K'}[f'''(i)\not \in J(f''')]\cdot p(i) \\\le & {} \sum _{i\in G}3c\cdot p(i) \\\le & {} 3c\cdot \delta ', \end{aligned}$$

since \(\sum _{i\in G}p(i)\le \sum _{i\in {\overline{H}}}p(i)=\delta '\). Hence,

$$\mathbf{Pr}_{f''':G\rightarrow K'} \left[ \sum _{i\in G:f'''(i)\not \in J(f''')}p(i)\ge 0.1\delta '\right] \;\le \;\frac{3c}{0.1} \;=\;30c.$$

Recalling that \(\sum _{i\in B} p(i)<0.5\delta '\), which implies \(\sum _{i\in G} p(i)>0.5\delta '\), this implies that Eq. (7) holds with probability at least \(1-30c\) (over the choice of \(f'''\)).

Lastly, recall that \(\sum _{i\in B} p(i)<0.5\delta '\), where \(B=\{i\in {\overline{H}}:f''(i)\in f'(H)\}\), holds with probability at least \(1-6c\) (over the choice of \(f'\) and \(f''\)). The claim follows, since (as argued above) Eq. (7) implies that \(\sum _{j\in J(f''')}\min (p'(j),(1/m)-p'(j))>0.1\delta '\) (whereas using \(\delta '\ge \delta /2\ge \epsilon /2\) and Eq. (4), it follows that f(X) is \(0.02\epsilon \)-far from being m-grained).    \(\square \)

Combining Claims A.1.1 and A.1.2, the lemma follows.    \(\blacksquare \)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Goldreich, O. (2020). The Uniform Distribution Is Complete with Respect to Testing Identity to a Fixed Distribution. In: Goldreich, O. (eds) Computational Complexity and Property Testing. Lecture Notes in Computer Science(), vol 12050. Springer, Cham. https://doi.org/10.1007/978-3-030-43662-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43662-9_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43661-2

  • Online ISBN: 978-3-030-43662-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics