# New algorithms and lower bounds for circuits with linear threshold gates Ryan Williams\* Stanford University January 13, 2014 #### **Abstract** Let $ACC \circ THR$ be the class of constant-depth circuits comprised of AND, OR, and MODm gates (for some constant m > 1), with a bottom layer of gates computing arbitrary linear threshold functions. This class of circuits can be seen as a "midpoint" between ACC (where we know nontrivial lower bounds) and depth-two linear threshold circuits (where nontrivial lower bounds remain open). We give an algorithm for evaluating an arbitrary symmetric function of $2^{n^{o(1)}}$ ACC $\circ$ THR circuits of size $2^{n^{o(1)}}$ , on all possible inputs, in $2^n \cdot \text{poly}(n)$ time. Several consequences are derived: - The number of satisfying assignments to an ACC $\circ$ THR circuit of subexponential size can be computed in $2^{n-n^{\varepsilon}}$ time (where $\varepsilon > 0$ depends on the depth and modulus of the circuit). - NEXP does not have quasi-polynomial size ACC ∘ THR circuits, and NEXP does not have quasi-polynomial size ACC ∘ SYM circuits. Nontrivial size lower bounds were not known even for AND ∘ OR ∘ THR circuits. - Every 0-1 integer linear program with n Boolean variables and s linear constraints is solvable in $2^{n-\Omega(n/((\log M)(\log s)^5))} \cdot \operatorname{poly}(s,n,M)$ time with high probability, where M upper bounds the bit complexity of the coefficients. (For example, 0-1 integer programs with weights in $[-2^{\operatorname{poly}(n)}, 2^{\operatorname{poly}(n)}]$ and $\operatorname{poly}(n)$ constraints can be solved in $2^{n-\Omega(n/\log^6 n)}$ time.) Impagliazzo, Paturi, and Schneider [IPS13] recently gave an algorithm for $\tilde{O}(n)$ constraints; ours is the first asymptotic improvement over exhaustive search for for up to subexponentially many constraints. We also present an algorithm for evaluating depth-two linear threshold circuits (a.k.a., THR $\circ$ THR) with exponential weights and $2^{n/24}$ size on all $2^n$ input assignments, running in $2^n \cdot \text{poly}(n)$ time. This is evidence that non-uniform lower bounds for THR $\circ$ THR are within reach. <sup>\*</sup>Supported by an Alfred P. Sloan Fellowship, a Microsoft Research Faculty Fellowship, a David Morgenthaler II Faculty Fellowship, and NSF CCF-1212372. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. # 1 Introduction Recall that in the non-uniform Boolean circuit model, one designs an infinite family of logical circuits $\{C_n\}$ , one for each input length n, in order to recognize a given binary language $L \subseteq \{0,1\}^*$ . This model is notoriously powerful, even when the size of $C_n$ is bounded from above by a fixed polynomial in n, defining the complexity class P/poly. With polynomial size circuits, one can already "compute" some undecidable languages, such as $L' = \{1^n \mid \text{the } n\text{th Turing machine halts on blank tape}\}$ . Nevertheless, it is strongly believed that NP $\not\subset$ P/poly, meaning that for even modestly-sized instances of NP-complete problems, the sizes of *computations* on such instances must be inevitably gigantic. However, knowledge of P/poly is rather poor, due to the "infinite" nature of the model: it is open if the huge complexity class *nondeterministic exponential time* (NEXP) is contained in P/poly. This containment would imply that problems verifiable with exponentially-long witnesses could be efficiently "solved" with small circuits. It looks obviously absurd; how can we rule it out? In recent years, it has been demonstrated that the existence of nontrivial circuit-analysis algorithms is closely linked to the NEXP versus P/poly problem. For instance, Impagliazzo, Kabanets, and Wigderson [IKW02] showed that NEXP $\not\subset$ P/poly follows, if there is a $2^{n^{o(1)}}$ time algorithm that can approximate a given circuit's acceptance probability to within 1/10. They also proved a partial converse, in that NEXP $\not\subset$ P/poly implies a certain kind of derandomization. Subsequent work [Wil10] strengthened the algorithms-to-lower bounds implication, proving that a similar algorithm which (for every k) runs in $2^{n-\omega(\log n)}$ time on all n-input $n^k$ -size circuits still implies NEXP $\not\subset$ P/poly. A variant of this implication (for circuit satisfiability algorithms) was combined with an satisfiability algorithm for a restricted circuit class called ACC, implying that NEXP does not have polynomial-size ACC circuits [Wil11b]. Recently, it was shown that NEXP $\not\subset$ P/poly is equivalent to establishing a "weak" form of natural proofs [Wil13b], building on Impagliazzo et al.<sup>1</sup> To continue progress on circuit lower bounds for NEXP, it is imperative to understand algorithms for analyzing circuits, such as algorithms for circuit satisfiability, evaluating a circuit on all $2^n$ inputs, and approximating the acceptance probability of a circuit.<sup>2</sup> In this paper, we make this sort of algorithmic progress for circuits with arbitrary *linear threshold* gates: such a gate outputs 1 if and only if a certain linear inequality $\sum_i w_i x_i \ge t$ is true, where $w_i, t \in \mathbb{Z}$ are *weights* and $x_i \in \{0,1\}$ are inputs to the gate. Linear threshold functions have been studied for decades, coinciding with research on neural networks [MP69, Mur71]. Low-depth linear threshold circuits are powerful: many basic functions in arithmetic, algebra, and cryptography are known to be implementable with only *constant-depth* linear threshold circuits [RT92, SBKH93, SP94, MT99, NR04]. In terms of lower bounds for such circuits, very weak questions remain major open problems: for example, is all of NEXP solvable with polynomial-size depth-*two* linear threshold circuits with exponential-size weights?<sup>3</sup> Depth-two circuits correspond to *multilayer perceptrons* with only one hidden layer. Despite considerable study in neural networks and deep learning, we still lack understanding of the power of depth-two. In this paper, we report some new progress on understanding the power of linear threshold gates. $<sup>^1</sup>$ In particular, NEXP $\not\subset$ P/poly if and only if there is a "constructive" property of Boolean functions that is "useful" against P/poly. The natural proofs barrier [RR97] states that if such a property is also "large" (true of a large fraction of functions) then strong cryptographic pseudorandom generators do not exist. Hence, assuming strong crypto, NEXP lower bounds must somehow confront the framework of natural proofs but sidestep the "large" condition. <sup>&</sup>lt;sup>2</sup>Recent surveys on these issues include [Will1a, San12, Coh13, Oli13]. <sup>&</sup>lt;sup>3</sup>Note that for thresholds with polynomially-bounded weights, depth-two lower bounds are known; however depth-three lower bounds are still open. The survey of Razborov [Raz92] is still relatively current on these points. Algorithms and lower bounds for ACC with threshold gates Let ACC $\circ$ THR denote the class of circuits consisting of AND, OR, MODm gates for some constant m, and linear threshold gates, with unbounded fan-in and constant depth, such that the inputs of all linear threshold gates connect directly to the circuit's input variables. Let SYM $\circ$ ACC $\circ$ THR be the class of circuits where the output gate computes an arbitrary symmetric function, and its inputs connect to the outputs of ACC $\circ$ THR circuits. We show that such circuits can very efficiently evaluated on all $2^n$ inputs, even if they are of $2^{n^{o(1)}}$ size. **Theorem 1.1** Given a SYM $\circ$ ACC $\circ$ THR circuit with n inputs and $2^{n^{o(1)}}$ size, we can produce its outputs on all $2^n$ inputs in $2^n \cdot poly(n)$ time. More generally, such a circuit of size s can be evaluated on all inputs in $2^n \cdot poly(\log s, n) + 2^{O(\log s)^c}$ time, for some $c \ge 1$ depending on the depth of the circuit and the modulus m of its MODm gates. The proof of Theorem 1.1 also carries through for SYM $\circ$ ACC $\circ$ SYM, where the bottom layer gates compute arbitrary symmetric functions (i.e., functions which only depend on the number of true inputs) of $2^{n^{o(1)}}$ wires. This algorithm can be used to *count* the number of satisfying assignments to ACC $\circ$ THR circuits. **Theorem 1.2** For every integer m > 1 and d > 0, there is an $\varepsilon > 0$ such that counting satisfying assignments to ACC $\circ$ THR circuits of size $2^{n^{\varepsilon}}$ , depth d, and MODm gates can be done in $2^{n-n^{\varepsilon}}$ time. By modifying prior arguments [Will1b], we can conclude lower bounds for such circuits. The new argument shows that the ability to count SAT assignments entails non-uniform lower bounds for circuit classes with very weak closure properties. **Theorem 1.3** NEXP *does not have non-uniform* ACC o THR *circuits of quasi-polynomial size.* As Theorem 1.1 also holds for SYM o ACC o SYM, it follows that NEXP doesn't have ACC o SYM circuits of quasi-polynomial size. as well Twenty years ago, Maciel and Therien [MT93] considered lower bounds for $AC^0 \circ MAJ$ circuits (which $ACC \circ THR$ subsumes), but nontrivial lower bounds have not been reported. Regan [Reg97] studied $MOD_2 \circ AND \circ THR$ circuits and also noted the absence of lower bounds. Lower bounds have been open even for the much weaker class $AND \circ OR \circ MAJ$ [HP13]. Theorem 1.3 moves a little closer to an "unconditional break" of the natural proofs barrier [RR97]. That is, it seems plausible that pseudorandom functions can be implemented with ACC $\circ$ THR circuits, in which case any lower bounds proved against such circuits must be non-naturalizing.<sup>5</sup> Plaku [Pla02] observed that the Naor-Reingold family of pseudorandom functions [NR04] can be implemented with quasi-polynomial size OR $\circ$ THR $\circ$ AND circuits; it follows that the natural proofs barrier already applies to this circuit class. It is an interesting open problem if ACC $\circ$ THR can efficiently simulate such depth-three circuits. Building on Theorem 1.1, we also give a new method for solving 0-1 integer linear programs. In FOCS'13, Impagliazzo, Paturi, and Schneider [IPS13] showed that for each c > 1, there is a $\delta < 1$ such that 0-1 integer LPs with cn constraints can be solved in $2^{\delta n}$ time. We provide an improvement over exhaustive search for up to subexponentially many constraints: **Theorem 1.4** Every 0-1 integer linear program with n variables and s constraints can be solved in time $2^{n-\Omega(n/((\log M)(\log s)^5))} \cdot poly(s,n,M)$ with high probability, where $M \leq 2^{o(n)}$ upper bounds the bit complexity of the coefficients in the program. $<sup>^4</sup>$ A MODm gate outputs 1 if and only if the sum of its input bits is divisible by m. $<sup>^5</sup>$ It is not completely settled whether the proof that NEXP $\not\subset$ ACC is "truly" non-naturalizing; it could be that the natural proofs barrier is irrelevant to the problem. (If pseudorandom functions cannot be implemented in ACC, then natural proofs considerations don't apply to ACC anyway; if such functions can be implemented in ACC, then the NEXP lower bound is indeed non-naturalizing.) Notice that the theorem allows for enormous coefficients, of size up to $2^{2^{o(n)}}$ . The time bound compares favorably with the AC<sup>0</sup> circuit satisfiability bounds of Impagliazzo, Matthews, and Paturi [IMP12]: there, the authors use random restriction methods to solve satisfiability of AC<sup>0</sup> circuits with depth d and size s in $2^{n-n/(\log s)^{O(d)}}$ randomized time with zero error. Our algorithm shows that, using probabilistic polynomials and fast rectangular matrix multiplication, one can obtain similar running times for SAT of AC<sup>0</sup>[2] circuits with a layer of symmetric gates at the bottom. **Depth-two linear threshold circuit evaluation.** We take an important step towards depth-two linear threshold circuit (a.k.a. THR o THR) lower bounds for the case of exponential weights, by giving an efficient algorithm for evaluating such circuits on all possible assignments. **Theorem 1.5** Let k > 1. Given a depth-two $2^{n/24}$ -size linear threshold circuit C with integer weights in $[-2^{n^k}, -2^{n^k}]$ , we can evaluate C on all $2^n$ input assignments in $2^n \cdot poly(n^k)$ time. Theorem 1.5 follows from a more general result showing that any sufficiently large "combinatorial rectangle" of inputs can be evaluated in poly(n) amortized time per input. Noting that a similar statement for evaluating ACC circuits forms the heart of the proof of NEXP $\not\subset$ ACC [Wil11b], Theorem 1.5 suggests that large complexity classes (such as NEXP) cannot have small depth-two linear threshold circuits. However, we do not yet know how to turn Theorem 1.5 into depth-two linear threshold lower bounds. #### 1.1 Prior work Considerable effort has been expended in proving lower bounds against circuits with linear threshold gates. Here we will provide some major highlights, in addition to the work already mentioned. It will help to introduce a little (standard) notation. Define MAJ, AND, OR, THR, and SYM to be the class of one-gate circuits corresponding to MAJORITY, AND, OR, linear threshold, and symmetric functions, respectively, with "free" NOT gates that can appear after the output or on the input wires to the gate. (Recall that a symmetric Boolean function's output only depends on the number of true inputs.) For classes of circuits $\mathscr C$ and $\mathscr D$ , define $\mathscr C \circ \mathscr D$ to be the class of circuits formed by taking a circuit $C \in \mathscr C$ , and feeding the outputs of circuits from $\mathscr D$ as inputs to C. That is, $\mathscr C \circ \mathscr D$ is simply the composition of circuits from $\mathscr C$ and $\mathscr D$ , with the circuits from $\mathscr D$ receiving the input and the circuit from $\mathscr C$ giving the output. We will equivocate the *size* of a circuit with the number of wires, i.e., the number of directed arcs in the DAG defining the circuit. This is an important measure for circuits with symmetric gates, as the number of wires governs the size of the symmetric function representation. Much work on depth-two threshold lower bounds has concentrated on lower bounds for inner product modulo 2, i.e., $IP2(x_1,\ldots,x_n,y_1,\ldots,y_n)=\sum_i x_i\cdot y_i \mod 2$ . Note that IP2 is easy for ACC (being a MOD2 of AND gates). In groundbreaking work, Hajnal et al. $[HMP^+93]$ proved that every MAJ $\circ$ MAJ circuit requires $2^{\Omega(n)}$ gates to compute IP2. They also showed MAJ $\circ$ SYM circuits can be efficiently simulated by MAJ $\circ$ MAJ circuits, so small MAJ $\circ$ SYM circuits also cannot compute IP2. Nisan [Nis94] extended the lower bound to MAJ $\circ$ THR circuits, and Forster et al. $[FKL^+01]$ extended the lower bound to THR $\circ$ MAJ circuits. More recently, Sherstov [She09] showed that $AC^0$ requires exponential-size MAJ $\circ$ THR circuits, and Beame and Huynh [BH12] showed that $AC^0$ requires $n^{\Omega(\log n)}$ -size MAJ $\circ$ SYM $\circ$ AND circuits. Although superpolynomial-size lower bounds against MAJ $\circ$ AC<sup>0</sup>, THR $\circ$ AC<sup>0</sup>, MAJ $\circ$ MAJ $\circ$ AND and even MAJ $\circ$ MAJ $\circ$ AC<sup>0</sup> circuits are known [ABFR94, Gol97, RW93, HM04], and many lower bounds are $<sup>^6</sup>$ The current theorems connecting circuit evaluation algorithms to circuit lower bounds require that, from the OR of a collection of circuits, we can generate an equivalent circuit in the same class. We do not know how to convert a large OR of THR $\circ$ THR circuits into an equivalent THR $\circ$ THR circuit, even assuming NEXP has small THR $\circ$ THR circuits. (In the case of ACC, this is trivial, because an OR of ACC circuits is still an ACC circuit.) known for $AC^0$ circuits augmented with a small number of threshold gates [Bei94, BS94, CH05, Vio06, Han07, GS10, LS11, Pod12], lower bounds for $AC^0 \circ MAJ$ circuits have remained open. Maciel and Therien [MT93] conjectured that the majority-of-majority function is not in $AC^0 \circ MAJ$ . Recently, Hansen and Podolskii [HP13] have shown an intriguing reduction: superpolynomial-size THR $\circ$ THR lower bounds for a function f would follow from superlogarithmic lower bounds on the 3-party NOF unbounded-error communication complexity of f. #### 1.2 Comparison and Intuition It is instructive to discuss how this paper's approach relates to prior work on depth-two threshold lower bounds. A certain popular approach [FKL<sup>+</sup>01, Lok08, She09, RS10] applies ingredients from Fourier analysis of Boolean functions, linear algebra, communication complexity, discrepancy theory, *etc*. In particular, these works follow the general scheme: - 1. Define some notion of "relaxed rank" of a $2^{n/2} \times 2^{n/2}$ Boolean matrix C. Intuitively, if C has "relaxed rank" r, then there are $2^{n/2} \times r$ and $r \times 2^{n/2}$ matrices A and B such that the entries of $A \cdot B$ correspond to the entries of C in a direct way. - 2. Show that every function $f:(\{0,1\}^{n/2}\times\{0,1\}^{n/2})\to\{0,1\}$ computable with a "small" $\mathscr C$ circuit has "small relaxed rank" when construed as an $2^{n/2}\times2^{n/2}$ Boolean matrix. - 3. Show that some explicit family of functions $g_n: (\{0,1\}^{n/2} \times \{0,1\}^{n/2}) \to \{0,1\}$ , construed as $2^{n/2} \times 2^{n/2}$ Boolean matrices, requires "high relaxed rank" asymptotically. Together, these steps prove that the family $g := \{g_n\}$ cannot have "small" $\mathscr C$ circuits. To prove ACC $\circ$ THR circuit lower bounds, we define a generalized rank notion we call the *symmetric rank*, informally measuring how efficiently a 0-1 matrix M can be decomposed into a sum of rank-one matrices such that, after applying a fixed symmetric function to each entry of the sum, we obtain the matrix M. Combining several elements from previous work, we show that for a Boolean matrix representing the truth table of a SYM $\circ$ ACC $\circ$ THR circuit of size s, its symmetric rank is $O(2^{\log^c s})$ for some constant $c \ge 1$ , depending on the depth d and modulus m of the MODm gates in the circuit. Moreover, given such a circuit we can efficiently compute a low-rank decomposition. However, we do not know how to use existing methods to prove that an explicit function *g* has high symmetric rank. Instead, we take a more *computational* approach that still exploits the low symmetric rank property. The idea is that, if we can efficiently compute a low-rank decomposition from a given circuit, then the circuit's truth table can be obtained faster than evaluating the circuit on all its inputs one-by-one. This in turn suggests that these circuits possess considerable structure that make them unsuitable for simulating very complex functions, such as those in NEXP. Suppose we are given an SYM $\circ$ ACC $\circ$ THR circuit C of size s with n inputs. Let M be a $2^{n/2} \times 2^{n/2}$ matrix defining the function computed by C. First we show how given any such C we can compute $2^{n/2} \times 2^{\log^c s}$ and $2^{\log^c s} \times 2^{n/2}$ matrices A and B (and a symmetric function f) giving a symmetric rank decomposition of M, in $2^{n/2} \cdot 2^{O(\log^c s)}$ time. By multiplying A and B and applying f to each entry of the output matrix, we can obtain M. When s is sufficiently small, a rectangular matrix multiplication of Coppersmith [Cop82] can be applied to compute the product of A and B, and the final matrix M is obtained in poly(n) time per entry. Hence, given an SYM $\circ$ ACC $\circ$ THR circuit C of size $2^{n^{o(1)}}$ , we can evaluate C on all its $2^n$ inputs in only $2^n \cdot \text{poly}(n)$ time. This fast evaluation algorithm is combined with prior work [Wil10, Wil11b] along with some new tricks to exhibit a $g := \{g_n\} \in \text{NEXP}$ which does not have quasipolynomial-size ACC $\circ$ THR circuits. Our evaluation algorithm for depth-two threshold circuits (Theorem 1.5) also uses Coppersmith's rectangular matrix multiplication as a subroutine, but the rest of the algorithm is rather different from the evaluation algorithm for $SYM \circ ACC \circ THR$ . We reduce the problem of efficiently evaluating a depth-two threshold circuits (Theorem 1.5) also uses Coppersmith's rectangular matrix multiplication as a subroutine, but the rest of the algorithm is rather different from the evaluation algorithm for $SYM \circ ACC \circ THR$ . We reduce the problem of efficiently evaluating a depth-two threshold circuits (Theorem 1.5) also uses Coppersmith's rectangular matrix multiplication as a subroutine, but the rest of the algorithm is rather different from the evaluation algorithm for $SYM \circ ACC \circ THR$ . cuit on many inputs to a special type of matrix multiplication. Namely, for two matrices A and B over the integers, we compute a "weighted" matrix product $$C[i,j] = \sum_{k} w_k \cdot \text{LEQ}(A[i,k], B[k,j]),$$ where LEQ(x, y) is a Boolean-valued function equal to 1 if and only if $x \le y$ , and the $w_k$ 's are arbitrary integer weights given as parameters to the problem. We show how Coppersmith's algorithm can be combined with a mild brute force search to efficiently compute a rectangular matrix product of the above form. # 2 Algorithms and lower bounds for ACC with a layer of threshold gates The main theorem of this section is: **Reminder of Theorem 1.1** Given a SYM $\circ$ ACC $\circ$ THR circuit with n inputs and $2^{n^{o(1)}}$ size, we can produce its outputs on all $2^n$ inputs in $2^n \cdot poly(n)$ time. More generally, such a circuit of size s can be evaluated on all inputs in $2^n \cdot poly(\log s, n) + 2^{O(\log s)^c}$ time, for some $c \ge 1$ depending on the depth of the circuit and the modulus m of its MODm gates. **Depth reduction.** The first stage of the proof is to convert an arbitrary SYM $\circ$ ACC $\circ$ THR circuit C of size s into a depth-two circuit C'' of symmetric gates, i.e., a SYM $\circ$ SYM circuit. The size of the depth-two circuit will be $O(2^{\log^c s})$ for a constant $c \ge 1$ , depending on the (constant) depth and (constant) modulus of circuit C. This stage requires several different pieces from prior work. **Lemma 2.1** There is an algorithm which given an SYM $\circ$ ACC $\circ$ THR circuit C of size $s \ge n$ , depth d, and MODm gates, outputs an equivalent SYM $\circ$ SYM circuit C'' with at most $2^{(\log s)^c}$ wires, and runs in time $O(2^{(\log s)^c})$ , for $c \ge 1$ depending only on d and m. The following paragraphs give the proof of Lemma 2.1. Let C be a SYM $\circ$ ACC $\circ$ THR circuit with inputs $x_1, \ldots, x_n$ , size s, depth d, and MODm gates, for constants d > 2 and m > 1. In the proof, several constants arise; we will denote all of them by the same constant b which is assumed to be the maximum of these quantities. The first step in Lemma 2.1 is to translate the THR layer of C into a SYM layer, by absorbing some of its complexity into the ACC part. Without loss of generality, we can assume that the weights of all threshold gates in C have absolute value at most $2^{bn\log_2 n}$ [MTT61, Mur71]. (Every THR function is equivalent to one with weights of bit-complexity at most $bn\log_2 n$ .)<sup>7</sup> Maciel and Therien [MT98] provided several fairly tight low-deph circuits for various tasks. We need: **Theorem 2.1** ([MT98], Theorem 3.3) Addition of n distinct n-bit numbers can be performed with polynomialsize AND $\circ$ OR $\circ$ SYM circuits. Furthermore, the circuits can be constructed in polynomial time. We can therefore replace every THR gate of C with an $AC^0 \circ MAJ$ circuit, as follows. Fix a threshold gate of C, with weights $w_{i_1}, \ldots, w_{i_t}$ for $t \le n$ , computing $\sum_{j=1}^{t-1} w_{i_j} x_{i_j} \ge w_{i_t}$ for some $i_j \in \{1, \ldots, n\}$ . Note $|w_{i_j}| \le 2^{bn \log_2 n}$ for $j = 1, \ldots, t$ . Set $W = bn \log_2 n$ . Let D be a circuit for the addition of t-1 W-bit numbers, provided by Theorem 2.1. For $j=1,\ldots,t-1$ , we connect to the jth W-bit input of D a circuit which, given $x_{i_j}$ , feeds $w_{i_j}$ to D if the input bit $x_{i_j}=1$ , and the all-zero W-bit string if $x_{i_j}=0$ . Note this extra circuit actually contains no gates: it simply has a wire from $x_{i_j}$ to all bits of the jth W-bit input where the corresponding bit of $w_{i_j}$ equals 1. Letting this new circuit <sup>&</sup>lt;sup>7</sup>In fact, this "small-weight" representation can be efficiently obtained, by evaluating the large-weight representation at only n+1 points, then solving a linear system in n+1 variables to determine the weights. See [MTT61], Theorem 16. be D', we have $D'(x_1, ..., x_n) = \sum_{j=1}^{t-1} w_{i_j} x_{i_j}$ . This can be compared to the value $w_{i_t}$ with an AC<sup>0</sup> circuit, using the fact that the "less-than-or-equal-to" comparison of two integers can be performed in AC<sup>0</sup> [CSV84]. We now have an AC<sup>0</sup> $\circ$ SYM circuit D'' of size poly(W,t) $\leq n^b$ computing the given threshold gate. Applying this construction to each threshold gate in the THR layer of C, we obtain an SYM $\circ$ ACC $\circ$ SYM circuit C' of size at most $s \cdot n^b$ . The next step of Lemma 2.1 is to convert the SYM o ACC part into a SYM o AND circuit, using a reduction of Beigel-Tarui [BT94] (with important details on constructibility filled in by Allender-Gore [AG91]). **Theorem 2.2** ([BT94, AG91]) Every SYM $\circ$ ACC circuit of size s can be simulated by a SYM $\circ$ AND circuit of $2^{(\log s)^{c'}}$ size for some constant c' depending only on the depth d and MODm gates of the ACC part. Moreover, the AND gates of the final circuit have only $(\log s)^{c'}$ fan-in, the final circuit can be constructed from the original in $2^{O((\log s)^{c'})}$ time, and the final symmetric function at the output can be computed in $2^{O((\log s)^{c'})}$ time. Applying this reduction to the top SYM $\circ$ ACC part of the circuit C' results in an equivalent SYM $\circ$ AND $(\log(s \cdot n^b))^{c'} \circ$ SYM circuit C'' of size $s' = 2^{O((\log(s \cdot n^b))^{c'})}$ (where the subscript on the AND denotes the fan-in of each AND gate). For simplicity of notation, let $t = (\log(s \cdot n^b))^{c'}$ in the following. Extending a trick of Beigel [Bei94] to symmetric gates, we can convert every AND<sub>t</sub> $\circ$ SYM subcircuit of C'' with $n^b$ wires into a single SYM gate with $O(n^{b \cdot t})$ wires. Let $S_1(x_1, \ldots, x_n) \wedge \cdots \wedge S_t(x_1, \ldots, x_n)$ be one such subcircuit, where $S_i$ denotes the ith symmetric gate. In particular, for $i = 1, \ldots, t$ , let $f_i : \mathbb{Z} \to \{0, 1\}$ be such that $f_i(\sum_{j=1}^n c_{i,j}x_j) = S_i(x_1, \ldots, x_n)$ , where $c_{i,j}$ denotes the number of copies of $x_j$ that feed into $S_i$ . Let $B = 1 + \max_i(\sum_{i=1}^n c_{i,i})$ ; note that $B \leq n^b$ . Consider the linear form $$L(x_1,...,x_n) = \sum_{i=1}^t B^{i-1} \cdot \left(\sum_{j=1}^n c_{i,j}x_j\right).$$ For any Boolean assignment to the $x_j$ 's, the number encoded by the linear form $L(x_1, \ldots, x_n)$ is an integer encoded in $O(t \cdot b \log n)$ bits. By construction, the bit representation of this integer contains, for every $i = 1, \ldots, t$ , the number of wires input to $S_i$ which are set true, as a string of $(b \log n)$ bits. Therefore, from the linear form $L(x_1, \ldots, x_n)$ we can easily infer whether all $S_i(x_1, \ldots, x_n)$ output 1 or not, and hence output the value of $S_1 \wedge \cdots \wedge S_t$ . To implement this linear form with a single SYM gate, for all j = 1, ..., n we put $\sum_{i=1}^{t} B^{i-1} c_{i,j}$ wires from the input variable $x_j$ into the new SYM gate. Hence there are $O(n^{b \cdot t})$ wires from the inputs into this new SYM gate. By choosing the appropriate symmetric function (which outputs 1 if and only if $L(x_1, ..., x_n)$ encodes a number such that $S_1 \wedge \cdots \wedge S_t$ is true) we can simulate any AND $_t \circ$ SYM circuit of $n^b$ wires with a single SYM gate of $O(n^{b \cdot t})$ wires. Replacing each AND $\circ$ SYM subcircuit in this manner results in a SYM $\circ$ SYM circuit of size $O(s' \cdot n^{b \cdot t}) \le 2^{O(\log s)^c}$ for some constant $c \ge 1$ . This concludes the proof of Lemma 2.1. **Symmetric rank.** Next, we prove that the truth table of any SYM o SYM circuit C'' of t wires and n inputs represents a $2^{n/2} \times 2^{n/2}$ matrix of *symmetric rank* at most poly(t), and this rank decomposition can be efficiently computed. For given matrices A and B over the integers, let $A \cdot B$ denote their matrix product over the integers. Let $M \in \{0,1\}^{m \times n}$ . We define the *symmetric rank of* M to be the minimum $r \in \mathbb{N}$ such that there are matrices $A \in \{0,1\}^{m \times r}$ , $B \in \{0,1\}^{r \times n}$ and a function $f:\{0,1,\ldots,r\} \to \{0,1\}$ satisfying $M[i,j] = f((A \cdot B)[i,j])$ for all i,j. We call the triple (A,B,f) a *symmetric rank decomposition* of M. The symmetric rank is similar to the typical notion of rank, except for the additional function f providing a "filter" from arbitrary integers back to $\{0,1\}$ . This filter function could potentially lead to smaller rank decompositions than the typical notion. However, note the symmetric rank of M is not necessarily at most (for instance) the rank of M over $\mathbb{R}$ , because A and B are required to have Boolean entries. For simplicity let n be even, and let $z_1, \ldots, z_{2^{n/2}}$ be the list of all $2^{n/2}$ n/2-bit strings in lexicographical order. For a circuit C with n inputs, define the *truth table matrix* $M_C$ to be the $2^{n/2} \times 2^{n/2}$ matrix with $M_C[i,j]$ equal to the output of $C(z_i, z_j)$ . **Lemma 2.2** Given a SYM $\circ$ SYM circuit C with t wires and n inputs, its truth table matrix $M_C$ has symmetric rank $O(t^3)$ , and a symmetric rank decomposition of $M_C$ can be computed from C in $2^{n/2} \cdot poly(t)$ time. **Proof.** For simplicity we assume n is even; the case of odd n will be apparent. Index the input variables of C by $x_1, \ldots, x_n$ . Let $g_1, \ldots, g_s$ be an indexing of the gates of C on the bottom layer (closest to the inputs) and let g' denote the output gate of C. (Note that $s \le t$ .) Let $f : \{0, 1, \ldots, s\} \to \{0, 1\}$ be the symmetric function of gate g': for all $a \in \{0, 1, \ldots, s\}$ , f(a) = b if and only if a true inputs make g' output b. We shall show how to efficiently construct matrices A and B with the appropriate properties. Let $z_1, \ldots, z_{2^{n/2}}$ be the list of all n/2-bit strings in lexicographical order, in the following. For every pair $(a,b) \in \{0,1,\ldots,t\}^2$ such that $a+b \le t$ , let $S_{a,b} \subseteq \{g_1,\ldots,g_s\}$ denote the subset of gates $g_j$ such that a+b true inputs makes gate $g_j$ output 1. The matrices A and B to be constructed show that the symmetric rank of $M_C$ is at most $$r = \sum_{a,b \in \{0,1,\dots,t\}: a+b \le t} |S_{a,b}| \le O(t^3).$$ In other words, each pair (a,b) will add $|S_{a,b}|$ additional components to the rows of A and the columns of B. For $i=1,\ldots,2^{n/2}$ , the ith row of A and ith column of B are defined as follows. For every pair (a,b), allocate $|S_{a,b}|$ additional components for the rows of A and columns of B. For $j = 1, ..., |S_{a,b}|$ , put a 1 in the *j*th additional component of the *i*th row of *A* if and only if there are *a* true wires going into the *j*th gate of $S_{a,b}$ when the input variables $x_1, ..., x_{n/2}$ are given assignment $z_i$ . That is, the *j*th component is 1 if and only if the contribution (from the first half of variables) to the overall sum for the *j*th gate is *a*. Similarly, for $j = 1, ..., |S_{a,b}|$ , put a 1 in the *j*th additional component of the *i*th column of *B* if and only if there are *b* true wires going into the *j*th gate of $S_{a,b}$ when the input variables $x_{n/2+1}, ..., x_n$ are given assignment $z_i$ . Note that each entry of A and B can be determined in poly(t) time. For every fixed (a,b), the product of two *j*th components for the *i*th row of *A* and the *k*th column of *B* is either 0 or 1, and the product is 1 if and only if: - the sum of true inputs into the *j*th gate of $S_{a,b}$ from the inputs $(x_1, \ldots, x_{n/2})$ equals a when the inputs $(x_1, \ldots, x_{n/2})$ are assigned $z_i$ , - the sum of true inputs into the same gate from $(x_{n/2+1},...,x_n)$ equals b when the inputs $(x_{n/2+1},...,x_n)$ are assigned $z_k$ , and - the jth gate outputs 1 when its sum of true inputs equals a + b. It follows that the *inner product* of the *i*th row of *A* and the *k*th column of *B* equals the total number $N_{i,k}$ of true wires going into the output gate of *C* on the variable assignment $(x_1, \ldots, x_n) \mapsto (z_i, z_k)$ . By definition, $f(N_{i,k})$ equals the output of *C* on that variable assignment. We need one more lemma to complete the proof of Theorem 1.1: **Lemma 2.3** For all sufficiently large N, and $\alpha \leq .172$ , multiplication of an $N \times N^{\alpha}$ matrix with an $N^{\alpha} \times N$ matrix can be done in $N^2 \cdot poly(\log N)$ arithmetic operations, over any field with $O(2^{poly(\log N)})$ elements.<sup>8</sup> <sup>&</sup>lt;sup>8</sup>See Appendix A for an exposition of this result. **Proof of Theorem 1.1.** Given a SYM $\circ$ ACC $\circ$ THR circuit C and size s, convert C into a SYM $\circ$ SYM circuit C'' of $2^{(\log s)^c}$ size using Lemma 2.1. Compute a symmetric rank decomposition of C into $2^{n/2} \times 2^{3(\log s)^c}$ and $2^{3(\log s)^c} \times 2^{n/2}$ 0-1 matrices A and B respectively, along with a function $f: [2^{3(\log s)^c}] \to \{0,1\}$ . Compute the product of A and B in $2^n \cdot \text{poly}(\log s, n)$ time, using Lemma 2.3. Finally, evaluate function f on all entries of the matrix product. This can be done by numerically sorting the entries, replacing each entry v by f(v), then inverting the sorted order, in time $2^n \cdot \text{poly}(\log s, n) + 2^{O(\log s)^c}$ . For $s < 2^{n^{o(1)}}$ , the runtime is $2^n \cdot \text{poly}(n)$ . $\square$ #### 2.1 Counting satisfying assignments to ACC of linear thresholds The evaluation algorithm of Theorem 1.1 is quite powerful, substantially extending the class of circuits for which we can perform non-trivial circuit analysis. **Reminder of Theorem 1.2** For every m > 1 and d > 0, there is an $\varepsilon > 0$ such that counting satisfying assignments to ACC $\circ$ THR circuits of size $2^{n^{\varepsilon}}$ , depth d, and MODm gates can be done in $2^{n-n^{\varepsilon}}$ time. **Proof.** For all $k \in \mathbb{N}$ and for i = 1, ..., 2k, define a $\operatorname{Bit}_i^k$ function with $2^{2k}$ inputs as follows: for all i = 1, ..., 2k, $\operatorname{Bit}_i^k$ outputs the *i*th bit of the sum of its input bits. Clearly, a $\operatorname{Bit}_i^k$ function is symmetric. Suppose we are given an ACC $\circ$ THR circuit C of size s and n inputs, and we wish to count its satisfying assignments. Let $\ell < n/2$ be a parameter to set later. For every assignment $A_j \in \{0,1\}^{2\ell}$ to the last $2\ell$ inputs of C, make a copy of C with the assignment $A_j$ plugged into those $2\ell$ inputs, calling this copy $C_{A_j}$ . Note that each $C_{A_j}$ has (the same) $n - 2\ell$ inputs $x_1, \ldots, x_{n-2\ell}$ . For every $i = 1, ..., 2\ell$ , define $B_i(x_1, ..., x_{n-2\ell}) := \operatorname{Bit}_i^{\ell}(C_{A_1}(x_1, ..., x_{n-2\ell}), ..., C_{A_{22\ell}}(x_1, ..., x_{n-2\ell}))$ . Each function $B_i$ can be implemented in $s' = 2^{2\ell} \cdot s$ size, as a SYM $\circ$ ACC $\circ$ THR circuit. Applying Theorem 1.1, $B_i$ can be evaluated on all of its $2^{n-2\ell}$ possible assignments in time $$2^{n-2\ell} \cdot \operatorname{poly}(n) + 2^{\operatorname{poly}(\log s')} \le 2^{n-2\ell} \cdot \operatorname{poly}(n) + 2^{\operatorname{poly}(\ell + \log s)}.$$ The above for-loop over all i produces $2\ell \cdot 2^{n-2\ell}$ bits: for each of the $2^{n-2\ell}$ partial assignments to $n-2\ell$ variables, we learn the number (in $2\ell$ bits) of partial assignments on the other $2\ell$ variables which result in satisfaction. The number of all satisfying assignments is obtained by simply summing all $2\ell$ -bit numbers obtained from the $2^{n-2\ell}$ assignments, in $2^{n-2\ell} \cdot \operatorname{poly}(\ell)$ time. Letting $\ell = n^{\varepsilon}/2$ for sufficiently small $\varepsilon > 0$ , we have a $2^{n-n^{\varepsilon}}$ time algorithm. #### 2.2 Faster 0-1 linear programming ACC • THR circuits are definitely powerful enough to simulate 0-1 integer linear programming; a straightforward application of Theorem 1.2 would yield a faster algorithm for the problem. However, the improvement over exhaustive search would be rather minor, and tedious to calculate. By modifying the proof of Theorem 1.1 in appropriate places, we can derive a better algorithm in this case: **Reminder of Theorem 1.4** Every 0-1 integer linear program with n variables and s constraints can be solved in time $2^{n-\Omega(n/((\log M)(\log s)^5))} \cdot poly(s,n,M)$ with high probability, where $M \leq 2^{o(n)}$ upper bounds the bit complexity of the coefficients in the program. **Proof.** Consider a 0-1 linear program of the form $Ax \le b$ , along with a cost function $\langle c, x \rangle$ we wish to maximize, where $A \in \mathbb{Z}^{s \times n}$ , $b \in \mathbb{Z}^s$ , and $c \in ([-2^M, 2^M] \cap \mathbb{Z})^n$ by assumption on M. First, reduce the optimization problem to one of feasibility, in a standard way: include $\langle c, x \rangle \ge v$ as an additional constraint for various $v \in \mathbb{Z}$ , and by binary searching on v, we maximize the value of v such that the s+1 constraint system remains feasible. Since the $x_i$ are Boolean valued, the binary search uses at most $O(M + \log n)$ calls to feasibility questions. Next, observe the feasibility questions can be viewed as a satisfiability question for a depth-two circuit D with an AND at the top gate, and linear threshold gates on the bottom layer, by directly translating each constraint in the program into a linear threshold gate. By Theorem 2.1 and the argument in Lemma 2.1, each threshold gate in the circuit D can be replaced with a polynomial-sized LEQ $\circ$ AND $\circ$ OR $\circ$ SYM circuit, where LEQ computes on n-bit integers a and b whether $a \le b$ . As LEQ has an OR $\circ$ AND $\circ$ XOR circuit of $O(n^2)$ size for n-bit inputs (see [CSV84] for a reference), the satisfiability question for the circuit D reduces to the SAT question for an AC $^0$ [2] $\circ$ SYM circuit C where the AC $^0$ [2] part has depth 5. Following the strategy of Theorem 1.2 (and the author's ACC SAT algorithm [Will11b]), the satisfiability question for C with n inputs and size poly(s) can be efficiently converted into the problem of evaluating a larger AC $^0$ [2] $\circ$ SYM circuit C', where C' has n' = n - k inputs, $2^k \cdot \text{poly}(s, M)$ size, k < n/2 is a parameter, and the AC $^0$ [2] part has depth 6. More precisely, C' is an OR of $2^k$ copies of the depth-5 circuit C, and each copy has its first k inputs assigned to a distinct string from $\{0,1\}^k$ . Clearly, this circuit C' is satisfiable if and only if C is satisfiable. Now we wish to evaluate C' on all $2^{n-k}$ inputs, efficiently. Rather than applying Beigel-Tarui at this point, as in Lemma 2.1, we instead apply the probabilistic polynomials of Smolensky [Smo87] to convert C' into a SYM $\circ$ SYM circuit C''. In particular, we use a slight modification of Smolensky's construction, as described by Kopparty and Srinivasan [KS12]. **Theorem 2.3** ([Smo87, KS12]) For every $AC^0$ circuit C of depth d, size s, and n inputs, and $\varepsilon > 0$ , there is a distribution of n-variate polynomials $\mathcal{D}_C$ over $\mathbb{F}_2$ with the following properties. Each p with nonzero support in $\mathcal{D}_C$ has degree at most $(4\log s)^{d-1} \cdot (\log 1/\varepsilon)$ , a polynomial p can be sampled from $\mathcal{D}_C$ in $n^{O(\log s)^{d-1}(\log 1/\varepsilon)}$ time, and for every $x \in \{0,1\}^n$ , $\Pr_{p \sim \mathcal{D}_C}[p(x) = C(x)] \ge 1 - \varepsilon$ . We apply Theorem 2.3 as follows. Recall that C' is an OR of some $AC^0[2] \circ SYM$ circuits $C_1, \ldots, C_{2^k}$ , each with (the same) n-k inputs. Moreover, the top $AC^0[2]$ part of each $C_i$ has depth 5, and each $C_i$ takes poly(s,M) inputs (coming from the outputs of SYM gates). For every i, we take the top $AC^0$ part of $C_i$ , and invoke Theorem 2.3 with $\varepsilon = 1/(10 \cdot 2^k)$ to sample $p_i \sim \mathcal{D}_{C_i}$ of degree at most $O(k(\log s)^4)$ and at most poly $(s,M)^{O(k \cdot (\log s)^4)}$ monomials. We replace the $AC^0$ part of $C_i$ with the XOR of ANDs circuit $p_i$ . Now the circuit C' is an OR of $C_i$ as a OR of SYM circuits; call them $C_1'', \ldots, C_{2^k}''$ . For every input $C_i$ inputs $C_i$ produce a single poly $C_i$ poly inputs the same values as $C_1, \ldots, C_{2^k}$ on $C_i$ , with probability at least $C_i$ 1/10. Now we randomly convert the topmost OR in C' to an XOR, with the usual Razborov-Smolensky subsum trick: we pick $r_{1,1}, r_{2,1}, r_{1,2}, r_{2,2}, \ldots, r_{1,2^k}, r_{2,2^k} \in \{0,1\}$ uniformly at random, and replace $C = \mathsf{OR}(C''_1, \ldots, C''_{2^k})$ with $$C''(x_{1},...,x_{n-k}) := \left(\sum_{i=1}^{2^{k}} r_{1,i} \cdot C''_{i}(x_{1},...,x_{n-k}) \bmod 2\right) \vee \left(\sum_{i=1}^{2^{k}} r_{2,i} \cdot C''_{i}(x_{1},...,x_{n-k}) \bmod 2\right)$$ $$= \sum_{i=1}^{2^{k}} r_{1,i} \cdot C''_{i}(x_{1},...,x_{n-k}) + \sum_{i=1}^{2^{k}} r_{2,i} \cdot C''_{i}(x_{1},...,x_{n-k})$$ $$+ \left(\sum_{i=1}^{2^{k}} r_{1,i} \cdot C''_{i}(x_{1},...,x_{n-k})\right) \cdot \left(\sum_{i=1}^{2^{k}} r_{2,i} \cdot C''_{i}(x_{1},...,x_{n-k})\right) \bmod 2,$$ which means that C'' equals $$\sum_{i=1}^{2^k} r_{1,i} \cdot C_i''(x_1, \dots, x_{n-k}) + \sum_{i=1}^{2^k} r_{2,i} \cdot C_i''(x_1, \dots, x_{n-k}) + \sum_{i,j=1}^{2^k} r_{1,i} \cdot r_{2,j} \cdot C_i''(x_1, \dots, x_{n-k}) \cdot C_i''(x_1, \dots, x_{n-k}) \mod 2.$$ Now for every $x \in \{0,1\}^{n-k}$ , $$\Pr_{p_{i} \sim \mathcal{D}, r_{i,j} \in \{0,1\}} [C''(x) \neq C'(x)] \\ \leq \Pr_{p_{1}, \dots, p_{2k} \sim \mathcal{D}_{C_{i}}} [\exists i, C''_{i}(x) \neq C_{i}(x)] + \Pr_{r_{i,j} \in \{0,1\}} [\mathsf{OR}(C''_{1}(x), \dots, C''_{2^{k}}(x)) = C'(x) \mid \forall i, C''_{i}(x) = C_{i}(x)] \\ \leq 1/10 + 1/4 \leq 1/3.$$ That is, for every input $x \in \{0,1\}^{n-k}$ , the probability that C'(x) = C''(x) will be greater than 2/3. Since each polynomial $p_i$ has degree at most $O(k \cdot (\log s)^4)$ , the AND gates representing the monomials of $p_i$ have $t \leq O(k \cdot (\log s)^4)$ fan-in. Applying another part of Lemma 2.1, the AND<sub>t</sub> $\circ$ SYM subcircuits of C'' with poly(s, M) wires can be replaced by a single SYM gate with poly $(s, M)^{O(t)}$ input wires. This results in an XOR $\circ$ SYM circuit C'' of poly $(s, M)^{O(k \cdot (\log s)^4)}$ total wires; this is also a SYM $\circ$ SYM circuit. Let $\varepsilon > 0$ be a parameter, and set $k := \max\{1, \frac{\varepsilon n}{(\log M)(\log s)^5}\}$ . (Note that if k = 1, the statement of Theorem 1.4 is trivially true.) Following the proof of Theorem 1.1, we can apply fast rectangular matrix multiplication to evaluate C'' on all $2^{n-k}$ inputs. For sufficiently small $\varepsilon > 0$ , the matrix multiplication runs in time $$2^{n-k} \cdot \operatorname{poly}(O(k \cdot (\log s)^4), \log M, n-k) + \operatorname{poly}(s, M)^{O(k \cdot (\log s)^4)} \leq 2^{n-\Omega\left(\frac{n}{(\log M)(\log s)^5}\right)} \cdot \operatorname{poly}(s, M, n).$$ The output of this procedure is a $2^{n-k}$ -bit string which, for every $x \in \{0,1\}^{n-k}$ , contains the correct output C'(x) with probability at least 2/3. Suppose we repeat the above randomized procedure for $n^2$ times: that is, for $n^2$ times, we independently sample $2^k$ polynomials $p_i$ for each $C_i$ and sample $r_{i,j} \in \{0,1\}$ , constructing $n^2$ different circuits $C_1'', \ldots, C_{n^2}''$ from C'. Then, standard tail bound arguments show that the majority value output by $C_1''(x), \ldots, C_{n^2}''(x)$ equals C'(x) for every $x \in \{0,1\}^{n-k}$ , with high probability. If some assignment $x^*$ has majority value 1, we conclude that the integer program is *feasible*; otherwise, we output *infeasible*. ### 2.3 Non-uniform ACC o THR lower bounds We now turn to the main application of the evaluation algorithm: **Reminder of Thm 1.3** NEXP does not have non-uniform ACC o THR circuits of quasi-polynomial size. To set the context, let us discuss the prior connection between known circuit satisfiability algorithms and circuit lower bounds. **Definition 2.1** Let $\mathscr{C}$ be a circuit class. $\mathscr{C}$ is said to be typical if, given any circuit D from one of the classes $\mathscr{C} \circ \mathscr{C}$ , $\mathsf{AND} \circ \mathscr{C}$ , $\mathsf{OR} \circ \mathscr{C}$ , $\mathsf{NOT} \circ \mathscr{C}$ , an equivalent $D' \in \mathscr{C}$ can be produced in $\mathsf{poly}(\mathsf{size}(D))$ time. That is, $\mathscr{C}$ is typical if it is *efficiently closed under composition*, unbounded fan-in AND, OR, and negations. Most well-studied circuit classes have this property. From prior work, we know there are connections between the existence of good SAT algorithms for typical circuit classes, and lower bounds against those classes: **Theorem 2.4 ([Will1b])** Let $\mathscr{C}$ be typical. Suppose for every $c \geq 1$ , there is an $\varepsilon > 0$ and an an algorithm for satisfiability of $\mathscr{C}$ circuits running in time $O(2^{n-n^{\varepsilon}})$ on circuits with n inputs and $n^{\log^c n}$ size. Then NEXP does not have quasi-polynomial size $\mathscr{C}$ circuits. For example, the proof that NEXP $\not\subset$ ACC follows from giving a faster-than-exhaustive-search ACC satisfiability algorithm, noting that ACC is typical, and applying Theorem 2.4. This theorem cannot be directly applied to a class such as $ACC \circ THR$ , because it is not known whether $ACC \circ THR \circ ACC \circ THR$ can be efficiently simulated with $ACC \circ THR$ . However, by modifying the argument of Theorem 2.4 and using an algorithm for *counting* SAT assignments, we can extend the theorem to circuits with a very weak closure property.<sup>9</sup> **Definition 2.2** Let $\mathscr{C}$ be a circuit class. We say $\mathscr{C}$ is weakly closed under AND if, given the AND of two circuits of $\mathscr{C}$ , an equivalent circuit in $\mathscr{C}$ can be produced in polynomial time. Weak closure under AND is satisfied by strictly more circuit classes than the property of being typical. To give an example, any class of the form $SYM \circ \cdots$ is weakly closed under AND, because an AND of t SYM gates with s wires can be collapsed into a single symmetric gate with $O(s^t)$ wires (as seen in the proof of Lemma 2.1). However, classes like $SYM \circ SYM$ are *not* known to be efficiently closed under composition or unbounded-fan in AND/OR, hence Theorem 2.4 does not apply to such classes. We prove: **Theorem 2.5** Let $\mathscr{C}$ be weakly closed under AND. Suppose for every $c \geq 1$ , there is an $\varepsilon > 0$ and an algorithm for counting the satisfying assignments of $\mathscr{C}$ circuits in time $O(2^{n-n^{\varepsilon}})$ on circuits with n inputs and $n^{\log^c n}$ size. Then NEXP does not have quasi-polynomial size $\mathscr{C}$ circuits. Note that Theorem 1.3 (the ACC $\circ$ THR lower bound) follows immediately from Theorem 2.5 and the counting algorithm of Theorem 1.2. It is our hope that Theorem 2.5 may be applicable in the future to depth-two classes, such as SYM $\circ$ SYM and depth-two *exact* threshold circuits [HP10]: an nontrivial counting SAT algorithm for one of these classes would entail new lower bounds. **Proof of Theorem 2.5.** (Sketch) Let us start with $\mathscr{C}$ as typical. We survey what is needed to conclude $\mathscr{C}$ lower bounds in the proof of Theorem 2.4, and show that the new hypothesis supplies these needs. The idea is to show that NEXP $\subset \mathscr{C}$ and the hypothesis implies every $L \in \mathsf{NTIME}[2^n]$ can be simulated in nondeterministic $2^n/n$ time, contradicting the nondeterministic time hierarchy [ŽŚ3]. In particular, the assumptions imply that the NEXP-complete problem SUCCINCT 3SAT on circuits of AND/OR/NOT with fan-in two, n inputs, and poly(n) size can be nondeterministically solved in $O(2^{n-n^{\varepsilon}})$ time, which is also provably false [Will1a]. Recall that SUCCINCT 3SAT is the problem: given an AND/OR/NOT circuit C of fan-in two, does the truth table of C encode a satisfiable 3-CNF formula? That is, SUCCINCT 3SAT is a "compressed" version of the 3SAT problem. Suppose we are given an (arbitrary) circuit C of size s and wish to determine if it is a yes-instance of SUCCINCT 3SAT. Assuming NEXP has quasipolynomial-size circuits, it is proved that for every C encoding a satisfiable 3-CNF F, there is a quasipolynomial-size circuit D which succinctly encodes a satisfying assignment for F: for all i, D(i) outputs the value of variable $x_i$ in the satisfying assignment. Our "fast" non-deterministic algorithm for SUCCINCT 3SAT guesses this circuit D, and uses it to construct a circuit E with n inputs and $n^{\log^c n}$ size for some c, which is unsatisfiable if and only if D encodes a satisfying assignment to the formula F encoded by C. Assuming NEXP has quasipolynomial-size $\mathscr C$ circuits and that there is an $O(2^{n-n^{\varepsilon}})$ time algorithm for $\mathscr C$ satisfiability, it is proved that there is a nondeterministic algorithm A running in $2^{n-\Omega(n^{\varepsilon})}$ time which, given an AND/OR/NOT of fan-in two circuit E of size s and n inputs, outputs an equivalent E' of $s^{\log^c s}$ size from the class $\mathscr C$ on at least one nondeterministic branch (and prints no on other branches). Running this algorithm A, obtaining E', then running the $\mathscr C$ satisfiability algorithm on E', we nondeterministically determine that C is a yes-instance of SUCCINCT-3SAT in $2^{n-\Omega(n^{\varepsilon})}$ time. Now assume $\mathscr{C}$ is weakly closed under AND. The point where closure properties are relevant is precisely in the argument that the nondeterministic algorithm A exists. In fact, if our hypothesis and the assumption <sup>&</sup>lt;sup>9</sup>See also [JMV13, Oli13] which consider other (stronger) closure properties. that NEXP has quasipolynomial-size $\mathscr{C}$ circuits implies such an algorithm, it can be observed that the rest of the proof carries over without modification. We now construct such an algorithm A. The algorithm A starts by guessing a $\mathscr C$ circuit E'' of $n^{\log^c n}$ size which takes as input a pair $(x,g) \in \{0,1\}^n \times \{0,1\}^{\log(\operatorname{size}(E))}$ , and outputs 1 if and only if the gate g in E outputs 1 when E is given the input x. (Such an E'' exists, assuming P has quasi-polynomial size $\mathscr C$ circuits.) Now we need to verify that for every gate g indexed by 1, 2, ..., size(E), E''(x, g) outputs what gate g of E(x) outputs, on all x. Each gate g is either an input, an AND of two previous gates $g_1$ and $g_2$ , or a NOT of a previous gate $g_1$ . To aid this verification, we show how to efficiently check for arbitrary $\mathscr C$ circuits G and H whether G(x)=H(x) for all inputs x, using an algorithm for counting SAT assignments. Let #SAT(C) be the number of satisfying assignments to a circuit C. Observe that G(x)=H(x) for all x if and only if $\#SAT(G)=\#SAT(H)=\#SAT(G \wedge H)$ . (Note the third quantity can be efficiently computed, assuming $\mathscr C$ is weakly closed under AND.) Moreover, $G(x)\neq H(x)$ for all x if and only if $\#SAT(G)+\#SAT(H)=2^n$ and $\#SAT(G \wedge H)=0$ . Therefore, by counting SAT assignments, we have algorithms checking whether G is equivalent to G is equivalent to the negation of G0, both running in time $G(2^{n-n^{\varepsilon}})$ . We claim that the verification problem for E'' can be reduced to a number of calls to the above kinds of checks. First, nondeterministically guess a circuit $E''_{not}$ , intended to satisfy $E''_{not}(x,g) = \neg E''(x,g)$ for all x and g. Verifying this condition can be done by counting SAT assignments, as described above. Checking E'' is correct on the input gates of E means that for all $i=1,\ldots,n,\ E''(x_1,\ldots,x_n,i)=x_i$ . Both $E''(x_1,\ldots,x_n,i)$ and $I(x_1,\ldots,x_n)=x_i$ are $\mathscr C$ circuits, hence their equivalence can be verified by #SAT calls. Checking a NOT gate g of E with input gate $g_1$ is equivalent to checking that $E''_{not}(x,g_1)=E''(x,g)$ on all $E''(x,g_1)$ On a circuit E with $s \le n^{\log^c n}$ gates, the above procedure runs in $O(2^{n-n^{\varepsilon}} \cdot s) \le 2^{n-\Omega(n^{\varepsilon})}$ time. When it concludes, we know that for all gates g and all x that E''(x,g) outputs the correct value. The circuit E'(x) output by A simply evaluates $E''(x,g^*)$ , where $g^*$ is the output gate of E. # 3 Fast evaluation of depth-two threshold circuits Finally, we show a strong sense in which depth-two threshold circuits are *weak*, by giving a fast algorithm for evaluating such circuit on many assignments in batch. The general theorem is: **Theorem 3.1** Given a depth-two linear threshold circuit C with 2k inputs and at most $n^{1/12}$ gates with weights on the bottom layer of absolute value at most $W_b$ , weights on the output gate of absolute value at most $W_o$ , and given two sets $A, B \subseteq \{0,1\}^k$ where |A| = |B| = n, we can evaluate C on all $n^2$ points in $A \times B$ using $n^2 \cdot poly(\log W_o, \log n) + n^{1+1/12} \cdot poly(\log n, \log W_b)$ time. The following is immediate from Theorem 3.1: **Reminder of Theorem 1.5** Let k > 1. Given a depth-two $2^{n/24}$ -size linear threshold circuit C with integer weights in $[-2^{n^k}, -2^{n^k}]$ , we can evaluate C on all $2^n$ input assignments in $2^n \cdot poly(n^k)$ time. While the proof of Theorem 3.1 also ultimately depends on Coppersmith's rectangular matrix multiplication, the rest of the algorithm is rather different from the evaluation algorithm of Theorem 1.1. **Proof of Theorem 3.1.** We reduce the evaluation task to a special kind of matrix multiplication, then combine Coppersmith's matrix multiplication with a mild brute force to expedite the matrix multiply. Define LEQ: $\mathbb{Z} \times \mathbb{Z} \to \{0,1\}$ to output 1 on (a,b) if and only if $a \leq b$ . Given a vector $w = (w_1, \dots, w_d) \in \mathbb{Z}^d$ , and given two matrices M and N which are $n \times d$ and $d \times n$ , define their w-weighted threshold product to be $(M \circledast_w N)[i,j] := \sum_{k=1}^d w_k \cdot \text{LEQ}(M[i,k],N[k,j])$ . We shall show that the w-weighted threshold product of an $n \times n^{1/12}$ matrix and an $n^{1/12} \times n$ matrix can be computed in essentially $n^2 \cdot \text{poly}(\log n)$ time (with some additional but negligible overhead in terms of the weights). Let us postpone this algorithm for the moment, and first show how to embed the evaluation problem into the weighted threshold product. Let C be a depth-two circuit of size s, with the 2k input variables $x_1, \ldots, x_k, y_1, \ldots, y_k$ . Let $w_1, \ldots, w_s$ be the weights of the top threshold gate of C, and let $\ell_1, t_1, \ldots, \ell_s, t_s$ be the corresponding linear forms and threshold values from the bottom layer of threshold gates: that is, the output of LEQ $(t_i, \ell_i)$ is multipled by $w_i$ in the output gate. Without loss of generality, we may assume that all weights $w_i$ are multiplied by the output of some threshold gate at the bottom layer (there are at most n wires from the input directly to the output gate, and they can be replaced by O(n) dummy gates at the bottom layer with wires to the output gate). Let $A = \{A_1, \ldots, A_n\} \subseteq \{0, 1\}^k$ and $B = \{B_1, \ldots, B_n\} \subseteq \{0, 1\}^k$ . We partition each linear form $\ell_j$ on the bottom layer into two sums $\ell_j^{(x)}$ and $\ell_j^{(y)}$ , such that $\ell_j^{(x)}$ involves only input variables $x_1, \ldots, x_k$ , $\ell_j^{(y)}$ involves only $y_1, \ldots, y_k$ , and $\ell_j^{(x)} + \ell_j^{(y)} = \ell_j$ . Let $A_i(\ell_j^{(x)})$ and $B_j(\ell_j^{(y)})$ denote the value of the linear form $\ell_j^{(x)}$ (respectively, $\ell_j^{(y)}$ ) evaluated on assignment $A_i$ (respectively, $B_j$ ). Define the matrix M with rows indexed by elements of A, and columns indexed by the bottom layer gates $1, \ldots, s$ . Set M[i,k] to the value $t_k - A_i(\ell_k^{(x)})$ . The matrix N has rows indexed by the bottom layer gates $1, \ldots, s$ , and columns indexed by elements of B. Set N[k,j] to the value $B_j(\ell_k^{(y)})$ . Now consider the *w*-weighted threshold product $M \circledast_w N$ , where *w* is the same as above. The *i*, *j* entry of this product equals $$\sum_{k=1}^{s} w_k \cdot \text{LEQ}\left(t_k - A(\ell_k^{(x)}), B_j(\ell_k^{(y)})\right) = \sum_{k=1}^{s} w_k \cdot \text{LEQ}\left(t_k, A_i(\ell_k^{(x)}) + B_j(\ell_k^{(y)})\right).$$ This is precisely the value of the linear form in the output gate of C, when $x_1, \ldots, x_k$ are given the assignment $A_i$ and $y_1, \ldots, y_k$ are assigned $B_j$ . The truth table of C on $A \times B$ can be recovered by simply checking which entries in $(M \circledast_w N)$ exceed the output gate's threshold. Next, we shall show how to compute a weighted threshold matrix product efficiently. Let $\delta$ be a parameter, and let M and N be $n \times n^{\delta}$ and $n^{\delta} \times n$ matrices, respectively. The first step is to reduce the weights significantly. For all $k = 1, \ldots, n^{\delta}$ , let $S_k$ be a list of all entries in the kth column of M, plus the kth row of N. Sort $S_k$ , obtaining a ranking of $S_k$ items, and replace each entry in the $S_k$ th column of $S_k$ and the $S_k$ in the sorted list $S_k$ . This step reduces the domains of $S_k$ and $S_k$ and the $S_k$ weighted threshold matrix product remains the same: all inequalities $S_k$ are preserved. Note this step takes $S_k$ poly(log $S_k$ , log $S_k$ ) time. In order to reduce to matrix multiplication, we perform two strategies with different advantages. (The reduction is inspired by work of Matousek [Mat91] on computing dominances in high dimensions.) Let $s \in \{1, ..., n\}$ be a parameter. Partition each sorted list $S_k$ into $t = \lceil n/s \rceil$ contiguous buckets $T_1, ..., T_t$ , where each bucket $T_i$ contains at most s entries. (For all i < j, the largest entry in $T_i$ is at most the smallest entry in $T_j$ .) Start with an $n \times n$ output matrix P that is all zeroes. For every $(i,k) \in [n] \times [n^{\delta}]$ , look up the bucket $T_{\ell}$ containing M[i,k] in the sorted list $S_k$ . For all N[k,j] contained in $T_{\ell}$ such that $M[i,k] \leq N[k,j]$ , add the weight $w_k$ to the entry P[i, j]. This loop adds to P all terms $w_k \cdot \text{LEQ}(M[i, k], N[k, j])$ such that M[i, k] and N[k, j] appear in the same bucket of $S_k$ . Observe that this step takes $\tilde{O}(n \cdot n^{\delta} \cdot s)$ time. To handle the (M[i,k],N[k,j]) pairs that do not appear in the same bucket, we use matrix multiplication. For each $(i,k) \in [n] \times [n^{\delta}]$ , replace the entry M[i,k] with a row vector $v_{i,k} \in \{0,w_k\}^t$ , such that $v_{i,k}[\ell] := w_k$ if and only if M[i,k] is in bucket $T_{\ell}$ of $S_k$ . That is, $v_{i,k}$ has $w_k$ in exactly one entry, and zeroes elsewhere. This forms a matrix M' of dimensions $n \times (n^{\delta} \cdot t)$ . For $(k,j) \in [n^{\delta}] \times [n]$ , replace each entry N[k,j] with a column vector $u_{k,j} \in \{0,1\}^t$ , such that $v_{i,k}[\ell'] := 1$ if and only if N[k,j] is in bucket $T_{\ell}$ of $S_k$ and $\ell > \ell'$ . This forms a matrix N' of dimensions $(n^{\delta} \cdot t) \times n$ . The matrix product $M' \cdot N'$ over the integers computes a sum of inner products $$(M'\cdot N')[i,j] = \sum_{n^{\delta}} \langle v_{i,k}, u_{k,j} \rangle.$$ If M[i,k] > N[k,j], or M[i,k] and N[k,j] are in the same bucket of $S_k$ , then $\langle v_{i,k}, u_{k,j} \rangle = 0$ . If $M[i,k] \le N[k,j]$ but N[k,j] and M[i,k] are in different buckets of $S_k$ then $\langle v_{i,k}, u_{k,j} \rangle = w_k$ . Letting $P := P + (M' \cdot N')$ , this procedure adds to P all terms $w_k \cdot \text{LEQ}(M[i,k],N[k,j])$ such that M[i,k] and N[k,j] appear in different buckets of $S_k$ . Therefore P[i,j] contains the value of the linear form for the output gate of C, under variable assignment $(A_i,B_j)$ , for all i,j. The above algorithm runs in time $O(n \cdot n^{\delta} \cdot s \log W_o + MM(n, n^{1+\delta}/s, n) \cdot poly(\log W_o))$ , where MM(a, b, c) is the running time for multiplying $a \times b$ and $b \times c$ matrices. If we set $n^{1+\delta}/s = n^{0.172}$ , then Coppersmith's algorithm (Lemma 2.3) can be applied to the second term of the running time, implementing it in $n^2 \cdot poly(\log n)$ time. Under this setting, $s = n^{\delta} \cdot n^{0.828}$ and the first term of the running time is $n^{1+2\delta+0.828}$ . Setting $\delta = 0.086 > 1/12$ , the first term becomes $n^2$ (note that $s = n^{.914}$ ). It is easy to see that, since the above algorithm actually evalutes the linear form at the output gate of a depth-two threshold circuit, we can also efficiently evaluate large SYM o THR circuits as well. **Acknowledgements.** I thank Igor Carboni Olivera for sending a preliminary version of his survey, which helped the ideas in the proof of Theorem 2.5 to congeal. I also thank Rahul Santhanam for helpful comments on an earlier draft. #### References - [ABFR94] James Aspnes, Richard Beigel, Merrick Furst, and Steven Rudich. The expressive power of voting polynomials. *Combinatorica*, 14(2):135–148, 1994. - [ACPS09] Benny Applebaum, David Cash, Chris Peikert, and Amit Sahai. Fast cryptographic primitives and circular-secure encryption based on hard learning problems. In *CRYPTO*, pages 595–618, 2009. - [AG91] Eric Allender and Vivek Gore. On strong separations from $AC^0$ . Fundamentals of Computation Theory, 8, 1991. - [Bei94] Richard Beigel. When do extra majority gates help? polylog(n) majority gates are equivalent to one. *Computational Complexity*, 4:314–324, 1994. - [BH12] Paul Beame and Trinh Huynh. Multiparty communication complexity and threshold circuit size of AC0. 41(3):484–518, 2012. - [BP94] Dario Bini and Victor Pan. Polynomial and matrix computations. Birkhauser, 1994. - [BS94] David A. Mix Barrington and Howard Straubing. Complex polynomials and circuit lower bounds for modular counting. *Computational Complexity*, 4(4):325–338, 1994. - [BT94] Richard Beigel and Jun Tarui. On ACC. Computational Complexity, pages 350–366, 1994. - [CH05] Arkadev Chattopadhyay and Kristoffer Arnsfelt Hansen. Lower bounds for circuits with few modular and symmetric gates. In *ICALP*, pages 994–1005, 2005. - [CKY89] John F. Canny, Erich Kaltofen, and Lakshman Yagati. Solving systems of non-linear equations faster. In *Proc. ACM-SIGSAM International Symposium on Symbolic and Algebraic Computation*, pages 121–128, 1989. - [Coh13] Gil Cohen. A taste of circuit complexity pivoted at NEXP $\not\subset$ ACC (and more). Lecture Notes, Electronic Colloquium on Computational Complexity (ECCC), http://eccc.hpi-web.de/resources/pdf/cohen.pdf, 2013. - [Cop82] Don Coppersmith. Rapid multiplication of rectangular matrices. *SIAM J. Comput.*, 11(3):467–471, 1982. - [Cop97] D. Coppersmith. Rectangular matrix multiplication revisited. *Journal of Complexity*, 13:42–49, 1997. - [CSV84] Ashok K. Chandra, Larry Stockmeyer, and Uzi Vishkin. Constant depth reducibility. *SIAM Journal on Computing*, 13(2):423–439, 1984. - [FKL<sup>+</sup>01] Jürgen Forster, Matthias Krause, Satyanarayana V. Lokam, Rustam Mubarakzjanov, Niels Schmitt, and Hans Ulrich Simon. Relations between communication complexity, linear arrangements, and computational complexity. In *FSTTCS 2001: Foundations of Software Technology and Theoretical Computer Science*, pages 171–182. Springer, 2001. - [Gal12] François Le Gall. Faster algorithms for rectangular matrix multiplication. In *FOCS*, pages 514–523, 2012. - [Gol97] Mikael Goldmann. On the power of a threshold gate at the top. *Information Processing Letters*, 63(6):287–293, 1997. - [GS10] Parikshit Gopalan and Rocco A. Servedio. Learning and lower bounds for $AC^0$ with threshold gates. In APPROX/RANDOM, pages 588–601. Springer, 2010. - [Han07] Kristoffer Arnsfelt Hansen. Computing symmetric boolean functions by circuits with few exact threshold gates. In *COCOON*, pages 448–458, 2007. - [HM04] Kristoffer Arnsfelt Hansen and Peter Bro Miltersen. Some meet-in-the-middle circuit lower bounds. In *MFCS*, pages 334–345, 2004. - [HMP<sup>+</sup>93] András Hajnal, Wolfgang Maass, Pavel Pudlák, Mario Szegedy, and György Turán. Threshold circuits of bounded depth. *J. Comput. Syst. Sci.*, 46(2):129–154, 1993. - [HP98] X. Huang and V. Y. Pan. Fast rectangular matrix multiplication and applications. *J. of Complexity*, 14(2):257–299, 1998. - [HP10] Kristoffer Arnsfelt Hansen and Vladimir V Podolskii. Exact threshold circuits. In *IEEE Conf. Computational Complexity*, pages 270–279, 2010. - [HP13] Kristoffer Arnsfelt Hansen and Vladimir V. Podolskii. Polynomial threshold functions and boolean threshold circuits. In *MFCS*, pages 516–527, 2013. - [IKW02] Russell Impagliazzo, Valentine Kabanets, and Avi Wigderson. In search of an easy witness: Exponential time vs. probabilistic polynomial time. *JCSS*, 65(4):672–694, 2002. - [IMP12] Russell Impagliazzo, William Matthews, and Ramamohan Paturi. A satisfiability algorithm for AC<sup>0</sup>. In *SODA*, pages 961–972, 2012. - [IPS13] Russell Impagliazzo, Ramamohan Paturi, and Stefan Schneider. A satisfiability algorithm for sparse depth two threshold circuits. In *FOCS*, pages 479–488, 2013. - [JMV13] Local reductions. Technical Report TR13-099, Electronic Colloquium on Computational Complexity, July 2013. - [KS12] Swastik Kopparty and Srikanth Srinivasan. Certifying polynomials for $AC^0$ (parity) circuits, with applications. In *FSTTCS*, pages 36–47, 2012. - [KZHP08] ShanXue Ke, BenSheng Zeng, WenBao Han, and Victor Y. Pan. Fast rectangular matrix multiplication and some applications. *Science in China Series A: Mathematics*, 51(3):389–406, 2008. - [Lok08] Satyanarayana V. Lokam. Complexity lower bounds using linear algebra. *Foundations and Trends in Theoretical Computer Science*, 4(1-2):1–155, 2008. - [LS11] Shachar Lovett and Srikanth Srinivasan. Correlation bounds for poly-size $AC^0$ circuits with $n^{1-o(1)}$ symmetric gates. In *APPROX/RANDOM*, pages 640–651. Springer, 2011. - [Mat91] Jiri Matousek. Computing dominances in $E^n$ . Inf. Process. Lett., 38(5):277–278, 1991. - [MP69] Marvin Minsky and Seymour Papert. *Perceptrons: An Introduction to Computational Geometry*. The MIT Press, 1969. - [MT93] Alexis Maciel and Denis Thérien. Threshold circuits for iterated multiplication: Using ac0 for free. In *STACS*, pages 545–565, 1993. - [MT98] Alexis Maciel and Denis Thrien. Threshold circuits of small majority-depth. *Information and Computation*, 146(1):55–83, 1998. - [MT99] Alexis Maciel and Denis Thérien. Efficient threshold circuits for power series. *Inf. Comput.*, 152(1):62–73, 1999. - [MTT61] S. Muroga, I. Toda, and S. Takasu. Theory of majority decision elements. *Journal of the Franklin Institute*, 271:376–418, 1961. - [Mur71] S. Muroga. Threshold Logic and its Applications. John Wiley & Sons, Inc., 1971. - [Nis94] Noam Nisan. The communication complexity of threshold gates. In *Proceedings of "Combinatorics, Paul Erdos is Eighty"*, pages 301–315, 1994. - [NR04] Moni Naor and Omer Reingold. Number-theoretic constructions of efficient pseudo-random functions. *JACM*, 51(2):231–262, 2004. - [Oli13] Igor Oliveira. Algorithms versus circuit lower bounds. Technical Report TR13-117, Electronic Colloquium on Computational Complexity (ECCC), September 2013. - [Pan84] Victor Y. Pan. *How to multiply matrices faster*. Springer-Verlag Lecture Notes in Computer Science 179, 1984. - [Pla02] Erion Plaku. Multiplicity automata, polynomials and the complexity of small-depth boolean circuits. Master's thesis, Clarkson University, Potsdam, NY, 2002. - [Pod12] Vladimir V. Podolskii. Exponential lower bound for bounded depth circuits with few threshold gates. *Information Processing Letters*, 112:267–271, 2012. - [Raz92] Alexander A. Razborov. On small depth threshold circuits. In SWAT, pages 42–52, 1992. - [Reg97] Kenneth W. Regan. Polynomials and combinatorial definitions of languages. pages 261–293. Springer LNCS, 1997. - [RR97] Alexander Razborov and Steven Rudich. Natural proofs. *JCSS*, 55(1):24–35, 1997. - [RS10] Alexander A. Razborov and Alexander A. Sherstov. The sign-rank of $AC^0$ . SIAM Journal on Computing, 39(5):1833–1855, 2010. - [RT92] John H. Reif and Stephen R. Tate. On threshold circuits and polynomial computation. *SIAM J. Comput.*, 21:118–123, 1992. - [RW93] Alexander Razborov and Avi Wigderson. $n^{\Omega(\log n)}$ lower bounds on the size of depth-3 threshold circuits with AND gates at the bottom. *Information Processing Letters*, 45(6):303–307, 1993. - [San12] Rahul Santhanam. Ironic complicity: Satisfiability algorithms and circuit lower bounds. *Bulletin of the EATCS*, 106:31–52, 2012. - [SBKH93] K-Y Siu, Jehoshua Bruck, Thomas Kailath, and Thomas Hofmeister. Depth efficient neural networks for division and related problems. *Information Theory, IEEE Transactions on*, 39(3):946–956, 1993. - [Sch81] Arnold Schönhage. Partial and total matrix multiplication. *SIAM J. Comput.*, 10(3):434–455, 1981. - [She09] Alexander A. Sherstov. Separating $AC^0$ from depth-2 majority circuits. SIAM Journal on Computing, 38(6):2113–2129, 2009. - [SM83] Gadiel Seroussi and Fai Ma. On the arithmetic complexity of matrix kronecker powers. *Information Processing Letters*, 17(3):145–148, 1983. - [Smo87] Roman Smolensky. Algebraic methods in the theory of lower bounds for Boolean circuit complexity. In *STOC*, pages 77–82, 1987. - [SP94] Kai-Yeung Siu and Vwani P.Roychowdhury. On optimal depth threshold circuits for multiplication and related problems. *SIAM Journal on Discrete Mathematics*, 7(2):284–292, 1994. - [Vio06] Emmanuele Viola. Pseudorandom bits for constant-depth circuits with few arbitrary symmetric gates. *SIAM J. Comput.*, 36:1387–1403, 2006. - [Will1a] Ryan Williams. Guest column: a casual tour around a circuit complexity bound. *ACM SIGACT News*, 42(3):54–76, 2011. - [Will1b] Ryan Williams. Non-uniform ACC circuit lower bounds. In *IEEE Conf. Computational Complexity*, pages 115–125, 2011. - [Will3a] Ryan Williams. Faster all-pairs shortest paths via circuit complexity. Submitted, 2013. - [Wil13b] Ryan Williams. Natural proofs versus derandomization. In STOC, pages 21–30, 2013. - [Wil10] Ryan Williams. Improving exhaustive search implies superpolynomial lower bounds. *SIAM Journal on Computing*, 42(3):1218–1244, 2013. See also STOC'10. - [ŽŚ3] Stanislav Žák. A Turing machine time hierarchy. *Theoretical Computer Science*, 26(3):327–333, October 1983. # A Appendix: An exposition of Coppersmith's algorithm In 1982, Don Coppersmith proved that the rank (that is, the number of essential multiplications) of $N \times N^{0.172}$ and $N^{0.172} \times N$ matrix multiplication is at most $O(N\log^2 N)$ . Prior work has observed that his algorithm can also be used to show that the total number of arithmetic operations for the same matrix multiply is $N \cdot \text{poly}(\log N)$ . However, the implication is not immediate, and uses specific properties of Coppersmith's algorithm. Because this result is so essential to this work and a recent algorithm for all-pairs shortest paths [Will3a], we give here a self-contained exposition. **Theorem A.1 (Coppersmith [Cop82])** For all sufficiently large N, the rank of $N \times N^{.172} \times N$ matrix multiplication is at most $O(N^2 \log^2 N)$ . We wish to derive the following consequence of Coppersmith's construction, which has been mentioned in the literature before [SM83, ACPS09, Will11b]: **Reminder of Lemma 2.3** For all sufficiently large N, and $\alpha \leq .172$ , multiplication of an $N \times N^{\alpha}$ matrix with an $N^{\alpha} \times N$ matrix can be done in $N^2 \cdot poly(\log N)$ arithmetic operations, over any field with $O(2^{poly(\log N)})$ elements. For brevity, we will use the notation " $\ell \times m \times n$ matrix multiply" to refer to the multiplication of $\ell \times m$ and $m \times n$ matrices (hence the above gives an algorithm for $N \times N^{\alpha} \times N$ matrix multiply). Note Lemma 2.3 has been "improved" in the sense that the upper bound on $\alpha$ has been increased mildly over the years [Cop97, HP98, KZHP08, Gal12]. However, these later developments only run in $N^{2+o(1)}$ time, not $N^2 \cdot \text{poly}(\log N)$ time (which we require). Our exposition will expand on the informal description given in recent work [Wil11b]. First, observe that the implication from Theorem A.1 to Lemma 2.3 is not immediate. For example, it could be that Coppersmith's algorithm is non-uniform, making it difficult to apply. As far as we know, one cannot simply take "constant size" arithmetic circuits implementing the algorithm of Theorem A.1 and recursively apply them. In that case, the poly(log N) factor in the running time would then become $N^{\varepsilon}$ for some constant $\varepsilon > 0$ (depending on the size of the constant-size circuit). To keep the overhead polylogarithmic, we have to unpack the algorithm and analyze it directly. #### A.1 A short preliminary Coppersmith's algorithm builds on many other tools from prior matrix multiplication algorithms, many of which can be found in the highly readable book of Pan [Pan84]. Here we will give a very brief tutorial of some of the aspects. **Bilinear algorithms and trilinear forms.** Essentially all methods for matrix multiplication are bilinear (and if not, they can be converted into such algorithms), meaning that they can be expressed in the so-called trilinear form $$\sum_{ijk} A_{ik} B_{kj} C_{ji} + p(x) = \sum_{\ell=1}^{5} \left( \sum_{ij} \alpha_{ij} A_{ij} \right) \cdot \left( \sum_{ij} \beta_{ij} B_{ij} \right) \cdot \left( \sum_{ij} \gamma_{ij} C_{ij} \right)$$ $$\tag{1}$$ where $\alpha_{ij}$ , $\beta_{ij}$ , and $\gamma_{ij}$ are constant-degree polynomials in x over the field, and p(x) is a polynomial with constant coefficient 0. Such an algorithm can be converted into one with no polynomials and minimal extra overhead (as described in Coppersmith's paper). Typically one thinks of $A_{ik}$ and $B_{kj}$ as entries in the input matrices, and $C_{ji}$ as indeterminates, so the LHS of (1) corresponds to a polynomial whose $C_{ji}$ coefficient is the ij entry of the matrix product. Note the *transpose* of the third matrix C corresponds to the final matrix product. To give an explicit example, we assume the reader is familiar with Strassen's famous method for $2 \times 2 \times 2$ matrix multiply. Strassen's algorithm can be expressed in the form of (1) as follows: $$\sum_{i,j,k=0,1} A_{ik} B_{kj} C_{ji} = (A_{00} + A_{11})(B_{00} + B_{11})(C_{00} + C_{11}) + (A_{10} + A_{11})B_{00}(C_{01} - C_{11}) + A_{00}(B_{01} - B_{11})(C_{10} + C_{11}) + (A_{10} - A_{00})(B_{00} + B_{01})C_{11} + (A_{00} + A_{01})B_{11}(C_{10} - C_{00}) + A_{11}(B_{10} - B_{00})(C_{00} + C_{01}) + (A_{01} - A_{11})(B_{10} + B_{11})C_{00}.$$ (2) The LHS of (1) and (2) represents the trace of the product of three matrices A, B, and C (where the ij entry of matrix X is $X_{ij}$ ). It is well known that every bilinear algorithm naturally expresses multiple algorithms through this trace representation. Since $$tr(ABC) = tr(BCA) = tr(CAB) = tr((ABC)^T) = tr((BCA)^T) = tr((CAB)^T),$$ if we think of A as a symbolic matrix and consider (1), we obtain a new algorithm for computing a matrix A when given B and C. Similarly, we get an algorithm for computing a B when given A and C, and analogous statements hold for computing $A^T$ , $B^T$ , and $C^T$ . So the aforementioned algorithm for multiplying a sparse $2 \times 3$ and sparse $3 \times 2$ yields several other algorithms. **Schönhage's decomposition paradigm.** Coppersmith's algorithm follows a specific paradigm introduced by Schönhage [Sch81] which reduces arbitrary matrix products to slightly larger matrix products with "structured nonzeroes." The general paradigm has the following form. Suppose we wish to multiply two matrices A'' and B''. - 1. First we *preprocess* A'' and B'' in some efficient way, decomposing A'' and B'' into structured matrices A, A', B, B' so that $A'' \cdot B'' = A' \cdot A \cdot B \cdot B'$ . (Note, the dimensions of $A' \cdot A$ may differ from A'', and similarly for $B' \cdot B$ and B''.) The matrices A and B are sparse "partial" matrices directly based on A'' and B'', but they have larger dimensions, and only contain nonzeroes in certain structured parts. The matrices A' and B' are very simple and explicit matrices of scalar constants, chosen independently of A'' and B''. (In particular, A' and B' are Vandermonde-style matrices.) - 2. Next, we apply a specialized constant-sized matrix multiplication algorithm in a recursive manner, to multiply the structured A and B essentially optimally. Recall that Strassen's famous matrix multiplication algorithm has an analogous form: it starts with a seven-multiplication product for $2 \times 2 \times 2$ matrix multiplication, and recursively applies this to obtain a general algorithm for $2^M \times 2^M \times 2^M$ matrix multiplication. Here, we will use an *optimal* algorithm for multiplying constant-sized matrices with zeroes in some of the entries; when this algorithm is recursively applied, it can multiply sparse A and B with nonzeroes in certain structured locations. 3. Finally, we *postprocess* the resulting product C to obtain our desired product $A'' \cdot B''$ , by computing $A' \cdot C \cdot B'$ . Using the simple structure of A' and B', the matrix products $D := A' \cdot C$ and $D \cdot B'$ can be performed very efficiently. Our aim is to verify that each step of this process can be efficiently computed, for Coppersmith's full matrix multiplication algorithm. #### A.2 The algorithm The construction of Coppersmith begins by taking input matrices A'' of dimensions $2^{4M/5} \times {M \choose 4M/5} 2^{4M/5}$ and B'' of dimensions ${M \choose 4M/5} 2^{4M/5} \times 2^{M/5}$ where $M \approx \log N$ , and obtains an $O(5^M \operatorname{poly}(M))$ algorithm for their multiplication. Later, he symmetrizes the construction to get an $N \times N \times N^{\alpha}$ matrix multiply. We will give this starting construction and show how standard techniques can be used to obtain an $N \times N^{\alpha} \times N$ matrix multiply from his basic construction. The multiplication of A'' and B'' will be derived from an algorithm which computes the product of $2 \times 3$ and $3 \times 2$ matrices with zeroes in some entries. In particular the matrices have the form: $$\left(\begin{array}{ccc} a_{11} & a_{12} & a_{13} \\ 0 & a_{22} & a_{23} \end{array}\right), \left(\begin{array}{ccc} b_{11} & b_{12} \\ b_{21} & 0 \\ b_{31} & 0 \end{array}\right),$$ and the algorithm is given by the trilinear form $$(a_{11} + x^{2}a_{12})(b_{21} + x^{2}b_{11})(c_{11}) + (a_{11} + x^{2}a_{13}(b_{31})(c_{11} - xc_{21}) + (a_{11} + x^{2}a_{22})(b_{21} - xb_{21})(c_{22})$$ $$+ (a_{11} + x^{2}a_{23})(b_{31} + xb_{12})(c_{12} + xc_{21}) - (a_{11})(b_{21} + b_{31})(c_{11} + c_{12})$$ $$= x^{2}(a_{11}b_{11}c_{11} + a_{11}b_{12}c_{21} + a_{12}b_{21}c_{11} + a_{13}b_{31}c_{11} + a_{22}b_{21}c_{12} + a_{23}b_{31}c_{12}) + x^{3} \cdot P(a, b, c, x).$$ $$(3)$$ That is, by performing the five products of the linear forms of $a_{ij}$ and $b_{k\ell}$ on the LHS, and using the $c_{ij}$ to determine how to add and subtract these products to obtain the output $2 \times 2$ matrix, we obtain a polynomial in each matrix entry whose $x^2$ coefficients yield the final matrix product $c_{ij}$ . When the algorithm given by (3) is applied recursively to $2^M \times 3^M$ and $3^M \times 2^M$ matrices (analogously to how Strassen's algorithm is applied to do $2^M \times 2^M \times 2^M$ matrix multiply), we obtain an algorithm that can multiply matrices A and B with dimensions $2^M \times 3^M$ and $3^M \times 2^M$ , respectively, where A has $O(5^M)$ nonzeroes, B has $O(4^M)$ nonzeroes, and these nonzeroes appear in a highly regular pattern (which can be easily deduced). This recursive application of (3) will result in polynomials in X of degree O(M), and additions and multiplications on such polynomials increase the overall time by an $M \cdot \text{poly}(\log M)$ factor. Therefore we can multiply these A and B with structured nonzeroes in $O(5^M \cdot \text{poly}(M))$ field operations. The decomposition of A'' and B'' is performed as follows. We choose A' and B' to have dimensions $2^{4M/5} \times 2^M$ and $2^M \times 2^{M/5}$ , respectively, and such that all $2^{4M/5} \times 2^{4M/5}$ submatrices of A' and $2^{M/5} \times 2^{M/5}$ submatrices of B' are non-singular. Following Schönhage, we pick A' and B' to be rectangular Vandermonde matrices: the i, j entry of A' is $(\alpha_j)^{i-1}$ , where $\alpha_1, \alpha_2, \ldots$ are distinct elements of the field; B' is defined analogously. Such matrices have three major advantages: (1) they can be succinctly described (with $O(2^M)$ field elements), (2) multiplying these matrices with arbitrary vectors can be done extremely efficiently, and (3) inverting an arbitrary square submatrix can be done extremely efficiently. More precisely, $n \times n$ Vandermonde matrices can be multiplied with arbitrary n-vectors in $O(n \cdot \operatorname{poly}(\log n))$ operations, and computing the inverse of an $n \times n$ Vandermonde matrix can be done in $O(n \cdot \operatorname{poly}(\log n))$ operations (for references, see [CKY89, BP94]). In general, operations on Vandermonde matrices, their transposes, their inverses, and the transposes of inverses can be reduced to fast multipoint computations on univariate polynomials. For example, multiplying an $n \times n$ Vandermonde matrix with a vector is equivalent to evaluating a polynomial (with coefficients given by the vector) on the n elements that comprise the Vandermonde matrix, which takes $O(n \log n)$ operations. This translates to $O(n \cdot \operatorname{poly}(\log n))$ arithmetic operations. The matrices A and B have dimensions $2^M \times 3^M$ and $3^M \times 2^M$ , respectively, where A has only $O(5^M)$ nonzeroes, B has only $O(4^M)$ nonzeroes, and there is an optimal algorithm for multiplying $2 \times 3$ (with 5 nonzeroes) and $3 \times 2$ matrices (with 4 nonzeroes) that can be recursively applied to multiply A and B optimally, in $O(5^M \cdot \text{poly}(M))$ operations. Matrices A and B are constructed as follows: take any one-to-one mapping between the $\binom{M}{4M/5} 2^{M/5}$ columns of the input A'' and columns of the sparse A with exactly $2^{4M/5}$ nonzeroes. For these columns q of A with $2^{4M/5}$ nonzeroes, we compute the inverse $A_q^{-1}$ of the $2^{4M/5} \times 2^{4M/5}$ minor $A_q$ of A' with rows corresponding to the nonzeroes in the column, and multiply $A_q^{-1}$ with column q (in $2^{4M/5} \cdot \text{poly}(M)$ time). After these columns are processed, the rest of A is zeroed out. Then, there is a one-to-one correspondence between columns of A' and nonzero columns of $A' \cdot A$ . Performing a symmetric procedure for B'' (with the same mapping on rows instead of columns), we can decompose it into B and B' such that there is a one-to-one correspondence between rows of B'' and nonzero rows of $B \cdot B'$ . It follows that this decomposition takes only $O(\binom{M}{4M/5})2^{4M/5} \cdot 2^{4M/5} \cdot \text{poly}(M)$ time. Since $5^M \approx \binom{M}{4M/5}4^{4M/5}$ (within poly M) factors), this quantity is upper bounded by $5^M \cdot \text{poly}(M)$ . After A and B are constructed, the constant-sized algorithm for $2 \times 3$ and $3 \times 2$ mentioned above can be applied in the usual recursive way to multiply the sparse A and B in $O(5^M \cdot \operatorname{poly}(M))$ operations; call this matrix Z. Because A' and B' are Vandermonde, the product $A' \cdot Z \cdot B'$ can be computed in $O(5^M \cdot \operatorname{poly}(M))$ operations. Hence we have an algorithm for multiplying matrices of dimensions $2^{4M/5} \times \binom{M}{4M/5} 2^{4M/5}$ and $\binom{M}{4M/5} 2^{4M/5} \times 2^{M/5}$ that is explicit and takes $5^M \cdot \operatorname{poly}(M)$ operations. Call the above algorithm ALGORITHM 1. Observe ALGORITHM 1 also works when the entries of A'' and B'' are themselves matrices over the field. (The running time will surely increase in proportion to the sizes of the underlying matrices, but the bound on the number of *operations on the entries* remains the same.) Up to this point, we have simulated Coppersmith's construction completely, and have simply highlighted its efficiency. By exploiting the symmetries of matrix multiplication algorithms in a standard way, we can extract more algorithms from the construction. The trace identity tells us that $$tr(ABC) = tr(BCA),$$ implying that the expression (3) can also be used to partially multiply a $3^M \times 2^M$ matrix B with at most $4^M$ structured nonzeroes and "full" $2^M \times 2^M$ matrix C in $5^M \cdot \text{poly}(M)$ operations, obtaining a $3^M \times 2^M$ matrix $A^T$ with at most $5^M$ nonzeroes. In our ALGORITHM 1, we have a decomposition of A and B; in terms of the trace, we can derive: $$tr(A''B''\cdot C'') = tr(A'A\cdot BB'\cdot C'') = tr(B\cdot B'C''A'\cdot A).$$ This can be applied to obtain an algorithm for $\binom{M}{4M/5} 2^{4M/5} \times 2^{M/5} \times 2^{4M/5}$ matrix multiplication, as follows. Given input matrices B'' and C'' of the respective dimensions, decompose B'' into a $3^M \times 2^M$ B with $O(4^M)$ nonzeroes and $2^M \times 2^{M/5}$ Vandermonde B', as described above. Letting A' be a Vandermonde $2^{4M/5} \times 2^M$ matrix, compute the matrix $C := B' \cdot C'' \cdot A'$ in at most $4^M \cdot \operatorname{poly}(M)$ operations. Noting that C is $2^M \times 2^M$ , we can then multiply B and C in $5^M \cdot \operatorname{poly}(M)$ operations. This results in a $3^M \times 2^M$ matrix $A^T$ with at most $5^M$ nonzeroes. The final output A'' is obtained by using the one-to-one mapping to extract the appropriate $\binom{M}{4M/5} 2^{4M/5}$ rows from $A^T$ , and multiplying each such row by the appropriate inverse minor of A' (corresponding to the nonzeroes of that row). This takes at most $\binom{M}{4M/5} 2^{4M/5} \cdot 2^M \cdot \operatorname{poly}(M) \leq 5^M \cdot \operatorname{poly}(M)$ operations. Call this ALGORITHM 2. From ALGORITHM 2 we immediately obtain an algorithm for $2^{4M/5} \times 2^{M/5} \times \binom{M}{4M/5} 2^{4M/5}$ matrix multiplication as well: given input matrices $(C'')^T$ and $(B'')^T$ of the respective dimensions, simply compute $B'' \cdot C''$ using ALGORITHM 2, and output the transpose of the answer. Call this ALGORITHM 3. Finally, by "tensoring" ALGORITHM 2 with ALGORITHM 3, we derive an algorithm for matrix multiplication with dimensions $$\binom{M}{4M/5} 2^{4M/5} \cdot 2^{4M/5} \times 2^{2M/5} \times \binom{M}{4M/5} 2^{4M/5} \cdot 2^{4M/5} \ge 5^M/M \times 4^{M/5} \times 5^M/M.$$ That is, we divide the two input matrices of large dimensions into blocks of $2^{4M/5} \times 2^{M/5}$ and $2^{M/5} \times \binom{M}{4M/5} 2^{4M/5}$ dimensions, respectively. We execute ALGORITHM 2 on the blocks, and call ALGORITHM 3 when the product of two blocks is needed. As both ALGORITHM 2 and ALGORITHM 3 are explicit and efficient, their "tensorization" inherits these properties. ALGORITHM 2 uses $5^M \cdot \text{poly}(M)$ operations, and each operation can take up to $5^M \cdot \text{poly}(M)$ time (due to calls to ALGORITHM 3). Therefore, we can perform a $5^M \times 4^{2M/5} \times 5^M$ matrix multiply over fields with $2^{\text{poly}(M)}$ elements, in $5^{2M} \cdot \text{poly}(M)$ time. Setting $n = \log(M)/\log(5)$ , the algorithm runs in $n^2 \cdot \text{poly}(\log n)$ time for fields with $2^{\text{poly}(\log n)}$ elements.