Appendix A
This appendix is meant to be a concise, selfcontained guide to the family of constraint satisfaction algorithms to which RRR belongs, the bare minimum of background needed to start using these methods for the training of networks. For an excellent and much more comprehensive review, see the article by Lindstrom and Sims [14].
1.1 A.1 A reinterpretation of ‘convergence’
Some of the optimization problems that arise in machine learning are known to be hard in a technical sense. For such problems, and problems of a similar nature whose difficulty status is unknown, how should algorithm ‘convergence’ be interpreted?
Nonnegative matrix factorization (NMF) is known to be NPhard (Vavasis [20]), a property that can be appreciated already in the easiest nontrivial Euclidean distance matrix instance of Sect. 4.4.1, here without the normalization factor of Eq. (25):
$$\begin{aligned} \begin{pmatrix} 0 & 1 & 4 & 9 & 16 & 25 \\ 1 & 0 & 1 & 4 & 9 & 16 \\ 4 & 1 & 0 & 1 & 4 & 9 \\ 9 & 4 & 1 & 0 & 1 & 4 \\ 16 & 9 & 4 & 1 & 0 & 1 \\ 25 & 16 & 9 & 4 & 1 & 0 \end{pmatrix} = \begin{pmatrix} 0 & 5 & 1 & 0 & 0 \\ 0 & 3 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 3 & 1 \\ 0 & 0 & 1 & 5 & 0 \end{pmatrix} \begin{pmatrix} 4 & 1 & 0 & 0 & 1 & 4 \\ 0 & 0 & 0 & 1 & 3 & 5 \\ 0 & 1 & 4 & 4 & 1 & 0 \\ 5 & 3 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 1 & 0 & 1 \end{pmatrix} \end{aligned}$$
The nonnegative \(6\times 6\) matrix on the left is shown to have a rank5 nonnegative factorization, the smallest possible. Just the subproblem of getting only zero diagonal elements on the left, from inner products of nonnegative vectors on the right, already has a combinatorial flavor. Finding suitable patterns of zero elements for the factors involves a ‘search’ in the same sense as finding the factors of a large integer, or a satisfying assignment to a complex logical formula. The term ‘convergence’ is normally not used for these tasks, nor should it be used in the machinelearning applications we consider, all of which are hard in some sense.
Still, ‘convergence’ is a much used term in related research, not just in numerical optimization but also machine learning, where engineers are looking for provable performance guarantees. We address these two sides of ‘convergence’ in turn.
The preoccupation with convergence in optimization theory is explained by the fact that algorithms such as Douglas–Rachford are more typically applied to convex problems, for which general convergence results are indeed available (Lindstrom and Sims [14]). In all our applications (of the related RRR algorithm), at least one of the constraint sets A or B is not convex. Not only does this void general convergence results, we should not hold out hope that these are forthcoming since relaxing convexity admits NPhard problems such as NMF. Special cases of the nonconvex set intersection problem, such as the sphere and line, have yielded to convergence analysis (Borwein and Sims [5]), including the global case by the construction of a Lyapunov function (Benoist [3]). However, it is unrealistic to expect that this approach will ever succeed for NPhard applications such as NMF.
Why then have algorithms that exploit convexity found application in hard, nonconvex optimization? We suspect the answer to this question is not deep and can be traced to local behavior: the fixedpoint property. In suitably small neighborhoods even the magnitude and bilinear constraints we use in this work can be approximated as affine. The behavior of these algorithms, in this ‘locally convex’ setting, is then amenable to analysis and in the case of RRR is completely characterized by the two scenarios depicted in Fig. 1. The upshot is that these algorithms are useful simply because they provably terminate when we want them to terminate and, just as usefully, keep going (avoid traps) when a solution is nowhere near.
To address the machinelearning engineer’s convergence concerns we use a different line of argument, beginning with two questions. Is a mediocre solution (for prediction accuracy, etc.) found in a systematic, monotone fashion better than a superior solution that is realized more erratically? Should network architectures (activation and loss functions) be dictated by their gradient properties, even when the data to be learned has a strongly discrete character? Our answer to both of these questions is “no” and we offer the following historical precedent to support this position.
Poincaré’s observation that even relatively simple dynamical systems are intractable (Holmes [12]) dimmed the prospects of predicting the future of the solar system, or the properties of even the simplest model gas (a box of billiard balls). However, the prevalence of nonintegrable systems (the physics cousins of nonconvex RRR) in the real world did not deter the architects of statistical mechanics (Boltzmann [4]). With the help of a basic hypothesis—ergodicity—even highly chaotic systems could be understood in practical terms (the thermodynamic basis of engines, etc.).
It is still too early to know whether a similar formal approach to nonconvex optimization by iterative algorithms will succeed, logically supported by suitable hypotheses, etc. On the other hand, researchers today have available digital computers for numerical experiments, which Poincaré and his cohort did not. And the technological incentives today, though in a completely different domain of human activity, have a similarly grand scale.
1.2 A.2 RRR as relaxed Douglas–Rachford
The RRR algorithm derives its name from the following expression for the update of the search vector x:
$$\begin{aligned} x'=(1\beta /2)x+(\beta /2) R_{B} \bigl(R_{A}(x) \bigr), \end{aligned}$$
(44)
where
$$\begin{aligned} R_{A}(x)=2P_{A}(x)x, \qquad R_{B}(x)=2P_{B}(x)x \end{aligned}$$
are reflections through the sets A and B. The parameter β “relaxes” the “reflect–reflect–average” case (\(\beta =1\)) that the convex optimization literature refers to as the Douglas–Rachford iteration.
The projections, such as to set A,
$$\begin{aligned} P_{A}(x)=\mathop {\operatorname {arg\,min}}_{x'\in A}{\x'x\}, \end{aligned}$$
define a unique point when the constraint sets are convex. When A is not convex there are special x for which \(P_{A}(x)\) is not unique. For example, when A is a sphere and \(x=0\), then \(P_{A}(x)\) is the whole sphere. A similar situation arises in the case of the bilinear constraint (5) of neural networks, where again a set of measure zero must be excluded for the projection to be unique. We avoid these complications by adopting a model of computation where all real variables are interpreted as being subject to small random errors. This makes the projections unique with probability 1. The randomness in this “fuzzy” model of computation has a noticeable effect only when projecting points near the troublesome zeromeasure sets and has no effect on fixedpoint properties when the constraints are suitably formulated.
Rewriting the reflections in (44) in terms of projections,
$$\begin{aligned} x'=x+\beta \bigl(P_{B}\bigl(2 P_{A}(x)x \bigr)  P_{A}(x) \bigr), \end{aligned}$$
(45)
we see that \(\beta \to 0\) corresponds to the flow interpretation. At a fixed point we have \(x'=x\), and therefore \(P_{B}(2 P_{A}(x)x) = P_{A}(x)\) must be a solution as it lies in both A and B. However, the fixed point itself is not in general a solution. The fixedpoint/solution relationship and the attractive nature of fixed points is explained in Sect. A.6.
In the case where both A and B are closed, convex and have a nonempty intersection, it is known (see Theorem 26.11 of Bauschke et al. [2]) that RRR has global convergence to a feasible point for \(\beta \in (0,2)\) and \(\beta =1\) achieves the fastest convergence rate when both sets are subspaces. However, as explained in Appendix A.1, the setting of β takes on a different role when even one of the sets is not convex. While this parameter still controls local convergence, this represents a very insignificant fraction of the entire computation. As experiments with bit retrieval have shown (Elser [11]), the setting of β that optimizes the much longer ‘search’ phase of the computation may be quite different from the best choice for fixedpoint convergence. Since small β (the flow limit) is best for bit retrieval, Appendix A.6 analyzes local convergence for that case.
1.3 A.3 ADMM with indicator functions
By using indicator functions for sets A and B as the two objective functions in the ADMM formalism (Boyd et al. [6]), the ADMM algorithm also provides a way of finding an element (1) in their intersection. One iteration (Boyd et al. [6]) involves a cycle of updates on a triple of variables, \((x,z,u)\to (x',z',u')\):
$$\begin{aligned} &z'=P_{A}(x+u) , \end{aligned}$$
(46a)
$$\begin{aligned} &u'=u+\alpha \bigl(xz'\bigr) , \end{aligned}$$
(46b)
$$\begin{aligned} &x'=P_{B}\bigl(z'u'\bigr). \end{aligned}$$
(46c)
We have followed the variablenaming conventions in the ADMM review by Boyd et al. [6] (see Eqs. 5.2) except in what we define to be the start and end of a cycle. Conventionally the final update is (46b), where the scaled dual variable u is incremented by the difference of the two projections. For showing RRR/ADMM equivalence (see below) the projection to B is the more convenient choice to end the cycle. This difference is irrelevant when interpreting a fixed point, \((x,z,u)=(x',z',u')\). Equation (46b) then implies \(x=z'\), and since \(x=x'\) (at the fixed point) we know that \(x'=z'\in A\cap B\) is a solution to (1). The constant \(\alpha \in (0,2)\) is a relaxation parameter, where \(\alpha <1\) corresponds to underrelaxation. To run ADMM the dual variables u must be initialized in addition to x; a standard choice is \(u=0\). With this initialization and \(\alpha =0\), ADMM reduces to the alternatingprojection algorithm. That alternatingprojections often gets stuck (cycles between a pair of proximal points), when ADMM does not, shows that \(\alpha \to 0\) is a singular limit.
1.4 A.4 General properties
The following general properties distinguish RRR and indicatorfunctionADMM from other iterative algorithms.

Problem instances are completely defined by a pair of projections.

Attractive fixed points encode solutions but, in general, are not themselves solutions.

The update rule respects Euclidean isometry.
The last property states that, if \(x_{0},x_{1},\ldots \) is a sequence of iterates generated by constraint sets A and B, then for any Euclidean transformation T, the constraint sets \(T(A)\) and \(T(B)\) would generate the sequence \(T(x_{0}),T(x_{1}),\ldots \) This follows from the Euclidean norm minimizing property of projections and that the construction of new points from old is “geometric”. For example, the update rule
$$\begin{aligned} x'=x+\beta \bigl(P_{B}\bigl((1+\lambda )P_{A}(x)\lambda x\bigr)P_{A}\bigl((1 \lambda )P_{B}(x)+\lambda x\bigr) \bigr) \end{aligned}$$
(47)
generalizes RRR (beyond \(\lambda = 1\)) and also respects Euclidean isometry.
1.5 A.5 Unrelaxed ADMM/RRR equivalence
RRR with \(\beta =1\) is equivalent to indicatorfunctionADMM with \(\alpha =1\). To see this, define a shifted x for ADMM by \(\tilde{x}=x+u\), and use the update rules (46a)–(46c) to determine \(\tilde{x}'=x'+u'\). By (46a) we have \(z'=P_{A}(\tilde{x})\) and from (46b) (with \(\alpha =1\))
$$\begin{aligned} u'&=(\tilde{x}x)+\bigl(xz'\bigr), \\ &=\tilde{x}P_{A}(\tilde{x}). \end{aligned}$$
Finally, using (46c)
$$\begin{aligned} \tilde{x}'&=x'+u' \\ &=P_{B} \bigl(P_{A}(\tilde{x})+P_{A}( \tilde{x})\tilde{x} \bigr)+ \bigl(\tilde{x}P_{A}(\tilde{x}) \bigr) \\ &=\tilde{x}+P_{B} \bigl(2 P_{A}(\tilde{x})\tilde{x} \bigr)P_{A}( \tilde{x}), \end{aligned}$$
we see that the shifted x of ADMM has the same update rule as RRR with \(\beta =1\).
1.6 A.6 Local convergence of RRR
We consider sets A and B that are subsets of \(\mathbb{R}^{n}\) for some n. Further, let \(a\in A\) and \(b\in B\) be mutually proximal points, and suppose \(\ab\\) is zero or sufficiently small that in a suitable neighborhood \(U\subset \mathbb{R}^{n}\) the sets A and B may be approximated as flats,
$$\begin{aligned} &A\approx \overline{A}+a, \\ &B\approx \overline{B}+b, \end{aligned}$$
where A̅ and B̅ are subspaces. For the local analysis that follows, we replace U by \(\mathbb{R}^{n}\) and consider the orthogonal decomposition \(U=Z_{\perp }\oplus Z\), where \(Z_{\perp }=\overline{A}+\overline{B}\) is the span of the two subspaces. Also decompose \(Z_{\perp }\) orthogonally, as \(Z_{\perp }=X\oplus Y\), where \(Y=\overline{A}\cap \overline{B}\). The two subspaces now orthogonally decompose as \(\overline{A}=C\oplus Y\) and \(\overline{B}=D\oplus Y\), where C and D are linearly independent subspaces of X, that is, \(C\cap D=\{0\}\). In the orthogonal decomposition \(U=X\oplus Y\oplus Z\), we can write down the most general pair of proximal points as
$$\begin{aligned} &a=(0,y,a_{z}), \end{aligned}$$
(48a)
$$\begin{aligned} &b=(0,y,b_{z}), \end{aligned}$$
(48b)
where \(y\in Y\) is arbitrary and \(a_{z},b_{z}\in Z\) are fixed by the two flats.
Projections from a general point \((x,y,z)\in U\) have the following formulas:
$$\begin{aligned} &P_{A}(x,y,z)=\bigl(P_{C}(x),y,a_{z}\bigr), \end{aligned}$$
(49a)
$$\begin{aligned} &P_{B}(x,y,z)=\bigl(P_{D}(x),y,b_{z}\bigr), \end{aligned}$$
(49b)
where \(P_{C}\) and \(P_{D}\) are the linear projections to the subspaces C and D. The \(\beta \to 0\) flow, now for the generalized RRR update (47), takes the following form:
$$\begin{aligned} &\dot{x}= \bigl((1+\lambda )P_{D} P_{C}(1\lambda )P_{C} P_{D} \lambda (P_{D}+P_{C}) \bigr) (x) , \\ &\dot{y}=0 , \\ &\dot{z}=b_{z}a_{z}. \end{aligned}$$
(50)
We have fixedpoint behavior only for \(b_{z}=a_{z}\), when the proximal points coincide. From (48a)–(48b) the space of solutions, or \(a=b\), is parameterized by \(y\in Y\). However, for each such solution point the flow is free to choose any \(z\in Z\) for its fixed point. To establish convergence to any of these fixed points we need to check that \(x\to 0\) under the RRR flow. The same check applies in the infeasible case, \(b_{z}\ne a_{z}\), since by (49a)–(49b) we see that \(x\to 0\) ensures the projections \(P_{A}\) and \(P_{B}\) converge to the two proximal points (48a)–(48b). To prove this result we need the following lemma.
Lemma A.1
If \(C \oplus C_{\perp }\) and \(D \oplus D_{\perp }\) are two orthogonal decompositions of X, where \(C+D=X\), then \(C_{\perp }\cap D_{\perp }=\{0\}\).
Proof
From
$$\begin{aligned} &C_{\perp }=\bigl\{ x\in X\colon u^{T} x=0, \forall u\in C\bigr\} , \\ &D_{\perp }=\bigl\{ x\in X\colon v^{T} x=0, \forall v\in D\bigr\} , \end{aligned}$$
it follows that, if \(x^{*}\in C_{\perp }\cap D_{\perp }\), then
$$\begin{aligned} (u+v)^{T} x^{*} =0,\quad \forall u\in C, v\in D. \end{aligned}$$
But this can only be true if \(x^{*}=0\) since
$$\begin{aligned} X=\{u+v\colon u\in C, v\in D\}. \end{aligned}$$
□
Theorem A.2
The distance \(\x\\) from the space of fixed points in the local RRR flow, for the generalized form (47), is strictly decreasing for \(x\ne 0\) and \(\lambda >0\).
Proof
Using the flow equation (50) in the time derivative of the squared distance,
$$\begin{aligned} \frac{d}{dt} \x\^{2}=2 x^{T}\dot{x}, \end{aligned}$$
and the symmetry of projections under transpose,
$$\begin{aligned} x^{T}(P_{D} P_{C}P_{C} P_{D})x=0, \end{aligned}$$
we obtain
$$\begin{aligned} \frac{d}{dt} \x\^{2}=2\lambda Q(x), \end{aligned}$$
where the result follows if we can show
$$\begin{aligned} Q(x)=x^{T} (P_{C}+P_{D}P_{C} P_{D}P_{D} P_{C})x \end{aligned}$$
is a positive definite quadratic form. From the idempotency of projections we have the identity
$$\begin{aligned} P_{C}+P_{D}P_{C} P_{D}P_{D} P_{C}&=P_{C}(\mathrm{Id}P_{D})P_{C}+( \mathrm{Id}P_{C})P_{D}(\mathrm{Id}P_{C}) \\ &=P_{C} P_{D_{\perp }}P_{C}+P_{C_{\perp }}P_{D} P_{C_{\perp }}, \end{aligned}$$
where the last line is expressed in terms of projections to the orthogonal complements of C and D in X. Using this identity, the quadratic form can be expressed as a sum of squares:
$$\begin{aligned} Q(x)=\P_{D_{\perp }}P_{C} x\^{2}+\P_{D} P_{C_{\perp }} x\^{2}. \end{aligned}$$
To show that Q has no nontrivial null vector \(x^{*}\), let \(u=P_{C} x^{*}\), so \(u\in C\). For the first square to vanish we must have \(u\in D\), and therefore \(u\in C\cap D=\{0\}\). From \(P_{C} x^{*}=u=0\) we then have \(x^{*}\in C_{\perp }\). Since now \(P_{C_{\perp }} x^{*}=x^{*}\), for the second square to vanish we must have \(x^{*}\in D_{\perp }\). Thus both squares vanish if and only if \(x^{*}\in C_{\perp }\cap D_{\perp }\) which, by the lemma, implies \(x^{*}=0\). □
Appendix B
2.1 B.1 Bilinear constraint
The central constraint in our training method is applied at each neuron and involves the vector x of its inputs (from other neurons), the corresponding vector of weights w, and the inner product y of these vectors. In NMF y is set by the data and is not a variable, while in deep networks y is the neuron’s preactivation variable. The projection to the constraint for both of these cases can be treated in a unified way. Most generally we seek the map \((x,w,y)\to (x',w',y')\) that minimizes
$$\begin{aligned} \x'x\^{2}+\w'w\^{2}+\gamma ^{2}\bigl(y'y\bigr)^{2} \end{aligned}$$
subject to the constraint
$$\begin{aligned} x'\cdot w'=\omega y'. \end{aligned}$$
(51)
By taking the limit \(\gamma \to \infty \), the deep network case reduces to the simple bilinear constraint of NMF where y is not a variable and is set by the data (we also set \(\omega =1\) to be consistent with our conventions for that case). Introducing a scalar Lagrange multiplier variable u to impose (51), we obtain the following system of linear equations:
$$\begin{aligned} &0=x'xu w', \end{aligned}$$
(52a)
$$\begin{aligned} &0=w'wu x', \end{aligned}$$
(52b)
$$\begin{aligned} &0=y'y+u \omega /\gamma ^{2}, \end{aligned}$$
(52c)
with solution (19a)–(19b) for \(x'\) and \(w'\) in the simple case and augmented by
$$\begin{aligned} y'=yu \omega /\gamma ^{2} \end{aligned}$$
for deep networks.
Imposing (51) on the solution, we obtain the following equation for u:
$$\begin{aligned} 0=\frac{p(1+u^{2})+q u}{(1u^{2})^{2}}\omega y+(\omega /\gamma )^{2}u=h_{0}(u), \end{aligned}$$
(53)
where p and q are the scalars in (17). From (52a)–(52c) we note that \(u=0\) corresponds to the identity case of the projection, and as we will see in the following lemma, there is a unique solution for \(u\in (1,1)\). We therefore let this u define the projection as it is connected to the identity.
The uniqueness of the solution for u and our method for computing it is based on the following lemma.
Lemma B.1
In the domain \(u\in (1,1)\), and with parameters satisfying (18), the function \(h_{0}(u)\) is strictly increasing and has a unique zero \(u_{0}\) and point of inflection \(u_{2}\).
Proof
In addition to \(h_{0}\), we will need its first (\(h_{1}\)), second (\(h_{2}\)), and third (\(h_{3}\)) derivatives:
$$\begin{aligned} &h_{1}(u)=\frac{q(1+3 u^{2})+2p u(3+u^{2})}{(1u^{2})^{3}}+(\omega / \gamma )^{2}, \\ &h_{2}(u)=3 \frac{q u(4+4u^{2})+2p(1+6 u^{2}+u^{4})}{(1u^{2})^{4}}, \\ &h_{3}(u)=12 \frac{q(1+10 u^{2}+5 u^{4})+2p u(5+10u^{2}+u^{4})}{(1u^{2})^{5}}. \end{aligned}$$
Let \(t_{0}(u)=p(1+u^{2})+q u\) be the numerator of the first term in \(h_{0}(u)\). Using (18), \(t_{0}(1)=(q2p)<0\) and \(t_{0}(1)=q+2p>0\), so that \(\lim_{u\to \pm 1}h_{0}(u)=\pm \infty \). Since \(h_{0}(u)\) is continuous, its range is \((\infty,+\infty )\). Using the same arguments we can show that exactly the same conclusion applies to \(h_{2}(u)\).
Now consider the numerator \(t_{1}(u)\) of the first term in \(h_{1}(u)\). Again using (18), and \(u< 1\), we have \(2pu>q u\) and the bound,
$$\begin{aligned} t_{1}(u)>q+3 q u^{2}q u\bigl(3+u^{2} \bigr)=q(1u)^{3}. \end{aligned}$$
This implies
$$\begin{aligned} h_{1}(u)>\frac{q}{(1+u)^{3}}+(\omega /\gamma )^{2}>0, \end{aligned}$$
and therefore \(h_{0}(u)\) is strictly increasing. We always have a zero \(u_{0}\in (1,1)\) because \(h_{0}(u)\) has range \((\infty,+\infty )\). By the same argument we find that
$$\begin{aligned} h_{3}(u)>\frac{12 q}{(1+u)^{5}}>0, \end{aligned}$$
so that \(h_{2}(u)\) has a unique zero, giving \(h_{0}(u)\) a unique inflection point \(u_{2}\in (1,1)\). □
By lemma (B.1), there is a unique root \(u_{0}\) of \(h_{0}(u)\) and therefore a unique projection whenever (18) holds. The other properties of \(h_{0}(u)\) motivate the following twomode algorithm for finding \(u_{0}\).
Start with \(u_{a}=0\) as the “active bound” on \(u_{0}\); this will be the base point for a Newton iteration. Depending on the sign of \(h_{0}(u_{a})\), \(u_{b}=\pm 1\) will be the initial “bracketing bound” on \(u_{0}\). From the Newton update
$$\begin{aligned} u'=u_{a}\frac{h_{0}(u_{a})}{h_{1}(u_{a})}, \end{aligned}$$
(54)
we take one of two possible actions. If \(u'\) is in the interval bracketed by \(u_{b}\), we set \(u'_{a}=u'\) and reset the bracketing bound \(u_{b}'=u_{a}\) if the sign of \(h_{0}(u_{a}')\) has changed (keeping \(u_{b}'=u_{b}\) otherwise). If, on the other hand, \(u'\) is outside the interval bracketed by \(u_{b}\), the new active bound is obtained by bisection, \(u_{a}'=(u_{a}+u_{b})/2\), and \(u_{b}'\) is set to either of the previous bounds, \(u_{a}\) or \(u_{b}\), depending on the sign of \(h_{0}(u_{a}')\).
By taking either a Newton step or a bisection step, the interval bracketing \(u_{0}\) is made smaller. By the lemma’s unique inflection point \(u_{2}\), eventually \(h_{2}\) will have the same sign at both endpoints of the interval. The function \(h_{0}\) is now convex/concave on the interval and all subsequent iterations always take the Newton step, converging quadratically to the root \(u_{0}\). The case \(u_{0}=u_{2}\) presents an exception, but the convergence by bisection steps will still be linear.
Appendix C
3.1 C.1 Binary encoding with continuous weights and step activation
It is a straightforward exercise to completely characterize the weights and biases that solve the binary encoding problem of Sect. 6.3.1 when the network is trained with the stepactivation constraint shown in Fig. 3. Combinatorially there are \((2^{n})!\) solutions, corresponding to how the 1hot positions are mapped to the integers \(0,1,\ldots, 2^{n}1\). Consider one such solution and let \(j\in C\) be the code node that codes a particular bit, and \(D_{1}(j)\subset D\) be the corresponding 1hot positions/integers that have a 1 in their binary representation for that bit. Let \(D_{0}(j)\) be the complement of \(D_{1}(j)\), that is, the subset of input nodes which are assigned a 0 for bit j. If Δ is the gap in the step activation, and neglecting the weightnormalization constraint for now, a necessary and sufficient set of constraints on the parameters for correct encoding is
$$\begin{aligned} &\forall j\in C, \forall i\in D_{1}(j):\quad w[i\to j]/\omega b[j] \ge \Delta /2, \end{aligned}$$
(55a)
$$\begin{aligned} &\forall j\in C, \forall i\in D_{0}(j):\quad w[i\to j]/\omega b[j] \le  \Delta /2. \end{aligned}$$
(55b)
Combining these to eliminate the biases we obtain
$$\begin{aligned} \forall j\in C, \forall i\in D_{1}(j), \forall i'\in D_{0}(j):\quad w[i\to j]w\bigl[i'\to j \bigr]\ge \omega \Delta. \end{aligned}$$
(56)
Now define
$$\begin{aligned} &w_{+}(j)=\min_{i\in D_{1}(j)}w[i\to j], \end{aligned}$$
(57a)
$$\begin{aligned} &w_{}(j)=\min_{i\in D_{0}(j)}w[i\to j]. \end{aligned}$$
(57b)
Supposing our weights satisfy (56), then
$$\begin{aligned} \forall j\in C: \quad w_{+}(j)+w_{}(j)\ge \omega \Delta \end{aligned}$$
(58)
and this guarantees that the constraints on the biases from (55a)–(55b)
$$\begin{aligned} \forall j\in C:\quad w_{}(j)/\omega +\Delta /2\le b[j]\le w_{+}(j)/ \omega \Delta /2 \end{aligned}$$
(59)
always has a solution.
When the weights into node j have norm ω, the inequalities (56) will not have a solution when Δ is too large. To obtain the precise limit we use the following:
Lemma C.1
Suppose \((x,y)\in \mathbb{R}^{2}\) satisfy \(xy\ge a> 0\). Then \(x^{2}+y^{2}\ge a^{2}/2\) and the equality case corresponds to \((x,y)=(a/2,a/2)\).
Proof
The minimum squared distance to the halfplane constraint is \(a^{2}/2\) and is uniquely attained for the stated assignment. □
Consider an arbitrary matching of the nodes in \(D_{1}(j)\) with the nodes in \(D_{0}(j)\), and the corresponding \(D/2\) instances of Lemma C.1 in the constraints (56). Additively combining the resulting norm inequalities we obtain
$$\begin{aligned} \omega ^{2}\ge (D/2) (\omega \Delta )^{2}/2, \end{aligned}$$
(60)
or
$$\begin{aligned} \Delta \le \frac{2}{\sqrt{D}}. \end{aligned}$$
(61)
We only get equality when all \(D/2\) inequalities of the matching are equalities, and for that case the lemma specifies a unique solution:
$$\begin{aligned} &\forall j\in C, \forall i\in D_{1}(j):\quad w[i\to j]= \frac{\omega }{\sqrt{D}}, \end{aligned}$$
(62a)
$$\begin{aligned} &\forall j\in C, \forall i\in D_{0}(j):\quad w[i\to j]= \frac{\omega }{\sqrt{D}}. \end{aligned}$$
(62b)
The analysis of the decoder is similar. For any \(i\in D\) let \(C_{1}(i)\subset C\) be the code nodes on which the corresponding integer assigned to i has a 1 in its binary representation. For the same integer i the nodes \(C_{0}(i)\) in the complement have a 0 bit. Now define
$$\begin{aligned} \forall i\in D:\quad w_{1}(i)=\sum_{j\in C_{1}(i)} w[j \to i]. \end{aligned}$$
(63)
The necessary and sufficient set of constraints on the parameters is now
$$\begin{aligned} &\forall i\in D:\quad w_{1}(i)/\omega b[i]\ge \Delta /2, \end{aligned}$$
(64a)
$$\begin{aligned} &\forall i\in D, \forall j\in C_{1}(i):\quad \bigl(w_{1}(i)w[j \to i]\bigr)/ \omega b[i]\le \Delta /2, \end{aligned}$$
(64b)
$$\begin{aligned} &\forall i\in D, \forall j\in C_{0}(i):\quad \bigl(w_{1}(i)+w[j \to i]\bigr)/ \omega b[i]\le \Delta /2, \end{aligned}$$
(64c)
where the last two inequalities cover, respectively, the case of a correct 1 bit flipping to 0 and a correct 0 bit flipping to 1. Comparing these inequalities with the first we infer
$$\begin{aligned} &\forall i\in D, \forall j\in C_{1}(i):\quad w[j\to i]\ge \omega \Delta, \end{aligned}$$
(65a)
$$\begin{aligned} &\forall i\in D, \forall j\in C_{0}(i):\quad w[j\to i]\le \omega \Delta. \end{aligned}$$
(65b)
When these inequalities are satisfied we can always find biases that satisfy (64a)–(64c). Moreover, since the norm of the weights into node i is ω, from (65a)–(65b) we obtain the inequality
$$\begin{aligned} \omega ^{2}\ge C (\omega \Delta )^{2} \end{aligned}$$
(66)
or
$$\begin{aligned} \Delta \le \frac{1}{\sqrt{C}}. \end{aligned}$$
(67)
The equality case corresponds to only equalities in (65a)–(65b), that is, weights differing only in sign as dictated by membership of j in \(C_{1}(i)\) or \(C_{0}(i)\).