Random block-coordinate methods for inconsistent convex optimisation problems

We develop a novel randomised block-coordinate primal-dual algorithm for a class of non-smooth ill-posed convex programs. Lying midway between the celebrated Chambolle–Pock primal-dual algorithm and Tseng’s accelerated proximal gradient method, we establish global convergence of the last iterate as well as optimal O(1/k)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O(1/k)$\end{document} and O(1/k2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O(1/k^{2})$\end{document} complexity rates in the convex and strongly convex case, respectively, k being the iteration count. Motivated by the increased complexity in the control of distribution-level electric-power systems, we test the performance of our method on a second-order cone relaxation of an AC-OPF problem. Distributed control is achieved via the distributed locational marginal prices (DLMPs), which are obtained as dual variables in our optimisation framework.


Introduction
In this paper we study non-smooth convex composite optimisation problems of the form min where X := arg min 1 2 ∥Az − b∥ 2 |z ∈ R m , and the functions Φ : R m → R and R : R m → (−∞, ∞] are convex and additively decomposable of the form Φ(x) := d i=1 Φ i (x i ) and R(x) := d i=1 r i (x i ). We assume that each function ϕ i : R mi → R is smooth, whereas r i (·) is only required to be proper convex and lower semi-continuous. We typically think of the smooth component ϕ i (·) as a convex loss function, whereas r i (·) can take over to role of a regularising or penalty function. In particular, r i (·) can represent an indicator function of a closed convex set K i ⊂ R mi , representing individual membership constraints of the decision variable x i ∈ R mi . Thus, problem (P) can include certain separable block constraints in addition to the non-separable constraints embodied in the set X . This setting gains relevance in distributed optimisation problems with joint coupling constraints, such as multi-agent optimisation problems [1]. Other examples include convex penalty functions, like the ℓ 1 -norm on R mi . The matrix A = [A 1 ; . . . ; A d ] is decomposed in q × m i matrices A i ∈ R q×mi . Accordingly, we use the notation x = [x 1 ; . . . ; x d ] with each x i ∈ R mi to represent the blocks of coordinates of x, and m = m 1 + . . . + m d . The joint restriction x ∈ X state that a feasible decision variable is a solution to a linear least squares problem and could be equivalently described as the set of solutions to the normal equations X = {x ∈ R m |A ⊤ Ax = A ⊤ b}. When b is in the range of A, then we can solve the linear system Ax = b exactly, and the optimisation problem reduces to a linearly constrained convex non-smooth minimisation problem We call this the consistent case. Problems of type (1) are very general and include all generic conic optimisation problems, such as linear programming, second-order cone optimisation and semi-definite programming. In particular, partitioning the matrix A into suitably defined blocks, problem (1)  This can be written as a linear constrained optimisation problem with matrix A corresponding to the Laplacian of a connected undirected graph and linear constraint Ax = 0. Example 2 (Distributed model fitting). Problem (P) can also cover composite minimisation problems which canonically arise in machine learning. Consider the problem min u∈R p ℓ(Ku − b) + ρ(u), where ℓ(·) is a convex and smooth statistical loss function of the form where ℓ i : R → R is the loss for the i-th training example, K i ∈ R p is the feature vector for example i, and b i is the output or response for example i. Define the variable Problem (P) is more general than the examples just presented, since we do not insist on b belonging to the range of A. Hence, the optimisation problems of interest to us are non-smooth convex problems with inconsistent linear restrictions under which the exact condition Ax = b is replaced by the approximate condition Ax ≈ b. This assumption gains relevance in inverse problems and PDE-constrained optimisation, where problems of the from (P) appear in the method of residuals (Morozov regularisation) [2]. Another instance where ill-posed linear restrictions appear is studied in an application to power systems in Section 5 of this work. Motivated by all these different optimisation problems, this paper derives a unified random block-coordinate method which solves problem (P) under very general assumptions on the problem data.

Related Methods
Our algorithms are closely related to randomised coordinate descent methods, primal-dual coordinate update methods, and accelerated primal-dual methods. In this subsection, let us briefly review these three classes of methods and discuss their relations to our work.

Linear constrained minimisation
From the theoretical standpoint, one could formulate problem (P) as a linear constrained optimisation problem of the type (1), with linear restriction A ⊤ Ax = A ⊤ b (the normal equations). Hence, one could approach problem (P) via primal-dual techniques directly. While we will show that our main algorithmic scheme (Algorithm 1) is indeed equivalent to a suitably defined primal-dual method, it can be argued that this connection is not always a recommended solution strategy in practice. First, a primaldual implementation has to deal with the symmetric matrix A ⊤ A, whose dimension is m×m. If q is much smaller than m then primal-dual methods would have to process many more data points than direct approaches. Second, if A is a sparse matrix, then A ⊤ A is usually not sparse anymore, which leads to heavy use of numerical linear algebra techniques. Finally, it is often not known a-priori whether the linear system is de facto consistent with the given inputs.This is in particular the case in the application motivating the development of the numerical scheme to be presented in this paper, which is concerned with the distributed optimisation of an electrical distribution system within an AC-optimal power flow framework [3,4]. A basic stability desideratum on a numerical solution method (acting as a decentralised coordination mechanism in our application) is therefore that overall system convergence is guaranteed even if the linear constraints are not satisfied with equality. Our method is exactly achieving this. Section 5 illustrates the performance of our method on a 15-bus AC-OPF problem taken from [5].
Primal penalty methods An alternative and popular approach to solve (P) is the penalty method. It consists in solving a sequence of unconstrained optimisation problems min x Ψ(x) + σ k 2 ∥Ax − b∥ 2 , where σ k is a positive and increasing penalty parameter sequence. Intuitively it is clear that, by choosing σ k larger, the more importance the optimisation gives to the constraints. Since penalty methods are entirely primal, they do not use duality arguments, and hence they are in principle able to solve the inconsistent case as well. However, their implementation usually involves two loops: an inner loop solving the minimisation problem for a fixed parameter σ k to some desired accuracy, followed by an outer loop describing how the penalty parameter is updated. Viewed from this angle, our algorithm is performing these operations in a single-loop fashion.

Randomised coordinate descent methods
In the absence of the linear constraints, our algorithm specialises to randomised coordinate descent (RCD), which was first proposed in [6], and later generalised to the non-smooth case in [7,8]. It was shown that RCD features sublinear rates of convergence with rate O(1/k), k being the iteration counter. Acceleration to a O(1/k 2 ) complexity and even linear rates for strongly convex problems has been obtained. Extensions to parallel computations, important for large-scale optimization problems, were first proposed in [9].

Primal-dual coordinate update methods
To cope with linear constraints, a very popular approach is the alternating direction method of multipliers (ADMM). Originally, ADMM [10,11] was proposed for two-block structured problems with separable objective. The convergence and complexity analysis of this method is well documented in the literature [12]; see [13] for a survey. Direct extensions of the ADMM to multi-block settings such as (P) is not straightforward, and indeed even may fail to converge [14]. Very recently, [15] proposed a randomised primal-dual coordinate (RPDC) update method, whose asynchronous parallel version was then studied in [16]. It was shown that RPDC converges with rate O(1/k) under convexity assumption. Improved complexity statementes for multi-block ADMM can be found in [17].
Accelerated primal-dual methods It is possible to accelerate the rate of convergence from O(1/k) to O(1/k 2 ) for gradient type methods. The first acceleration result was shown by [18] for solving smooth unconstrained problems. The technique has been generalised to accelerated gradienttype methods on possibly non-smooth convex programs [19,20]. Primal-dual methods on solving linearly constrained problems can also be accelerated by similar techniques. Under convexity assumption, the augmented Lagrangian method (ALM) is accelerated

Methodological contributions
We propose a block-coordinate implementation of the method developed by [22] for linearly constrained optimisation, lying midway between the celebrated primal-dual hybrid gradient algorithm [23] and Tseng's accelerated proximal gradient method [24]. Specifically, our proposed method is a distributed interpretation of the primal-dual algorithm of [23] which operates on randomly selected coordinate blocks. A parallel between the Chambolle-Pock method [23] and the accelerated proximal gradient of [24] was already drawn in [22,25]. Reducing the primal-dual algorithm to an implementation of Tseng's method enabled [25] to derive new convergence results based on primal arguments, thus departing from strong duality requirements and the ergodic rates typically issued for primal-dual methods. Our developments revisit the coordinate-descent implementation proposed in [26] for the basic algorithm of [22], to a block-coordinate descent setting featuring Nesterov type of acceleration. In particular, in the strongly convex case, we derive a new step size policy that achieves an accelerated rate of O(k −2 ). Besides the recent preprint [27], we are not aware of a similar algorithm achieving the accelerated convergence rates O(k −2 ) in a fully distributed computational setting. Thus, our main contributions in this paper can be summarised as follows: (i) In the convex case, provided problem (P) possesses a saddle point (defined in Section 2.6) our main result is Theorem 6 which establishes an O(1/k) iteration complexity in terms of the objective function gap and convergence of the last iterate. (ii) In the strongly convex case and uniform sampling of coordinate blocks, our main result is Theorem 8, which proves an accelerated O(k −2 ) convergence rate in the primal objective function values.
We remark that RCD methods have been shown to exhibit linear convergence rates in the strongly convex regime [6]. Such fast convergence is, however, not to be expected in the presence of linear constraints. While strong convexity of the primal objective ensures smoothness of the Lagrangian dual function, but not its strong concavity.
Hence, in general, we do not expect to see linear convergence rates by only assuming strong convexity in the primal. However, we note that [28] obtain linear convergence rates in the consistent regime if there is one block variable that is independent of all others in the objective (but coupled in the linear constraint) and also the corresponding component function is smooth. Related to this paper is also the very recent work [29]. They consider a larger class of convex optimisation problems with linear constraints and design a new randomized primal-dual algorithm with single block activation in each iteration and similar complexity results as reported in the present work. However, our method is able to solve inconsistent convex optimisation problems with general sampling techniques.

Organisation of this paper
This paper is organised as follows. Section 2 describes our block coordinate descent framework. Section 3 explains in detail our algorithmic approach. Section 4 contains all details for the asymptotic convergence and finite-time complexity statements. Section 5 describes a challenging application of our algorithm to a distributed optimisation formulation of an AC-OPF problem formulate the distribution grid model, based on the second-order cone relaxation of [30,31]. Preliminary numerical results are reported to show the applicability of our method using a 15 bus network studied in [5] as a concrete example.

Notation
We work in the space R p composed by column vectors. For x, u ∈ R p denote the standard Euclidean inner product ⟨x, u⟩ = x ⊤ u and the Euclidean norm ∥x∥ = ⟨x, x⟩ 1/2 . We let S p + := {B ∈ R p×p |B ⊤ = B, B ⪰ 0} the space of positive definite matrices. Given Λ ∈ S p + , we let ∥x∥ Λ := ⟨Λx, x⟩ 1/2 for x ∈ R p . The identity matrix of dimension p is denoted as I p . Whenever the dimension is clear from the context, we will just write I. We denote by λ max (A) the largest eigenvalue of a square p × p matrix A. We call by Γ 0 (R p ) the set of proper convex, lower semi-continuous func-

Block structure
We first describe the block setup which has become standard in the analysis of block coordinate methods [8,9,32,33]. The block structure of (P) is given by a decompo- x be the block of coordinates corresponding to the columns of U i . Any vector s ∈ R m can be written as s We denote the ℓ 2 -norm on R mi as ∥·∥ i . If Q = blkdiag[Q 1 ; . . . , Q d ] is a blockdiagonal matrix with Q i ∈ S mi ++ , we define a weighted norm on R m by

Smoothness of Φ
We assume throughout the paper that Φ : R m → R is convex and possesses a Lipschitz continuous partial gradient. Specifically, we assume that for each block i ∈ [d] there exists a matrix Λ i ∈ S mi ++ so that A typical situation is that Λ i = L i I mi for a scalar L i > 0, so that the gradient ∇ϕ is Lipschitz continuous with modulus L i > 0. Allowing for matrix-valued parameters increases generality and takes into account that norms on the factors R mi might differ from block to block. Collecting all the matrices Λ i in one block diagonal matrix

Properties of R
We assume that R : where the functions r i : R mi → (−∞, ∞] are µ i -strongly convex and closed, with µ i ≥ 0. Calling Υ i = µ i I mi , this gives Typical examples for the regulariser r i are indicator functions of closed convex sets, i.e. r i (x i ) = δ Ki (x i ), for K i ⊂ R mi convex and closed, or structure-imposing regularisers like the L p -penalty r i (x i ) = ∥x i ∥ p Lp(R m i ) (p ≥ 1), prominent in distributed estimation of high-dimensional signals and neural networks. We let Υ = blkdiag[Υ 1 ; . . . ; Υ d ] be the m × m matrix collecting all strong-convexity parameters of the individual blocks.

Quadratic Penalty function
Let h * = min x∈R d h(x), so that X = arg min R d h(x). Clearly, h * ≥ 0, with equality if and only if the linear system Ax = b is consistent.
Since h is quadratic, we have for all u, w, In particular, if x * ∈ X , the above implies for We also have the spectral norm of the matrix A i , and Lemma 9 (proven in Appendix A.1) immediately implies that for all t ∈ [0, 1]

On Saddle points
The optimization problem can be equivalently expressed as the linear constrained optimisation problem min x=(x1,...,x d ) The Lagrangian associated with this non-smooth convex optimization problem is For convex programs, the conditions (8) are sufficient for x * to be a solution of (P). They are also necessary if a constraint qualification condition holds (e.g. the Slater condition, stating that there exists x in the interior of the domain of Ψ such that A ⊤ Ax = A ⊤ b).

Random Sampling
We next introduce our random sampling strategy. Our approach is very general, and allows for virtually all existing sampling strategies considered in the literature. We refer the reader to [34,35] for an in-depth systematic overview on this topic.
Let (Ω, F, P) be a probability space. By a sampling we mean a random set-valued mapping with values in 2 [d] . We will call the random variable I : Ω → 2 [d] a random sampling. I(ω) defines the selection of blocks employed in a single iteration of our method. The set I(Ω) = Σ is the set of all possible realisations of the random selection mechanism. Let {ι k } k∈N represent the stochastic process on 2 [d] in which each random variable ι k is an i.i.d copy of I. We refer to such a sampling as an i.i.d sampling. We assume that the sampling I is proper : There exists a vector π = [π 1 , . . . , π d ] ∈ R d with With the sampling I, we associate the matrix Π ∈ R d×d defined as We note that Π ≻ 0 [35, Thm 3.1]. We further define the weighting matrix We emphasise that the random sampling model we adopt here is capable of capturing many stationary randomised activation mechanisms. To illustrate this, consider the following activation mechanisms: • A special case of a uniform sampling is the popular class of m-nice samplings, arising under the specification where Σ is the set of all subsets of [d] containing exactly m elements, each of which is activated with uniform probability. Clearly, in this case one has π i = m d for all i ∈ [d]. • Full Sampling: Σ = {1, . . . , d}, which means that all coordinates are updated in parallel.

Algorithm 1: Distributed Accelerated Proximal Gradient Algorithm
Parameters : as an i.i.d copy of the sampling I. 4 Update

Parallel block coordinate algorithm
Our random block-coordinate algorithm for solving (P) recursively updates three be a given sequence of positive numbers. At each iteration k ≥ 0, we define a weight sequence (S k ) k≥0 recursively by Consider the extrapolated point together with the sequence This reads in coordinates w k+1 To evaluate w k+1 , we need the primal update x k+1 , which is obtained by a weighted forward-backward step involving the first-order signal , which reads explicitly as We will choose the matrices B i later on to adapt for the strong convexity present in the problem data. Putting all these tools together yields Algorithm 1.

Relation to primal-dual methods
Our method is very similar to the recent block coordinate primal-dual update [28], who focus on the consistent case and uniform samplings [6,9,32,35]. We generalise draw block ι k as an i.i.d copy of the sampling I. 3 Update this to arbitrary samplings and show that the sequences produced by Algorithm 1 are equivalent to a primal-dual process in the spirit of [26], formulated as Algorithm 2.
Let {(x k , w k , z k )} k≥0 denote the sequences generated by running Algorithm 1. Let S k be the cumulative step-size process S k = 1 + k t=1 σ t . Introduce the sequence In terms of this new dual variable y k , we can reorganize the primal update so that The next iterate x k+1 is obtained by the block-coordinate update rule (A2). This gives line 3 of Algorithm 2. Next, observe that Hence, after introducing the residual u k = Ax k − b, satisfying we obtain line 4 of Algorithm 2, as well as where we have used in the last step the identity (12). Thereby, we obtain line 5 of Algorithm 2. This completes the verification that the sequence {(x k , w k , z k )} k≥0 generated by Algorithm 1 is equivalent to the just constructed sequence {(x k , u k , y k )} k≥0 corresponding to the iterates of Algorithm 2. Remark 1. The implementation of Algorithm 2 is fully parallelisable, involving a computational architecture with d agents and a single central coordinator. The agents manage the coordinate blocks x i in a fully decentralised way, using information about the centrally updated dual variable y k only. A practical implementation of the computational scheme is as follows: 1. Given the current data (x k , u k , y k ) the coordinator realises a sampling ι k . 2. All agents in ι k receive the order to update their control variables in parallel, given their current position x k i , the data matrix A i and the penalty of resource utilisation y k . 3. Once all active agents executed their computation, they report to the vector Distributed primal-dual methods as the one described have received enormous attention in the control and machine learning community; see e.g [28,33,[36][37][38][39][40]. Remark 2. Consider the special case with π i = 1/d and d = 1, as well as σ k ≡ σ. Then, Algorithm 2 coincides with the primal-dual method of Chambolle-Pock [23]. In fact, in this case it follows that u k = Ax k − b, and

Convergence Analysis
This section is concerned with the convergence properties of Algorithm 1. We start with a basic descent property of the primal forward-backward step. This will involve the identification of a Lyapunov function to obtain energy decay estimates in a variable metric. Building on this result, we investigate two important scenarios in isolation: First, we consider the merely convex case, which is obtained when Υ = 0. If Υ ≻ 0, then accelerated rates in the primal iterates can be obtained. This, however, requires the derivation of a suitable step-size policy that exploits strong convexity for boosting the performance of the method. Understanding the mechanics of this step-size regime involves a delicate analysis of the thus-constructed step-size policy, which is relegated to Appendix B.

Lyapunov function and key estimates
We start the analysis with a small extension of "Property 1" stated in [24] for Bregman proximal gradient algorithms. Lemma 1. For all k ≥ 0 and x ∈ R m , define Then, for all x ∈ R m it holds true that Proof. Collect the forward-backward operator in coordinates . From the definition of the forward-backward operator (11), we see that Hence, for all x ∈ R m and i ∈ [d], we obtain from eq. (3) Summing over all blocks i ∈ [d], it follows Adding these two inequalities, and rearranging, gives the claimed result.
Lemma 11, together with Lemma 12, shows that Applying eq. (A5), (A6), (A9) and (A10), as well as the identity θ k = σ k S k , the above becomes Finally, we need a characterisation of the sequence (w k ) k≥0 , which is quite standard in the analysis of randomised block-coordinate methods [32,34,35]. We therefore skip the straightforward proof. Lemma 2. Let (x k , w k ) k≥0 be the iterates of Algorithm 1. Then, for all k ≥ 1, we have where for each i ∈ [d], the coefficients (γ k,t i ) k t=0 are defined recursively by setting γ 0,0 Moreover, for all k ≥ 0, the following identity holds Moreover, if θ 0 ∈ (0, min i∈[d] π i ] and (θ k ) k≥0 is a decreasing sequence, then for all k ≥ 0 and i ∈ [d], the coefficients (γ k,t i ) k t=0 are all positive and add up to 1.
Using (14), we can continue with the estimate Consequently, for x * ∈ X = arg min x h(x) as a reference point, the following bounds are obtained: Via eq. (A8), we obtain Next, we apply identity (5) to each inner product separately, in order to conclude Combined with the previous display, this shows for x * ∈ arg min x∈R m h(x), Finally, applying identity (16), one sees where the last step uses the identity 2S k = Plugging this expression into the penultimate display, we arrive that Define the vector-valued function ⃗ ψ( . Thanks to Lemma 2, we haveΨ k ≥ Ψ(w k ) for all k ≥ 0.
Proof. Eq. (21) can be easily seen by observing Summing over all i ∈ {1, . . . , d} gives the result. To prove (22), we first observe Taking conditional expectations on both sides, it follows The next result characterises a suitable Lyapunov function in an adapted variable metric. We set B := blkdiag[B 1 ; . . . ; B d ]. Lemma 4. Assume that θ 0 ∈ (0, min i∈[d] π i ], and that the sequences (σ k ) k≥0 , (τ k ) k≥0 satisfy the matrix inequalities Define the matrix-valued sequence (W k ) k≥0 ⊂ S m ++ by W k = σ k τ k P 2 B + σ k (P − I)Υ, and introduce the functions We have for all x * ∈ X , Proof. Using identity (21) with M = Q k + Υ, we get and (20) shows Adding these three expression together, we obtain the recursion Recall that Q k = 1 τ k P B. Condition (23) guarantees that Q k ⪰ Λ + σ k Ξ. Multiplying both sides of (28) by σ k , we obtain Under condition (24), it follows that Exploiting this relation in the penultimate display, using definitions (25) and (26), one readily obtains

The Convex case
Suppose that Υ = 0, so that W k = σ k τ k P 2 B. In this case, the matrix inequality (24) is satisfied for any sequences (σ k ) k≥0 , (τ k ) k≥0 satisfying σ k τ k ≥ σ k+1 τ k+1 . In particular, this can be realised with the policy σ k = στ k for all k ≥ 0, and some constant σ ∈ (0, min i∈[d] π i ]. This implies θ k = σ 1+kσ for all k ≥ 0. This specification satisfies all conditions needed for Lemma 2 to hold. We further set τ k ≡ 1. It only remains to see if matrix inequality (23) is satisfied. In fact, this condition gives us a restriction on σ, reading as P B ⪰ σΞ + Λ.
We show later in this section that this matrix inequality is achievable. In this regime, we will show that we get an O(1/k) iteration complexity estimate of the averaged sequence (w k ) k , in terms of the objective function value. This extends the results reported in [22,25,26] to general random block-coordinate activation schemes. The following Assumptions shall be in place throughout this section: Assumption 1. Problem (P) admits a saddle point, i.e. a pair (x * , y * ) ∈ R m × R m satisfying (8). Assumption 2. The solution set of problem (P) X * is nonempty. Thanks to eq. (17), we know that w k ∈ Conv(x 0 , . . . , x k ) ⊆ dom(r). Therefore, where δ := √ 2∥Ay * ∥. We assume that τ k ≡ 1 and σ k = σ > 0. The energy functions (25) and (26) Suppose that Assumption 1 applies. Then, the process (S k−1 (F k − Ψ(x * ))) k≥0 is almost surely bounded.
Proof. Note that The convex function t → S 2 2 t 2 − δSt attains the global minimum at the value − δ 2 2 . Hence, The last inequality uses the fact that x * ∈ X , so that h(w k ) ≥ h(x * ). This completes the proof.
We are now in the position to give the proof of the first main result of this paper. Theorem 6. Suppose that Assumptions 1 and 2 apply. Consider the sequence (z k , w k , x k ) k≥0 generated by Algorithm 1 with σ k ≡ σ and τ k ≡ 1 for all k ≥ 1. Then, the following holds: Proof. (i) From Lemma 5, it follows that the process E k := V k (x * )+ δ 2 2 is non-negative. Eq. (27) shows that E k satisfies the recursion Lemma 10 implies that (E k ) k≥0 converges a.s. to a limiting random variable E ∞ ∈ [0, ∞) and is bounded. By definition of the energy function V k (x * ), the sequence (x k ) k≥0 is bounded and (31)  Hence, by Lemma 2 and the Silverman-Toeplitz theorem, the sequence (w k ) k converges to the same limit point as (x k ) k , a.s.
By definition of the energy function V k , the sequence (x k ) k≥0 is a.s. bounded. We show next that all accumulation bounds are contained in the set of solutions X * of (P).
(ii) We next show that cluster points are unique and thereby demonstrate almost sure convergence of the last iterate. We argue by contradiction. Suppose there are converging subsequences (x k ) k∈K1 and (x k ) k∈K2 with limit points x ′ and x ′′ , respectively. Hence, Ψ(x ′ ) = Ψ(x ′′ ) ≡ Ψ * . Following the above argument, we see that x ′ , x ′′ ∈ X * a.s. The point x * chosen in the previous argument was arbitrary, so we can replace it by x ′ . Let us simplify notation by setting a k := S k (F k − Ψ * ). Since V k (x ′ ) converges almost surely, we see Similarly, We conclude that lim k→∞,k∈K1 Repeating the same analysis for x ′′ instead of x ′ , we get lim k→∞,k∈K2 Combining these two equalities, we see that x ′ = x ′′ . Therefore, the whole sequence (x k ) k converges pointwise almost surely to a solution. Lemma 2 yields the same assertion for (w k ) k .
(iii) Taking expectations on both sides of (27) and iterating the thus obtained expression gives From here, we deduce To obtain a lower bound, it suffices to apply eq. (30) and use the results from (i) to obtain We conclude this section with a concrete example showing that the matrix inequality (29) can be satisfied.
In this case eq. (29) can be decomposed to the block specific condition π −1 i B i ⪰ Λ i + σ πi A ⊤ i A i . Let us assume further that Λ i = L i I mi , for a scalar L i > 0. If we choose B i = π i (λ i +L i )I mi , where λ i = ∥A i ∥ 2 . Then, a sufficient condition for satisfying the matrix inequality is . This inequality can be satisfied by choosing σ = min i∈[d] π i , which is the largest value for a given set of activation probabilities, that is compatible with Lemma 2.

The strongly convex case
In this section we study the performance of Algorithm 1 in the strongly convex regime. Assumption 3. Υ ≻ 0.
The main challenge we need to overcome is to design step size sequences which obey the matrix inequalities (23) and (24). Let us focus on (23) first and show how it can be restated in a more symmetric way. To that end, define the matrix B = P −2 Υ and M 1/2 := P −1/2 Υ 1/2 , so that M = P −1 Υ. In terms of this new parameter matrix, we see It is easy to see that Ξ is symmetric. Furthermore, for u ∈ R m \ {0}, we see u ⊤ Ξu = E[∥At∥ 2 ] ≥ 0, where t := P E k u is a random vector in R m with mean u. It follows Ξ ∈ S m + . This suggests to relate the step size parameters to the spectra of the involved matrices. The rest of this section assumes the following parametric model on the coupling between σ k and τ k . Assumption 4. The sequence (τ k ) k is non-increasing and positive. They are related by the coupling equation where α and β are positive numbers.
We note that this coupling relation automatically means that σ k = α/τ k − β is non-decreasing. A sufficient condition to satisfy eq. (33) is Using the model (34), this boils down to We make the choice β = αλ max (M −1/2 (Λ + Υ)M −1/2 ), (36) so that the coupling equation (34) gives rise to a step size process satisfying matrix inequality (23). Remark 3. To get some intuition behind the derived conditions for α and β, consider the independent sampling case under which the operator Ξ factorises to Ξ = blkdiag( 1 . Then, the above choice says that Clearly, κ is related to the condition number of problem (P), and satisfies κ ≥ 1 πi for all i ∈ [d]. We point out that the explicit expression for κ does not rely on the independent sampling assumption. Only the quantification of the parameter α exploits this special structure. Let us fix these identified values for α and β, and continue our derivation with matrix inequality (24). Let us choose B = P −2 Υ = blkdiag[π 2 1 µ 1 I m1 ; . . . ; π 2 d µ d I m d ]. Using the block structure, and the specification of the sequence (σ k ) k , it can be verified that (24) reduces to the scalar-valued inequality This defines a quadratic inequality in τ k+1 of the form We study the qualitative properties of the so produced sequence (τ k ) k in detail in Appendix B, under the following assumptions: Assumption 6. The sampling is uniform: Uniform sampling is a very common sampling scheme employed in the literature. It contains as special cases the single coordinate activation scheme (i.e. π i = 1 d ), and the uniform sampling scheme. In particular, the frequently employed m-nice sampling (cf. Section 2.7) is contained in this framework. Under this assumption we prove that if τ 0 is chosen sufficiently small, then the sequence (τ k ) k exhibits the same qualitative behaviour as a Nesterov accelerated method [41,42]. This is summarised in Lemma 7, whose proof is provided in Appendix B. Lemma 7. Let Assumptions 4, 5 and 6 hold, and consider the step size sequences (σ k ) k , (τ k ) k constructed recursively via eq. (40), with initial condition τ 0 ∈ (0, 1/κ). Then, (τ k ) k is monotonically decreasing and satisfies In particular, σ k = O(k) and S k = O(k 2 ). We therefore can prove accelerated rates for our scheme in the strongly convex case.
Theorem 8. Suppose that Assumptions 1-6 apply. Consider the sequence (z k , w k , x k ) k≥0 generated by Algorithm 1 with the following step size policy constructed via eqs. (34), (35), (36). Then, the following holds: Proof. (i) If a Lagrange multiplier exists then Lemma 5 still applies and guarantees that the function V k (x * ) + δ 2 2 ≥ 0 for all k ≥ 0, and the recursion (32) applies, which reads in the present case Since P −1 Υ − τ k (σ k Ξ + Λ) ⪰ 0, we can upper bound the right-hand side of the above display as The supermartingale convergence theorem (Lemma 10) implies that Moreover, (w k ) k and (x k ) k share the same limit.
(ii) We can repeat the arguments of Theorem 6, to conclude that accumulation points of (x k ) k and (x k ) k converge to the same random variable with values in X * . However, since the problem is strongly convex, we must have X * = {x} for somex ∈ X . Hence, the sequences actually converge and the common limit point isx a.s. (iii) We next show convergence rates in terms of the objective function gap. Iterating eq. (41), it . This is equivalent to In particular, Furthermore,Ψ k ≥ Ψ(w k ), so that via eq. (30) we obtain

Application to power systems
In this section we apply Algorithm 1 to the distributed coordination of an energy grid in an AC-optimal power flow formulation. Specifically, we illustrate how our blockcoordinate descent method provides an efficient way to replicate the transmission-level locational marginal prices to the distribution level. 1

Power flow model
Consider a radial network with N buses N = {1, . . . , N }, excluding the slack node 0. Hence, the distribution network assumes the structure of a tree graph and we identify the transmission lines with the label of the parental node. The network is optimised over a time window T = {1, . . . , T }. We use p n = (p n,1 , . . . , p n,T ) and q n = (q n,1 , . . . , q n,T ) to denote active and reactive power net consumption at bus n at each time point t ∈ T . We let i = √ −1 and write a complex number z ∈ C as Re(z) + i Im(z). Thus, p n,t < 0 means that there is production of energy at bus n at time t. At the slack node n = 0, we assume that power will only be generated and there is no consumption, i.e. p 0,t ≤ 0. At node n, we denote by f n = (f n,t ) t∈T and g n = (g n,t ) t∈T the real and reactive power flows, so that f n + i g n is the complex power injected into node n. By Ohm's law, we have g n = Im(V nĪn ) and f n = Re(V nĪn ). We let ℓ n = (ℓ n,t ) t∈T = |V n | is the squared current magnitude, R n and X n denotes the series resistance and reactance. Hence, z n = R n + i X n is the series impedance, and y n = z −1 n the shunt admittance. The load flow equations using the BFM [44,45], without explicit consideration of transmission tap ratios, are as follows: where • v n = (v n,1 , . . . , v n,T ) and v n− are the squared voltage magnitudes at buses n and n − , • ℓ n is the squared current magnitude on branch (n, n − ), • f n and g n are the active and the reactive parts of the power flow over line (n, n − ), • R n and X n are the resistance and the reactance of branch (n, n − ), • G n and B n are the line conductance and susceptance at n.
Equation (42a) and (42b) are the active and reactive flow conservation equations, (42c) is an expression of Ohm's law for the branch (n, n − ), and (42d) is a SOCP relaxation of the definition of the power flow, which implies f 2 n,t + g 2 n,t = v n,t ℓ n,t . There exist sufficient conditions under which the optimization problem subject to (42) remains exact, see [4,30,46]. Equations (42e) and (42f) are limitations on the squared power flow magnitude on (n, n − ), and (42g) gives lower and upper bounds on the voltage at n. For the coupling flow conservation laws, dual variables are attached, which are the DLMPs corresponding to active and reactive power.
For later reference, we point out that the network flow constraints (42a)-(42b) can be compactly summarised ed as A 0 x 0 + a A a x a = b for suitably defined matrices A 0 , A a and right-hand side vector b. However, for our computational scheme to work, we do not need to assume that the linear constraint holds exactly. Instead, we assume that the linear relation A 0 x 0 + a A a x a = b holds inexactly. Specifically, motivated by data-driven approaches for power flow models [47], we assume that there exists a random variable ξ with bounded support such that In such inexact regimes, conventional deterministic OPF solvers are not applicable, whereas Algorithm 3 is designed to handle such scenarios.

Load aggregators
The set of buses N is partitioned into a collection (N a ) a∈A of subsets, such that each node subset N a is managed by a Load Aggregator a ∈ A. Each LA controls the flexible net power consumption (p n,t ) and generation at each node n ∈ N a , given at time t by p n,t = p c n,t − p p n,t , q n,t = q c n,t − q p n,t , for all n ∈ N and t ∈ T . p c n,t ≥ 0 is the consumption part and p p n,t ≥ 0 is the production part of the power profile. Power consumption and production at the nodes are made flexible by the presence of deferrable loads (electric vehicles, water heaters) and Distributed Energy Resources (DERs). The consumption at each node n ∈ N must satisfy a global energy demand E n over the full time window, t∈T p c n,t ≥ E n , ∀n ∈ N .
Consumption and production are also constrained by power bounds and active to reactive power ratio: 0 ≤ p p n,t ≤ P , ∀n ∈ N , ∀t ∈ T , (43e) ρ p n,t p p n,t ≤ q p n,t ≤ ρ p n,t p p n,t , ∀n ∈ N , ∀t ∈ T . (43f) Constraints (43a)-(43f) define the feasible set X a of LA decisions, containing vectors x a = (p n , q n ) n∈Na . Both, consumption and production, must be scheduled by the LA, taking into account the current spot market prices, and other specific local factors characterising the private objectives of the LA. Formally, there is a convex cost function ϕ a (x a ) which the LA would like to unilaterally minimise, subject to private feasibility x a ∈ X a .

The distribution system operator
In order to guarantee stability of the distribution network, the Distribution System Operator (DSO) takes the individual aggregators' decisions into account and adjusts the power flows so that the flow conservation constraints (42a)-(42b), together with the SOCP constraints (42c)-(42g), are satisfied. Let x 0 = (p 0 , q 0 , f , g, v, ℓ) denote the vector of the variables controlled by the DSO, and define the DSO's feasible set X 0 = {x 0 |(42c) − (42g) hold for n ∈ N }. Then, the set of DSO decision variables inducing a physically meaningful network flow for a given tuple of LA decisions Denoting the DSO cost function ϕ 0 (x 0 ), we arrive at the DSO's decision problem This represents the smallest costs to the DSO, given the profile of flexible net consumption and generation at each affiliated node n ∈ N a .

A privacy-preserving DLMP solver
The privacy-preserving DLMP solver (PPDLMP), described in Algorithm 3, asks the DSO to adjust DLMPs based on the prevailing plans reported by the LAs. Once the price update is completed, a single LA is selected at random to adapt the power profile within the subnetwork they manage. The local update of the LA results in bid vector w k , which will be fed into the DSO final computational step to perform dispatch. From a practical point of view, it is important to point out that, while executing PPDLMP, the bus-specific data (like cost function, power profiles, etc.) remain private information. This applies equally to the DSO and the LA. Coordination of the system-wide behaviour is achieved via exchanging information about dual variables only, describing the DLMPs and the expressed bids of the LAs.

Convergence of PPDLMP
We deduce the convergence of PPDLMP from the analysis of Algorithm 1. In order to recover the OPF problem, we identify each function ϕ i with a cost function of the DSO or LA, and r i is an indicator function of the feasible set X a and X 0 , respectively. The sampling technology is an example of a uniform sampling of pairs involving Algorithm 3: privacy-preserving DLMP solver (PPDLMP) Parameters: p = |A|, σ > 0, B 0 , B a 1 Initialization at each aggregator a ∈ A : 2 x 0 0 ∈ X 0 ; 3 send bid u a = A a x 0 a − b a to the DSO 4 Initialization at the DSO : at LA a drawn uniformly at random do 10 receive DLMP y k from DSO ; the DSO and one aggregator. Specifically, the set of realisations of the sampling is Σ = {{0, a} : a ∈ {1, . . . , |A|} ≡ {1, . . . , p}} . Each pair is realised with the uniform probability (Π) 0,a = 1 p . The marginal distribution of the sampling technology is thus given by π 0 = 1 and π a = 1/p for a ∈ {1, . . . , p}. The optimisation problem is characterised by d = p + 1 blocks and weighting matrix P = blkdiag[I m0 ; . . . ; pI m d ], where m 0 is the dimension of the feasible set of the DSO, and m a is the dimension of the feasible set of aggregator a ∈ A. Now, define R(x) = r 0 (x 0 ) + a∈A r a (x a ), where r 0 (x 0 ) = δ X0 (x 0 ) and r a (x a ) = δ Xa (x a ), in which δ C is the indicator function of a set C. We now show that Algorithm 3 is a special instantiation of Algorithm 2, which in turn we know to be equivalent to Algorithm 1. If the load aggregator a is chosen at step k, Line 5 in Algorithm 2 becomes where we have used the identity u k+1 = Ax k −b and define . Exploiting Line 4 in Algorithm 2, we find that v k can be computed locally and inductively by choosing the initial condition v 0 = σ a∈A (A a x 0 a − b a ) initially, and performing sequential updates . This shows which agrees with Lines 4,7 and 8 of Algorithm 3.

Numerical Results
We apply Algorithm 3 to a realistic 15-bus network example based on the instance proposed in [5], over a time horizon T = {0, 1}. The parameters (R n , X n , S n , B n , V n ) are those used in [5]. We consider variable, flexible active and reactive loads (as opposed to fixed ones, as in [5]): parameters (P n , P n , E n , τ c n ) n are generated based on the values of [5].The underlying parameter values are specified in Table 1: line parameters R n , X n , S n , B n , n ∈ N are taken from [5], (P n , P n , E n , τ c n ) n for the flexible loads are generated based on the fixed load values of [5]. As in [5], bus 11 is the only bus to offer renewable production, with P We consider a zero cost function for each LA (ϕ a = 0 for each a ∈ A), meaning that LAs are indifferent to consumption profiles for as long as their feasibility constraints are satisfied. This is a reasonable assumption in practice for certain types of consumption flexibilities (electric vehicles, batteries). We consider the DSO objective ϕ(x) = ϕ 0 (x 0 ) = t∈T c t (p p 0t ) + k loss n,t R n ℓ nt , with loss penalisation factor k loss = 0.001 and with: c 0 : p → 2p + p 2 , c 1 : p → p, giving an expensive time period and a cheap one, which can be interpreted as peak and offpeak periods.  Table 1: Parameters for the 15 buses network based on [5] The solution obtained by Algorithm 3 after 2000 iterations is illustrated in Figure 1, which displays the active flows directions as well as the Distribution Locational Marginal Price (DLMP) values.  The solutions show that the active (and reactive) DLMPs obtained for each time period are close to the DLMPs at the root node (y p 0 , y q 0 ), with the following exceptions: • For the branch composed of nodes 8, 7, 9, 10, 11, active DLMPs are close to 0.0 due to the presence of renewable production (at null cost) at node 11, and of negative load at node 7, which together fully compensate for the demand on this branch. Since Line (3,8) is saturated, no energy can be exported further. • Active DLMPs on the branch composed of nodes (12,13,14) at t = 1 are much larger than on other nodes: this is explained by the congestion of line (0, 12). • The DLMP for node 7 and t = 0 is strictly negative: the (negative) consumption for this node is at its upper bound p 7,0 = P 7,0 = −0.173. The negative DLMP suggests that the system will be better off if less power is injected by node 7. Figures 2 and 3. Figure 2 displays the convergence of the last iterate with respect to various criteria:  (i) convergence of ϕ(x k ) to the optimal cost ϕ(x * ), where we computed x * using the CvxOpt Python library; (ii) convergence to zero of the primal residuals h(x k ) and ∥Ax k − b∥; (ii) convergence to zero of the stopping criterion formed on the KKT residual R KKT (x k , y k ) = dist ∞ (0, (∂ x , −∂ y )L(x k , y k )), where L(x, y) = Φ(x) + ⟨y, Ax − b⟩ denotes the Lagrangian of (P) in the feasible case; (iv) convergence of the Figure 3 shows the convergence to zero of the primal infeasiblity in the averaged trajectory w k , as predicted by the theory.

Conclusion
In this paper we developed a novel random block-coordinate descent method for illposed convex non-smooth optimisation problems. Our scheme gives optimal iteration complexity guarantees in terms of the last iterate of the sequence generated by the algorithm. As such, we directly generalise results from [26] to a fully distributed optimisation setting. Motivated by the need to achieve a distributed optimisation of the power system, we follow a data-driven approach leading to a potentially inconsistent AC-OPF formulation. We show that our Algorithmic scheme is immediately applicable to such a challenging and important setting, and achieves distributed control of the power grid via distributed locational marginal prices as price signals to independent agents (i.e. aggregators). Future work should consist in extending the method to non-convex optimisation problems, so that exact formulation of the power flow constraints can be implemented. We also plan to conduct extensions of this work where the electric network is exposed to stochastic uncertainty.

Declarations
• Availability of data and materials: Not Applicable Paulin Jacquot contributed to the development of the algorithm, and implemented the numerical example.

A.2 Connections among the iterates produced by Algorithms 1 and 2
To obtain a compact representation of the primal updates, let us introduce the iid Bernoulli process ϵ k i : Ω → {0, 1} by This matrix corresponds to the identity for those blocks that are activated in round k, and zero for those not activated. Therefore, we call it the activation matrix of the Algorithm. By definition of the sampling process, we have E(ϵ k i ) = P(i ∈ ι k ) = π i for all i ∈ [d] and k ≥ 0, so that E[E k ] = P −1 .
which is a symmetric time-independent matrix, thanks to iid sampling. We have Proof. By definition of the sampling procedure, we have E k [P E k ] = I m . Some simple algebra shows then that Remark 4. The matrix Ξ can be given a simple expression in terms of the probability matrix Π. A direct computation shows that Ξ is a block-symmetric matrix with d 2 blocks Ξ[i, j], 1 ≤ i, j ≤ d, each block having the form In special instances this matrix can be simplified even further: 1. If P(|I| = 1) = 1 the random sampling is a single coordinate activation mechanism. In this case the associated matrix Π is diagonal with entries π i . It follows that and Ξ[i, j] = 0 for i ̸ = j. 2. We say that the matrix A has orthogonal design if A ⊤ i A j is the zero matrix for i ̸ = j. Also in this case the matrix Ξ is block diagonal, with the same coordinate expression as above.

A.3 Properties of the penalty function h
We collect some important identities involving the penalty function h in this subsection. From the definition of the iterate z k , we see Furthermore, we note that the definition of the iterate w k+1 implies (A9) Lemma 13. We have Proof. The Pythagorean identity gives Taking conditional expectations on both sides, using eq. (A3), establishes the claimed identity.