- Research
- Open access
- Published:
Stochastic approximation method using diagonal positive-definite matrices for convex optimization with fixed point constraints
Fixed Point Theory and Algorithms for Sciences and Engineering volume 2021, Article number: 10 (2021)
Abstract
This paper proposes a stochastic approximation method for solving a convex stochastic optimization problem over the fixed point set of a quasinonexpansive mapping. The proposed method is based on the existing adaptive learning rate optimization algorithms that use certain diagonal positive-definite matrices for training deep neural networks. This paper includes convergence analyses and convergence rate analyses for the proposed method under specific assumptions. Results show that any accumulation point of the sequence generated by the method with diminishing step-sizes almost surely belongs to the solution set of a stochastic optimization problem in deep learning. Additionally, we apply the learning methods based on the existing and proposed methods to classifier ensemble problems and conduct a numerical performance comparison showing that the proposed learning methods achieve high accuracies faster than the existing learning method.
1 Introduction
Convex stochastic optimization problems in which the objective function is the expectation of convex functions are considered important due to their occurrence in practical applications, such as machine learning and deep learning.
The classical method for solving these problems is the stochastic approximation (SA) method [1, (5.4.1)], [2, Algorithm 8.1], [3], which is applicable when unbiased estimates of (sub)gradients of an objective function are available. Modified versions of the SA method, such as the mirror descent SA method [4, Sects. 3 and 4], [5, Sect. 2.3] and the accelerated SA method [6, Sect. 3.1], have been reported as useful methods for solving these problems. Meanwhile, some stochastic optimization algorithms have been proposed with the rapid development of deep learning. For example, AdaGrad [7, Figs. 1 and 2] is an algorithm based on the mirror descent SA method, and Adam [8, Algorithm 1], [2, Algorithm 8.7] and AMSGrad [9, Algorithm 2] are well known as powerful tools for solving convex stochastic optimization problems in deep neural networks. These algorithms use the inverses of diagonal positive-definite matrices at each iteration to adapt the learning rates of all model parameters. Hence, these algorithms are called adaptive learning rate optimization algorithms.
The above-mentioned methods commonly assume that metric projection onto a given constraint set is computationally possible. However, although the metric projection onto a simple convex set, such an affine subspace, half-space, or hyperslab, can be easily computed, the projection onto a complicated set, such as the intersections of simple convex sets, the set of minimizers of a convex function, or the solution set of a monotone variational inequality, cannot be easily computed. Accordingly, it is difficult to apply the above-mentioned methods to stochastic optimization problems with complicated constraints.
In order to solve a stochastic optimization problem over a complicated constraint set, we define a computable quasinonexpansive mapping whose fixed point set coincides with the constraint set, which is possible for the above-mentioned complicated convex sets (see Sect. 3.1 and Example 4.1 for examples of computable quasinonexpansive mappings). Accordingly, the present paper deals with a convex stochastic optimization problem over the fixed point set of a computable quasinonexpansive mapping.
Since useful fixed point algorithms have already been reported [10, Chap. 5], [11, Chaps. 2–9], [12–16], we can find fixed points of quasinonexpansive mappings, which are feasible points of the convex stochastic optimization problem. By combining the SA method with an existing fixed point algorithm, we could obtain algorithms [17, Algorithms 1 and 2] for solving convex stochastic optimization problems that can be applied to classifier ensemble problems [18, 19] (Example 4.1(ii)), which arise in the field of machine learning. However, the existing algorithms converge slowly [17] due to being stochastic first-order methods. In this paper, we propose an algorithm (Algorithm 1) for solving a convex stochastic optimization problem (Problem 3.1) that performs better than the algorithms in [17, Algorithms 1 and 2]. The algorithm proposed herein is based on useful adaptive learning rate optimization algorithms, such as Adam and AMSGrad, that use certain diagonal positive-definite matrices.Footnote 1 The first contribution of the present study is an analysis of the convergence of the proposed algorithm (Theorem 5.1). This analysis finds that, if sufficiently small constant step-sizes are used, then the proposed algorithm approximates a solution to the problem (Theorem 5.2). Moreover, for sequences of diminishing step-sizes, the convergence rates of the proposed algorithm can be specified (Theorem 5.3 and Corollary 5.1).
We compare the proposed algorithm with the existing adaptive learning rate optimization algorithms for a constrained convex stochastic optimization problem in deep learning (Example 4.1(i)). Although the existing adaptive learning rate optimization algorithms achieve low regret, they cannot solve the problem. The second contribution of the present study is to show that, unlike the existing adaptive learning rate optimization algorithms, the proposed algorithm can solve the problem (Corollaries 5.2 and 5.3) (see Sect. 5.2 for details). The third contribution is that we show that the proposed algorithm can solve classifier ensemble problems and that the learning methods based on the proposed algorithm perform better numerically than the existing learning method based on the existing algorithms in [17]. In particular, the numerical results indicate that the learning methods based on the proposed algorithm with constant step-sizes or step-sizes computed by the Armijo line search algorithm can solve classifier ensemble problems faster than the existing learning method based on the algorithms in [17]. As a result, the proposed learning methods achieve high accuracies faster than the existing learning method.
2 Mathematical preliminaries
2.1 Definitions and propositions
Let \(\mathbb{N}\) be the set of all positive integers. Let \(\mathbb{R}^{N}\) be an N-dimensional Euclidean space with the inner product \(\langle \cdot , \cdot \rangle \) with the associated norm \(\|\cdot \|\), and let \(\mathbb{R}_{+}^{N} := \{(x_{i})_{i=1}^{N} \in \mathbb{R}^{N} \colon x_{i} \geq 0 \ (i=1,2,\ldots ,N) \}\). Let \(X^{\top }\) denote the transpose of matrix X, let I denote the identity matrix, and let Id denote the identity mapping on \(\mathbb{R}^{N}\). Let \(\mathbb{S}^{N}\) be the set of \(N \times N\) symmetric matrices, i.e., \(\mathbb{S}^{N} = \{ X \in \mathbb{R}^{N \times N} \colon X=X^{\top }\}\). Let \(\mathbb{S}^{N}_{++}\) denote the set of symmetric positive-definite matrices, i.e., \(\mathbb{S}^{N}_{++} = \{ X \in \mathbb{S}^{N} \colon X \succ O \}\). Given \(H \in \mathbb{S}_{++}^{N}\), the H-inner product of \(\mathbb{R}^{N}\) and the H-norm can be defined for all \(x,y\in \mathbb{R}^{N}\) by \(\langle x,y \rangle _{H} := \langle x, H y \rangle \) and \(\|x \|_{H}^{2} := \langle x, Hx \rangle \). Let \(\mathsf{diag}(x_{i})\) be an \(N \times N\) diagonal matrix with diagonal components \(x_{i} \in \mathbb{R}\) (\(i=1,2,\ldots ,N\)), and let \(\mathbb{D}^{N}\) be the set of \(N \times N\) diagonal matrices, i.e., \(\mathbb{D}^{N} = \{ X \in \mathbb{R}^{N \times N} \colon X = \mathsf{diag}(x_{i}), x_{i} \in \mathbb{R}\ (i=1,2, \ldots ,N) \}\).
Let \(\mathbb{E}[X]\) denote the expectation of random variable X. The history of the process \(\xi _{0},\xi _{1},\ldots \) up to time n is denoted by \(\xi _{[n]} = (\xi _{0},\xi _{1},\ldots ,\xi _{n})\). Let \(\mathbb{E}[X|\xi _{[n]}]\) denote the conditional expectation of X given by \(\xi _{[n]} = (\xi _{0},\xi _{1},\ldots ,\xi _{n})\). Unless stated otherwise, all relations between random variables are supported to hold almost surely.
The subdifferential [10, Definition 16.1], [20, Sect. 23] of a convex function \(f \colon \mathbb{R}^{N} \to \mathbb{R}\) is defined for all \(x\in \mathbb{R}^{N}\) by
A point \(u \in \partial f(x)\) is called the subgradient of f at \(x\in \mathbb{R}^{N}\).
Proposition 2.1
([21, Theorem 4.1.3], [10, Propositions 16.14(ii), (iii)])
Let \(f \colon \mathbb{R}^{N} \to \mathbb{R}\) be convex. Then f is continuous and \(\partial f(x) \neq \emptyset \) for every \(x\in \mathbb{R}^{N}\). Moreover, for every \(x\in \mathbb{R}^{N}\), there exists \(\delta > 0\) such that \(\partial f(B(x;\delta ))\) is bounded, where \(B(x;\delta )\) is the closed ball with center x and radius δ.
When a mapping \(Q \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is considered under the H-norm \(\|\cdot \|_{H}\), we denote it as \(Q_{H} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\). We define \(Q := Q_{I}\). A mapping \(Q \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is said to be quasinonexpansive [10, Definition 4.1(iii)] if
for all \(x\in \mathbb{R}^{N}\) and all \(y\in \mathrm{Fix}(Q)\), where \(\mathrm{Fix}(Q)\) is the fixed point set of Q defined by \(\mathrm{Fix}(Q) := \{ x \in \mathbb{R}^{N} \colon x = Q(x) \}\). When a quasinonexpansive mapping has one fixed point, its fixed point set is closed and convex [22, Proposition 2.6]. Q is called a firmly quasinonexpansive mapping [23, Sect. 3] if \(\| Q(x) - y \|^{2} + \| (\mathrm{Id} - Q)(x) \|^{2} \leq \| x-y \|^{2}\) for all \(x\in \mathbb{R}^{N}\) and all \(y\in \mathrm{Fix}(Q)\). Q is firmly quasinonexpansive if and only if \(R:= 2Q - \mathrm{Id}\) is quasinonexpansive [10, Proposition 4.2]. This means that \((1/2)(\mathrm{Id} + R)\) is firmly quasinonexpansive when R is quasinonexpansive. Given \(H \in \mathbb{S}_{++}^{N}\), we define the subgradient projectionFootnote 2 relative to a convex function \(f \colon \mathbb{R}^{N} \to \mathbb{R}\) by
where \(G(x)\) is any point in \(\partial f(x)\) (\(x\in \mathbb{R}^{N}\)) and \(\mathrm{lev}_{\leq 0} f := \{ x\in \mathbb{R}^{N} \colon f(x) \leq 0 \} \neq \emptyset \). The following proposition holds.
Proposition 2.2
Let \(H \in \mathbb{S}_{++}^{N}\) and let \(f\colon \mathbb{R}^{N} \to \mathbb{R}\) be convex. Then \(Q_{f,H} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) defined by (1) satisfies the following:
-
(i)
\(Q_{f} := Q_{f,I}\) is firmly quasinonexpansive and \(\mathrm{Fix}(Q_{f}) = \mathrm{lev}_{\leq 0} f\);
-
(ii)
\(Q_{f,H}\) is firmly quasinonexpansive under the H-norm with \(\mathrm{Fix}(Q_{f,H}) = \mathrm{Fix}(Q_{f})\).
Proof
(i) This follows from Proposition 2.3 in [22].
(ii) We first show that \(\mathrm{lev}_{\leq 0} f = \mathrm{Fix}(Q_{f,H})\). From (1), we have that \(\mathrm{lev}_{\leq 0} f \subset \mathrm{Fix}(Q_{f,H})\). Let \(x \in \mathrm{Fix}(Q_{f,H})\) and assume that \(x \notin \mathrm{lev}_{\leq 0} f\). Then the definition of the H-inner product and \(G(x) \in \partial f(x)\) mean that, for all \(y\in \mathrm{lev}_{\leq 0} f\),
which implies that \(H^{-1} G(x) \neq 0\). From (1) and \(x \in \mathrm{Fix}(Q_{f,H})\), we also have that
which, together with \(f(x) > 0\), gives \(H^{-1} G(x)=0\), which is a contradiction. Hence, we have that \(\mathrm{lev}_{\leq 0} f \supset \mathrm{Fix}(Q_{f,H})\), i.e., \(\mathrm{lev}_{\leq 0} f = \mathrm{Fix}(Q_{f,H})\). Accordingly, (i) ensures that \(\mathrm{Fix}(Q_{f,H}) = \mathrm{lev}_{\leq 0} f = \mathrm{Fix}(Q_{f})\). For all \(x\in \mathbb{R}^{N} \backslash \mathrm{lev}_{\leq 0} f\) and all \(y\in \mathrm{lev}_{\leq 0} f\),
which, together with (2), implies that \(Q_{f,H}\) is firmly quasinonexpansive under the H-norm. □
\(Q \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is said to be Lipschitz continuous (L-Lipschitz continuous) if there exists \(L > 0\) such that \(\| Q(x) - Q(y) \| \leq L \| x-y \|\) for all \(x,y\in \mathbb{R}^{N}\). \(Q \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is said to be nonexpansive [10, Definition 4.1(ii)] if Q is 1-Lipschitz continuous, i.e., \(\| Q(x) - Q(y) \| \leq \| x-y \|\) for all \(x,y\in \mathbb{R}^{N}\). Any nonexpansive mapping satisfies the quasinonexpansivity condition. The metric projection [10, Subchapter 4.2, Chap. 28] onto a nonempty, closed convex set C (\(\subset \mathbb{R}^{N}\)), denoted by \(P_{C}\), is defined for all \(x\in \mathbb{R}^{N}\) by \(P_{C}(x) \in C\) and \(\| x - P_{C}(x) \| = \mathrm{d}(x,C) := \inf_{y\in C} \| x-y \|\). \(P_{C}\) is firmly nonexpansive, i.e., \(\| P_{C}(x) - P_{C}(y) \|^{2} + \| (\mathrm{Id} - P_{C})(x) - ( \mathrm{Id} - P_{C})(y) \|^{2} \leq \| x-y \|^{2}\) for all \(x,y\in \mathbb{R}^{N}\), with \(\mathrm{Fix}(P_{C}) = C\) [10, Proposition 4.8, (4.8)]. The metric projection onto C under the H-norm is denoted by \(P_{C,H}\). When C is an affine subspace, half-space, or hyperslab, the projection onto C can be computed within a finite number of arithmetic operations [10, Chap. 28].
3 Convex stochastic optimization problem over fixed point set
This paper considers the following problem.
Problem 3.1
Assume that
-
(A0)
\((\mathsf{H}_{n})_{n\in \mathbb{N}}\) is the sequence in \(\mathbb{S}_{++}^{N} \cap \mathbb{D}^{N}\);
-
(A1)
\(Q_{\mathsf{H}_{n}} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is quasinonexpansive under the \(\mathsf{H}_{n}\)-norm and \(X := \bigcap_{n\in \mathbb{N}} \mathrm{Fix}(Q_{\mathsf{H}_{n}})\) (⊂C) is nonempty, where \(C \subset \mathbb{R}^{N}\) is a nonempty, closed convex set onto which the projection can be easily computed;
-
(A2)
\(f \colon \mathbb{R}^{N} \to \mathbb{R}\) defined for all \(x \in \mathbb{R}^{N}\) by \(f(x) := \mathbb{E}[F({x},\xi )]\) is well defined and convex, where ξ is a random vector whose probability distribution P is supported on a set \(\Xi \subset \mathbb{R}^{M}\) and \(F \colon \mathbb{R}^{N} \times \Xi \to \mathbb{R}\).
Then
where one assumes that \(X^{\star }\) is nonempty.
Examples of \(Q_{\mathsf{H}_{n}}\) satisfying (A0) and (A1) are described in Sect. 3.1 and Example 4.1.
The following are sufficient conditions [5, (A1), (A2), (2.5)] for being able to solve Problem 3.1.
-
(C1)
There is an independent and identically distributed sample \(\xi _{0}, \xi _{1}, \ldots \) of realizations of the random vector ξ;
-
(C2)
There is an oracle which, for a given input point \((x, \xi ) \in \mathbb{R}^{N} \times \Xi \), returns a stochastic subgradient \(\mathsf{G}(x,\xi )\) such that \(\mathsf{g}(x) := \mathbb{E}[\mathsf{G}(x,\xi )]\) is well defined and is a subgradient of f at x, i.e., \(\mathsf{g}(x) \in \partial f (x)\);
-
(C3)
There exists a positive number M such that, for all \(x\in C\), \(\mathbb{E}[\|\mathsf{G}(x,\xi )\|^{2}] \leq M^{2}\).
Suppose that \(F(\cdot , \xi )\) (\(\xi \in \Xi \)) is convex and consider the oracle which returns a stochastic subgradient \(\mathsf{G}(x,\xi ) \in \partial _{x} F (x,\xi )\) for given \((x,\xi ) \in \mathbb{R}^{N} \times \Xi \). Then \(f (\cdot ) = \mathbb{E}[F(\cdot ,\xi )]\) is well defined and convex, and \(\partial f (x) = \mathbb{E}[\partial _{x} F(x,\xi )]\) [25, Theorem 7.51], [5, p.1575].
3.1 Related problems and their algorithms
Here, let us consider the following convex stochastic optimization problem [5, (1.1)]:
where \(C \subset \mathbb{R}^{N}\) is nonempty, bounded, closed, and convex. The classical method for problem (3) under (C1)–(C3) is the stochastic approximation (SA) method [1, (5.4.1)], [2, Algorithm 8.1], [3] defined as follows: given \(x_{0}\in \mathbb{R}^{N}\) and \((\lambda _{n})_{n\in \mathbb{N}} \subset (0,+\infty )\),
The SA method requires the metric projection onto C, and hence can be applied only to cases where C is simple in the sense that \(P_{C}\) can be efficiently computed (e.g., C is a closed ball, half-space, or hyperslab [10, Chap. 28]). When C is not simple, the SA method requires solving the following subproblem at each iteration n:
The mirror descent SA method [4, Sects. 3 and 4], [5, Sect. 2.3] is useful for solving problem (3) and has been analyzed for the case of step-sizes that are constant or diminishing. For example, the mirror descent SA method [5, (2.32), (2.38), and (2.47)] with a constant step-size policy generates the following sequence \((\tilde{x}_{1}^{n})_{n\in \mathbb{N}}\): given \(x_{0}\in X^{o} := \{x\in \mathbb{R}^{N} \colon \partial \omega (x) \neq \emptyset \}\),
where \(\omega \colon C \to \mathbb{R}\) is differentiable and convex, \(V \colon X^{o} \times C \to \mathbb{R}_{+}\) is defined for all \((x,z) \in X^{o} \times C\) by \(V(x,z) := \omega (z) - [\omega (x) + \langle \nabla \omega (x),z-x \rangle ]\), and \(\gamma _{t}\) (\(t\in \mathbb{N}\)) is a constant step-size. When \(\omega (\cdot ) = (1/2)\|\cdot \|^{2}\), \(x_{n+1}\) in (5) coincides with \(x_{n+1}\) in (4). Under certain assumptions, method (5) satisfies \(\mathbb{E} [ f(\tilde{x}_{1}^{n}) - f^{\star }] = \mathcal{O}(1/\sqrt{n})\) [5, (2.48)] (see [5, (2.57)] for the rate of convergence of the mirror descent SA method with a diminishing step-size policy).
As the field of deep learning has developed, it has produced some useful stochastic optimization algorithms, such as AdaGrad [7, Figs. 1 and 2], [2, Algorithm 8.4], RMSProp [2, Algorithm 8.5], and Adam [8, Algorithm 1], [2, Algorithm 8.7], for solving problem (3). The AdaGrad algorithm is based on the mirror decent SA method (5) (see also [7, (4)]), and the RMSProp algorithm is a variant of AdaGrad. The Adam algorithm is based on a combination of RMSProp and the momentum method [26, (9)], as follows: given \(x_{t}, m_{t-1}, v_{t-1} \in \mathbb{R}^{N}\),
where \(\beta _{i} > 0\) (\(i=1,2\)), \(\epsilon > 0\), \((\lambda _{n})_{n\in \mathbb{N}} \subset (0,1)\) is diminishing step-size, and \(A \odot B\) denotes the Hadamard product of matrices A and B. If we define matrix \(\mathsf{H}_{t}\) as
then the Adam algorithm (6) can be expressed as
Unfortunately, there exists an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution [9, Theorem 2]. To guarantee convergence and preserve the practical benefits of Adam, AMSGrad [9, Algorithm 2] was proposed as follows: for \((\beta _{1,t})_{t\in \mathbb{N}} \subset (0,+\infty )\),
The existing SA methods (4), (5), (6), and (9) (see also [6, 27], [2, Sect. 8.5], and [5, Sect. 2.3]) require minimizing a certain convex function over C at each iteration. Therefore, when C has a complicated form (e.g., C is expressed as the set of all minimizers of a convex function over a closed convex set, the solution set of a monotone variational inequality, or the intersection of closed convex sets), it is difficult to compute the point \(x_{n+1}\) generated by any of (4), (5), (6), and (9) at each iteration.
Meanwhile, the fixed point theory [10, 28–30] enables us to define a computable quasinonexpansive mapping of which the fixed point set is equal to the complicated set. For example, let \(\mathrm{lev}_{\leq 0} f_{i}\) (\(i=1,2,\ldots ,I\)) be the level set of a convex function \(f_{i} \colon \mathbb{R}^{N} \to \mathbb{R}\), and let X be the intersection of \(\mathrm{lev}_{\leq 0} f_{i}\), i.e.,
Let \(n\in \mathbb{N}\) be fixed arbitrarily, and let \(\mathsf{H}_{n} \in \mathbb{S}^{N}_{++}\) (see (A0)). Let \(Q_{f_{i}, \mathsf{H}_{n}} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) (\(i=1,2,\ldots ,I\)) be the subgradient projection defined by (1) with \(f:= f_{i}\) and \(H := \mathsf{H}_{n}\). Accordingly, Proposition 2.2 implies that \(Q_{f_{i}, \mathsf{H}_{n}}\) is firmly quasinonexpansive under the \(\mathsf{H}_{n}\)-norm and \(\mathrm{Fix}(Q_{f_{i}, \mathsf{H}_{n}}) = \mathrm{lev}_{\leq 0} f_{i}\) (\(i=1,2,\ldots ,I\)). Under the condition that the subgradients of \(f_{i}\) can be efficiently computed (see, e.g., [10, Chap. 16] for examples of convex functions with computable subgradients), \(Q_{f_{i}, \mathsf{H}_{n}}\) also can be computed. Here, let us define \(Q_{\mathsf{H}_{n}} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) as
where \((\omega _{i})_{i=1}^{I} \subset (0,+\infty )\) satisfies \(\sum_{i=1}^{I} \omega _{i} = 1\). Then \(Q_{\mathsf{H}_{n}}\) is quasinonexpansive under the \(\mathsf{H}_{n}\)-norm [10, Exercise 4.11]. Moreover, we have that
where the second equality comes from Proposition 2.2(i) (i.e., \(\mathrm{Fix}(Q_{f_{i}}) = \mathrm{lev}_{\leq 0} f_{i}\) (\(i=1,2,\ldots ,I\))), the third equality comes from Proposition 2.2(ii) (i.e., \(\mathrm{Fix}(Q_{f_{i}}) = \mathrm{Fix}(Q_{f_{i}, \mathsf{H}_{n}})\) for all \(n\in \mathbb{N}\)), and the fourth equality comes from [10, Proposition 4.34]. Therefore, (10), (11), and (12) imply that we can define a computable mapping \(Q_{\mathsf{H}_{n}}\) satisfying (A1) of which the fixed point set is equal to the intersection of level sets. In the case where C is simple in the sense that \(P_{C} = P_{C,I}\) can be easily computed, \(I \succ O\) and \(Q := P_{C}\) obviously satisfy (A0) and (A1) with \(\mathrm{Fix}(P_{C}) =C=:X\). Accordingly, Problem 3.1 with \(Q := P_{C}\) coincides with problem (3), which implies that Problem 3.1 is a generalization of problem (3).
Fixed point algorithms exist for searching for a fixed point of a nonexpansive mapping [10, Chap. 5], [11, Chaps. 2–9], [12–16]. The sequence \((x_{n})_{n\in \mathbb{N}}\) is generated by the Halpern fixed point algorithm [11, Subchapter 6.5], [12, 16] as follows: for all \(n\in \mathbb{N}\),
where \(x_{0}\in \mathbb{R}^{N}\), \((\alpha _{n})_{n\in \mathbb{N}} \subset (0,1)\) satisfies \(\lim_{n\to +\infty } \alpha _{n} = 0\) and \(\sum_{n=0}^{+\infty } \alpha _{n} = +\infty \), and \(Q \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is nonexpansive with \(\mathrm{Fix}(Q) \neq \emptyset \). The sequence \((x_{n})_{n\in \mathbb{N}}\) in (13) converges to the minimizer of the specific convex function \(f_{0}(x) := (1/2)\|x - x_{0}\|^{2}\) (\(x\in \mathbb{R}^{N}\)) over \(\mathrm{Fix}(Q)\) (see, e.g., [11, Theorem 6.19]). From \(\nabla f_{0}(x) = x - x_{0}\) (\(x\in \mathbb{R}^{N}\)), the Halpern algorithm (13) can be expressed as follows (see [31, 32] for algorithms optimizing a general convex function):
The following algorithm obtained by combining the SA method (4) with (14) for solving Problem 3.1 follows naturally from the above discussion: for all \(n\in \mathbb{N}\),
where \(Q_{\alpha }:= \alpha \mathrm{Id} + (1-\alpha )Q\) (\(\alpha \in (0,1)\)). A convergence analysis of this algorithm for different step-size rules was performed in [17]. For example, algorithm (15) with a diminishing step-size was shown to converge in probability to a solution to Problem 3.1 with \(X = \mathrm{Fix}(Q)\) [17, Theorem III.2]. The advantage of algorithm (15) is that it allows convex stochastic optimization problems with complicated constraints to be solved (see also (12)). From the fact stated in [17, Problem II.1] that the classifier ensemble problem [18, 19], which is a central issue in machine learning, can be formulated as a convex stochastic optimization problem with complicated constraints, the classifier ensemble problem can be regarded as an example of Problem 3.1. This result implies that algorithm (15) can solve the classifier ensemble problem. However, this algorithm suffers from slow convergence, as shown in [17]. Specifically, although the learning methods based on algorithm (15) have higher accuracies than the previously proposed learning methods, they have longer elapsed times. Accordingly, we should consider developing stochastic optimization techniques to accelerate algorithm (15). This paper proposes an algorithm (Algorithm 1) based on useful stochastic gradient descent algorithms, such as Adam [8, Algorithm 1] and AMSGrad [9, Algorithm 2], for solving Problem 3.1, as a replacement for the existing stochastic first-order method [17].
4 Proposed algorithm
Before giving some examples, we first prove the following lemma listing the basic properties of Algorithm 1.
Lemma 4.1
Suppose that \(\mathsf{H}_{n} \in \mathbb{S}_{++}^{N}\) (\(n\in \mathbb{N}\)), (A1), (A2), (C1), and (C2) hold and consider the sequence \((x_{n})_{n\in \mathbb{N}}\) defined for all \(n\in \mathbb{N}\) by Algorithm 1. Then, for all \(x\in X\) and all \(n\in \mathbb{N}\),
Moreover, under (C3), \(\mathbb{E}[\|m_{n}\|^{2}] \leq \tilde{M}^{2} := \max \{ \|m_{-1}\|^{2},M^{2} \}\) holds for all \(n\in \mathbb{N}\). If
-
(A3)
\(h_{\star }:= \sup \{\max_{i=1,2,\ldots ,N} h_{n,i}^{-1/2} \colon n \in \mathbb{N} \}\) is finite, where \(\mathsf{H}_{n} := \mathsf{diag}(h_{n,i})\),
then \(\mathbb{E}[\|\mathsf{d}_{n}\|_{\mathsf{H}_{n}}^{2}] \leq h_{\star }^{2} \tilde{M}^{2}\) holds for all \(n\in \mathbb{N}\).
Proof
Let \(x\in X \subset C\) and \(n\in \mathbb{N}\) be fixed arbitrarily. The definition of \(x_{n+1}\) and the firm nonexpansivity of \(P_{C,\mathsf{H}_{n}}\) guarantee that, almost surely,
which, together with \(\| \alpha x + (1-\alpha ) y \|^{2} = \alpha \|x\|^{2} + (1-\alpha ) \|y\|^{2} - \alpha (1-\alpha )\|x-y\|^{2}\) (\(x,y\in \mathbb{R}^{N}\), \(\alpha \in \mathbb{R}\)), implies that
The definition of \(y_{n}\) and (A1) ensure that, almost surely,
The definitions of \(\mathsf{d}_{n}\) and \(m_{n}\) in turn ensure that
Hence, (16) implies that, almost surely,
Moreover, the condition \(x_{n} = x_{n}(\xi _{[n-1]})\) (\(n\in \mathbb{N}\)) and (C1) guarantee that
which, together with (C2), implies that
Therefore, taking the expectation of (17) gives the first assertion of Lemma 4.1.
The definition of \(m_{n}\) and (C3), together with the convexity of \(\|\cdot \|^{2}\), guarantee that, for all \(n\in \mathbb{N}\),
Induction thus ensures that, for all \(n\in \mathbb{N}\),
Given \(n\in \mathbb{N}\), \(\mathsf{H}_{n} \succ O\) ensures that there exists a unique matrix \(\overline{\mathsf{H}}_{n} \succ O\) such that \(\mathsf{H}_{n} = \overline{\mathsf{H}}_{n}^{2}\) [33, Theorem 7.2.6]. Since \(\|x\|_{\mathsf{H}_{n}}^{2} = \| \overline{\mathsf{H}}_{n} x \|^{2}\) holds for all \(x\in \mathbb{R}^{N}\), the definition of \(\mathsf{d}_{n}\) implies that, for all \(n\in \mathbb{N}\),
where \(\| \overline{\mathsf{H}}_{n}^{-1} \| = \| \mathsf{diag}(h_{n,i}^{-1/2}) \| = \max_{i=1,2,\ldots ,N} h_{n,i}^{-1/2}\) (\(n\in \mathbb{N}\)). From (18) and
(by (A3)), we have that, for all \(n\in \mathbb{N}\),
This completes the proof. □
The convergence analyses of Algorithm 1 in Sect. 5 depend on the following assumption:
Let us consider the case where \(\mathsf{H}_{n}\) and \(v_{n}\) are defined for all \(n\in \mathbb{N}\) by
where \(\beta \in (0,1)\) and \(v_{-1} = \hat{v}_{-1} = 0 \in \mathbb{R}^{N}\) (see also (9)), and discuss the relationship between (A3) and (A4). Assumption (A4) implies that \((x_{n})_{n\in \mathbb{N}} \subset C\) generated by Algorithm 1 is almost surely bounded. In the standard case of \(\mathsf{G}(x_{n},\xi _{n}) \in \partial _{x} F(x_{n}, \xi _{n})\), Proposition 2.1 and (A4) imply that \((\mathsf{G}(x_{n},\xi _{n}))_{n\in \mathbb{N}}\) is almost surely bounded, i.e., \({M}_{1} := \sup_{n\in \mathbb{N}} \| \mathsf{G}(x_{n},\xi _{n}) \odot \mathsf{G}(x_{n},\xi _{n}) \| < + \infty \). Since the triangle inequality and (19) guarantee that, almost surely, \(\|v_{n} \| \leq \beta \|v_{n-1} \| + (1 - \beta ) \|\mathsf{G}(x_{n}, \xi _{n}) \odot \mathsf{G}(x_{n},\xi _{n}) \|\), induction shows that, for all \(n\in \mathbb{N}\), almost surely, \(\|v_{n}\| \leq {M}_{1} < + \infty \). Accordingly, (19) leads to the almost sure boundedness of \((\hat{v}_{n})_{n\in \mathbb{N}}\). Hence, \(h_{\star }:= \sup \{ \max_{i=1,2,\ldots ,N} \sqrt{\hat{v}_{n,i}} \colon n\in \mathbb{N} \}< + \infty \), which implies that (A3) holds. The above discussion shows that (A4) implies (A3) when \(\mathsf{H}_{n}\) and \(v_{n}\) are as follows (see also (6) and (7)):
We provide some examples of Problem 3.1 with (A0)–(A4) that can be solved by Algorithm 1 under (C1)–(C3).
Example 4.1
(i) Deep learning problem [9, p.2]: At each time step t, stochastic optimization algorithms used in training deep networks pick a point \(x_{t} \in X\) with the parameters of the model to be learned, where \(X \subset \mathbb{R}^{N}\) is the simple, nonempty, bounded, closed convex feasible set of points, and then incur loss \(f_{t} (x_{t})\), where \(f_{t} \colon \mathbb{R}^{N} \to \mathbb{R}\) is a convex loss function represented as the loss of the model with the chosen parameters in the next minibatch. Accordingly, the stochastic optimization problem in deep networks can be formulated as follows:
where T is the total number of rounds in the learning process, and \((\mathsf{H}_{n})_{n\in \mathbb{N}} \subset \mathbb{S}_{++}^{N} \cap \mathbb{D}^{N}\) defined by each of (19) and (20) satisfies (A0). \(Q_{\mathsf{H}_{n}} := P_{X,\mathsf{H}_{n}}\) (\(n\in \mathbb{N}\)) satisfies (A1), and \(f(\cdot ) = \mathbb{E}[f_{\xi } (\cdot )]:= (1/T) \sum_{t=1}^{T} f_{t}( \cdot )\) satisfies (A2). Setting \(C := X\) ensures (A4), which implies (A3). Algorithm 1 for solving problem (21) is as follows:
(ii) Classifier ensemble problem [18, Sect. 2.2.2], [19, Sect. 3.2.2] (see also [17, Problem II.1]): For a training set \(S =\{ (z_{m}, l_{m})\}_{m=1}^{M} \subset \mathbb{R}^{N} \times \mathbb{R}\), where \(z_{m} := (z_{m}^{n})_{n=1}^{N}\) and \(z_{m}^{n}\) is the measure corresponding to the mth sample in the sample set and the nth classifier in an ensemble. The classifier ensemble problem with sparsity learning is the following:
where \(\|\cdot \|_{1}\) denotes the \(\ell _{1}\)-norm and \(t_{1}\) is the sparsity control parameter. Suppose that \(\mathsf{H}_{n}\) is as each of (19) and (20), which satisfies (A0), and define a mapping \(Q_{\mathsf{H}_{n}} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) by
Since the projections \(P_{\mathbb{R}_{+}^{N}, \mathsf{H}_{n}}\) and \(P_{\{ x \in \mathbb{R}^{N} \colon \|x\|_{1} \leq t_{1} \}, \mathsf{H}_{n}}\) can be easily computed [34, Lemma 1.1], \(Q_{\mathsf{H}_{n}}\) defined by (24) can be also computed. Moreover, \(Q_{\mathsf{H}_{n}}\) defined by (24) is nonexpansive with \(X= \bigcap_{n\in \mathbb{N}} \mathrm{Fix}(Q_{\mathsf{H}_{n}})\), i.e., (A1) holds. Since \(\{ x \in \mathbb{R}^{N} \colon \|x\|_{1} \leq t_{1} \}\) is bounded, we can set a simple, bounded set C such that \(X \subset C\), i.e., (A4) holds. Moreover, f in problem (23) satisfies (A2).
The classifier ensemble problem with both sparsity and diversity learning is as follows:
where \(t_{2}\) is the diversity control parameter, \(f_{\mathrm{div}}(x) := \sum_{m=1}^{M} \{ \langle [z_{m}], x \rangle - \langle z_{m}, x \rangle ^{2} \}\) (\(x\in \mathbb{R}^{N}\)), and \([z_{m}] := ((z_{m}^{i})^{2})_{i=1}^{N} \in \mathbb{R}^{N}\). From the discussion regarding (10), (11), and (12), a mapping
with \((\mathsf{H}_{n})_{n\in \mathbb{N}} \subset \mathbb{S}_{++}^{N} \cap \mathbb{D}^{N}\) defined by each of (19) and (20), is quasinonexpansive under the \(\mathsf{H}_{n}\)-norm satisfying \(X = \bigcap_{n\in \mathbb{N}} \mathrm{Fix}(Q_{\mathsf{H}_{n}})\), i.e., (A1) holds. The discussion in the previous paragraph implies that (A0), (A2), and (A4) again hold.
Algorithm 1 for solving each of problems (23) and (25) is represented as follows:
In contrast to Adam (6) and AMSGrad (9) that can solve a convex stochastic optimization problem with a simple constraint (3) (see also problem (21)), algorithm (27) can be applied to a convex stochastic optimization problem with complicated constraints, such as problems (23) and (25).
(iii) Network utility maximization problem [35, (6), (7)] (see also [36, Problem II.1]): The network resource allocation problem is to determine the source rates that maximize the utility aggregated over all sources over the link capacity constraints and source constraints. This problem can be formulated as the following network utility maximization problem:
where \(x_{s}\) denotes the transmission rate of source \(s \in \mathcal{S}\), \(u_{s}\) is a concave utility function of source s, \(\mathcal{S}(l)\) denotes the set of sources that use link \(l \in \mathcal{L}\), \(C_{l}\) is the capacity constraint set of link l having capacity \(c_{l} \in \mathbb{R}_{+}\) defined by \(C_{l} := \{ x= (x_{s})_{s\in \mathcal{S}} \colon \sum_{s \in \mathcal{S}(l)} x_{s} \leq c_{l} \}\), and \(C_{s}\) is the constraint set of source s having the maximum allowed rate \(M_{s}\) defined by \(C_{s} := \{ x= (x_{s})_{s\in \mathcal{S}} \colon x_{s} \in [0,M_{s}] \}\). Since \(C_{l}\) and \(C_{s}\) are half-spaces, the projections \(P_{C_{l}, \mathsf{H}_{n}}\) and \(P_{C_{s}, \mathsf{H}_{n}}\) are easily computed,Footnote 3 where \((\mathsf{H}_{n})_{n\in \mathbb{N}} \subset \mathbb{S}_{++}^{N} \cap \mathbb{D}^{N}\) is defined by each of (19) and (20). For example, we can define a nonexpansive mapping \(Q_{\mathsf{H}_{n}} := \prod_{l\in \mathcal{L}} P_{C_{l},\mathsf{H}_{n}} \prod_{s\in \mathcal{S}} P_{C_{s}, \mathsf{H}_{n}}\) satisfying \(X = \bigcap_{n\in \mathbb{N}} \mathrm{Fix}(Q_{\mathsf{H}_{n}})\). The boundedness of \(\bigcap_{s\in \mathcal{S}} C_{s}\) allows us to set a simple, bounded set C satisfying \(C \supset \bigcap_{s\in \mathcal{S}} C_{s} \supset X\). Algorithm (27) with \(\mathsf{G}(x_{n},\xi _{n}) \in \partial (-u_{\xi _{n}})(x_{n})\) can be applied to problem (28).
5 Convergence analyses and comparisons
5.1 Convergence analyses of Algorithm 1
For convergence analyses of Algorithm 1, we prove the following theorem.
Theorem 5.1
Suppose that (A0)–(A4) and (C1)–(C3) hold and that \((\alpha _{n})_{n\in \mathbb{N}}\), \((\beta _{n})_{n\in \mathbb{N}}\), \((\lambda _{n})_{n\in \mathbb{N}}\), and \((\gamma _{n})_{n\in \mathbb{N}}\) defined by \(\gamma _{n} := (1-\alpha _{n})(1-\beta _{n})\lambda _{n}\) (\(n\in \mathbb{N}\)) satisfy
and that \(\mathsf{H}_{n} = \mathsf{diag}(h_{n,i})\) satisfies Footnote 4
Then Algorithm 1 is such that the following are satisfied for all \(n \geq 1\):
where \(\tilde{x}_{n} := (1/n) \sum_{k=1}^{n} x_{k}\), M̃ and \(h_{\star }\) are defined as in Lemma 4.1,
\((\alpha _{n})_{n\in \mathbb{N}} \subset [c,a] \subset (0,1)\), \((\beta _{n})_{n\in \mathbb{N}} \subset (0,b] \subset (0,1)\), \(\tilde{a} := 1 - a\), \(\tilde{b} := 1 - b\), \(\tilde{c} := 1-c\), and \(\hat{M} := \sup \{\mathbb{E}[f(x) - f(x_{n})] \colon n\in \mathbb{N} \} < + \infty \) for some \(x\in X\). If
-
(A1)’
\(Q_{\mathsf{H}_{n}} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is nonexpansive under the \(\mathsf{H}_{n}\)-norm,
then, for all \(n \geq 1\),
Proof
Let \(x\in X\) be fixed arbitrarily. Lemma 4.1 guarantees that, for all \(k\in \mathbb{N}\),
Summing the above inequality ensures that, for all \(n \geq 1\),
where (29) implies that \(b > 0\) exists such that, for all \(n\in \mathbb{N}\), \(\beta _{n} \leq b < 1\) and \(\tilde{b} := 1 -b\). The definition of \(\Gamma _{n}\) and \(\mathbb{E} [ \| x_{n+1} - x \|_{\mathsf{H}_{n}}^{2}]/\gamma _{n} \geq 0\) imply that
Given \(k\in \mathbb{N}\), \(\mathsf{H}_{k} \succ O\) ensures that there exists a unique matrix \(\overline{\mathsf{H}}_{k} \succ O\) such that \(\mathsf{H}_{k} = \overline{\mathsf{H}}_{k}^{2}\) [33, Theorem 7.2.6]. Since \(\|x\|_{\mathsf{H}_{k}}^{2} = \| \overline{\mathsf{H}}_{k} x \|^{2}\) holds for all \(x\in \mathbb{R}^{N}\), we have that, for all \(k\in \mathbb{N}\),
Since \(\mathsf{H}_{k}\) (\(k\in \mathbb{N}\)) is diagonal, we can express \(\mathsf{H}_{k}\) as \(\mathsf{H}_{k} = \mathsf{diag}(h_{k,i})\), where \(h_{k,i} > 0\) (\(k\in \mathbb{N}\), \(i=1,2,\ldots ,N\)). Accordingly, for all \(k\in \mathbb{N}\) and all \(x := (x_{i})_{i=1}^{N} \in \mathbb{R}^{N}\),
Hence, (33) ensures that, for all \(n\in \mathbb{N}\),
From \(\gamma _{k} \leq \gamma _{k-1}\) (\(k\geq 1\)) (see (29)) and (30), we have that \(h_{k,i}/\gamma _{k} - h_{k-1,i}/\gamma _{k-1} \geq 0\) (\(k \geq 1\), \(i=1,2,\ldots ,N\)). Moreover, (A4) implies that \(D := \max_{i=1,2,\ldots ,N} \sup \{ (x_{n,i} - x_{i})^{2} \colon n \in \mathbb{N} \} < + \infty \). Accordingly, for all \(n\in \mathbb{N}\),
Hence, (32), together with \(\mathbb{E} [\| x_{1} - x\|_{\mathsf{H}_{1}}^{2}]/\gamma _{1} \leq D \mathbb{E} [ \sum_{i=1}^{N} h_{1,i}/\gamma _{1}]\), implies that, for all \(n\in \mathbb{N}\),
which, together with the existence of \(a > 0\) such that, for all \(n\in \mathbb{N}\), \(\alpha _{n} \leq a < 1\) (by (29)) and \(\tilde{a} := 1 -a\), implies that
The Cauchy–Schwarz inequality, together with \(D := \max_{i=1,2,\ldots ,N} \sup \{ (x_{n,i} - x_{i})^{2} \colon n \in \mathbb{N} \} < + \infty \) and \(\mathbb{E}[\|m_{n}\|] \leq \tilde{M} := \sqrt{\max \{ \|m_{-1}\|^{2}, M^{2} \}}\) (\(n\in \mathbb{N}\)) (by Lemma 4.1), guarantees that, for all \(n\in \mathbb{N}\),
Since \(\mathbb{E}[ \|\mathsf{d}_{n} \|_{\mathsf{H}_{n}}^{2} ] \leq h_{\star }^{2} \tilde{M}^{2}\) (\(n\in \mathbb{N}\)) holds (by Lemma 4.1), we have that, for all \(n\in \mathbb{N}\),
Therefore, (31), (35), (36), and (37), together with the convexity of f, imply that, for all \(n\in \mathbb{N}\),
Lemma 4.1 ensures that, for all \(n\in \mathbb{N}\),
A discussion similar to the one for obtaining (35) implies that
The continuity of f (see (A2)) and (A4) mean that \(\hat{M} := \sup \{\mathbb{E}[f(x) - f(x_{n})] \colon n\in \mathbb{N} \} < + \infty \). Accordingly, an argument similar to the one for obtaining (36) and (37) guarantees that, for all \(n\in \mathbb{N}\),
From (29), there exists \(c > 0\) such that, for all \(n\in \mathbb{N}\), \(c \leq \alpha _{n}\). Setting \(\tilde{c} := 1 -c\), it follows that, for all \(n\in \mathbb{N}\),
A discussion similar to the one for obtaining (38) ensures that, for all \(n\in \mathbb{N}\),
Suppose that (A1)’ holds. Then we have that, for all \(k\in \mathbb{N}\), almost surely \(\| y_{k} - Q_{\mathsf{H}_{k}}(x_{k}) \|_{\mathsf{H}_{k}} = \| Q_{ \mathsf{H}_{k}} (x_{k} + \lambda _{k} \mathsf{d}_{k}) - Q_{\mathsf{H}_{k}}(x_{k}) \|_{\mathsf{H}_{k}} \leq \lambda _{k} \|\mathsf{d}_{k}\|_{\mathsf{H}_{k}}\), which, together with \(\| x-y \|^{2} \leq 2 \|x\|^{2} + 2 \|y\|^{2}\) (\(x,y\in \mathbb{R}^{N}\)), implies that
Accordingly, (38) and (39) guarantee that, for all \(n\in \mathbb{N}\),
which completes the proof. □
5.1.1 Constant step-size rule
The following theorem indicates that sufficiently small constant step-sizes \(\beta _{n} := \beta \) and \(\lambda _{n} := \lambda \) allow a solution to the problem to be approximated.
Theorem 5.2
Suppose that the assumptions in Theorem 5.1hold and also assume that, for all \(i=1,2,\ldots ,N\), there exists a positive number \(B_{i}\) such thatFootnote 5
Then Algorithm 1 with \(\alpha _{n} := \alpha \), \(\beta _{n} := \beta \), and \(\lambda _{n} := \lambda \) (\(n\in \mathbb{N}\)) satisfies that
where \(\tilde{x}_{n} := (1/n) \sum_{k=1}^{n} x_{k}\) and \(\tilde{\alpha } := 1 -\alpha \). Under (A1)’, we have
Proof
We first show that, for all \(\epsilon > 0\),
If (46) does not hold, then there exists \(\epsilon _{0} > 0\) such that
Let \(x\in X\) and \(\chi _{n} := \mathbb{E} [ \| x_{n} - x \|_{\mathsf{H}_{n}}^{2}]\) for all \(n\in \mathbb{N}\). Lemma 4.1, together with the proofs of (36) and (37), implies that, for all \(n\in \mathbb{N}\),
From (34) and (A4), for all \(n\in \mathbb{N}\),
Accordingly, (30) and (40) ensure that there exists \(n_{0} \in \mathbb{N}\) such that, for all \(n \geq n_{0}\),
Hence, (48) implies that, for all \(n \geq n_{0}\),
From (47), there exists \(n_{1} \in \mathbb{N}\) such that, for all \(n \geq n_{1}\),
Therefore, for all \(n \geq n_{2} := \max \{n_{0}, n_{1}\}\),
which is a contradiction since the right-hand side of the above inequality approaches minus infinity as n increases. Hence, (46) holds for all ϵ, which implies that (41) holds. A discussion similar to the one for showing (46) leads to (42). We next show that, for all \(\epsilon > 0\),
If (50) does not hold for all \(\epsilon > 0\), then there exist \(\epsilon _{0} > 0\) and \(n_{3} \in \mathbb{N}\) such that, for all \(n \geq n_{3}\),
Lemma 4.1, together with (48) and (49), ensures that, for all \(n\geq n_{0}\),
Accordingly, for all \(n \geq n_{4} := \max \{n_{0}, n_{3}\}\),
which is a contradiction. Since (50) holds for all \(\epsilon > 0\), we have (43). Conditions (44) and (45) follow from Theorem 5.1, which completes the proof. □
5.1.2 Diminishing step-size rule
Lemma 4.1 and Theorem 5.1 give us the following theorem as a convergence analysis of Algorithm 1 with a diminishing step-size.
Theorem 5.3
Suppose that the assumptions in Theorem 5.1and (40) hold. Let \((\beta _{n})_{n\in \mathbb{N}}\) and \((\lambda _{n})_{n\in \mathbb{N}}\) satisfy the following:
Then Algorithm 1 satisfies that
Moreover, if (A1)’ holds, then we have
Let \((\beta _{n})_{n\in \mathbb{N}}\) and \((\lambda _{n})_{n\in \mathbb{N}}\) satisfy the following:
Then the sequence \((\tilde{x}_{n})_{n\in \mathbb{N}}\) defined by \(\tilde{x}_{n} := (1/n) \sum_{k=1}^{n} x_{k}\) satisfies
with
Moreover, if (A1)’ holds, then we have
with
Proof
We first show (52). Lemma 4.1, together with (36), (37), and (48), implies that, for all \(n\in \mathbb{N}\),
where \(\chi _{n}(x) := \mathbb{E}[\| x_{n} - x\|_{\mathsf{H}_{n}}^{2}]\) for all \(x\in X\) and all \(n\in \mathbb{N}\). Consider (Case 1): For all \(x\in X\), there exists \(m_{0} \in \mathbb{N}\) such that, for all \(n\in \mathbb{N}\), \(n \geq m_{0}\) implies \(\chi _{n+1}(x) \leq \chi _{n}(x)\). This case guarantees the existence of \(\lim_{n\to + \infty } \chi _{n}(x)\) for all \(x\in X\). From (30) and (40), we have that \(\lim_{n\to + \infty } \mathbb{E} [ \sum_{i=1}^{N} (h_{n+1,i} - h_{n,i})] = 0\). Moreover, (51) ensures that \(\lim_{n\to + \infty } \beta _{n} = \lim_{n\to + \infty } \lambda _{n} = 0\). Accordingly, (55) and \(0< \liminf_{n\to + \infty } \alpha _{n} \leq \limsup_{n\to + \infty } \alpha _{n} < 1\) (by (29)) imply that
Consider (Case 2): There exists \(x_{0} \in X\), for all \(m \in \mathbb{N}\), there exists \(n\in \mathbb{N}\) such that \(n \geq m\) and \(\chi _{n+1}(x_{0}) > \chi _{n}(x_{0})\). In this case, there exists \((x_{n_{i}})_{i\in \mathbb{N}} \subset (x_{n})_{n\in \mathbb{N}}\) such that, for all \(i\in \mathbb{N}\), \(\chi _{n_{i} +1}(x_{0}) > \chi _{n_{i}}(x_{0})\). From (55), we have that, for all \(i\in \mathbb{N}\),
A discussion similar to the one for showing (56) guarantees that
Therefore, we have (52). If (A1)’ holds, then Lemma 4.1 implies that, for all \(n\in \mathbb{N}\),
which implies that \(\lim_{n\to + \infty } \mathbb{E} [ \| y_{n} - Q_{\mathsf{H}_{n}} (x_{n}) \|_{\mathsf{H}_{n}} ] = 0\). In (Case 1), (56) and the triangle inequality mean that \(\lim_{n\to + \infty } \mathbb{E} [ \| x_{n} - y_{n} \|_{\mathsf{H}_{n}} ] = 0\). Accordingly, the triangle inequality and \(\lim_{n\to + \infty } \mathbb{E} [ \| y_{n} - Q_{\mathsf{H}_{n}} (x_{n}) \|_{\mathsf{H}_{n}} ] = 0\) imply that \(\lim_{n\to + \infty } \mathbb{E} [ \| x_{n} - Q_{\mathsf{H}_{n}} (x_{n}) \|_{\mathsf{H}_{n}} ] = 0\). In (Case 2), (57) and the triangle inequality mean that \(\lim_{i\to + \infty } \mathbb{E} [ \| x_{n_{i}} - y_{n_{i}} \|_{ \mathsf{H}_{n_{i}}} ] = 0\). Accordingly, the triangle inequality and \(\lim_{i\to + \infty } \mathbb{E} [ \| y_{n_{i}} - Q_{\mathsf{H}_{n_{i}}} (x_{n_{i}}) \|_{\mathsf{H}_{n_{i}}} ] = 0\) imply that \(\lim_{i\to + \infty } \mathbb{E} [ \| x_{n_{i}} - Q_{\mathsf{H}_{n_{i}}} (x_{n_{i}}) \|_{\mathsf{H}_{n_{i}}} ] = 0\). Thus, we have that
Next, we show (53). Lemma 4.1, together with (36) and (37), ensures that, for all \(x^{\star }\in X^{\star }\) and all \(k \in \mathbb{N}\),
where \(\chi _{n}^{\star }:= \chi _{n}(x^{\star })\) for all \(x^{\star }\in X^{\star }\) and all \(n\in \mathbb{N}\). Summing the above inequality from \(k=0\) to \(k = n\) gives that, for all \(n\in \mathbb{N}\),
which, together with (40) and (51), implies that
If (53) does not hold, then there exist \(\zeta > 0\) and \(m_{1} \in \mathbb{N}\) such that, for all \(k \geq m_{1}\), \(\mathbb{E} [ f(x_{k}) - f^{\star } ] \geq \zeta \). Hence, we have that
where the first equation comes from \(\limsup_{n \to + \infty } \alpha _{n} < 1\), \(\sum_{n=0}^{+\infty } \lambda _{n} = + \infty \), and \(\sum_{n=0}^{+\infty } \beta _{n} \lambda _{n} < + \infty \) (by (29) and (51)). Since we have a contradiction, (53) holds. Theorem 5.1, together with (40) and (54), ensures that
with the convergence rate in Theorem 5.3. □
Theorem 5.3 leads to the following corollary.
Corollary 5.1
Suppose that the assumptions in Theorem 5.3and (A1)’ hold, and consider Algorithm 1 with \(\lambda _{n} := 1/n^{\eta }\) (\(\eta \in [1/2,1]\)) and \((\beta _{n})_{n\in \mathbb{N}}\) such that \(\sum_{n=1}^{+\infty } \beta _{n} < + \infty \). Under \(\eta \in (1/2,1]\), we have that
Under \(\eta \in [1/2,1)\), we have that
with the rate of convergence
Proof
The step-size \(\lambda _{n} := 1/n^{\eta }\) (\(\eta \in (1/2,1]\)) and \((\beta _{n})_{n\in \mathbb{N}}\) such that \(\sum_{n=1}^{+\infty } \beta _{n} < + \infty \) satisfy (51). Accordingly, Theorem 5.3 with (A1)’ implies that \(\liminf_{n \to + \infty } \mathbb{E}[ f(x_{n}) - f^{\star }] \leq 0\), and \(\liminf_{n \to + \infty } \mathbb{E}[ \|x_{n} - Q_{\mathsf{H}_{n}}(x_{n}) \|_{\mathsf{H}_{n}}] = 0\). The step-size \(\lambda _{n} := 1/n^{\eta }\) (\(\eta \in [1/2,1)\)) satisfies
Moreover, we have that
Hence, \(\lim_{n\to +\infty } (1/n) \sum_{k=1}^{n} \lambda _{k} = \lim_{n \to +\infty } (1/n) \sum_{k=1}^{n} \lambda _{k}^{2} = 0\). The condition \(\sum_{n=1}^{+\infty } \beta _{n} < + \infty \) implies that \(\lim_{n\to +\infty } (1/n) \sum_{k=1}^{n} \beta _{k} = 0\) and \(\lim_{n\to +\infty } (1/n) \sum_{k=1}^{n} \beta _{k} \lambda _{k} = 0\). Hence, (54) is satisfied. Accordingly, from Theorem 5.3 with (A1)’ and (58), we have the convergence rate of Algorithm 1 in Corollary 5.1. □
5.2 Comparisons of Algorithm 1 with the existing adaptive learning rate optimization algorithms
The main objective of the existing adaptive learning rate optimization algorithms is to minimize \(\sum_{t=1}^{T} f_{t} (x)\) subject to \(x\in X\), where T is the total number of rounds in the learning process, \(f_{t} \colon \mathbb{R}^{N} \to \mathbb{R}\) (\(t=1,2,\ldots ,T\)) is a differentiable, convex loss function, and \(X \subset \mathbb{R}^{N}\) is bounded, closed, and convex (see also problem (21) in Example 4.1(i)). We would like to achieve low regret on the sequence \((f_{t}(x_{t}))_{t=1}^{T}\), measured as
where \(x^{*} \in X\) is a minimizer of \(\sum_{t=1}^{T} f_{t} (x)\) over X, and \((x_{t})_{t=1}^{T} \subset X\) is the sequence generated by a learning algorithm. Although Theorem 4.1 in [8] indicates that Adam [8, Algorithm 1], [2, Algorithm 8.7] (algorithm (6)) is such that there exists a positive real number D such that \(R(T)/T \leq D/\sqrt{T}\), the proof of Theorem 4.1 in [8] is incomplete [9, Theorem 1]. AMSGrad [9, Algorithm 2] (algorithm (9)) is such that the following result holds [9, Theorem 4, Corollary 1]: Suppose that \(\beta _{1,t} := \beta _{1} \lambda ^{t-1}\) (\(\beta _{1}, \lambda \in (0,1)\)), \(\gamma := \beta _{1}/\sqrt{\beta _{2}} < 1\), and \(\lambda _{t} := \alpha /\sqrt{t}\) (\(\alpha > 0\)). Then there exist positive real numbers \(\hat{D}_{i}\) (\(i=1,2,3\)) such that
where \(\tilde{\beta }_{1} := 1 -\beta _{1}\), \(g_{t} := \nabla _{x} F(x_{t},\xi _{t})\), Footnote 6 and \(\|g_{1:T,i} \| := \sqrt{\sum_{t=1}^{T} g_{t,i}^{2}} \leq \hat{D}_{3} \sqrt{T}\). Hence, with AMSGrad, there exists a positive real number D̂ such that
We apply Algorithm 1 with \(\lambda _{n} := 1/n^{\eta }\) (\(\eta \in [1/2,1)\)) (see also algorithm (22)) to Problem 3.1 for the special case where \(f(\cdot ) = \mathbb{E}[f_{\xi }(\cdot )] := (1/T) \sum_{t=1}^{T} f_{t}( \cdot )\), \(Q_{\mathsf{H}_{n}} := P_{X,\mathrm{H}_{n}}\) (\(n\in \mathbb{N}\)), \(\mathsf{H}_{n}\) is defined by either (19) or (20), and \(C =X\) (see also problem (21)). Then Theorem 5.2 has the following corollary.
Corollary 5.2
Consider problem (21) and suppose that the assumptions in Theorem 5.1hold. Then algorithm (22) satisfies that
where \(\tilde{x}_{n} := (1/n) \sum_{k=1}^{n} x_{k}\) and \((x_{n})_{n\in \mathbb{N}} \subset X\) is the sequence in algorithm (22).
In contrast to Adam and AMSGrad with diminishing step-sizes, Corollary 5.2 indicates that algorithm (22) with constant step-sizes may approximate a solution of problem (21).
Corollary 5.1 implies the following corollary.
Corollary 5.3
Suppose that the assumptions in Corollary 5.1hold and \(\lambda _{n} := 1/n^{\eta }\) (\(\eta \in [1/2,1]\)), and \((\beta _{n})_{n\in \mathbb{N}}\) is such that \(\sum_{n=1}^{+\infty } \beta _{n} < + \infty \). Under \(\eta \in (1/2,1]\), algorithm (22) satisfies that
Moreover, under \(\eta \in [1/2,1)\), any accumulation point of \((\tilde{x}_{n} := (1/n) \sum_{k=1}^{n} x_{k})_{n\in \mathbb{N}}\) almost surely belongs to the solution set of problem (21), and algorithm (22) achieves the following rate of convergence:
Proof
For problem (21), Corollary 5.3 implies that \(0 \leq \liminf_{n\to + \infty } \mathbb{E}[f(x_{n}) - f^{\star }] \leq 0\) and \(0 \leq \limsup_{n\to + \infty } \mathbb{E}[f(\tilde{x}_{n}) - f^{\star }] \leq 0\), where \(f := (1/T) \sum_{t=1}^{T} f_{t}\). The second inequality guarantees that \(\lim_{n\to + \infty } \mathbb{E}[f(\tilde{x}_{n}) - f^{\star }] = 0\). Let \(\hat{x} \in X\) be an arbitrary accumulation point of \((\tilde{x}_{n})_{n\in \mathbb{N}} \subset X\). Since there exists \((\tilde{x}_{n_{i}})_{i\in \mathbb{N}} \subset (\tilde{x}_{n})_{n\in \mathbb{N}}\) such that \((\tilde{x}_{n_{i}})_{i\in \mathbb{N}}\) converges almost surely to \(\hat{x} \in X\), the continuity of f ensures that \(0 = \lim_{i\to + \infty } \mathbb{E}[f(\tilde{x}_{n_{i}}) - f^{\star }] = \mathbb{E}[f(\hat{x}) - f^{\star }]\), i.e., \(\hat{x} \in X^{\star }\). The rate of convergence of \((\tilde{x}_{n})_{n\in \mathbb{N}}\) is obtained from Corollary 5.1. □
It is not guaranteed that \(x_{T}\) defined by AMSGrad with \(\lambda _{t} : = \alpha /\sqrt{t}\) optimizes \(\sum_{t=1}^{T} f_{t}\) over X since (59) depends on a given parameter T, i.e.,
Meanwhile, Corollary 5.3 implies that any accumulation point of \((\tilde{x}_{n})_{n\in \mathbb{N}}\) defined by algorithm (22) with \(\lambda _{n} := 1/\sqrt{n}\) almost surely belongs to the set of minimizers of \(\sum_{t=1}^{T} f_{t}\) over X and \((\tilde{x}_{n})_{n\in \mathbb{N}}\) achieves an \(\mathcal{O}(1/\sqrt{n})\) convergence rate, i.e.,
5.3 Numerical comparisons
In this section, we consider the classifier ensemble problem [18, Sect. 2.2.2], [19, Sect. 3.2.2], [17, Problem II.1] (see problems (23) and (25) in Example 4.1 (ii)) and compare the performances of the learning methods based on the following algorithms which used commonly \(\beta =0.99\) [9, Sect. 5] and \(\alpha _{n} = 1/2\) (\(n\in \mathbb{N}\)).
- SG::
-
Stochastic gradient algorithm (15) with \(\lambda _{n} \in [10^{-3}/(n+1), 1/(n+1)]\) computed by the Armijo line search algorithm [17, Algorithms 2 and 3, LS].
- C1::
-
Algorithm 1 with (19) and \(\beta _{n} = \lambda _{n} = 10^{-1}\).
- C2::
-
Algorithm 1 with (19) and \(\beta _{n} = \lambda _{n} = 10^{-3}\).
- C3::
-
Algorithm 1 with (20) and \(\beta _{n} = \lambda _{n} = 10^{-1}\).
- C4::
-
Algorithm 1 with (20) and \(\beta _{n} = \lambda _{n} = 10^{-3}\).
- D1::
-
Algorithm 1 with (19), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} = 10^{-1}/\sqrt{n+1}\).
- D2::
-
Algorithm 1 with (19), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} = 10^{-3}/\sqrt{n+1}\).
- D3::
-
Algorithm 1 with (19), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} \in [10^{-3}/\sqrt{n+1}, 1/\sqrt{n+1}]\) computed by the Armijo line search algorithm.
- D4::
-
Algorithm 1 with (20), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} = 10^{-1}/\sqrt{n+1}\).
- D5::
-
Algorithm 1 with (20), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} = 10^{-3}/\sqrt{n+1}\).
- D6::
-
Algorithm 1 with (20), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} \in [10^{-3}/\sqrt{n+1}, 1/\sqrt{n+1}]\) computed by the Armijo line search algorithm.
The step-size \(\beta _{n} := 0.9/2^{n}\) used in D1–D6 was based on [9, Sect. 5]. The numerical results in [17] showed that the learning method based on SG performed better than the existing methods in [19, (18)]. Therefore, we compare the performance of the learning method based on SG with the one of the learning methods based on C1–D6. See Corollary 1 in [17], Theorems 5.2 and 5.3, and Corollary 5.1 for convergence analyses of the above algorithms for solving problems (23) and (25).
The experiments used Mac Pro (Late 2013) with a 3.5 GHz 6-core Intel Xeon E5 CPU, 32 GB 1866 MHz DDR3 memory, and macOS Catalina version 10.15.1 operating system. The algorithms used in the experiments were written in Python 3.7.5 with the NumPy 1.17.4 package. The experiments used the datasets from LIBSVM [37] and the UCI Machine Learning Repository [38] for which information is shown in Table 1. In these experiments, stratified 10-fold cross-validation for the datasets was performed. For this validation, the StratifiedKFold class in the scikit-learn 0.21.3 package was used. Ensembles of support vector classifiers were constructed by the BaggingClassifier class in the scikit-learn 0.21.3 package. The number of base estimators was set as the default value of the scikit-learn package. For learning multiclass classification tasks with the classifiers used in the experiments, the one-vs-the-rest multiclass classification strategy implemented as the OneVsRestClassifier class in the scikit-learn 0.21.3 package was used. The stopping condition for the algorithms used in the experiments was \(n=100\).
Let us consider problem (23) and compare the performances of the sparsity learning methods based on the algorithms with \(Q_{\mathsf{H}_{n}}\) defined by (24). Although we can consider problem (25) and compare the performances of the sparsity and diversity learning methods based on the algorithms with \(Q_{\mathsf{H}_{n}}\) defined by (26), we omit the details due to lack of space.Footnote 7
Tables 2 and 3 show that the accuracy of the learning method based on SG was almost the same as that of the learning methods based on C1, C2, C3, C4, D3, D4, and D6. These tables also show that the elapsed times for the proposed learning methods were shorter than the elapsed times for the learning method based on SG.
The average accuracies and elapsed times of the existing learning method (SG) were compared to the average accuracies and elapsed times of the proposed learning methods (C1–D6) by using an analysis of variance (ANOVA) test and Tukey–Kramer’s honestly significant difference (HSD) test. The scipy.stats.f_oneway method in the SciPy library was used as the implementation of the ANOVA test, and the statsmodels.stats.multicomp.pairwise_tukeyhsd method in the StatsModels package was used as the implementation of Tukey–Kramer’s HSD test. Recall that the ANOVA test examines whether the hypothesis that the given groups have the same population mean is rejected, whereas Tukey–Kramer’s HSD test can be used to find specifically which pair has a significant difference in groups. The significance level was set at 5% (0.05) for the ANOVA and Tukey–Kramer’s HSD tests. The p-value computed by the ANOVA test for the accuracies was about \(4.09 \times 10^{-19}\) (<0.05). Table 4 indicates that the adjusted p-value between each of the learning methods based on C1, C2, C3, C4, D3, D4, and D6 and the existing learning method based on SG was greater than 0.05. This implies that the existing and proposed methods based on C1, C2, C3, C4, D3, D4, and D6 had almost the same performances in the sense of accuracy. The p-value computed by the ANOVA test for the elapsed time was about \(2.67 \times 10^{-29}\) (<0.05). Table 5 indicates that there is a significant difference in the sense of the elapsed time between each of the proposed methods and the existing method based on SG. Therefore, the proposed methods ran significantly faster than the existing method based on SG.
6 Conclusion
In this paper, we proposed a stochastic approximation method based on adaptive learning rate optimization algorithms for solving a convex stochastic optimization problem over the fixed point set of a quasinonexpansive mapping. It also presented convergence analyses of the proposed method with constant and diminishing step-sizes. The analyses confirm that any accumulation point of the sequence generated by the proposed method almost surely belongs to the solution set of the stochastic optimization problem in deep learning. We also compared the proposed algorithm with the existing adaptive learning rate optimization algorithms and showed that the proposed algorithm achieved an \(\mathcal{O}(1/\sqrt{n})\) convergence rate which was not achieved for the existing adaptive learning rate optimization algorithms. Numerical results for the classifier ensemble problems demonstrated that the proposed learning methods achieve high accuracies faster than the existing learning method based on the first-order algorithm. In particular, the proposed methods with constant step-sizes or Armijo line search step-sizes solve the classifier ensemble problems faster than the existing method based on the first-order algorithm.
Availability of data and materials
Not applicable.
Notes
The projection \(P_{C,\mathsf{H}_{n}}\) onto a half-space \(C := \{ x\in \mathbb{R}^{N} \colon \langle a ,x \rangle \leq b \} = \mathrm{Fix}(P_{C}) = \mathrm{Fix}(P_{C,\mathsf{H}_{n}})\) under the \(\mathsf{H}_{n}\)-norm, where \(a\neq 0\) and \(b\in \mathbb{R}\), can be defined for all \(x\in \mathbb{R}^{N}\) by \(P_{C,\mathsf{H}_{n}}(x) := x + [(b - \langle a ,x \rangle _{ \mathsf{H}_{n}})/\|a\|_{\mathsf{H}_{n}}^{2}] a\) (\(x\notin C\)) or \(P_{C,\mathsf{H}_{n}}(x) := x\) (\(x\in C\)).
Since AMSGrad is applied to constrained convex optimization, in general, \(\lim_{T \to + \infty } \|g_{1:T,i}\| \neq 0\) and \(\|g_{1:T,i} \| \leq \hat{D}_{3} \sqrt{T}\) hold [8, Corollary 4.2].
References
Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, Cambridge (2008)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Nedić, A., Lee, S.: On stochastic subgradient mirror-descent algorithm with weighted averaging. SIAM J. Optim. 24, 84–107 (2014)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: a generic algorithmic framework. SIAM J. Optim. 22, 1469–1492 (2012)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, pp. 1–15 (2015)
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. In: Proceedings of the International Conference on Learning Representations, pp. 1–23 (2018)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)
Berinde, V.: Iterative Approximation of Fixed Points. Springer, Berlin (2007)
Halpern, B.: Fixed points of nonexpanding maps. Bull. Am. Math. Soc. 73, 957–961 (1967)
Krasnosel’skiĭ, M.A.: Two remarks on the method of successive approximations. Usp. Mat. Nauk 10, 123–127 (1955)
Mann, W.R.: Mean value methods in iteration. Proc. Am. Math. Soc. 4, 506–510 (1953)
Nakajo, K., Takahashi, W.: Strong convergence theorems for nonexpansive mappings and nonexpansive semigroups. J. Math. Anal. Appl. 279, 372–379 (2003)
Wittmann, R.: Approximation of fixed points of nonexpansive mappings. Arch. Math. 58, 486–491 (1992)
Iiduka, H.: Stochastic fixed point optimization algorithm for classifier ensemble. IEEE Trans. Cybern. 50, 4370–4380 (2020)
Yin, X.C., Huang, K., Hao, H.W., Iqbal, K., Wang, Z.B.: A novel classifier ensemble method with sparsity and diversity. Neurocomputing 134, 214–221 (2014)
Yin, X.C., Huang, K., Yang, C., Hao, H.W.: Convex ensemble learning with sparsity and diversity. Inf. Fusion 20, 49–58 (2014)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Borwein, J.M., Lewis, A.S.: Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, New York (2000)
Bauschke, H.H., Combettes, P.L.: A weak-to-strong convergence principle for Fejér-monotone methods in Hilbert space. Math. Oper. Res. 26, 248–264 (2001)
Bauschke, H.H., Chen, J.: A projection method for approximating fixed points of quasi nonexpansive mappings without the usual demiclosedness condition. J. Nonlinear Convex Anal. 15, 129–135 (2014)
Vasin, V.V., Ageev, A.L.: Ill-Posed Problems with a Priori Information. V.S.P. Intl. Science, Utrecht (1995)
Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on Stochastic Programming: Modeling and Theory, 2nd edn. MOS-SIAM Series on Optimization. SIAM, Philadelphia (2014)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization II: shrinking procedures and optimal algorithms. SIAM J. Optim. 23, 2061–2089 (2013)
Goebel, K., Kirk, W.A.: Topics in Metric Fixed Point Theory. Cambridge Studies in Advanced Mathematics. Cambridge University Press, New York (1990)
Goebel, K., Reich, S.: Uniform Convexity, Hyperbolic Geometry, and Nonexpansive Mappings. Dekker, New York (1984)
Takahashi, W.: Nonlinear Functional Analysis. Yokohama Publishers, Yokohama (2000)
Yamada, I.: The hybrid steepest descent method for the variational inequality problem over the intersection of fixed point sets of nonexpansive mappings. In: Butnariu, D., Censor, Y., Reich, S. (eds.) Inherently Parallel Algorithms for Feasibility and Optimization and Their Applications, pp. 473–504. Elsevier, New York (2001)
Yamada, I., Ogura, N.: Hybrid steepest descent method for variational inequality problem over the fixed point set of certain quasi-nonexpansive mappings. Numer. Funct. Anal. Optim. 25, 619–655 (2004)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985)
Wanka, G., Wilfer, O.: Formulae of epigraphical projection for solving minimax location problems. Pac. J. Optim. 16, 289–313 (2020)
Nedić, A., Ozdaglar, A.: Approximate primal solutions and rate analysis for dual subgradient methods. SIAM J. Optim. 19, 1757–1780 (2009)
Iiduka, H.: Distributed optimization for network resource allocation with nonsmooth utility functions. IEEE Trans. Control Netw. Syst. 6, 1354–1365 (2019)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27 (2011)
Dua, D., Graff, C.: UCI Machine learning repository. School Inf. Comput. Sci., Univ. California at Irvine, Irvine, CA, USA (2019)
Acknowledgements
The author would like to thank Professor Heinz Bauschke, Professor Yunier Bello-Cruz, Professor Radu Ioan Bot, Professor Robert Csetnek, and Professor Alexander Zaslavski for giving him a chance to submit his paper to this special issue. The author is sincerely grateful to Editor-in-Chief Yunier Bello-Cruz and the two anonymous reviewers for helping him improve the original manuscript. The author thanks Hiroyuki Sakai for his input on the numerical examples.
Funding
This work was supported by the Japan Society for the Promotion of Science (JSPS KAKENHI Grant Number JP18K11184).
Author information
Authors and Affiliations
Contributions
HI developed the mathematical methods. HI discussed the results and contributed to the final manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The author declares that they have no competing interests.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Iiduka, H. Stochastic approximation method using diagonal positive-definite matrices for convex optimization with fixed point constraints. Fixed Point Theory Algorithms Sci Eng 2021, 10 (2021). https://doi.org/10.1186/s13663-021-00695-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13663-021-00695-3