Similar to quasi-Newton methods (but not inspired by them), the CMA-ES is a second order approach estimating a positive definite matrix within an iterative procedure (more precisely: a covariance matrix, that is, on convex-quadratic functions, closely related to the inverse Hessian). This makes the method feasible on non-separable and/or badly conditioned problems. In contrast to quasi-Newton methods, the CMA-ES does not use or approximate gradients and does not even presume or require their existence. This makes the method feasible on non-smooth and even non-continuous problems, as well as on multimodal and/or noisy problems. It turns out to be a particularly reliable and highly competitive evolutionary algorithm for local optimization (Hansen & Ostermeier 2001) and, surprising at first sight, also for global optimization (Hansen & Kern 2004, CEC 2005, Hansen 2009 in BBOB-2009).
The CMA-ES has several invariance properties. Two of them, inherited from the plain evolution strategy, are (i) invariance to order preserving (i.e. strictly monotonic) transformations of the objective function value (that is, e.g., ||x||2 and 3||x||0.2-100 are equivalent objective functions with identical performance of CMA-ES), and (ii) invariance to angle preserving (rigid) transformations of the search space (including rotation, reflection, and translation), if the initial search point is transformed accordingly. Invariances are highly desirable: they imply uniform behavior on classes of functions and therefore imply the generalization of empirical results.
The CMA-ES does not require a tedious parameter tuning for its application. In fact, the choice of strategy internal parameters is not left to the user (arguably with the exception of population size λ). Finding good (default) strategy parameters is considered as part of the algorithm design, and not part of its application—the aim is to have a well-performing algorithm as is. The default population size λ is comparatively small to allow for fast convergence. Restarts with increasing population size (Auger & Hansen 2005) improve the global search performance. For the application of the CMA-ES, an initial solution, an initial standard deviation (step-size, variables should be defined such that the same standard deviations can be reasonably applied to all variables, see also here) and, possibly, the termination criteria (e.g. a function tolerance) need to be set by the user. The most common applications are model calibration (e.g. curve fitting) and shape optimisation.
The generating set adaptation (GSA) is the first evolution strategy to our knowledge that learns a problem scaling and is invariant to coordinate system rotations. The GSA implements the principal idea of CMA, to increase the likelihood of selected steps, without using the formalism of a covariance matrix. Global step-size control is done by mutative self-adaptation. The paper also provides an algorithm that learns individual step-sizes and one direction (AII). AII has a linear number of strategy parameters and hence learns a less rich model quicker.
The first CMA paper, where the covariance matrix adaptation is introduced into the (1,λ)-ES (μ=1). The paper emphasizes on the evolution path and the differences to the generating set adaptation (Hansen et al 1995). Using the formalism of a covariance matrix has three important consequences. First, path length control (cumulative step-size adaptation, CSA) can be used to control the global step-size. This will become particularly relevant when μ is chosen to be greater than 1. Second, the eigensystem, determined by the covariance matrix, can give insights into the underlying problem structure. De facto, it turns out to be particularly useful to monitor the time evolution of eigenvalues of the covariance matrix. Third, the computational expense to generate a new individual reduces from O(n3) to O(n2). Because the eigendecomposition can be postponed for Ω(n) iterations, the overall internal expenses become O(n2).
Introduction of the "multi-membered" (μ/μ,λ)-CMA-ES with intermediate recombination of all μ parental solutions. Investigation of the effect of μ for small λ on unimodal test functions, some including distorted selection. The (3/3,12)-CMA-ES is recommended as the best compromise for being fast and robust.
A comprehensive article, where weighted recombination is introduced in the (μ/μ,λ)-CMA-ES. Objectives and rationales behind the strategy are discussed as well as (default) strategy parameter choices in detail. Comparisons with the so-called correlated mutations and scale-up investigations, up to 320-D, on a considerable number of unimodal test functions are shown. On the multimodal Rastrigin function the myth that (local) adaptation to the topography of the function jeopardizes global search properties is disproved. Local adaptation allows for larger step-sizes and consequently is likely to improve global search performance.
Extension with the so-called rank-μ update of the covariance matrix which scales up the efficiency of the CMA to large population sizes. The rank-μ update reduces the time complexity, i.e. the number of generations to adapt the complete matrix roughly from 10n2 to 20n. The other way around, the learning speed, with respect to the number of objective function evaluations, depends much less on the population size if the rank-μ update is used. Reasonable population sizes range between 5 and 5n. For larger population sizes a deficiency of the step-size adaptation is discovered (that has been solved in Hansen and Kern (2004)). Simulations on a number of unimodal test functions and scale-up investigations are presented.
In this paper 1) the weighted rank-μ update of the covariance matrix is introduced. 2) the step-size adaptation problem discovered in Hansen et al (2003) is addressed by a considerably improved setting of the step-size control parameters: for large population sizes the backward time horizon of the evolution path is diminished by increasing the parameter cσ and the step-size damping dσ is increased. 3) the influence of the population size on the performance is investigated on a number of multimodal functions, revealing that an increased population size can dramatically improve the global search performance. Comparisons to other optimization algorithms reveal that on additively decomposable (and therefore separable) functions the CMA-ES can be vastly outperformed, e.g., by differential evolution, while it reveals superior performance on non-separable functions.
The remaining strategy parameter population size, λ, that is most important for the global search performance, is removed by implementing a restart mechanism. Before each restart the population size is increased by a factor of two. Otherwise restarts are conducted independently. Tests on a benchmark function set of 25 functions provided for the session on real-parameter optimization reveal highly competitive performance on local and global optimization problems, compared to the other submitted algorithms.
The CMA is approached and analyzed from the view point of estimating a search distribution. Important differences to the so-called estimation of distribution algorithms (EDAs) are revealed. 1) in CMA the distribution of the selected steps is estimated, while in EDAs the selected points are modeled. This leads invariably to larger variances in CMA. In particular, a "pure EDA" converges prematurely on a linear function whereas a "pure CMA" (without step-size control) does not. 2) in EDAs the estimation is usually done from scratch, while in CMA the distribution is updated (at least for population size λ < 4n2). Therefore in CMA the estimation can be done reliably independent of the population size and, for a small population size, in a lesser number of function evaluations. 3) in CMA an additional global step-size adaptation is conducted, based on a different adaptation principle (cumulative step-size adaptation, or path length control) that often leads to a considerably larger population spread and prevents premature convergence even more effectively. Major concepts and design principles, namely change rates, invariance, and stationarity are discussed.
A negative rank-μ update of the covariance matrix is introduced using the μ worst of the λ individuals. All vectors for both, the original and the negative, rank-μ updates have the same weight. For population sizes up to λ=4n a notable speed-up compared to the original equally weighted rank-μ update can be observed. On the discus (or tablet) function the speed-up exceeds a factor of two for large dimensions and small population sizes. While positive definiteness of the covariance matrix cannot be guaranteed anymore, empirically the covariance matrix remains always positive definite. For a larger population size presumably the learning parameter β needs to be modified.
This paper introduces 1) the CMA into the (1+1)-ES using an improved variant of a success rule based step-size adaptation in place of the path length control, and 2) an incremental O(n2) Cholesky update of the covariance matrix replacing the original O(n2) covariance matrix update altogether with the iterative O(n3) eigendecomposition of the covariance matrix (the latter is usually postponed until after n/10 generations). The resulting algorithm is an elegant CMA variant and accomplishes what is expected. Nevertheless, its practical relevance is limited. The (1+1)-ES is slightly faster than its ","-counterpart, but more prone to converge to local optima. Applying the Cholesky update prohibits to utilize the evolution path for the covariance matrix update. This is a significant disadvantage as the evolution path considerably improves the performance on many functions and was addressed in Suttorp et.al. (2009).
MO-CMA-ES is a multi-objective algorithm maintaining several (1+1)-CMA-ES algorithms in parallel to approximate the pareto set. The algorithm is based on non-dominated sorting. Igel et al (2007) introduce steady-state selection into this algorithm. Voß et al (2010) improve the step-size adaptation.
A very simple change in the covariance matrix update restricts the changes to the diagonal elements of the matrix, AKA separable or diagonal CMA-ES. As the most important consequence, the learning rate for the covariance matrix update can be increased from roughly O(1/n2) to O(1/n). Also, internal time and memory complexity become linear in the dimension. However, the main feature of CMA-ES to learn dependencies between variables is abandoned. Yet, surprisingly, the sep-CMA-ES is competitive on the Rosenbrock function in dimension larger than a hundred.
This paper makes two somewhat significant algorithmic contributions. (i) The cumulation parameter setting cc is improved for large population size such that it approaches 1/2 instead of 4 / n when μ / n → ∞ and n ≫ 1. This improves the performance with very large population sizes roughly by a factor of two (not shown). (ii) The BIPOP restart scheme with two modes is invented: the original IPOP mode (Auger and Hansen 2005) and a mode with small population size. The resulting algorithm is a simple portfolio between the IPOP-CMA-ES and a CMA-ES with independent restarts where the initial conditions for mean and step-size vary. The latter mode allows to solve some weakly structured multi-modal functions that IPOP-CMA-ES fails to solve. The benchmarking data provide a performance baseline, however generally, algorithm portfolios may often not be the best baseline to choose from.
This seminal paper introduces the exponential Natural Evolution Strategy (xNES) and shows its close connection to CMA-ES. The additive update of the covariance matrix C with a weighted empirical covariance matrix √CM√C and learning rate 2η can be expressed as a multiplicative update of C or, in first order, of √C with I + ηM if the weights sum to zero or with I + η(M - I) if the weights sum to one. These multiplicative updates are in first order equivalent with the exponential multiplicative update by exp(ηM) or exp(η(M - I)). The exponential update guaranties C to remain positive definite. In xNES, all recombination weights are offset such that they sum to zero.
The idea of using different recombination weights (Hansen and Ostermeier 2001) is combined with the idea of active CMA-ES using negative weights for the covariance matrix update (Jastrebski and Arnold 2006). This combination is straightforward and has become the default implementation of CMA-ES.
This paper fully establishes the link between a natural gradient descent in the parametrized space of Gaussian distributions and the update equations in the CMA-ES algorithm. Remarkably, the update of the covariance matrix can be (and has been) expressed without explicit computation of the Fisher matrix, which would often be prohibitively expensive. This link provides a deep theoretical foundation to CMA-ES.
The initial step-size is an important parameter for the performance on multi-modal functions. Surprisingly, a relatively small initial step-size can significantly outperform the usual default choice on some multi-modal functions even when a large population is needed. This gives rise to consider more intricate variation schemes for the initial step-size using also smaller step-sizes not only when the population size is small.
While this paper is not stricly about CMA-ES, it introduces a polished and fully parallelizable version of two point adaptation (TPA) which has since become a secondary standard and default for step-size adaptation in CMA-ES. Two candidate solutions are sampled on the line of the last mean shift symmetrically about the current mean. Their ranking difference is used to update the step-size. When combined with CMA-ES, the norm in the denominator of line 10 in Algorithm 3 is the Mahalanobis norm from the current covariance matrix, see also Akimoto & Hansen (2016), Eq.(6).