Similar to quasi-Newton methods (but not inspired by them), the CMA-ES is a **second
order** approach estimating a positive definite matrix within an
iterative procedure (more precisely: a covariance matrix, that is,
*on convex-quadratic functions*, closely related to the inverse
Hessian). This makes the method feasible on non-separable and/or
badly conditioned problems. In contrast to quasi-Newton methods,
the CMA-ES does not use or approximate gradients and does not even
presume or require their existence. This makes the method feasible
on **non-smooth** and even non-continuous problems, as well as on
multimodal and/or noisy problems. It turns out to be a
particularly reliable and highly competitive evolutionary
algorithm for local optimization (Hansen & Ostermeier
2001) and, surprising at first sight, also for global
optimization (Hansen & Kern
2004, CEC 2005, Hansen 2009 in BBOB-2009).

The CMA-ES has several **invariance
properties**. Two of them, inherited from the plain
evolution strategy, are (i) invariance to order preserving
(i.e. strictly monotonic) transformations of the objective
function value (that is, e.g., ||x||^{2} and
3||x||^{0.2}-100 are *equivalent* objective functions
with identical performance of CMA-ES), and (ii) invariance to
angle preserving (rigid) transformations of the search space
(including rotation, reflection, and translation), if the initial
search point is transformed accordingly. Invariances are highly
desirable: they imply uniform behavior on classes of functions and
therefore imply the generalization of empirical results.

The CMA-ES does not require a tedious
**parameter tuning** for its application. In fact, the choice
of strategy internal parameters is not left to the user
(arguably with the exception of population size λ).
Finding good (default) strategy parameters is considered as part
of the algorithm design, and not part of its
application—the aim is to have a well-performing algorithm
*as is*. The default population size λ is
comparatively small to allow for fast convergence. Restarts with
increasing population size (Auger & Hansen
2005) improve the global search performance. For the
**application** of the CMA-ES, an initial solution, an initial
standard deviation (step-size, variables should be defined such
that the same standard deviations can be reasonably applied to
all variables, see also here)
and, possibly, the termination criteria (e.g. a
function tolerance) need to be set by the user.
The most common applications are model calibration (e.g. curve fitting)
and shape optimisation.

- Four slides on randomized search and the CMA-ES (2004, pdf).
- A written tutorial.
- Tutorial slides (pdf 1.8MB) presented at the GECCO 2013.
- CMA-ES at wikipedia
- Comparison studies
- BBOB 2009 Comparison with a wide variety of other algorithms on 24 benchmark functions
- CEC 2005 Comparison with other Evolutionary Algorithms on a Benchmark Function Set
- Kern et al (2004) compare the CMA-ES with two other Evolution Strategies, (1+1)-ES and CSA-ES, and with two Estimation of Distribution Algorithms, IDEA and MBOA, on 14 separable and non-separable benchmark functions (pdf 827KB).

- Source code page (C, C++, Java, Matlab, Octave, Python, Scilab)
- A list of references to
applications (pdf) of the CMA-ES (not quite exhaustive and
*entirely*outdated).

- Hansen et
al (1995). On the adaptation of arbitrary normal mutation distributions
in evolution strategies: The generating set adaptation. In L. Eshelman
(Ed.),
*Proceedings of the Sixth International Conference on Genetic Algorithms*, Pittsburgh, pp. 57-64. Morgan Kaufmann.The generating set adaptation (GSA) is the first evolution strategy to our knowledge that learns a problem scaling

*and*is invariant to coordinate system rotations. The GSA implements the principal idea of CMA, to increase the likelihood of selected steps, without using the formalism of a covariance matrix. Global step-size control is done by mutative self-adaptation. The paper also provides an algorithm that learns individual step-sizes and one direction (AII). AII has a linear number of strategy parameters and hence learns a less rich model quicker. - Hansen
and Ostermeier (1996). Adapting arbitrary normal mutation
distributions in evolution strategies: The covariance matrix
adaptation. In
*Proceedings of the 1996 IEEE International Conference on Evolutionary Computation*, pp. 312-317.The first CMA paper, where the covariance matrix adaptation is introduced into the (1,λ)-ES (μ=1). The paper emphasizes on the evolution path and the differences to the generating set adaptation (Hansen et al 1995). Using the formalism of a covariance matrix has three important consequences. First, path length control (cumulative step-size adaptation, CSA) can be used to control the global step-size. This will become particularly relevant when μ is chosen to be greater than 1. Second, the eigensystem, determined by the covariance matrix, can give insights into the underlying problem structure. De facto, it turns out to be particularly useful to monitor the time evolution of eigenvalues of the covariance matrix. Third, the computational expense to

*generate*a new individual reduces from*O(n*to^{3})*O(n*. Because the eigendecomposition can be postponed for^{2})*Ω(n)*iterations, the overall internal expenses become*O(n*.^{2}) - Hansen
and Ostermeier (1997). Convergence properties of evolution strategies
with the derandomized covariance matrix adaptation: The
(μ/μ ES. In_{I}, λ)-*EUFIT'97, 5th Europ.Congr.on Intelligent Techniques and Soft Computing, Proceedings*, pp. 650-654.Introduction of the "multi-membered" (μ/μ,λ)-CMA-ES with intermediate recombination of all μ parental solutions. Investigation of the effect of μ for small λ on unimodal test functions, some including distorted selection. The (3/3,12)-CMA-ES is recommended as the best compromise for being fast and robust.

- Hansen
and Ostermeier (2001). Completely Derandomized Self-Adaptation in
Evolution Strategies.
*Evolutionary Computation*, 9(2), pp. 159-195 (2001).A comprehensive article, where weighted recombination is introduced in the (μ/μ,λ)-CMA-ES. Objectives and rationales behind the strategy are discussed as well as (default) strategy parameter choices in detail. Comparisons with the so-called

*correlated mutations*and scale-up investigations, up to 320-D, on a considerable number of unimodal test functions are shown. On the multimodal Rastrigin function the myth that (local) adaptation to the topography of the function jeopardizes global search properties is disproved. Local adaptation allows for larger step-sizes and consequently is likely to improve global search performance. -
Hansen et al (2003). Reducing
the Time Complexity of the Derandomized Evolution Strategy with
Covariance Matrix Adaptation (CMA-ES).
*Evolutionary Computation*, 11(1), pp. 1-18 (2003).Extension with the so-called rank-μ update of the covariance matrix which scales up the efficiency of the CMA to large population sizes. The rank-μ update reduces the time complexity, i.e. the number

*of generations*to adapt the complete matrix roughly from 10*n*^{2}to 20*n*. The other way around, the learning speed, with respect to the number of objective function evaluations, depends much less on the population size if the rank-μ update is used. Reasonable population sizes range between 5 and 5*n*. For larger population sizes a deficiency of the step-size adaptation is discovered (that has been solved in Hansen and Kern (2004)). Simulations on a number of unimodal test functions and scale-up investigations are presented. -
Hansen and Kern (2004). Evaluating the CMA Evolution Strategy on
Multimodal Test Functions. In
*Eighth International Conference on Parallel Problem Solving from Nature PPSN VIII, Proceedings*, pp. 282-291.In this paper 1) the

*weighted*rank-μ update of the covariance matrix is introduced. 2) the step-size adaptation problem discovered in Hansen et al (2003) is addressed by a considerably improved setting of the step-size control parameters: for large population sizes the backward time horizon of the evolution path is diminished by increasing the parameter*c*_{σ}and the step-size damping*d*_{σ}is increased. 3) the influence of the population size on the performance is investigated on a number of multimodal functions, revealing that an increased population size can dramatically improve the global search performance. Comparisons to other optimization algorithms reveal that on additively decomposable (and therefore separable) functions the CMA-ES can be vastly outperformed, e.g., by*differential evolution*, while it reveals superior performance on non-separable functions. - Auger
and Hansen (2005). A Restart CMA Evolution Strategy With Increasing
Population Size. In
*IEEE Congress on Evolutionary Computation, CEC 2005, Proceedings*, pp. 1769-1776.The remaining strategy parameter

*population size*, λ, that is most important for the global search performance, is removed by implementing a restart mechanism. Before each restart the population size is increased by a factor of two. Otherwise restarts are conducted independently. Tests on a benchmark function set of 25 functions provided for the session on real-parameter optimization reveal highly competitive performance on local*and*global optimization problems, compared to the other submitted algorithms. -
N. Hansen (2006). The CMA Evolution Strategy: A Comparing Review.
In J.A. Lozano, P. Larrañaga, I. Inza and E. Bengoetxea (Eds.).
Towards a new evolutionary computation. Advances in estimation of
distribution algorithms. Springer, pp. 75-102.
The CMA is approached and analyzed from the view point of

*estimating a search distribution*. Important differences to the so-called*estimation of distribution algorithms*(EDAs) are revealed. 1) in CMA the distribution of the selected*steps*is estimated, while in EDAs the selected*points*are modeled. This leads invariably to larger variances in CMA. In particular, a "pure EDA" converges prematurely on a linear function whereas a "pure CMA" (without step-size control) does not. 2) in EDAs the estimation is usually done from scratch, while in CMA the distribution is updated (at least for population size λ < 4*n*^{2}). Therefore in CMA the estimation can be done reliably*independent of the population size*and, for a small population size, in a lesser number of function evaluations. 3) in CMA an additional global step-size adaptation is conducted, based on a different adaptation principle (cumulative step-size adaptation, or path length control) that often leads to a considerably larger population spread and prevents premature convergence even more effectively. Major concepts and design principles, namely change rates, invariance, and stationarity are discussed. - Jastrebski
and Arnold (2006). Improving evolution strategies through active
covariance matrix adaptation. In
*2006 IEEE World Congress on Computational Intelligence, Proceedings*, pp. 9719-9726A

*negative*rank-μ update of the covariance matrix is introduced using the μ worst of the λ individuals. All vectors for both, the original and the negative, rank-μ updates have the same weight. For population sizes up to λ=4*n*a notable speed-up compared to the original equally weighted rank-μ update can be observed. On the discus (or tablet) function the speed-up exceeds a factor of two for large dimensions and small population sizes. While positive definiteness of the covariance matrix cannot be guaranteed anymore, empirically the covariance matrix remains always positive definite. For a larger population size presumably the learning parameter β needs to be modified. - Igel et al
(2006). A Computational Efficient Covariance Matrix Update and a
(1+1)-CMA for Evolution Strategies. In
*Genetic and Evolutionary Computation Conference, GECCO 2006, Proceedings*, pp.453-460.This paper introduces 1) the CMA into the (1+1)-ES using an improved variant of a success rule based step-size adaptation in place of the path length control, and 2) an incremental

*O(n*Cholesky update of the covariance matrix replacing the original^{2})*O(n*covariance matrix update altogether with the iterative^{2})*O(n*eigendecomposition of the covariance matrix (the latter is usually postponed until after^{3})*n/10*generations). The resulting algorithm is an elegant CMA variant and accomplishes what is expected. Nevertheless, its practical relevance is limited. The (1+1)-ES is slightly faster than its ","-counterpart, but more prone to converge to local optima. Applying the Cholesky update prohibits to utilize the evolution path for the covariance matrix update. This is a significant disadvantage as the evolution path considerably improves the performance on many functions and was addressed in Suttorp et.al. (2009). - Igel et al (2007). Covariance Matrix Adaptation for Multi-objective Optimization.
*Evolutionary Computation*15 (1): 1-28.MO-CMA-ES is a multi-objective algorithm maintaining several (1+1)-CMA-ES algorithms in parallel to approximate the pareto set. The algorithm is based on non-dominated sorting. Igel et al (2007) introduce steady-state selection into this algorithm. Voß et al (2010) improve the step-size adaptation.

- Ros and Hansen (2008). A Simple Modification in CMA-ES Achieving Linear Time and Space Complexity. In
*Tenth International Conference on Parallel Problem Solving from Nature PPSN X, Proceedings*, pp. 296-305.A very simple change in the covariance matrix update restricts the changes to the diagonal elements of the matrix, AKA separable or diagonal CMA-ES. As the most important consequence, the learning rate for the covariance matrix update can be increased from roughly

*O(1/n*to^{2})*O(1/n)*. Also, internal time and memory complexity become linear in the dimension. However, the main feature of CMA-ES to learn dependencies between variables is abandoned. Yet, surprisingly, the sep-CMA-ES is competitive on the Rosenbrock function in dimension larger than a hundred. - Hansen (2009). Benchmarking a BI-Population CMA-ES on the BBOB-2009
Function Testbed.
*Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers*, pp. 2389-2396.This paper makes two somewhat significant algorithmic contributions. (i) The cumulation parameter setting

*c*is improved for large population size such that it approaches 1/2 instead of 4 /_{c}*n*when μ /*n*→ ∞ and*n*≫ 1. This improves the performance with very large population sizes roughly by a factor of two (not shown). (ii) The BIPOP restart scheme with two modes is invented: the original IPOP mode (Auger and Hansen 2005) and a mode with small population size. The resulting algorithm is a simple portfolio between the IPOP-CMA-ES and a CMA-ES with independent restarts where the initial conditions for mean and step-size vary. The latter mode allows to solve some weakly structured multi-modal functions that IPOP-CMA-ES fails to solve. The benchmarking data provide a performance baseline, however generally, algorithm portfolios may often not be the best baseline to choose from. - Glasmachers et al (2010). Exponential natural evolution strategies. In
*Proceedings Companion of the 12th Annual Conference Companion on Genetic and Evolutionary Computation Conference*, pp. 393-400.This seminal paper introduces the exponential Natural Evolution Strategy (xNES) and shows its close connection to CMA-ES. The additive update of the covariance matrix

*C*with a weighted empirical covariance matrix √*CM*√*C*and learning rate 2*η*can be expressed as a multiplicative update of*C*or, in first order, of √*C*with I +*ηM*if the weights sum to zero or with*I*+*η*(*M*- I) if the weights sum to one. These multiplicative updates are in first order equivalent with the exponential multiplicative update by exp(*ηM*) or exp(*η*(*M*- I)). The exponential update guaranties*C*to remain positive definite. In xNES, all recombination weights are offset such that they sum to zero. - Hansen and Ros (2010).
Benchmarking a weighted negative covariance matrix update
on the BBOB-2010 noiseless testbed. In
*Proceedings Companion of the 12th Annual Conference Companion on Genetic and Evolutionary Computation Conference*, pp. 1673-1680.The idea of using different recombination weights (Hansen and Ostermeier 2001) is combined with the idea of active CMA-ES using negative weights for the covariance matrix update (Jastrebski and Arnold 2006). This combination is straightforward and has become the default implementation of CMA-ES.

- Akimoto et al (2010). Bidirectional relation between CMA evolution strategies and natural evolution strategies. In International Conference on Parallel Problem Solving from Nature.
In
*11th International Conference on Parallel Problem Solving from Nature PPSN XI, Proceedings*, pp. 154-163.This paper fully establishes the link between a natural gradient descent in the parametrized space of Gaussian distributions and the update equations in the CMA-ES algorithm. Remarkably, the update of the covariance matrix can be (and has been) expressed without explicit computation of the Fisher matrix, which would often be prohibitively expensive. This link provides a deep theoretical foundation to CMA-ES.

- Loshchilov et al (2012). Alternative Restart Strategies for CMA-ES. In
*12th International Conference on Parallel Problem Solving from Nature PPSN XII, Proceedings*, pp. 296-305.The initial step-size is an important parameter for the performance on multi-modal functions. Surprisingly, a relatively small initial step-size can significantly outperform the usual default choice on some multi-modal functions even when a large population is needed. This gives rise to consider more intricate variation schemes for the initial step-size using also smaller step-sizes not only when the population size is small.

- Hansen et al (2014). How to assess step-size adaptation mechanisms in randomised search. In
*13th International Conference on Parallel Problem Solving from Nature PPSN XIII, Proceedings*, pp. 60-69.While this paper is not stricly about CMA-ES, it introduces a polished and fully parallelizable version of two point adaptation (TPA) which has since become a secondary standard and default for step-size adaptation in CMA-ES. Two candidate solutions are sampled on the line of the last mean shift symmetrically about the current mean. Their ranking difference is used to update the step-size. When combined with CMA-ES, the norm in the denominator of line 10 in Algorithm 3 is the Mahalanobis norm from the current covariance matrix, see also Akimoto & Hansen (2016), Eq.(6).

- 2014- to be continued.