Covariant Policy Search
J. Andrew Bagnell and Jeff Schneider
Inspired by the work of [Amari and Nagaoka, 2000],
Kakade proposed a “natural gradient” algorithm. In particu-
We investigate the problem of non-covariant behav-
lar, [Kakade, 2002] proposed a scheme for generating a met-
ior of policy gradient reinforcement learning algo-
ric on parameter space that has interesting theoretical prop-
rithms. The policy gradient approach is amenable
erties. Most convincingly, Kakade showed strong empiri-
to analysis by information geometric methods. This
cal evidence for the algorithm’s usefulness. In particular,
leads us to propose a natural metric on controller
[Kakade, 2002] applies the algorithm to the Tetris problem
parameterization that results from considering the
of [Bertsekas and Tsitsiklis, 1996]. This problem is partic-
manifold of probability distributions over paths in-
ularly interesting as value function methods (as for example
described in [Bertsekas and Tsitsiklis, 1996] ) demonstrate
of this approach leads to a covariant gradient as-
non-monotone performance; first policy iteration improves
cent rule. Interesting properties of this rule are
the policy dramatically, but after a small number of iterations
discussed, including its relation with actor-critic
the policy gets much worse. Normal gradient methods, in-
style reinforcement learning algorithms. The al-
cluding second-order and conjugate methods also prove very
gorithms discussed here are computationally quite
ineffective on this problem; even after a tremendous number
efficient and on some interesting problems lead
of rounds they only mildly increase the performance of the
to dramatic performance improvement over non-
game-player. The method presented in [Kakade, 2002] shows
rapid performance improvement and achieves a significantlybetter policy than value-function methods (at their peak) in
Much recent work in reinforcement learning and stochastic
However, despite recognizing an interesting defect in the
optimal control has focused on algorithms that search directly
general approach and intuiting an apparently powerful algo-
through a space of policies rather than building approximate
rithm, Kakade concludes that his method also fails to be co-
value functions. Policy search has numerous advantages: do-
variant leaving the problem open. We present here what we
main knowledge may be easily encoded in a policy, the policy
believe to be an appropriate solution.
may require less representational power than a value-function
In Kakade’s work, there is no proper probability manifold,
approximation, there are simple extensions to the multi-agent
but rather a collection of such (one for each state) based on
domain, and convergent algorithms are known. Furthermore,
the policy. As such, Kakade must rely on an ad-hoc method
policy search approaches have recently scored some encour-
for generating a metric. Here instead we take a proper mani-
aging successes [Bagnell and Schneider, 2001] [Baxter et al.
fold, the distribution over paths induced by the controller, and
1999] including helicopter control and game-playing.
compute a metric based on that. In the special case appropri-
One interesting aspect of existing gradient search algo-
ate to the average reward RL formulation, Kakade’s scheme
rithms, however, is that they are non-covariant;
that is, a
is shown to give rise to a bona-fide natural metric, despite the
simple re-parameterization of the policy typically leads to a
observation in the paper that the learning was non-covariant.
different gradient direction. (Not that found by applying the
We believe this is an artifact– perhaps of step-length. Fur-
Jacobian of the re-parameterization to the gradient). This
ther, we note that parametric invariance does not
is a odd result; it is intuitively difficult to justify the gradi-
this metric– there are numerous covariant metrics. Rather,
ent computation as actually indicating the direction of steep-
a stronger probabilistically natural invariance demands the
est descent. This problem is well recognized in the pattern-
metric used. We describe this result to motivate our method.
recognition and statistics communities and is a very active
Importantly, the notion of the metric on the path-
area of research. [Kakade, 2002] was, to the best of our
distribution allows us to provide very simple and natural ex-
knowledge the first to identify this problem in reinforcement
tensions to Kakade’s algorithm that cover the finite horizon,
learning and to suggest that techniques from information ge-
discounted start-state, and partially-observed reinforcement
ometry may prove valuable in its solution.
learning problems as well as the average reward one. Fi-
nally, for completeness, we describe the result discovered by
Kakade relating the invariant metric and compatible actor-
critic methods [Sutton et al.
Problem Setup and Notation
Figure 1: A two-state MDP. Each state has two controls one which
A stochastic control problem consists of paths
self transitions (and earns the reward that labels the state) and an-
other that simply transitions to the other state. Rewards occur only
, that is a function of a sequence of (finite) controls
on the transitions between different states.
, indexed by time, from a space A. Throughout
this paper we will be considering Partially Observed MarkovDecision Problems, in which there are state-variables
state and next state transition probabilities
We also typically have a sequence of outputs (observations)
strategy that maps the history of observations
be given in terms of memoryless stochastic controllers thatmap the current state to a distribution over actions, although
Figure 2: Log odds of actions for policies ÐÓ Ô´
they can be easily extended to finite window or recurrent rep-
on the two-state MDP (horizontal axis corresponds to state 0).
resentations that operate on only the observations. The goal
Different curves correspond to varying . Notice the non-covariant
in a control problem is to maximize the expected reinforce-
always assumed to exist for all controllers. The reward func-
state 1. In state 1, action 0 returns to state 1 and gets reward
tion on paths in a POMDP is additive in time (or in the infi-nite time case, discounted or averaged), and a function
2. Action 1 transits to state 0 and achieves no reward.
Consider parameterized probabilistic policies of the form
of state, although the algorithms discussed here are not neces-
sarily predicated on that assumption. Reinforcement learning
is the adaptive version of the control problem where we at-
parameter that we use to demonstrate that even mildly differ-
tept to maximize the expected reinforcment by sampling tra-
ent parameterization lead to dramatically different behavior.
jectories and then improving our policy. We use the notation
Below we plot the resulting track through the space of poli-
cies using the log ratio of probabilities of the actions for each
of the possible states. We start from a policy that makes it
to denote the inner product with respect to metric
somewhat more likely to choose action 0 in state 0 and action
odds of the policy from state 1 (Figure 2).
The non-covariant behavior is quite clear in this graph.
Covariance and Riemannian manifolds
Further, we note that it is very
difficult to achieve a goodpolicy using this algorithm from this starting point as the
Meaning of steepest ascent
action becomes overwhelmingly likely to be chosen
For direct policy search methods, we are interested in finding
from state 0. If sampling were used to compute the gradient,
the direction of steepest ascent
with respect to our current pa-
we would nearly never get samples from state 1– which we
rameters. That is, we wish to maximize the reward function
Path distribution manifolds
reparameterize our controller in terms of, e.g. , and expressthe same effective infinitesimal policy change in terms of
The control problem is essentially a coupled one of optimiza-
tion and integration over the space of possible paths,
ever, if we measure lengths using the naive dot product, the
motivates the idea that instead of considering the (arbitrary)
same effective change will not have the same size. It is this
distance in terms of parameterization of the policy, we may
problem that leads to the odd behavior of “steepest descent”
consider distance in terms of the change in distribution over
paths resulting from the policy change. We may view thedistribution over paths
Non-covariant learning rule
To demonstrate the problem we first consider a simple two
some work to visualize; as an example consider a space of
state, two action MDP described in [Kakade02]. (Figure 1)
three possible paths. All possible distributions over these
In state 0 executing action 0 returns to the same state and gets
three paths can be smoothly represented with two parameters.
a reward of 1. Executing action 1 get no reward but transits to
In Figure 3 (left) we see one visualization with the embedding
and take derivatives with respect to each
Figure 3: An example probability manifold over paths. Each
axis represents the log probability of one of the three possible
paths. The manifold is formed as we vary the parametersdefining it throughout their domain. On the right we attach
the tangent space at a particular point in parameter space.
That is, the direction of steepest descent is simply the nor-
mal gradient times the inverse of the metric evaluated at the
point of tangency. We call this the “natural gradient” for the
parameters, assuming no redundancy, we generate
Reimannian manifold. [Amari and Nagaoka, 2000]
dimensional manifold. In the case pictured, it is the set
of all distributions on 3 paths, but in general, the dimension-ality of the path probability manifold will be tremendously
Invariance and Chentsov’s Theorem
less than the number of possible paths. It is important to un-
Some confusion seems to surround the choice of metric on
derstand the manifold under consideration is that of the prob-
probability manifolds. There is no
unique answer for this
ability distribution over paths and not of paths themselves.
metric, even under the supposition of parametric invariance
The study of parameterized manifolds like that pictured
as suggested by some authors. The issue is rather more sub-
above is the domain of differential geometry. Here we are
tle. Any metric that transforms under parameter change ac-
interested in establishing a Riemannian
structure on the man-
cording to the Jacobian of the function connecting parameters
ifold of paths. By that we mean we wish to establish a metric
meets this requirement. This type of parametric covariance is
on the tangent space (the local linear approximation to the
a minimum requirement for us. It is natural to suggest a met-
manifold at a point ), so we can measure small parameter
ric that preserves essential probabilistic aspects of the mani-
gent space (which is spanned by the partials with respect to
congruent embeddings, or sufficient statistics, depending on
your viewpoint) that carry one distribution over paths to an-
definite matrix. This is a very natural thing to do– instead
other in a natural and recoverable way. For example, con-
of just the standard dot product, we have a dot product that
sider the mapping between the two distributions
allows us to represent rotations and scalings (exactly what
a positive definite matrix can represent) and that can vary
throughout the manifold. In Figure 3 (right) we depict thetangent space at a point in our parameterization as the locallinear approximation at that point.
Steepest ascent on Riemannian manifold
There are two questions that then naturally arise– what is
steepest descent on a function defined over a Riemannian
space, and is there a Riemannian metric on the manifold ofthe paths that is in some sense natural. We quickly answer
the first question, and pursue the second in the next two sec-
and splits one path into two (with probability 1/2 for each
tions. A Lagrange multiplier argument (given schematically
In a probabilistically fundamental way, the mani-
here) makes it easy to see the form of the steepest descent
a smoothly parameterized set of distributions) are similarto each other. For each path¼ we can uniquely recover theoriginal probability. In this way, a parameterized distribu-
tion over paths can be embedded into a different path space.
(This equivalence can actually be phrased in a category the-
oretic way where the morphisms are these congruent embed-
dings.) [Chentsov, 1980] General congruent embeddings can
be thought of as simply generalizations of the example de-
picted above to allow arbitrary permutations and arbitrary
probabilities for different paths stemming from a single path,
as well as compositions of these two. In the control problem
this could arise by simple permutation of the state variables
We use the Markov property and the invariance of the tran-
or change of co-ordinates to ones with redundant information.
sition probabilities (given the actions) in the algorithm above
It is natural to require that with our metric the congruent em-
bedding is an isometry on the tangent space. (i.e. preserves
tracted from the proofs below, as well as other, potentially
the length of the vectors). That is, if we make a change of
To compute the natural gradient, we simply invert this ma-
trix and multiply it with the gradient computed using standard
with respect to congruent embeddings leads to a unique (up toscale) metric on the manifold. [Chentsov, 1980] This metric
Limiting metric for infinite horizon problems
is well-known in statistical inference as the Fisher-Rao metric
A well-known result in statistics [DeGroot, 1970] gives a dif-
and it can be written as the Fisher information matrix [DeG-
ferent form for the Fisher metric under appropriate regularity
conditions. We quickly derive that result, and then show howthis gives us a simple form for the path probability metric as
Another way to derive the metric is to think about “dis-
tances” on probability spaces. The KL-divergence (or rel-
ative entropy) between two distributions is a natural diver-
gence on changes in a distribution. It is also manifestly in-
variant to re-parameterization. If we think about derivatives
as small changes in parameters (differentials) then we dis-
cover that we also get a unique metric as the second-orderTaylor expansion of the KL-divergence. This too agrees with
the Fisher information (up to scale). We note that the di-
rection of the KL-divergence is irrelevant as to second-order
Fisher-Rao Metric on the Path-space
The issue now becomes how to derive the Fisher metric on
The third line follows from the second by integrating by
the space of path distributions. It turns out in the case of
parts and the fifth follows by observing that the total proba-
processes with underlying Markovian state to be rather easy,
and involves only computations we already make in the likeli-
Now we show that in the limit this leads to a simple metric
hood ratio approach standard in gradient-based reinforcement
for the infinite horizon case. To get a convergent metric, we
must normalize the metric, in particular by the total lengthof the path denoted . Since the metric is defined only up to
Derivation of finite-time path metric
Theorem 1 (Infinite-Horizon Metric) For
Markov process the Fisher information matrix limits to the
do. The essential algorithm within gradient methods like
expected Fisher information of the policy for each state and
REINFORCE and GPOMDP is a clever method of computing
control under stationary distribution of states and actions.
, the expected score function. Thus, while
to indicate the t-step finite-horizon metric.
the expected score is the gradient, the correlation of thescore is the Fisher matrix. The following simple algorithm
is unbiased and converges almost surely (under the same
regularity conditions as in [Baxter et al.
, 1999] as the number
Algorithm 1 (Finite-horizon metric computation) For in
For a Markov process we can write the likelihood ratio
Using the chain rule we continue the derivation with
indicating the stationary distribution:
bution of states. Thus the infinite horizon (undiscounted) andstart-state metrics give essentially the same metric, with only
the effective weighting by states differing.
Metrics for Partially Observed Problems
For policies that map the observation space of a partially-
observed markov decision process into distributions over ac-tions, it is just as easy to derive appropriate metric using our
with only subtle changes to the arguments above we end up
with the same metric except using the limiting distribution of
Relation to Compatible Value Function
Kakade noticed a fascinating connection between the limit-
ing metric given in Theorem 1 and the improvement directioncomputed by a class of actor-critic methods that use a special
where the second equality follows by noting that the like-
compatible function approximation technique.[Sutton et al.
lihood ratio is independent of transition probability matrix
1999; Konda and Tsitsiklis, 2002] Following Kakade we let
given the action probability, and by application of the ergodic
theorem. The last line follows (with
tion for the policy at a given state) as the second term in line3 vanishes since the total probability is constant with respect
It is also interesting to consider what happens in the start-
This type of value function approximator was initially sug-
state, discounted formulation of the problem. In this case we
gested as it may be used to compute the true gradient. In prac-
would like our metric, using the general definition over path
tice, it has not been clear what advantage this brings over the
distributions given above, to naturally weigh the start start
gradient estimation routines of [Baxter et al.
, 1999]. How-
more than it necessarily is in the infinite horizon average case.
ever, “folk-wisdom” has it that performing an that infinitesi-
mal policy iteration (moving in the direction of the best policy
undiscounted problem where each trajectory terminates with
at each step. We can use this fact to derive
good properties and often significantly outperforms standard
a metric more appropriate for the discounted formalism.
gradient methods. The natural gradient provides insight intothis behavior. Let
Theorem 2 (Start-State Metric) For a discounted Markov
process the Fisher information matrix equals the Fisher in-formation of the policy for each state and control under the
limiting distribution of states and actions.
proof is very similiar to the infinite horizon case and so wesimply sketch it:
is the exact advantage function. [Sutton et
, 1999] It is easy to check that ([Kakade, 2002], Theo-
policy improvement is exactly the natural gradient direction.
This can be seen by simply differentiating ( ; ) to minimize
and noting the result of [Sutton et al.
As a consequence of the results of section 5 and [Kakade,
2002], experimental results already exist demonstrating the
effectiveness of the natural gradient method. That is, all re-
sults using policy improvement with compatible function ap-
proximators are implicitly computing this result. As a demon-
stration, we computed analytically the natural gradient for the
ing. Unfortunately, working in the space of policies, it was
difficult to generate such an algorithm. Here we proposed
considering the induced path-distribution manifold and used
notions from information geometry to propose a natural co-
variant algorithm. This leads to interesting insight and a prac-
tical algorithm. Fortunately, it agrees with the heuristic sug-
gested by Kakade (despite the suggestion in the paper that
the algorithm there was actually not covariant) in the infinite
horizon case and extends to cover new problems. [Peters et
, 2002] independently developed theory related to ours (inparticular the theorems in Section 4.2) and presented results
Figure 4: Performance comparison of covariant (nat and scaled)
in the context of robot dynamic control.
and non-covariant (grad) on the 2-state MDP. The covariant learning
Further work will provide more investigation into the ex-
algorithm dramatically outperforms the standard gradient algorithm.
perimental behavior of the algorithms presented. Future ef-
The standard method achieves J=2 after more than 1000 iterations.
fort may also yield deeper insight into the relationship be-tween the method presented here and value-function approx-
[Amari and Nagaoka, 2000] S. Amari and H. Nagaoka.
Methods of Information Geometry
[Bagnell and Schneider, 2001] J. Bagnell and J. Schneider.
Autonomous helicopter control by policy-search based re-
inforcement learning. In Proceedings of the 2001 IEEEInt. Conference on Robotics and Automation
. IEEE, 2001.
Figure 5: Path through policy space (showing log-odds of the ac-
[Baxter et al.
, 1999] J. Baxter, L. Weaver, and P. Bartlett.
tions with state 0 along the horizontal axis) of covariant (nat and
Direct-gradient-based reinforcement learning I: Gradient
scaled) and non-covariant (grad) algorithms on the 2-state MDP.
estimation algorithms. Technical report, Computer Sci-
Note that, as required, the path is independent of scaling for the
ences Lab, Australian National University, 1999.
[Bertsekas and Tsitsiklis, 1996] D. Bertsekas and J. Tsitsik-
problem given in Section 2.2. This problem is interesting as it
reveals some of the properties of the natural gradient method.
First it is easily checked that for a complete Boltzmann policy
[Chentsov, 1980] N.N. Chentsov. Statistical Decision Rules
and Optimal Inference
. American Mathematical Society,1980.
[DeGroot, 1970] M. DeGroot. Optimal Statistical Decisions
This means that it is very similiar to the standard gradient
[Kakade, 2002] S. Kakade. A natural policy gradient. In Ad-
except that it removes the weighting of states by the stationary
vances in Neural Information Processing Systems (NIPS-
distribution. Rather, each state is treated equally. This leads
to much more reasonable results in the problem discussed as
[Konda and Tsitsiklis, 2002] V. Konda and J. Tsitsiklis.
the partial derivative component for state 1 does not shrink as
Actor-critic algorithms. to appear in the SIAM Journal
on Control and Optimization
In Figure 4 we plot the performance of the natural gradi-
ent (using two different scalings) and the standard gradient
[Peters et al.
, 2002] Jan Peters, Sethu Vijaykumar, and Ste-
methods in optimizing the policy for this problem.
Policy gradient methods for robot control.
It is interesting to note that in this graph the behavior of
the natural gradient descent algorithm appears to be non-
[Sutton et al.
, 1999] R. Sutton, D. McAllester, S. Singh, and
covariant. This is simply due to the step size heuristic not
Y. Mansour. Policy gradient methods for reinforcement
computing equivalent steps in the policy space. The direction
learning with function approximation. In Neural Informa-
s constant as illustrated in Figure 5.
tion Processing Systems 12
[Kakade, 2002] suggested that covariant behavior of gradientalgorithms is an important desiderata in reinforcement learn-
Beximco Pharmaceuticals Ltd. ( Beximco Pharma) belongs to Beximco Group, the largest privat va e sector industr ial conglomerat e in Bangladesh. The company is a leading man uf u act urer of pharmaceut icals - bot h finished formulat ions and Act ive Pharmaceut ical I ngredient s ( APIs). The st at e-of-t he-art man uf u act uring facilit ili ies of t he comp
The Review of Diabetic Studies Cardiovascular Disease, Lipids, and Diabetes Herewith, I cordially invite you participating in the Special Issue Project for The Review of Diabetic Studies (http://www.the-rds.org) on Cardiovascular Disease, Lipids, and Diabetes . There are many controversies and unanswered questions relating to the care of diabetic patients with dyslipidemia.