autonlab.org

Autonlab.org

Covariant Policy Search
J. Andrew Bagnell and Jeff Schneider
Abstract
Inspired by the work of [Amari and Nagaoka, 2000], Kakade proposed a “natural gradient” algorithm. In particu- We investigate the problem of non-covariant behav- lar, [Kakade, 2002] proposed a scheme for generating a met- ior of policy gradient reinforcement learning algo- ric on parameter space that has interesting theoretical prop- rithms. The policy gradient approach is amenable erties. Most convincingly, Kakade showed strong empiri- to analysis by information geometric methods. This cal evidence for the algorithm’s usefulness. In particular, leads us to propose a natural metric on controller [Kakade, 2002] applies the algorithm to the Tetris problem parameterization that results from considering the of [Bertsekas and Tsitsiklis, 1996]. This problem is partic- manifold of probability distributions over paths in- ularly interesting as value function methods (as for example described in [Bertsekas and Tsitsiklis, 1996] ) demonstrate of this approach leads to a covariant gradient as- non-monotone performance; first policy iteration improves cent rule. Interesting properties of this rule are the policy dramatically, but after a small number of iterations discussed, including its relation with actor-critic the policy gets much worse. Normal gradient methods, in- style reinforcement learning algorithms. The al- cluding second-order and conjugate methods also prove very gorithms discussed here are computationally quite ineffective on this problem; even after a tremendous number efficient and on some interesting problems lead of rounds they only mildly increase the performance of the to dramatic performance improvement over non- game-player. The method presented in [Kakade, 2002] shows rapid performance improvement and achieves a significantlybetter policy than value-function methods (at their peak) in Introduction
Much recent work in reinforcement learning and stochastic However, despite recognizing an interesting defect in the optimal control has focused on algorithms that search directly general approach and intuiting an apparently powerful algo- through a space of policies rather than building approximate rithm, Kakade concludes that his method also fails to be co- value functions. Policy search has numerous advantages: do- variant leaving the problem open. We present here what we main knowledge may be easily encoded in a policy, the policy believe to be an appropriate solution.
may require less representational power than a value-function In Kakade’s work, there is no proper probability manifold, approximation, there are simple extensions to the multi-agent but rather a collection of such (one for each state) based on domain, and convergent algorithms are known. Furthermore, the policy. As such, Kakade must rely on an ad-hoc method policy search approaches have recently scored some encour- for generating a metric. Here instead we take a proper mani- aging successes [Bagnell and Schneider, 2001] [Baxter et al., fold, the distribution over paths induced by the controller, and 1999] including helicopter control and game-playing.
compute a metric based on that. In the special case appropri- One interesting aspect of existing gradient search algo- ate to the average reward RL formulation, Kakade’s scheme rithms, however, is that they are non-covariant; that is, a
is shown to give rise to a bona-fide natural metric, despite the simple re-parameterization of the policy typically leads to a observation in the paper that the learning was non-covariant.
different gradient direction. (Not that found by applying the We believe this is an artifact– perhaps of step-length. Fur- Jacobian of the re-parameterization to the gradient). This ther, we note that parametric invariance does not require
is a odd result; it is intuitively difficult to justify the gradi- this metric– there are numerous covariant metrics. Rather, ent computation as actually indicating the direction of steep- a stronger probabilistically natural invariance demands the est descent. This problem is well recognized in the pattern- metric used. We describe this result to motivate our method.
recognition and statistics communities and is a very active Importantly, the notion of the metric on the path- area of research. [Kakade, 2002] was, to the best of our distribution allows us to provide very simple and natural ex- knowledge the first to identify this problem in reinforcement tensions to Kakade’s algorithm that cover the finite horizon, learning and to suggest that techniques from information ge- discounted start-state, and partially-observed reinforcement ometry may prove valuable in its solution.
learning problems as well as the average reward one. Fi-

nally, for completeness, we describe the result discovered by Kakade relating the invariant metric and compatible actor- critic methods [Sutton et al., 1999].
Problem Setup and Notation
Figure 1: A two-state MDP. Each state has two controls one which A stochastic control problem consists of paths self transitions (and earns the reward that labels the state) and an- other that simply transitions to the other state. Rewards occur only , that is a function of a sequence of (finite) controls on the transitions between different states.
, indexed by time, from a space A. Throughout this paper we will be considering Partially Observed MarkovDecision Problems, in which there are state-variables state and next state transition probabilities We also typically have a sequence of outputs (observations) strategy that maps the history of observations be given in terms of memoryless stochastic controllers thatmap the current state to a distribution over actions, although Figure 2: Log odds of actions for policies ÐÓ Ô´ they can be easily extended to finite window or recurrent rep- on the two-state MDP (horizontal axis corresponds to state 0).
resentations that operate on only the observations. The goal Different curves correspond to varying . Notice the non-covariant in a control problem is to maximize the expected reinforce- always assumed to exist for all controllers. The reward func- state 1. In state 1, action 0 returns to state 1 and gets reward tion on paths in a POMDP is additive in time (or in the infi-nite time case, discounted or averaged), and a function 2. Action 1 transits to state 0 and achieves no reward.
Consider parameterized probabilistic policies of the form of state, although the algorithms discussed here are not neces- sarily predicated on that assumption. Reinforcement learning is the adaptive version of the control problem where we at- parameter that we use to demonstrate that even mildly differ- tept to maximize the expected reinforcment by sampling tra- ent parameterization lead to dramatically different behavior.
jectories and then improving our policy. We use the notation Below we plot the resulting track through the space of poli- cies using the log ratio of probabilities of the actions for each of the possible states. We start from a policy that makes it to denote the inner product with respect to metric somewhat more likely to choose action 0 in state 0 and action odds of the policy from state 1 (Figure 2).
The non-covariant behavior is quite clear in this graph.
Covariance and Riemannian manifolds
Further, we note that it is very difficult to achieve a goodpolicy using this algorithm from this starting point as the Meaning of steepest ascent
wrong action becomes overwhelmingly likely to be chosen For direct policy search methods, we are interested in finding from state 0. If sampling were used to compute the gradient, the direction of steepest ascent with respect to our current pa- we would nearly never get samples from state 1– which we rameters. That is, we wish to maximize the reward function Path distribution manifolds
reparameterize our controller in terms of, e.g. , and expressthe same effective infinitesimal policy change in terms of The control problem is essentially a coupled one of optimiza- tion and integration over the space of possible paths, ever, if we measure lengths using the naive dot product, the motivates the idea that instead of considering the (arbitrary) same effective change will not have the same size. It is this distance in terms of parameterization of the policy, we may problem that leads to the odd behavior of “steepest descent” consider distance in terms of the change in distribution over paths resulting from the policy change. We may view thedistribution over paths Non-covariant learning rule
To demonstrate the problem we first consider a simple two some work to visualize; as an example consider a space of state, two action MDP described in [Kakade02]. (Figure 1) three possible paths. All possible distributions over these In state 0 executing action 0 returns to the same state and gets three paths can be smoothly represented with two parameters.
a reward of 1. Executing action 1 get no reward but transits to In Figure 3 (left) we see one visualization with the embedding

and take derivatives with respect to each Figure 3: An example probability manifold over paths. Each axis represents the log probability of one of the three possible paths. The manifold is formed as we vary the parametersdefining it throughout their domain. On the right we attach the tangent space at a particular point in parameter space.
That is, the direction of steepest descent is simply the nor- mal gradient times the inverse of the metric evaluated at the point of tangency. We call this the “natural gradient” for the parameters, assuming no redundancy, we generate Reimannian manifold. [Amari and Nagaoka, 2000] dimensional manifold. In the case pictured, it is the set of all distributions on 3 paths, but in general, the dimension-ality of the path probability manifold will be tremendously Invariance and Chentsov’s Theorem
less than the number of possible paths. It is important to un- Some confusion seems to surround the choice of metric on derstand the manifold under consideration is that of the prob- probability manifolds. There is no unique answer for this ability distribution over paths and not of paths themselves.
metric, even under the supposition of parametric invariance The study of parameterized manifolds like that pictured as suggested by some authors. The issue is rather more sub- above is the domain of differential geometry. Here we are tle. Any metric that transforms under parameter change ac- interested in establishing a Riemannian structure on the man- cording to the Jacobian of the function connecting parameters ifold of paths. By that we mean we wish to establish a metric meets this requirement. This type of parametric covariance is on the tangent space (the local linear approximation to the a minimum requirement for us. It is natural to suggest a met- manifold at a point ), so we can measure small parameter ric that preserves essential probabilistic aspects of the mani- gent space (which is spanned by the partials with respect to congruent embeddings, or sufficient statistics, depending on your viewpoint) that carry one distribution over paths to an- definite matrix. This is a very natural thing to do– instead other in a natural and recoverable way. For example, con- of just the standard dot product, we have a dot product that sider the mapping between the two distributions allows us to represent rotations and scalings (exactly what a positive definite matrix can represent) and that can vary throughout the manifold. In Figure 3 (right) we depict thetangent space at a point in our parameterization as the locallinear approximation at that point.
Steepest ascent on Riemannian manifold
There are two questions that then naturally arise– what is steepest descent on a function defined over a Riemannian Paths’
space, and is there a Riemannian metric on the manifold ofthe paths that is in some sense natural. We quickly answer the first question, and pursue the second in the next two sec- and splits one path into two (with probability 1/2 for each tions. A Lagrange multiplier argument (given schematically In a probabilistically fundamental way, the mani- here) makes it easy to see the form of the steepest descent a smoothly parameterized set of distributions) are similarto each other. For each path¼ we can uniquely recover theoriginal probability. In this way, a parameterized distribu- tion over paths can be embedded into a different path space.
(This equivalence can actually be phrased in a category the- oretic way where the morphisms are these congruent embed- dings.) [Chentsov, 1980] General congruent embeddings can

be thought of as simply generalizations of the example de- – Compute
picted above to allow arbitrary permutations and arbitrary probabilities for different paths stemming from a single path, as well as compositions of these two. In the control problem this could arise by simple permutation of the state variables We use the Markov property and the invariance of the tran- or change of co-ordinates to ones with redundant information.
sition probabilities (given the actions) in the algorithm above It is natural to require that with our metric the congruent em- bedding is an isometry on the tangent space. (i.e. preserves tracted from the proofs below, as well as other, potentially the length of the vectors). That is, if we make a change of To compute the natural gradient, we simply invert this ma- trix and multiply it with the gradient computed using standard with respect to congruent embeddings leads to a unique (up toscale) metric on the manifold. [Chentsov, 1980] This metric Limiting metric for infinite horizon problems
is well-known in statistical inference as the Fisher-Rao metric A well-known result in statistics [DeGroot, 1970] gives a dif- and it can be written as the Fisher information matrix [DeG- ferent form for the Fisher metric under appropriate regularity conditions. We quickly derive that result, and then show howthis gives us a simple form for the path probability metric as Another way to derive the metric is to think about “dis- tances” on probability spaces. The KL-divergence (or rel- ative entropy) between two distributions is a natural diver- gence on changes in a distribution. It is also manifestly in- variant to re-parameterization. If we think about derivatives as small changes in parameters (differentials) then we dis- cover that we also get a unique metric as the second-orderTaylor expansion of the KL-divergence. This too agrees with the Fisher information (up to scale). We note that the di- rection of the KL-divergence is irrelevant as to second-order Fisher-Rao Metric on the Path-space
Manifold
The issue now becomes how to derive the Fisher metric on The third line follows from the second by integrating by the space of path distributions. It turns out in the case of parts and the fifth follows by observing that the total proba- processes with underlying Markovian state to be rather easy, and involves only computations we already make in the likeli- Now we show that in the limit this leads to a simple metric hood ratio approach standard in gradient-based reinforcement for the infinite horizon case. To get a convergent metric, we must normalize the metric, in particular by the total lengthof the path denoted . Since the metric is defined only up to Derivation of finite-time path metric
Theorem 1 (Infinite-Horizon Metric) For
Markov process the Fisher information matrix limits to the do. The essential algorithm within gradient methods like expected Fisher information of the policy for each state and REINFORCE and GPOMDP is a clever method of computing control under stationary distribution of states and actions. , the expected score function. Thus, while to indicate the t-step finite-horizon metric. the expected score is the gradient, the correlation of thescore is the Fisher matrix. The following simple algorithm is unbiased and converges almost surely (under the same regularity conditions as in [Baxter et al., 1999] as the number Algorithm 1 (Finite-horizon metric computation) For in
For a Markov process we can write the likelihood ratio

Using the chain rule we continue the derivation with indicating the stationary distribution: bution of states. Thus the infinite horizon (undiscounted) andstart-state metrics give essentially the same metric, with only the effective weighting by states differing.
Metrics for Partially Observed Problems
For policies that map the observation space of a partially- observed markov decision process into distributions over ac-tions, it is just as easy to derive appropriate metric using our with only subtle changes to the arguments above we end up with the same metric except using the limiting distribution of Relation to Compatible Value Function
Actor Critic
Kakade noticed a fascinating connection between the limit- ing metric given in Theorem 1 and the improvement directioncomputed by a class of actor-critic methods that use a special where the second equality follows by noting that the like- compatible function approximation technique.[Sutton et al., lihood ratio is independent of transition probability matrix 1999; Konda and Tsitsiklis, 2002] Following Kakade we let given the action probability, and by application of the ergodic theorem. The last line follows (with tion for the policy at a given state) as the second term in line3 vanishes since the total probability is constant with respect It is also interesting to consider what happens in the start- This type of value function approximator was initially sug- state, discounted formulation of the problem. In this case we gested as it may be used to compute the true gradient. In prac- would like our metric, using the general definition over path tice, it has not been clear what advantage this brings over the distributions given above, to naturally weigh the start start gradient estimation routines of [Baxter et al., 1999]. How- more than it necessarily is in the infinite horizon average case.
ever, “folk-wisdom” has it that performing an that infinitesi- mal policy iteration (moving in the direction of the best policy undiscounted problem where each trajectory terminates with at each step. We can use this fact to derive good properties and often significantly outperforms standard a metric more appropriate for the discounted formalism.
gradient methods. The natural gradient provides insight intothis behavior. Let Theorem 2 (Start-State Metric) For a discounted Markov
process the Fisher information matrix equals the Fisher in-formation of the policy for each state and control under the limiting distribution of states and actions. proof is very similiar to the infinite horizon case and so wesimply sketch it: is the exact advantage function. [Sutton et al., 1999] It is easy to check that ([Kakade, 2002], Theo- policy improvement is exactly the natural gradient direction.
This can be seen by simply differentiating ( ; ) to minimize and noting the result of [Sutton et al., Demonstration
As a consequence of the results of section 5 and [Kakade, 2002], experimental results already exist demonstrating the effectiveness of the natural gradient method. That is, all re- sults using policy improvement with compatible function ap- proximators are implicitly computing this result. As a demon- stration, we computed analytically the natural gradient for the ing. Unfortunately, working in the space of policies, it was difficult to generate such an algorithm. Here we proposed considering the induced path-distribution manifold and used notions from information geometry to propose a natural co- variant algorithm. This leads to interesting insight and a prac- tical algorithm. Fortunately, it agrees with the heuristic sug- gested by Kakade (despite the suggestion in the paper that the algorithm there was actually not covariant) in the infinite horizon case and extends to cover new problems. [Peters et al., 2002] independently developed theory related to ours (inparticular the theorems in Section 4.2) and presented results Figure 4: Performance comparison of covariant (nat and scaled) in the context of robot dynamic control.
and non-covariant (grad) on the 2-state MDP. The covariant learning Further work will provide more investigation into the ex- algorithm dramatically outperforms the standard gradient algorithm.
perimental behavior of the algorithms presented. Future ef- The standard method achieves J=2 after more than 1000 iterations.
fort may also yield deeper insight into the relationship be-tween the method presented here and value-function approx- References
[Amari and Nagaoka, 2000] S. Amari and H. Nagaoka.
Methods of Information Geometry.
[Bagnell and Schneider, 2001] J. Bagnell and J. Schneider.
Autonomous helicopter control by policy-search based re- inforcement learning. In Proceedings of the 2001 IEEEInt. Conference on Robotics and Automation. IEEE, 2001.
Figure 5: Path through policy space (showing log-odds of the ac- [Baxter et al., 1999] J. Baxter, L. Weaver, and P. Bartlett.
tions with state 0 along the horizontal axis) of covariant (nat and Direct-gradient-based reinforcement learning I: Gradient scaled) and non-covariant (grad) algorithms on the 2-state MDP.
estimation algorithms. Technical report, Computer Sci- Note that, as required, the path is independent of scaling for the ences Lab, Australian National University, 1999.
[Bertsekas and Tsitsiklis, 1996] D. Bertsekas and J. Tsitsik- problem given in Section 2.2. This problem is interesting as it reveals some of the properties of the natural gradient method.
First it is easily checked that for a complete Boltzmann policy [Chentsov, 1980] N.N. Chentsov. Statistical Decision Rules and Optimal Inference. American Mathematical Society,1980.
[DeGroot, 1970] M. DeGroot. Optimal Statistical Decisions.
This means that it is very similiar to the standard gradient [Kakade, 2002] S. Kakade. A natural policy gradient. In Ad- except that it removes the weighting of states by the stationary vances in Neural Information Processing Systems (NIPS- distribution. Rather, each state is treated equally. This leads to much more reasonable results in the problem discussed as [Konda and Tsitsiklis, 2002] V. Konda and J. Tsitsiklis.
the partial derivative component for state 1 does not shrink as Actor-critic algorithms. to appear in the SIAM Journal on Control and Optimization, 2002.
In Figure 4 we plot the performance of the natural gradi- ent (using two different scalings) and the standard gradient [Peters et al., 2002] Jan Peters, Sethu Vijaykumar, and Ste- methods in optimizing the policy for this problem.
Policy gradient methods for robot control.
It is interesting to note that in this graph the behavior of the natural gradient descent algorithm appears to be non- [Sutton et al., 1999] R. Sutton, D. McAllester, S. Singh, and covariant. This is simply due to the step size heuristic not Y. Mansour. Policy gradient methods for reinforcement computing equivalent steps in the policy space. The direction learning with function approximation. In Neural Informa- however is constant as illustrated in Figure 5.
tion Processing Systems 12, 1999.
Conclusions
[Kakade, 2002] suggested that covariant behavior of gradientalgorithms is an important desiderata in reinforcement learn-

Source: http://www.autonlab.org/autonweb/14625/version/3/part/5/data/bagnellCovariant.pdf?branch=main&language=en

Untitled-4

Beximco Pharmaceuticals Ltd. ( Beximco Pharma) belongs to Beximco Group, the largest privat va e sector industr ial conglomerat e in Bangladesh. The company is a leading man uf u act urer of pharmaceut icals - bot h finished formulat ions and Act ive Pharmaceut ical I ngredient s ( APIs). The st at e-of-t he-art man uf u act uring facilit ili ies of t he comp

Microsoft word - editors note_lipoproteins and cvd in diabetes.doc

The Review of Diabetic Studies Cardiovascular Disease, Lipids, and Diabetes Herewith, I cordially invite you participating in the Special Issue Project for The Review of Diabetic Studies (http://www.the-rds.org) on Cardiovascular Disease, Lipids, and Diabetes . There are many controversies and unanswered questions relating to the care of diabetic patients with dyslipidemia.