Expectation-maximization Policy Search with Parameter-based Exploration
-
摘要: 针对随机探索易于导致梯度估计方差过大的问题,提出一种基于参数探索的期望最大化(Expectation-maximization,EM)策略搜索方法.首先,将策略定义为控制器参数的一个概率分布.然后,根据定义的概率分布直接在控制器参数空间进行多次采样以收集样本.在每一幕样本的收集过程中,由于选择的动作均是确定的,因此可以减小采样带来的方差,从而减小梯度估计方差.最后,基于收集到的样本,通过最大化期望回报函数的下界来迭代地更新策略参数.为减少采样耗时和降低采样成本,此处利用重要采样技术以重复使用策略更新过程中收集的样本.两个连续空间控制问题的仿真结果表明,与基于动作随机探索的策略搜索强化学习方法相比,本文所提方法不仅学到的策略最优,而且加快了算法收敛速度,具有较好的学习性能.Abstract: In order to reduce large variance of gradient estimation resulted from stochastic exploration strategy, a kind of expectation-maximization policy search reinforcement learning with parameter-based exploration is proposed. At first, a probability distribution over the parameters of a controller is used to define a policy. Secondly, samples are collected by directly sampling in the controller parameter space according to the probability distribution for several times. During the sample-collection procedure of each episode, because the selected actions are deterministic, sampling from the defined policy leads to a small variance in the samples, which can reduce the variance of gradient estimation. At last, based on the collected samples, policy parameters are iteratively updated by maximizing the lower bound of the expected return function. In order to reduce the time-consumption and to lower the cost of sampling, an importance sampling technique is used to repeatedly use samples collected from policy update process. Simulation results on two continuous-space control problems illustrate that the proposed policy search method can not only obtain the most optimal policy but also improve the convergence speed as compared with several policy search reinforcement learning methods with action-based stochastic exploration, thus has a better learning performance.
-
[1] Zhao Dong-Bin,Liu De-Rong,Yi Jian-Qiang. An overview on the adaptive dynamic programming based urban city traffic signal optimal control. Acta Automatica Sinica,2009,35(6):676-681(赵冬斌,刘德荣,易建强. 基于自适应动态规划的城市交通信号优化控制方法综述. 自动化学报,2009,35(6):676-681)[2] Zhang W,Dietterich T G. Value function approximation and job-shop scheduling. In:Proceedings of the Workshop on Value Function Approximation,Report Number CMU-CS-95-206,School of Computer Science,Carnegie-Mellon University,USA,1995[3] Sugiyama M,Hachiya H,Towell C,Vijayakumar S. Value function approximation on non-linear manifolds for robot motor control. In:Proceedings of the IEEE International Conference on Robotics and Automation. Rome,Italy:IEEE,2007. 1733-1740[4] Barto A G,Sutton R S,Anderson C W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on System,Man and Cybernetics,1983,13(5):834-846[5] Peters J,Schaal S. Policy gradient methods for robotics. In:Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Beijing,China:IEEE,2006. 2219-2225[6] Cheng Yu-Hu,Feng Huan-Ting,Wang Xue-Song. Policy iteration reinforcement learning based on geodesic Gaussian basis defined on state-action graph. Acta Automatica Sinica,2011,37(1):44-51(程玉虎,冯涣婷,王雪松. 基于状态--动作图测地高斯基的策略迭代强化学习. 自动化学报,2011,37(1):44-51)[7] Wang Xue-Ning,Chen Wei,Zhang Meng,Xu Xin,He Han-Gen. A survey of direct policy search methods in reinforcement learning. CAAI Transactions on Intelligent Systems,2007,2(1):16-24(王学宁,陈伟,张锰,徐昕,贺汉根. 增强学习中的直接策略搜索方法综述. 智能系统学报,2007,2(1):16-24)[8] Dayan P,Hinton G E. Using expectation-maximization for reinforcement learning. Neural Computation,1997,9(2):271-278[9] Peters J,Schaal S. Reinforcement learning by reward-weighted regression for operational space control. In:Proceedings of the 24th International Conference on Machine Learning. Corvallis,USA:ACM,2007. 745-750[10] Wang Xue-Song,Tian Xi-Lan,Cheng Yu-Hu,Yi Jian-Qiang. Q-learning system based on cooperative least squares support vector machine. Acta Automatica Sinica,2009,35(2):214-219(王雪松,田西兰,程玉虎,易建强. 基于协同最小二乘支持向量机的Q学习. 自动化学报,2009,35(2):214-219)[11] Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning,1992,8(3-4):229-256[12] Rückstie\ss T,Felder M,Schmidhuber J. State-dependent exploration for policy gradient methods. In:Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. Antwerp,Belgium:Springer,2008. 234-249[13] Peters J,Kober J. Using reward-weighted imitation for robot reinforcement learning. In:Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. Nashville,USA:IEEE,2009. 226-232[14] Sehnke F,Osendorfer C,Rückstie\ss T,Graves A,Peters J,Schmidhuber J. Parameter-exploring policy gradients. Neural Networks,2010,23(4):551-559[15] Tang Hao,Wan Hai-Feng,Han Jiang-Hong,Zhou Lei. Coordinated look-ahead control of multiple CSPS system by multi-agent reinforcement learning. Acta Automatica Sinica,2010,36(2):289-296(唐昊,万海峰,韩江洪,周雷. 基于多Agent强化学习的多站点CSPS系统的协作Look-ahead 控制. 自动化学报,2010,36(2):289-296)[16] Hachiya H,Peters J,Sugiyama M. Efficient sample reuse in EM-based policy search. In:Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. Bled,Slovenia:Springer,2009. 469-484[17] Riedmiller M,Peters J,Schaal S. Evaluation of policy gradient methods and variants on the cart-pole benchmark. In:Proceedings of the IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning. Honolulu,USA:IEEE,2007. 254-261[18] Peters J,Vijayakumar S,Schaal S. Natural actor-critic. In:Proceedings of the 16th European Conference on Machine Learning. Porto,Portugal:Springer,2005. 280-291
点击查看大图
计量
- 文章访问数: 2116
- HTML全文浏览量: 55
- PDF下载量: 809
- 被引次数: 0