基于高斯回归的连续空间多智能体跟踪学习

陈鑫; 魏海军; 吴敏; 曹卫华

doi:10.3724/SP.J.1004.2013.02021

基于高斯回归的连续空间多智能体跟踪学习

doi: 10.3724/SP.J.1004.2013.02021

陈鑫^1,2,
魏海军^1,2,
吴敏^1,2,
曹卫华^1,2

1.
中南大学信息科学与工程学院长沙 410083;
2.
先进控制与智能自动化湖南省工程实验室长沙 410083

基金项目:

国家自然科学基金（61074058）资助

详细信息

作者简介:
陈鑫中南大学副教授. 主要研究方向为多智能体系统，智能控制和过程控制.E-mail：chenxin@csu.edu.cn

计量
- 文章访问数: 1563
- HTML全文浏览量: 94
- PDF下载量: 1356
- 被引次数: 0
出版历程
- 收稿日期: 2012-04-17
- 修回日期: 2013-05-13
- 刊出日期: 2013-12-20

Tracking Learning Based on Gaussian Regression for Multi-agent Systems in Continuous Space

CHEN Xin^1,2,
WEI Hai-Jun^1,2,
WU Min^1,2,
CAO Wei-Hua^1,2

1.
School of Information Science and Engineering, Central South University, Changsha 410083;
2.
Hunan Engineering Laboratory for Advanced Control and Intelligent Automation, Changsha 410083

Funds:

Supported by National Natural Science Foundation of China (61074058)

摘要

摘要: 提高适应性、实现连续空间的泛化、降低维度是实现多智能体强化学习（Multi-agent reinforcement learning，MARL）在连续系统中应用的几个关键. 针对上述需求，本文提出连续多智能体系统（Multi-agent systems，MAS）环境下基于模型的智能体跟踪式学习机制和算法（MAS MBRL-CPT）.以学习智能体适应同伴策略为出发点，通过定义个体期望即时回报，将智能体对同伴策略的观测融入环境交互效果中，并运用随机逼近实现个体期望即时回报的在线学习.定义降维的Q函数，在降低学习空间维度的同时，建立MAS环境下智能体跟踪式学习的Markov决策过程（Markov decision process，MDP）.在运用高斯回归建立状态转移概率模型的基础上，实现泛化样本集Q值函数的在线动态规划求解.基于离散样本集Q函数运用高斯回归建立值函数和策略的泛化模型. MAS MBRL-CPT在连续空间Multi-cart-pole控制系统的仿真实验表明，算法能够使学习智能体在系统动力学模型和同伴策略未知的条件下，实现适应性协作策略的学习，具有学习效率高、泛化能力强等特点.
- 连续状态空间 /
- 多智能体系统 /
- 基于模型的强化学习 /
- 高斯回归
Abstract: Improving adaption, realizing generalization in continuous space, and reducing dimensions are always viewed as the key issues for the implementation of multi-agent reinforcement learning (MARL) within continuous systems. To tackle them, the paper presents a learning mechanism and algorithm named model-based reinforcement learning with companion's policy tracking for multi-agent systems (MAS MBRL-CPT). Stemming from the viewpoint to make the best responses to companions, a new expected immediate reward is defined, which merges the observation on companion's policy into the payoff fed back from the environment, and whose value is estimated online by stochastic approximation. Then a Q value function with dimension reduced is developed to set up Markov decision process (MDP) for strategy learning in multi-agent environment. Based on the model of state transition using Gaussian regression, the Q value functions w.r.t. the state-action samples for generalization are solved by dynamic programming, which then serve as the basic samples to realize the generalization of value functions and learned strategies. In the simulation of multi-cart-pole in continuous space, even if the dynamics and companions' strategies are unknown in priori, MBRL-CPT entitles the learning agent to learn the tracking strategy to cooperate with its companions. The performance of MBRL-CPT shows its high efficiency and good generalization ability.
- Continuous state space /
- multi-agent systems (MAS) /
- model-based reinforcement learning (MBRL) /
- Gaussian regression (GR)

HTML全文

参考文献(22)

[1]	Busoniu L, Babuska R, De Schutter B. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2008, 38(2): 156-172
[2]	Kaelbling L P, Littman M L, Moore A W. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 1996, 4: 237-285
[3]	Chen Xue-Song, Yang Yi-Min. Reinforcement learning: survey of recent work. Application Research of Computers, 2010, 27(8): 2834-2838, 2844 (陈学松, 杨宜民. 强化学习研究综述. 计算机应用研究, 2010, 27(8): 2834-2838, 2844)
[4]	Cheng Yu-Hu, Feng Huan-Ting, Wang Xue-Song. Policy iteration reinforcement learning based on geodesic Gaussian basis defined on state-action graph. Acta Automatica Sinica, 2011, 37(1): 44-51 (程玉虎, 冯涣婷, 王雪松. 基于状态-动作图测地高斯基的策略迭代强化学习. 自动化学报, 2011, 37(1): 44-51)
[5]	Xu Xin, Shen Dong, Gao Yan-Qing, Wang Kai. Learning control of dynamical systems based on Markov decision processes: research frontiers and outlooks. Acta Automatica Sinica, 2012, 38(5): 673-687 (徐昕, 沈栋, 高岩青, 王凯. 基于马氏决策过程模型的动态系统学习控制: 研究前沿与展望. 自动化学报, 2012, 38(5): 673-687)
[6]	Busoniu L, De Schutter B, Babuška R. Approximate dynamic programming and reinforcement learning. In: Proceedings of the 2010 Interactive Collaborative Information Systems, Studies in Computational Intelligence. Berlin Heidelberg: Springer, 2010, 281: 3-44
[7]	Wang Xue-Song, Tian Xi-Lan, Cheng Yu-Hu, Yi Jian-Qiang. Q-learning system based on cooperative least squares support vector machine. Acta Automatica Sinica, 2009, 35(2): 214-219 (王雪松, 田西兰, 程玉虎, 易建强. 基于协同最小二乘支持向量机的Q学习. 自动化学报, 2009, 35(2): 214-219)
[8]	Busoniu L, Ernst D, De Schutter B, Babuska R. Online least-squares policy iteration for reinforcement learning control. In: Proceedings of the 2010 American Control Conference. Baltimore, USA: IEEE, 2010. 486-491
[9]	Rasmussen C E, Kuss M. Gaussian processes in reinforcement learning. In: Proceedings of the 17th Annual Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press, 2003. 751-759
[10]	Jung T, Stone P. Gaussian processes for sample efficient reinforcement learning with RMAX-like exploration. In: Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases, Part I. Berlin, Heidelberg: Springer-Verlag, 2010. 601-616
[11]	Deisenroth M P, Rasmussen C E. PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on Machine Learning. Washington, USA, 2011. 465-472
[12]	Deisenroth M P, Rasmussen C E, Peters J. Gaussian process dynamic programming. Neurocomputing, 2009, 72(7-9): 1508-1524
[13]	Wu Jun, Xu Xin, Wang Jian, He Han-Gen. Recent advances of reinforcement learning in multi-robot systems: a survey. Control and Decision, 2011, 26(11): 1601-1610, 1615 (吴军, 徐昕, 王健, 贺汉根. 面向多机器人系统的增强学习研究进展综述. 控制与决策, 2011, 26(11): 1601-1610, 1615)
[14]	Hu J L, Wellman M P. Nash Q-learning for general-sum stochastic games. The Journal of Machine Learning Research, 2003, 4: 1039-1069
[15]	Greenwald A, Hall K. Correlated Q-learning. In: Proceedings of the 20th International Conference on Machine Learning. Washington D.C., USA: AAAI Press, 2003. 242-249
[16]	Conitzer V, Sandholm T. AWESOME: a general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 2007, 67(1-2): 23-43
[17]	Weinberg M, Rosenschein J S, Paul K. Best-response multiagent learning in non-stationary environments. In: Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems. Washington D.C., USA: IEEE, 2004. 506-513
[18]	Chen C L, Li H X, Dong D Y. Hybrid control for robot navigation: a hierarchical Q-learning algorithm. IEEE Robotics and Automation Magazine, 2008, 15(2): 37-47
[19]	Dai Zhao-Hui, Yuan Jiao-Hong, Wu Min, Chen Xin. Dynamic hierarchical reinforcement learning based on probability model. Control Theory and Applications, 2011, 28(11): 1595-1600, 1606 (戴朝晖, 袁姣红, 吴敏, 陈鑫. 基于概率模型的动态分层强化学习. 控制理论与应用, 2011, 28(11): 1595-1600, 1606)
[20]	Shoham Y, Powers R, Grenager T. Multi-agent Reinforcement Learning: a Critical Survey, Technical Report, Computer Science Department, Stanford University, 2003
[21]	Rasmussen C E, Williams C K I. Gaussian Processes for Machine Learning. Cambridge, MA, USA: The MIT Press, 2006
[22]	Florian R V. Correct Equations for the Dynamics of the Cart-pole System. Technical Report, Center for Cognitive and Neural Studies, 2007