Entropy-guided MiniMax Value Decomposition for Reinforcement Learning in Two-team Zero-sum Games
-
摘要: 在两团队零和马尔科夫博弈中, 一组玩家通过合作与另一组玩家进行对抗. 由于对手行为的不确定性和复杂的团队内部合作关系, 在高采样成本的任务中快速找到优势的分布式策略仍然具有挑战性. 本文提出了熵引导的极小极大值分解(Entropy-guided minimax factorization, EGMF)强化学习方法, 在线学习队内合作和队间对抗的策略. 首先, 提出了基于极小极大值分解的多智能体执行器-评估器框架, 在高采样成本的、不限动作空间的任务中, 提升优化效率和博弈性能. 其次, 引入最大熵使智能体可以更充分地探索状态空间, 避免在线学习过程收敛到局部最优. 统计策略在时间域累加的熵值用于评估策略的熵, 并将其与分解的个体独立Q函数结合用于策略改进. 最后, 在多种博弈仿真场景和一个实体任务平台上进行方法验证, 并与其他基线方法进行比较, 结果显示EGMF可以在更少样本下学到更具有对抗性能的两团队博弈策略.
-
关键词:
- 多智能体深度强化学习 /
- 两团队零和马尔科夫博弈 /
- 最大熵 /
- 值分解
Abstract: In two-team zero-sum Markov Games, a group of players collaborates and confronts a team of adversaries. Due to the uncertainty of opponent behavior and the complex cooperation within teams, quickly identifying advantageous distributed policies in high-sampling-cost tasks remains challenging. This paper introduces the entropy-guided minimax factorization (EGMF) reinforcement learning method, which enables online learning of cooperative policies within teams and competitive policies between teams. Firstly, a multi-agent actor-critic framework based on minimax value decomposition is proposed to enhance optimization efficiency and game performance in tasks with high sampling costs and unrestricted action spaces. Secondly, maximum entropy is introduced to allow agents to explore the state space more comprehensively, preventing the online learning process from converging to local optima. In addition, the policy entropy is also summed along the time horizon for evaluation, which is combined with independent Q function for policy improvement. Finally, experiments are conducted in some simulated scenarios and a real-robot platform, comparing EGMF with baseline methods. The results demonstrate that EGMF achieves superior performance with relatively fewer samples.1)1 1 红方(Proponents, Pros)和蓝方(Antagonists, Ants) -
表 1 实验中所有方法的重要超参数
Table 1 Important hyperparameters of all methods in experiments
算法 超参数 名称 Wimblepong MPE RoboMaster 共用超参数 n_episodes 回合数 13000 13000 80000 n_seeds 种子数 8 8 8 $\gamma$ 折扣因子 0.99 0.98 0.99 hidden_layers 隐藏层 [64, 64] [64, 64] [128, 128] mix_hidden_dim 混合网络隐藏层 32 32 32 learning_rate 学习率 0.0005 0.0005 0.0005 EGMF (本文) buffer_size 经验池大小 4e6 4e5 4e6 RADAR[15]/Team-PSRO[16]/NXDO[44] n_genes 迭代数 13 13 10 ep_per_gene 单次迭代回合 1000 1000 80000 batch_size 批大小 1000 1000 2000 buffer_size 经验池大小 200000 20000 200000 表 2 训练结束后各个算法与基于脚本的智能体对抗的结果和循环赛交叉对抗的结果
Table 2 Performance of all methods at the end of training by playing against the scripted-based bots, and the cross-play results of Round-Robin returns
指标 算法 场景 Pong-D MPE-D RM-D Pong-C MPE-C RM-C 与固定脚本对抗 EGMF (本文) 0.95(±0.01) 32.3(±1.0) 0.63(±0.03) 0.95(±0.02) 23.0(±0.5) 0.62(±0.03) RADAR[15] 0.52(±0.11) 16.3(±5.2) 0.35(±0.02) 0.58(±0.03) 12.5(±5.1) 0.52(±0.02) Team-PSRO[16] 0.71(±0.04) 21.2(±3.4) 0.33(±0.01) 0.71(±0.06) 22.1(±2.9) 0.54(±0.03) NXDO[44] 0.71(±0.10) 24.1(±1.6) 0.45(±0.02) 0.80(±0.05) 23.0(±0.4) 0.61(±0.01) 循环赛结果 EGMF (本文) 0.92(±0.01) 12.1(±0.3) 0.90(±0.02) 0.91(±0.02) 7.8(±2.2) 0.72(±0.01) RADAR[15] 0.45(±0.02) −2.4(±2.5) 0.45(±0.04) 0.43(±0.02) −1.8(±1.9) 0.50(±0.01) Team-PSRO[16] 0.53(±0.02) 1.9(±1.9) 0.49(±0.01) 0.56(±0.04) −3.7(±2.8) 0.55(±0.01) NXDO[44] 0.51(±0.02) 2.5(±1.2) 0.51(±0.02) 0.63(±0.02) 2.9(±1.9) 0.58(±0.02) 注: 粗体表示各算法在不同场景下的最优结果. 表 3 EGMF和FM3Q与基于脚本的智能体对抗的结果
Table 3 Performance of EGMF and FM3Q by playing against the scripted-based bots
方法 Pong-D MPE-D RM-D 回合(0.8) 性能 回合(25) 性能 回合(0.6) 性能 EGMF (本文) 3.0k 0.95(±0.01) 2.8k 32.3(±1.0) 35k 0.63(±0.03) FM3Q[17] 3.1k 0.96(±0.03) 3.6k 29.9(±1.2) 19k 0.68(±0.03) 注: 粗体表示各方法在不同场景下的最优结果. -
[1] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484−489 doi: 10.1038/nature16961 [2] 唐振韬, 邵坤, 赵冬斌, 等. 深度强化学习进展: 从AlphaGo到AlphaG Zero. 控制理论与应用, 2017, 34(12): 1529−1546 doi: 10.7641/CTA.2017.70808 [3] SANDHOLM T. Solving imperfect-information games. Science, 2015, 347(6218): 122−123 doi: 10.1126/science.aaa4614 [4] TANG Z, ZHU Y, ZHAO D, et al. Enhanced rolling horizon evolution algorithm with opponent model learning. IEEE Transactions on Games, 2023, 15(1): 5−15 doi: 10.1109/TG.2020.3022698 [5] GUAN Y, AFSHARI M, TSIOTRAS P. Zero-sum games between mean-field teams: Reachability-based analysis under mean-field sharing[C]//AAAI Conference on Artificial Intelligence (AAAI): Vol. 38. 2024: 9731−9739. [6] MATHIEU M, OZAIR S, SRINIVASAN S, et al. Starcraft II unplugged: Large scale offline reinforcement learning[C]//Deep Reinforcement Learning Workshop Advances in Neural Information Processing Systems (NeurIPS). 2021. [7] YE D, LIU Z, SUN M, et al. Mastering complex control in MOBA games with deep reinforcement learning[C]//AAAI Conference on Artificial Intelligence (AAAI): Vol. 34. 2020: 6672−6679. [8] LITTMAN M L. Markov games as a framework for multi-agent reinforcement learning [M]//Machine learning Proceedings (ML). Elsevier, 1994: 157−163. [9] HU J, WELLMAN M P. Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 2003, 4: 1039−1069 [10] ZHU Y, ZHAO D. Online minimax q network learning for two-player zero-sum markov games. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(3): 1228−1241 doi: 10.1109/TNNLS.2020.3041469 [11] LANCTOT M, ZAMBALDI V, GRUSLYS A, et al. A unified game-theoretic approach to multiagent reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2017, 30: 1−35 [12] CHAI J, CHEN W, ZHU Y, et al. A hierarchical deep reinforcement learning framework for 6-dof ucav air-to-air combat. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(9): 5417−5429 doi: 10.1109/TSMC.2023.3270444 [13] LI W, ZHU Y, ZHAO D. Missile guidance with assisted deep reinforcement learning for head-on interception of maneuvering target. Complex and Intelligent Systems, 2022, 8(2): 1205−1216 doi: 10.1007/s40747-021-00577-6 [14] HAARNOJA T, MORAN B, LEVER G, et al. Learning agile soccer skills for a bipedal robot with deep reinforcement learning. Science Robotics, 2024, 9(89): eadi8022 doi: 10.1126/scirobotics.adi8022 [15] PHAN T, BELZNER L, GABOR T, et al. Resilient multi-agent reinforcement learning with adversarial value decomposition[C]//AAAI Conference on Artificial Intelligence (AAAI): Vol. 35. 2021: 11308−11316. [16] MCALEER S, FARINA G, ZHOU G, et al. Team-psro for learning approximate tmecor in large team games via cooperative reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 202436 [17] HU G, ZHU Y, LI H, et al. FM3Q: Factorized multi-agent minimax Q-learning for two-team zero-sum markov game[J/OL]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2024. DOI: 10.1109/TETCI.2024.338 3454. [18] BAI Y, JIN C. Provable self-play algorithms for competitive reinforcement learning[C]//International Conference on Machine Learning (ICML). PMLR, 2020: 551−560. [19] PEREZ-NIEVES N, YANG Y, SLUMBERS O, et al. Modelling behavioural diversity for learning in open-ended games[C]//International Conference on Machine Learning (ICML). PMLR, 2021: 8514−8524. [20] BALDUZZI D, GARNELO M, BACHRACH Y, et al. Open-ended learning in symmetric zero-sum games[C]//International Conference on Machine Learning (ICML). 2019: 434−443. [21] MCALEER S, LANIER J B, FOX R, et al. Pipeline PSRO: A scalable approach for finding approximate Nash equilibria in large games. Advances in Neural Information Processing Systems (NeurIPS), 2020, 33: 20238−20248 [22] MULLER P, OMIDSHAFIEI S, ROWLAND M, et al. A generalized training approach for multiagent learning[C]//International Conference on Learning Representations (ICLR). 2020. [23] MARRIS L, MULLER P, LANCTOT M, et al. Multi-agent training beyond zero-sum with correlated equilibrium meta-solvers[C]//International Conference on Machine Learning (ICML). PMLR, 2021: 7480-7491. [24] FENG X, SLUMBERS O, WAN Z, et al. Neural auto-curricula in two-player zero-sum games. Advances in Neural Information Processing Systems (NeurIPS), 2021, 34: 3504−3517 [25] ANAGNOSTIDES I, KALOGIANNIS F, PANAGEAS I, et al. Algorithms and complexity for computing nash equilibria in adversarial team games[M]//Proceedings of the 24th ACM Conference on Economics and Computation (EC). 2023: 89−89. [26] ZHU Y, LI W, ZHAO M, et al. Empirical policy optimization for n-player Markov games. IEEE Transactions on Cybernetics, 2023, 53(10): 6443−6455 doi: 10.1109/TCYB.2022.3179775 [27] LUO G, ZHANG H, HE H, et al. Multiagent adversarial collaborative learning via mean-field theory. IEEE Transactions on Cybernetics, 2020, 51(10): 4994−5007 [28] LOWE R, WU Y I, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing Systems (NeurIPS), 2017, 30: 6382−6393 [29] SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward [C]//International Conference on Autonomous Agents and Multiagent Systems (AAMAS). 2018. [30] RASHID T, SAMVELYAN M, DE WITT C S, et al. Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research, 2020, 21(1): 7234−7284 [31] CHAI J, LI W, ZHU Y, et al. UNMAS: Multiagent reinforcement learning for unshaped cooperative scenarios. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(4): 2093−2104 doi: 10.1109/TNNLS.2021.3105869 [32] PENG B, RASHID T, SCHROEDER DE WITT C, et al. FACMAC: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems (NeurIPS), 2021, 34: 12208−12221 [33] ZHANG T, LI Y, WANG C, et al. FOP: Factorizing optimal joint policy of maximum-entropy multi-agent reinforcement learning [C]//International Conference on Machine Learning (ICML). PMLR, 2021: 12491−12500. [34] HAARNOJA T, TANG H, ABBEEL P, et al. Reinforcement learning with deep energy-based policies[C]//International Conference on Machine Learning (ICML). PMLR, 2017: 1352−1361. [35] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//International Conference on Machine Learning (ICML). PMLR, 2018: 1861−1870. [36] DUAN J, GUAN Y, LI S E, et al. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors. IEEE transactions on neural networks and learning systems, 2021, 33(11): 6584−6598 [37] KALOGIANNIS F, PANAGEAS I, VLATAKIS-GKARAGKOUNIS E V. Towards convergence to nash equilibria in two-team zero-sum games[C]//International Conference on Learning Representations (ICLR). 2021. [38] WANG J, REN Z, LIU T, et al. QPLEX: Duplex dueling multi-agent Q-learning[C]//International Conference on Learning Representations (ICLR). 2020. [39] CONDON A. On algorithms for simple stochastic games. Advances in Computational Complexity Theory, 1990, 13: 51−72 [40] ZHOU M, LIU Z, SUI P, et al. Learning implicit credit assignment for cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2020, 33: 11853−11864 [41] ZIEBART B D, MAAS A L, BAGNELL J A, et al. Maximum entropy inverse reinforcement learning.[C]//AAAI Conference on Artificial Intelligence (AAAI): Vol. 8. Chicago, IL, USA, 2008: 1433−1438. [42] BELLMAN R. On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 1952, 38(8): 716−719 doi: 10.1073/pnas.38.8.716 [43] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529−533 doi: 10.1038/nature14236 [44] MCALEER S, WANG K A, BALDI P, et al. XDO: A double oracle algorithm for extensiveform games. Advances in Neural Information Processing Systems (NeurIPS), 2021, 34: 23128−23139 [45] TERRY J, BLACK B, GRAMMEL N, et al. Pettingzoo: Gym for multi-agent reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2021, 34: 15032−15043 [46] HU G, LI H, LIU S, et al. NeuronsMAE: A novel multi-agent reinforcement learning environment for cooperative and competitive multi-robot tasks[M]//International Joint Conference on Neural Networks (IJCNN). IEEE, 2023: 1−8. [47] SAMVELYAN M, KHAN A, DENNIS M D, et al. MAESTRO: Open-ended environment design for multi-agent reinforcement learning[C]//International Conference on Learning Representations (ICLR). 2023. [48] TIMBERS F, BARD N, LOCKHART E, et al. Approximate exploitability: learning a best response[C]//International Joint Conference on Artificial Intelligence (IJCAI). 2022: 3487−3493. [49] COHEN A, YU L, WRIGHT R. Diverse exploration for fast and safe policy improvement[C]//AAAI Conference on Artificial Intelligence (AAAI): Vol. 32. 2018: 2876−2883. [50] TSAI Y Y, XU H, DING Z, et al. DROID: Minimizing the reality gap using single-shot human demonstration. IEEE Robotics and Automation Letters, 2021, 6: 3168−3175 doi: 10.1109/LRA.2021.3062311 -
计量
- 文章访问数: 38
- HTML全文浏览量: 24
- 被引次数: 0