2.765

2022影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种用于两人零和博弈对手适应的元策略演化学习算法

吴哲 李凯 徐航 兴军亮

吴哲, 李凯, 徐航, 兴军亮. 一种用于两人零和博弈对手适应的元策略演化学习算法. 自动化学报, 2022, 48(10): 2462−2473 doi: 10.16383/j.aas.c211003
引用本文: 吴哲, 李凯, 徐航, 兴军亮. 一种用于两人零和博弈对手适应的元策略演化学习算法. 自动化学报, 2022, 48(10): 2462−2473 doi: 10.16383/j.aas.c211003
Wu Zhe, Li Kai, Xu Hang, Xing Jun-Liang. A meta-evolutionary learning algorithm for opponent adaptation in two-player zero-sum games. Acta Automatica Sinica, 2022, 48(10): 2462−2473 doi: 10.16383/j.aas.c211003
Citation: Wu Zhe, Li Kai, Xu Hang, Xing Jun-Liang. A meta-evolutionary learning algorithm for opponent adaptation in two-player zero-sum games. Acta Automatica Sinica, 2022, 48(10): 2462−2473 doi: 10.16383/j.aas.c211003

一种用于两人零和博弈对手适应的元策略演化学习算法

doi: 10.16383/j.aas.c211003
基金项目: 国家重点研发计划(2020AAA0103401), 国家自然科学基金(62076238, 61902402), 中国科学院战略性先导研究项目(XDA27000000), CCF-腾讯犀牛鸟基金(RAGR20200104)资助
详细信息
    作者简介:

    吴哲:中国科学院自动化研究所智能系统与工程研究中心硕士研究生. 2019年获得山东大学工学学士学位. 主要研究方向为计算机博弈, 强化学习. E-mail: wuzhe2019@ia.ac.cn

    李凯:中国科学院自动化研究所智能系统与工程研究中心副研究员. 2018年获得中国科学院自动化研究所模式识别与智能系统博士学位. 主要研究方向为计算机博弈, 强化学习. E-mail: kai.li@ia.ac.cn

    徐航:中国科学院自动化研究所智能系统与工程研究中心硕士研究生. 2020年获得武汉大学工学学士学位. 主要研究方向为计算机博弈, 强化学习. E-mail: xuhang2020@ia.ac.cn

    兴军亮:清华大学计算机科学与技术系研究员. 2012年获得清华大学计算机科学与技术系博士学位. 主要研究方向为计算机博弈. 本文通信作者. E-mail: jlxing@nlpr.ia.ac.cn

A Meta-evolutionary Learning Algorithm for Opponent Adaptation in Two-player Zero-sum Games

Funds: Supported by National Key Research and Development Program of China (2020AAA0103401), National Natural Science Foundation of China (62076238, 61902402), Strategic Priority Research Program of Chinese Academy of Sciences (XDA27000000), and CCF-Tencent Open Research Fund (RAGR20200104)
More Information
    Author Bio:

    WU Zhe Master student at the Center for Research on Intelligent System and Engineering, Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree in engineering from Shandong University in 2019. His research interest covers computer game and reinforcement learning

    LI Kai Associate professor at the Center for Research on Intelligent System and Engineering, Institute of Automation, Chinese Academy of Sciences. He received his Ph.D. degree in pattern recognition and intelligent system from Institute of Automation, Chinese Academy of Sciences in 2018. His research interest covers computer game and reinforcement learning

    XU Hang Master student at the Center for Research on Intelligent System and Engineering, Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree in engineering from Wuhan University in 2020. His research interest covers computer game and reinforcement learning

    XING Jun-Liang Professor in the Department of Computer Science and Technology, Tsinghua University. He received his Ph.D. degree in Department of Computer Science and Technology from Tsinghua University in 2012. His main research interest is computer game. Corresponding author of this paper

  • 摘要: 围绕两人零和博弈所开展的一系列研究, 近年来在围棋、德州扑克等问题中取得了里程碑式的突破. 现有的两人零和博弈求解方案大多在理性对手的假设下围绕纳什均衡解开展, 是一种力求不败的保守型策略, 但在实际博弈中由于对手非理性等原因并不能保证收益最大化. 对手建模为最大化博弈收益提供了一种新途径, 但仍存在建模困难等问题. 结合元学习的思想提出了一种能够快速适应对手策略的元策略演化学习求解框架. 在训练阶段, 首先通过种群演化的方法不断生成风格多样化的博弈对手作为训练数据, 然后利用元策略更新方法来调整元模型的网络权重, 使其获得快速适应的能力. 在Leduc扑克、两人有限注德州扑克(Heads-up limit Texas Hold'em, LHE)和RoboSumo上的大量实验结果表明, 该算法能够有效克服现有方法的弊端, 实现针对未知风格对手的快速适应, 从而为两人零和博弈收益最大化求解提供了一种新思路.
  • 图  1  本文提出的元模型训练方法

    Fig.  1  The meta-model's training process

    图  2  本文算法与基线算法在RoboSumo中的对比

    Fig.  2  Comparison of our method with the baseline algorithm in RoboSumo

    图  3  消融实验

    Fig.  3  Ablation study

    图  5  超参数设置对模型性能影响

    Fig.  5  Effect of hyperparameter settings

    图  4  种群演化模块生成的对手策略

    Fig.  4  Visualization of the styles of the strategies

    表  1  不同环境下的实验参数设置

    Table  1  Hyperparameters settings

    参数LeducLHERoboSumo
    状态空间3672120
    动作空间448
    网络尺寸[64, 64][128, 128][128, 128]
    训练步长$\alpha$0.1000.0500.003
    训练步长$\beta$0.01000.01000.0006
    折扣系数$\gamma$0.9900.9950.995
    梯度更新步数111
    种群规模101010
    精英比例$\vartheta$0.20.40.4
    变异率0.30.10.2
    变异强度0.10.10.1
    测试更新步长0.1000.050 0.001
    测试更新步数333
    评估局数5010050
    存储资源 (GB)~ 0.3 ~ 2.0~ 1.5
    迭代次数 (T)100400300
    下载: 导出CSV

    表  2  本文算法与基线算法在Leduc环境中的对比

    Table  2  The average return of our method and baseline methods in Leduc

    方法Random 对手Call 对手Bluff 对手CFR 对手NFSP 对手
    本文算法1.359 ± 0.0230.646 ± 0.0690.576 ± 0.043− 0.162 ± 0.0320.325 ± 0.096
    CFR 算法0.749 ± 0.0140.364 ± 0.0100.283 ± 0.0280.010 ± 0.0240.144 ± 0.007
    DRON 算法1.323 ± 0.0140.418 ± 0.0110.409 ± 0.052− 0.347 ± 0.0310.212 ± 0.080
    EOM 算法1.348 ± 0.0150.635 ± 0.0070.444 ± 0.024− 0.270 ± 0.042− 0.012 ± 0.023
    NFSP 算法0.780 ± 0.0190.132 ± 0.0240.029 ± 0.022− 0.412 ± 0.0400.011 ± 0.027
    MAML 算法1.372 ± 0.0280.328 ± 0.0130.323 ± 0.044− 0.409 ± 0.0100.089 ± 0.051
    本文算法 +PPO1.353 ± 0.0110.658 ± 0.0050.555 ± 0.017− 0.159 ± 0.0410.314 ± 0.012
    本文算法 −EA0.994 ± 0.0420.611 ± 0.0210.472 ± 0.038− 0.224 ± 0.0160.203 ± 0.029
    EA 算法0.535 ± 0.1640.422 ± 0.1080.366 ± 0.113− 0.365 ± 0.0940.189 ± 0.102
    Oracle1.373 ± 0.0070.662 ± 0.0140.727 ± 0.012− 0.089 ± 0.0160.338 ± 0.041
    下载: 导出CSV

    表  3  本文算法与基线算法在LHE环境中的对比

    Table  3  The average return of our method and baseline methods in LHE

    方法Random 对手LA 对手TA 对手LP 对手
    本文算法2.594 ± 0.0890.335 ± 0.0120.514 ± 0.0310.243 ± 0.102
    DRON 算法2.131 ± 0.672− 0.609 ± 0.1760.294 ± 0.0570.022 ± 0.028
    EOM 算法2.555 ± 0.020− 0.014 ± 0.0130.237 ± 0.0230.144 ± 0.128
    NFSP 算法1.342 ± 0.033− 0.947 ± 0.012− 0.352 ± 0.0940.203 ± 0.089
    MAML 算法2.633 ± 0.0350.037 ± 0.0470.231 ± 0.0570.089 ± 0.051
    本文算法 +PPO2.612 ± 0.0580.327 ± 0.0110.478 ± 0.0420.246 ± 0.070
    本文算法 −EA2.362 ± 0.0230.185 ± 0.0490.388 ± 0.0120.119 ± 0.015
    EA 算法2.193 ± 0.1580.096 ± 0.0870.232 ± 0.0970.091 ± 0.009
    Oracle2.682 ± 0.0330.513 ± 0.0090.624 ± 0.0110.270 ± 0.026
    下载: 导出CSV
  • [1] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436-444 doi: 10.1038/nature14539
    [2] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504-507 doi: 10.1126/science.1127647
    [3] Silver D, Huang A, Maddison C J, Guez A, Sifre L, Van Den Driessche, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484-489 doi: 10.1038/nature16961
    [4] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354-359 doi: 10.1038/nature24270
    [5] Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 2018, 362(6419): 1140-1144 doi: 10.1126/science.aar6404
    [6] 赵冬斌, 邵坤, 朱圆恒, 李栋, 陈亚冉, 王海涛, 等. 深度强化学习综述: 兼论计算机围棋的发展. 控制理论与应用, 2016, 33(6): 701-717 doi: 10.7641/CTA.2016.60173

    Zhao Dong-Bin, Shao Kun, Zhu Yuan-Heng, Li Dong, Chen Ya-Ran, Wang Hai-Tao, et al. Review of deep reinforcement learning and discussions on the development of computer Go. Control Theory and Applications, 2016, 33(6): 701-717 doi: 10.7641/CTA.2016.60173
    [7] 周志华. AlphaGo专题介绍. 自动化学报, 2016, 42(5): 670

    Zhou Zhi-Hua. AlphaGo special session: An introduction. Acta Automatica Sinica, 2016, 42(5): 670
    [8] Sandholm T. Solving imperfect-information games. Science, 2015, 347(6218): 122-123 doi: 10.1126/science.aaa4614
    [9] Bowling M, Burch N, Johanson M, Tammelin O. Heads-up limit hold'em poker is solved. Science, 2015, 347(6218): 145-149 doi: 10.1126/science.1259433
    [10] 郭潇逍, 李程, 梅俏竹. 深度学习在游戏中的应用. 自动化学报, 2016, 42(5): 676-684

    Guo Xiao-Xiao, Li Cheng, Mei Qiao-Zhu. Deep learning applied to games. Acta Automatica Sinica, 2016, 42(5): 676-684
    [11] Brown G W. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951, 13(1): 374-376
    [12] 沈宇, 韩金朋, 李灵犀, 王飞跃. 游戏智能中的AI-从多角色博弈到平行博弈. 智能科学与技术学报, 2020, 2(3): 205-213 doi: 10.11959/j.issn.2096-6652.202023

    Shen Yu, Han Jin-Peng, Li Ling-Xi, Wang Fei-Yue. AI in game intelligence—from multi-role game to parallel game. Chinese Journal of Intelligent Science and Technology, 2020, 2(3): 205-213 doi: 10.11959/j.issn.2096-6652.202023
    [13] Tammelin O. Solving large imperfect information games using CFR+. arXiv preprint arXiv: 1407.5042, 2014.
    [14] Moravcík M, Schmid M, Burch N, Lisý V, Morrill D, Bard N, et al. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 2017, 356(6337): 508-513 doi: 10.1126/science.aam6960
    [15] Brown N, Sandholm T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 2018, 359(6374): 418-424 doi: 10.1126/science.aao1733
    [16] Albrecht S V, Stone P. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 2018, 258: 66-95 doi: 10.1016/j.artint.2018.01.002
    [17] He H, Boyd-Graber J, Kwok K, DauméIII H. Opponent modeling in deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. New York, USA: PMLR, 2016. 1804−1813
    [18] Foerster J N, Chen R Y, Al-Shedivat M, Whiteson S, Abbeel P, Mordatch I. Learning with opponent-learning awareness. In: Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems. Richland, USA: ACM, 2018. 122−130
    [19] Nash J. Non-cooperative games. Annals of Mathematics, 1951, 54: 286-295 doi: 10.2307/1969529
    [20] Neumann J V, Morgenstern O. The Theory of Games and Economic Behaviour. New Jersey: Princeton University Press, 1944.
    [21] Heinrich J, Silver D. Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv: 1603.01121, 2016.
    [22] Lanctot M, Zambaldi V, Gruslys A, Lazaridou A, Tuyls K, Pérolat J, et al. A unified game-theoretic approach to multiagent reinforcement learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc, 2017. 4193−4206
    [23] Zinkevich M, Johanson M, Bowling M, Piccione C. Regret minimization in games with incomplete information. In: Proceedings of the 21st International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc, 2007.
    [24] Brown N, Sandholm T. Solving imperfect-information games via discounted regret minimization. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Honolulu, USA: AAAI Press, 2019. 1829−1836
    [25] Johanson M, Bard N, Burch N, Bowling M. Finding optimal abstract strategies in extensive form games. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence. Toronto, Canada: AAAI Press, 2012. 1371−1379
    [26] Heinrich J, Lanctot M, Silver D. Fictitious self-play in extensive-form games. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France: PMLR, 2015. 805−813
    [27] Lanctot M, Waugh K, Zinkevich M, Bowling M. Monte Carlo sampling for regret minimization in extensive games. In: Proceedings of the 23rd Advances in Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc, 2009. 1078−1086
    [28] Brown N, Lerer A, Gross S, Sandholm T. Deep counterfactual regret minimization. In: Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR, 2019. 793−802
    [29] Hernandez-Leal P, Rosman B, Taylor M E, Sucar L E, Munoz De Cote E. A Bayesian approach for learning and tracking switching, non-stationary opponents. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. Singapore: ACM, 2016. 1315−131
    [30] Rosman B, Hawasly M, Ramamoorthy S. Bayesian policy reuse. Machine Learning, 2016, 104(1): 99-127 doi: 10.1007/s10994-016-5547-y
    [31] Letcher A, Foerster J, Balduzzi D, Rocktäschel T, Whiteson S. Stable opponent shaping in differentiable games. In: Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: OpenReview.net, 2019.
    [32] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017. 1126−1135
    [33] Al-Shedivat M, Bansal T, Burda Y, Sutskever I, Mordatch I, Abbeel P. Continuous adaptation via meta-learning in nonstationary and competitive environments. In: Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: OpenReview.net, 2018.
    [34] Vinyals O, Blundell C, Lillicrap T, Wierstra D. Matching networks for one shot learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc, 2016. 3630−3638
    [35] Ravi S, Larochelle H. Optimization as a model for few-shot learning. In: Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview.net, 2017.
    [36] Drugan M M. Reinforcement learning versus evolutionary computation: A survey on hybrid algorithms. Swarm and Evolutionary Computation, 2019, 44: 228-246 doi: 10.1016/j.swevo.2018.03.011
    [37] Jaderberg M, Dalibard V, Osindero S, Czarnecki W M, Donahue J, Razavi A, et al. Population based training of neural networks. arXiv preprint arXiv: 1711.09846, 2017.
    [38] Jaderberg M, Czarnecki W M, Dunning I, Marris L, Lever G, Castaneda A G, et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 2019, 364(6443): 859-865 doi: 10.1126/science.aau6249
    [39] Liu S, Lever G, Merel J, Tunyasuvunakool S, Heess N, Graepel T. Emergent coordination through competition. In: Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: OpenReview.net, 2019.
    [40] Khadka S, Tumer K. Evolution-guided policy gradient in reinforcement learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc, 2018. 1196−1208
    [41] Conti E, Madhavan V, Petroski Such F, Lehman J, Stanley K, Clune J. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc, 2018. 5032−5043
    [42] Elman J L. Learning and development in neural networks: The importance of starting small. Cognition, 1993, 48(1): 71-99 doi: 10.1016/0010-0277(93)90058-4
    [43] 梁星星, 冯旸赫, 马扬, 程光权, 黄金才, 王琦, 等.多Agent深度强化学习综述.自动化学报, 2020, 46(12): 2537-2557

    Liang Xing-Xing, Feng Yang-He, Ma Yang, Cheng Guang-Quan, Huang Jin-Cai, Wang Qi, et al. Deep multi-agent reinforcement learning: A survey. Acta Automatica Sinica, 2020, 46(12): 2537-2557
    [44] 孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题. 自动化学报, 2020, 46(7): 1301-1312

    Sun Chang-Yin, Mu Chao-Xu. Important scientific problems of multi-agent deep reinforcement learning. Acta Automatica Sinica, 2020, 46(7): 1301-1312
    [45] Finn C B. Learning to Learn With Gradients [Ph.D. dissertation], University of California, Berkeley, 2018
    [46] Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. Pytorch: An imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc, 2019. 8024−8035
    [47] Schulman J, Levine S, Abbeel P, Jordan M, Moritz P. Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France: PMLR, 2015. 1889−1897
    [48] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv preprint arXiv: 1707.06347, 2017.
    [49] Schulman J, Moritz P, Levine S, Jordan M, Abbeel P. High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the 4th International Conference on Learning Representations. San Juan, Puerto Rico: OpenReview.net, 2016.
    [50] Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In: Proceedings of the 4th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale, USA: JMLR, 2011. 315−323
    [51] Bansal T, Pachocki J, Sidor S, Sutskever I, Mordatch I. Emergent complexity via multi-agent competition. In: Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: OpenReview.net, 2018.
  • 加载中
图(5) / 表(3)
计量
  • 文章访问数:  803
  • HTML全文浏览量:  219
  • PDF下载量:  260
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-10-22
  • 录用日期:  2022-03-14
  • 网络出版日期:  2022-05-05
  • 刊出日期:  2022-10-14

目录

    /

    返回文章
    返回