2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于Transformer的状态−动作−奖赏预测表征学习

刘民颂 朱圆恒 赵冬斌

刘民颂, 朱圆恒, 赵冬斌. 基于Transformer的状态−动作−奖赏预测表征学习. 自动化学报, 2025, 51(1): 117−132 doi: 10.16383/j.aas.c240230
引用本文: 刘民颂, 朱圆恒, 赵冬斌. 基于Transformer的状态−动作−奖赏预测表征学习. 自动化学报, 2025, 51(1): 117−132 doi: 10.16383/j.aas.c240230
Liu Min-Song, Zhu Yuan-Heng, Zhao Dong-Bin. State-action-reward prediction representation learning based on Transformer. Acta Automatica Sinica, 2025, 51(1): 117−132 doi: 10.16383/j.aas.c240230
Citation: Liu Min-Song, Zhu Yuan-Heng, Zhao Dong-Bin. State-action-reward prediction representation learning based on Transformer. Acta Automatica Sinica, 2025, 51(1): 117−132 doi: 10.16383/j.aas.c240230

基于Transformer的状态−动作−奖赏预测表征学习

doi: 10.16383/j.aas.c240230 cstr: 32138.14.j.aas.c240230
基金项目: 中国科学院战略性先导研究(XDA27030400), 国家自然科学基金(62136008, 62293541), 北京市自然科学基金(4232056)资助
详细信息
    作者简介:

    刘民颂:中国科学院自动化研究所博士研究生. 2018年获得北京科技大学学士学位. 主要研究方向为深度强化学习和对比学习. E-mail: liuminsong2018@ia.ac.cn

    朱圆恒:中国科学院自动化研究所副研究员. 2010年获得南京大学自动化专业学士学位. 2015年获得中国科学院自动化研究所控制理论和控制工程专业博士学位. 主要研究方向为深度强化学习, 博弈理论, 博弈智能和多智能体学习. 本文通信作者. E-mail: yuanheng.zhu@ia.ac.cn

    赵冬斌:中国科学院自动化研究所研究员, 中国科学院大学教授. 分别于1994年、1996年和2000年获得哈尔滨工业大学学士学位、硕士学位和博士学位. 主要研究方向为深度强化学习, 计算智能, 自动驾驶, 游戏人工智能, 机器人. E-mail: dongbin.zhao@ia.ac.cn

State-Action-Reward Prediction Representation Learning Based on Transformer

Funds: Supported by Strategic Priority Research Program of Chinese Academy of Sciences (XDA27030400), National Natural Science Foundation of China (62136008, 62293541), and Beijing Natural Science Foundation (4232056)
More Information
    Author Bio:

    LIU Min-Song Ph.D. candidate of the Institute of Automation, Chinese Academy of Science. He received his bachelor degree from University of Science and Technology Beijing, in 2018. His research interest covers deep reinforcement learning and contrastive learning

    ZHU Yuan-Heng Associate professor at the Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree in automation from Nanjing University in 2010, and his Ph.D. degree in control theory and control engineering from the Institute of Automation, Chinese Academy of Sciences in 2015. His research interest covers deep reinforcement learning, game theory, game intelligence, and multiagent learning. Corresponding author of this paper

    ZHAO Dong-Bin Professor at the Institute of Automation, Chinese Academy of Sciences and the University of Chinese Academy of Sciences. He received his bachelor, master, and Ph.D. degrees from Harbin Institute of Technology in 1994, 1996, and 2000, respectively. His research interest covers deep reinforcement learning, computational intelligence, autonomous driving, game artificial intelligence, and robotics

  • 摘要: 为了提升具有高维动作空间的复杂连续控制任务的性能和样本效率, 提出一种基于Transformer的状态−动作−奖赏预测表征学习框架(Transformer-based state-action-reward prediction representation learning framework, TSAR). 具体来说, TSAR提出一种基于Transformer的融合状态−动作−奖赏信息的序列预测任务. 该预测任务采用随机掩码技术对序列数据进行预处理, 通过最大化掩码序列的预测状态特征与实际目标状态特征间的互信息, 同时学习状态与动作表征. 为进一步强化状态和动作表征与强化学习(Reinforcement learning, RL)策略的相关性, TSAR引入动作预测学习和奖赏预测学习作为附加的学习约束以指导状态和动作表征学习. TSAR同时将状态表征和动作表征显式地纳入到强化学习策略的优化中, 显著提高了表征对策略学习的促进作用. 实验结果表明, 在DMControl的9个具有挑战性的困难环境中, TSAR的性能和样本效率超越了现有最先进的方法.
  • 图  1  传统状态预测模型和TSAR的状态预测模型

    Fig.  1  Traditional stale prediction model and TSAR stale prediction model

    图  2  TSAR学习框架

    Fig.  2  The learning framework of TSAR

    图  3  状态预测学习框架

    Fig.  3  The framework of state prediction learning

    图  4  动作预测学习和奖赏预测学习框架

    Fig.  4  The framework of action prediction learning and reward prediction learning

    图  5  TSAR和对比算法的性能

    Fig.  5  Performance of TSAR and comparison algorithms

    图  6  动作表征的可视化展示

    Fig.  6  The visualization of action representation

    图  7  3种预测任务的消融实验

    Fig.  7  Ablation study on three prediction tasks

    图  8  RL策略学习中动作表征的消融实验

    Fig.  8  Ablation study of action representation in RL policy learning

    图  9  状态预测模型输入词符的消融实验

    Fig.  9  Ablation study on the input tokens of the state prediction model

    图  10  掩码比例的消融实验

    Fig.  10  Ablation study on mask ratio

    图  11  序列长度的消融实验

    Fig.  11  Ablation study on sequence length

    表  1  9个困难环境的基本信息

    Table  1  The fundamental information of nine challenging environments

    环境动作空间维度难易程度
    Quadruped Walk12困难
    Quadruped Run12困难
    Reach Duplo9困难
    Walker Run6困难
    Cheetah Run6困难
    Hopper Hop4困难
    Finger Turn Hard2困难
    Reacher Hard2困难
    Acrobot Swingup1困难
    下载: 导出CSV

    表  2  TSAR额外的超参数

    Table  2  Additional hyperparameters for TSAR

    超参数 含义
    $ \lambda_1 $ 状态预测损失权重 1
    $ \lambda_2 $ 动作预测损失权重 1
    $ \lambda_3 $ 奖赏预测损失权重 1
    batch_size 训练批次大小 256
    mask_ratio 掩码比例 50%
    $ K $ 序列长度 16
    $ L $ 注意力层数 2
    2: Hopper Hop
    $ n $ 奖赏预测步长 Reacher Hard
    1: 其他 
    $ \tau $ EMA衰减率 0.95
    下载: 导出CSV

    表  3  TSAR和对比算法在100万步长时的得分

    Table  3  Scores achieved by TSAR and comparison algorithms at 1 M time steps

    环境 TSAR (本文) TACO[16] DrQ-v2[22] CURL[8] Dreamer-v3[44] TD-MPC[27]
    Quadruped Run 657±25 541±38 407±21 181±14 331±42 397±37
    Hopper Hop 293±41 261±52 189±35 152±34 369±21 195±18
    Walker Run 699±22 637±11 517±43 387±24 765±32 600±28
    Quadruped Walk 837±23 793±8 680±52 123±11 353±27 435±16
    Cheetah Run 835±32 821±48 691±42 657±35 728±32 565±61
    Finger Turn Hard 636±24 632±75 220±21 215±17 810±58 400±113
    Acrobot Swingup 318±19 241±21 128±8 5±1 210±12 224±20
    Reacher Hard 937±18 883±63 572±51 400±29 499±51 485±31
    Reach Duplo 247±11 234±21 206±32 8±1 119±30 117±12
    平均性能 606.6 560.3 226.4 236.4 464.9 379.8
    中位性能 657 632 179 181 369 400
    注: 加粗字体表示在不同环境下各算法的最优结果.
    下载: 导出CSV

    表  4  与不同表征学习目标的对比

    Table  4  Comparison with other representation learning objectives

    环境 TSAR (本文) TACO[16] M-CURL[11] SPR[10] ATC[39] DrQ-v2[22]
    Quadruped Run 657±25 541±38 536±45 448±79 432±54 407±21
    Hopper Hop 293±41 261±52 248±61 154±10 112±98 192±41
    Walker Run 699±22 637±21 623±39 560±71 502±171 517±43
    Quadruped Walk 837±23 793±8 767±29 701±25 718±27 680±52
    Cheetah Run 835±32 821±48 794±61 725±49 710±51 691±42
    Finger Turn Hard 636±24 632±75 624±102 573±88 526±95 220±21
    Acrobot Swingup 318±19 241±21 234±22 198±21 206±61 210±12
    Reacher Hard 937±18 883±63 865±72 711±92 863±12 572±51
    Reach Duplo 247±11 234±21 229±34 217±25 219±27 206±32
    平均性能 606.6 560.3 546.7 476.3 475.4 226.4
    中位性能 657 632 623 560 502 179
    下载: 导出CSV

    表  5  状态预测准确性对比

    Table  5  Comparison of state prediction accuracy

    环境 TSAR (本文) TACO[16] M-CURL[11]
    误差 性能 误差 性能 误差 性能
    Quadruped Run 0.097 657±25 0.157 541±38 0.124 536±45
    Walker Run 0.081 699±22 0.145 637±21 0.111 623±39
    Hopper Hop 0.206 293±41 0.267 261±52 0.245 248±61
    Reacher Hard 0.052 937±18 0.142 883±63 0.107 865±72
    Acrobot Swingup 0.063 318±19 0.101 241±21 0.082 234±22
    下载: 导出CSV
  • [1] Shao K, Zhu Y H, Zhao D B. StarCraft micromanagement with reinforcement learning and curriculum transfer learning. IEEE Transactions on Emerging Topics in Computational Intelligence, 2019, 3(1): 73−84 doi: 10.1109/TETCI.2018.2823329
    [2] Hu G Z, Li H R, Liu S S, Zhu Y H, Zhao D B. NeuronsMAE: A novel multi-agent reinforcement learning environment for cooperative and competitive multi-robot tasks. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN). Gold Coast, Australia: IEEE, 2023. 1−8
    [3] Wang J J, Zhang Q C, Zhao D B. Highway lane change decision-making via attention-based deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 2022, 9(3): 567−569 doi: 10.1109/JAS.2021.1004395
    [4] Yarats D, Kostrikov I, Fergus R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event: OpenReview.net, 2021.
    [5] Liu M S, Zhu Y H, Chen Y R, Zhao D B. Enhancing reinforcement learning via Transformer-based state predictive representations. IEEE Transactions on Artificial Intelligence, 2014, 5(9): 4364−4375
    [6] Liu M S, Li L T, Hao S, Zhu Y H, Zhao D B. Soft contrastive learning with Q-irrelevance abstraction for reinforcement learning. IEEE Transactions on Cognitive and Developmental Systems, 2023, 15(3): 1463−1473 doi: 10.1109/TCDS.2022.3218940
    [7] Chen L L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, et al. Decision Transformer: Reinforcement learning via sequence modeling. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc., 2021. Article No. 1156
    [8] Laskin M, Srinivas A, Abbeel P. CURL: Contrastive unsupervised representations for reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. Virtual Event: JMLR.org, 2020. Article No. 523
    [9] van den Oord A, Li Y Z, Vinyals O. Representation learning with contrastive predictive coding. arXiv preprint arXiv: 1807.03748, 2021.
    [10] Schwarzer M, Anand A, Goel R, Hjelm R D, Courville A C, Bachman P. Data-efficient reinforcement learning with self-predictive representations. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event: OpenReview.net, 2021.
    [11] Zhu J H, Xia Y C, Wu L J, Deng J J, Zhou W G, Qin T, et al. Masked contrastive representation learning for reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3421−3433
    [12] Yu T, Zhang Z Z, Lan C L, Lu Y, Chen Z B. Mask-based latent reconstruction for reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA: NeurIPS, 2022. 25117−25131
    [13] Ye W R, Liu S H, Kurutach T, Abbeel P, Gao Y. Mastering Atari games with limited data. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc., 2021. Article No. 1951
    [14] Kim M, Rho K, Kim Y D, Jung K. Action-driven contrastive representation for reinforcement learning. PLoS One, 2022, 17(3): Article No. e0265456 doi: 10.1371/journal.pone.0265456
    [15] Fujimoto S, Chang W D, Smith E J, Gu S S, Precup D, Meger D. For SALE: State-action representation learning for deep reinforcement learning. In: Proceedings of the 37th Conference on Neural Information Processing Systems. New Orleans, LA, USA: NeurIPS, 2023. 61573−61624
    [16] Zheng R J, Wang X Y, Sun Y C, Ma S, Zhao J Y, Xu H Z, et al. TACO: Temporal latent action-driven contrastive loss for visual reinforcement learning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans, LA, USA: Curran Associates Inc., 2024. Article No. 2092
    [17] Zhang A, Mcallister R, Calandra R, Gal Y, Levine S. Learning invariant representations for reinforcement learning without reconstruction. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event: Open-Review.net, 2021.
    [18] Chai J J, Li W F, Zhu Y H, Zhao D B, Ma Z, Sun K W, et al. UNMAS: Multiagent reinforcement learning for unshaped cooperative scenarios. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(4): 2093−2104 doi: 10.1109/TNNLS.2021.3105869
    [19] Hansen N, Su H, Wang X L. Stabilizing deep Q-learning with ConvNets and vision transformers under data augmentation. In: Proceedings of the Proceedings of the 35th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc., 2021. Article No. 281
    [20] Gelada C, Kumar S, Buckman J, Nachum O, Bellemare M G. DeepMDP: Learning continuous latent space models for representation learning. In: Proceedings of the 36th International Conference on Machine Learning. Long Beach, California, USA: PMLR, 2019. 2170−2179
    [21] Lee A X, Nagabandi A, Abbeel P, Levine S. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, BC, Canada: Curran Associates Inc., 2020. Article No. 63
    [22] Yarats D, Fergus R, Lazaric A, Pinto L. Mastering visual continuous control: Improved data-augmented reinforcement learning. In: Proceedings of the 10th International Conference on Learning Representations. Virtual Event: OpenReview.net, 2022. 941−950
    [23] Park S, Levine S. Predictable MDP abstraction for unsupervised model-based RL. In: Proceedings of the 40th International Conference on Machine Learning. Honolulu, HI, USA: PMLR, 2023. 27246−27268
    [24] Yarats D, Fergus R, Lazaric A, Pinto L. Reinforcement learning with prototypical representations. In: Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR, 2021. 11920−11931
    [25] Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R. Improving sample efficiency in model-free reinforcement learning from images. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. Virtual Event: AAAI, 2021. 10674−10681
    [26] Schwarzer M, Rajkumar N, Noukhovitch M, Anand A, Charlin L, Hjelm D, et al. Pretraining representations for data-efficient reinforcement learning. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc., 2021. Article No. 971
    [27] Hansen N A, Su H, Wang X L. Temporal difference learning for model predictive control. In: Proceedings of the 39th International Conference on Machine Learning. Baltimore, MD, USA: PMLR, 2022. 8387−8406
    [28] Hansen N, Wang X L. Generalization in reinforcement learning by soft data augmentation. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Xi'an, China: IEEE, 2021. 13611−13617
    [29] Ma Y J, Sodhani S, Jayaraman D, Bastani O, Kumar V, Zhang A. VIP: Towards universal visual reward and representation via value-implicit pre-training. In: Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: OpenReview.net, 2023.
    [30] Parisi S, Rajeswaran A, Purushwalkam S, Gupta A. The unsurprising effectiveness of pre-trained vision models for control. In: Proceedings of the 39th International Conference on Machine Learning. Baltimore, MD, USA: PMLR, 2022. 17359−17371
    [31] Hua P, Chen Y B, Xu H Z. Simple emergent action representations from multi-task policy training. In: Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: OpenReview.net, 2023.
    [32] Chandak Y, Theocharous G, Kostas J, Jordan S, Thomas P. Learning action representations for reinforcement learning. In: Proceedings of the 36th International Conference on Machine Learning. Long Beach, California, USA: PMLR, 2019. 941−950
    [33] Allshire A, Martín-Martín R, Lin C, Manuel S, Savarese S, Garg A. LASER: Learning a latent action space for efficient reinforcement learning. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Xi'an, China: IEEE, 2021. 6650−6656
    [34] Eysenbach B, Zhang T J, Levine S, Salakhutdinov R. Contrastive learning as goal-conditioned reinforcement learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, LA, USA: NeurIPS, 2022. Article No. 2580
    [35] Grill J B, Strub F, Altché F, Tallec C, Richemond P H, Buchatskaya E, et al. Bootstrap your own latent: A new approach to self-supervised learning. In: Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, Canada: NeurIPS, 2020. 21271−21284
    [36] Mazoure B, Tachet Des Combes R, Doan T L, Bachman P, Hjelm R D. Deep reinforcement and InfoMax learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, BC, Canada: Curran Associates Inc., 2020. Article No. 311
    [37] Rakelly K, Gupta A, Florensa C, Levine S. Which mutual-information representation learning objectives are sufficient for control? In: Proceedings of the 35th Conference on Neural Information Processing Systems. Virtual Event: NeurIPS, 2021. 26345−26357
    [38] Anand A, Racah E, Ozair S, Bengio Y, Côté M A, Hjelm R D. Unsupervised state representation learning in Atari. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, BC, Canada: Curran Associates Inc., 2019. Article No. 787
    [39] Stooke A, Lee K, Abbeel P, Laskin M. Decoupling representation learning from reinforcement learning. In: Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR, 2021. 9870−9879
    [40] Zhu Y H, Zhao D B. Online minimax Q network learning for two-player zero-sum Markov games. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(3): 1228−1241 doi: 10.1109/TNNLS.2020.3041469
    [41] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc., 2012. 1097−1105
    [42] Haynes D, Corns S, Venayagamoorthy G K. An exponential moving average algorithm. In: Proceedings of the IEEE Congress on Evolutionary Computation. Brisbane, QLD, Australia: IEEE, 2012. 1−8
    [43] Li N N, Chen Y R, Li W F, Ding Z X, Zhao D B, Nie S. BViT: Broad attention-based vision Transformer. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(9): 12772−12783 doi: 10.1109/TNNLS.2023.3264730
    [44] Shakya A K, Pillai G, Chakrabarty S. Reinforcement learning algorithms: A brief survey. Expert Systems With Applications, 2023, 231: Article No. 120495 doi: 10.1016/j.eswa.2023.120495
  • 加载中
图(11) / 表(5)
计量
  • 文章访问数:  116
  • HTML全文浏览量:  40
  • PDF下载量:  28
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-04-30
  • 录用日期:  2024-09-25
  • 网络出版日期:  2024-12-12
  • 刊出日期:  2025-01-16

目录

    /

    返回文章
    返回