2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于Transformer的状态−动作−奖赏预测表征学习

刘民颂 朱圆恒 赵冬斌

刘民颂, 朱圆恒, 赵冬斌. 基于Transformer的状态−动作−奖赏预测表征学习. 自动化学报, 2025, 51(1): 1−16 doi: 10.16383/j.aas.c240230
引用本文: 刘民颂, 朱圆恒, 赵冬斌. 基于Transformer的状态−动作−奖赏预测表征学习. 自动化学报, 2025, 51(1): 1−16 doi: 10.16383/j.aas.c240230
Liu Min-Song, Zhu Yuan-Heng, Zhao Dong-Bin. State-action-reward prediction representation learning based on transformer. Acta Automatica Sinica, 2025, 51(1): 1−16 doi: 10.16383/j.aas.c240230
Citation: Liu Min-Song, Zhu Yuan-Heng, Zhao Dong-Bin. State-action-reward prediction representation learning based on transformer. Acta Automatica Sinica, 2025, 51(1): 1−16 doi: 10.16383/j.aas.c240230

基于Transformer的状态−动作−奖赏预测表征学习

doi: 10.16383/j.aas.c240230 cstr: 32138.14.j.aas.c240230
基金项目: 中国科学院战略性先导研究(XDA27030400), 国家自然科学基金(62136008, 62293541), 北京市自然科学基金(4232056)资助
详细信息
    作者简介:

    刘民颂:中国科学院自动化研究所博士研究生. 2018年获得北京科技大学学士学位. 主要研究方向为深度强化学习和对比学习. E-mail: liuminsong2018@ia.ac.cn

    朱圆恒:中国科学院自动化研究所副研究员. 2010年获得南京大学自动化专业学士学位. 2015年获得中国科学院自动化研究所控制理论和控制工程专业博士学位. 主要研究方向为深度强化学习, 博弈理论, 博弈智能和多智能体学习. 本文通信作者. E-mail: yuanheng.zhu@ia.ac.cn

    赵冬斌:中国科学院自动化研究所研究员, 中国科学院大学教授. 分别于1994年、1996年和2000年获得哈尔滨工业大学学士学位、硕士学位和博士学位. 主要研究方向为深度强化学习, 计算智能, 自动驾驶, 游戏人工智能, 机器人. E-mail: dongbin.zhao@ia.ac.cn

State-Action-Reward Prediction Representation Learning Based on Transformer

Funds: Supported by Strategic Priority Research Program of Chinese Academy of Sciences (XDA27030400), National Natural Science Foundation of China (62136008, 62293541), and Beijing Natural Science Foundation (4232056)
More Information
    Author Bio:

    LIU Min-Song Ph.D. candidate of the Institute of Automation, Chinese Academy of Science. He received the bachelor degree from University of Science and Technology Beijing, in 2018. His research interest covers deep reinforcement learning and contrastive learning

    ZHU Yuan-Heng Associate professor at the Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree in automation from Nanjing University in 2010, and his Ph.D. degree in control theory and control engineering from the Institute of Automation, Chinese Academy of Sciences in 2015. His research interest covers deep reinforcement learning, game theory, game intelligence, and multiagent learning. Corresponding author of this paper

    ZHAO Dong-Bin Professor at the Institute of Automation, Chinese Academy of Sciences and the University of Chinese Academy of Sciences. He received his bachelor, master, and Ph.D. degrees from Harbin Institute of Technology in 1994, 1996, and 2000, respectively. His research interest covers deep reinforcement learning, computational intelligence, autonomous driving, game artificial intelligence, and robotics

  • 摘要: 为了提升具有高维动作空间的复杂连续控制任务的性能和样本效率, 提出一种基于Transformer的状态−动作−奖赏预测表征学习框架(Transformer-based state-action-reward prediction representation learning framework, TSAR). 具体来说, TSAR提出一种基于Transformer的融合状态−动作−奖赏信息的序列预测任务. 该预测任务采用随机掩码技术对序列数据进行预处理, 通过最大化掩码序列的预测状态特征与实际目标状态特征间的互信息, 同时学习状态与动作表征. 为进一步强化状态和动作表征与强化学习(Reinforcement learning, RL)策略的相关性, TSAR引入动作预测学习和奖赏预测学习作为附加的学习约束以指导状态和动作表征学习. TSAR同时将状态表征和动作表征显式地纳入到强化学习策略的优化中, 显著提高了表征对策略学习的促进作用. 实验结果表明, 在DMControl的9个具有挑战性的困难环境中, TSAR的性能和样本效率超越了现有最先进的方法.
  • 图  1  传统状态预测模型和TSAR的状态预测模型

    Fig.  1  Traditional stale prediction model and TSAR stale prediction model

    图  2  TSAR学习框架

    Fig.  2  The learning framework of TSAR

    图  3  状态预测学习框架

    Fig.  3  The framework of state prediction learning

    图  4  动作预测学习和奖赏预测学习框架

    Fig.  4  The framework of action prediction learning and reward prediction learning

    图  5  TSAR和对比算法的性能

    Fig.  5  Performance of TSAR and comparison algorithms

    图  6  动作表征的可视化展示

    Fig.  6  The visualization of action representation

    图  7  3种预测任务的消融实验

    Fig.  7  Ablation study on three prediction tasks

    图  8  RL策略学习中动作表征的消融实验

    Fig.  8  Ablation study of action representation in RL policy learning

    图  9  状态预测模型输入词符的消融实验

    Fig.  9  Ablation study on the input tokens of the prediction model

    图  10  掩码比例的消融实验

    Fig.  10  Ablation study on mask ratio

    图  11  序列长度的消融实验

    Fig.  11  Ablation study on sequence length

    表  1  9个困难环境的基本信息

    Table  1  The fundamental information of nine challenging environments

    环境 动作空间维度 难易程度
    Quadruped Walk 12 困难
    Quadruped Run 12 困难
    Reach Duplo 9 困难
    Walker Run 6 困难
    Cheetah Run 6 困难
    Hopper Hop 4 困难
    Finger Turn Hard 2 困难
    Reacher Hard 2 困难
    Acrobot Swingup 1 困难
    下载: 导出CSV

    表  2  TSAR额外的超参数

    Table  2  Additional hyperparameters for TSAR

    超参数 含义
    $ \lambda_1 $ 状态预测损失权重 1
    $ \lambda_2 $ 动作预测损失权重 1
    $ \lambda_3 $ 奖赏预测损失权重 1
    batch_size 训练批次大小 256
    mask_ratio 掩码比例 50%
    $ K $ 序列长度 16
    $ L $ 注意力层数 2
    $ n $ 奖赏预测步长 2: Hopper Hop
    Reacher Hard
    1: 其他
    $ \tau $ EMA衰减率 0.95
    下载: 导出CSV

    表  3  TSAR和对比算法在100万步长时的得分

    Table  3  Scores achieved by TSAR and comparison algorithms at 1 M time steps

    环境 TSAR (本文) TACO[16] DrQ-v2[22] CURL[8] Dreamer-v3[44] TDMPC[27]
    Quadruped Run 657±25 541±38 407±21 181±14 331±42 397±37
    Hopper Hop 293±41 261±52 189±35 152±34 369±21 195±18
    Walker Run 699±22 637±11 517±43 387±24 765±32 600±28
    Quadruped Walk 837±23 793±8 680±52 123±11 353±27 435±16
    Cheetah Run 835±32 821±48 691±42 657±35 728±32 565±61
    Finger Turn Hard 636±24 632±75 220±21 215±17 810±58 400±113
    Acrobot Swingup 318±19 241 ±21 128±8 5±1 210±12 224±20
    Reacher Hard 937±18 883±63 572±51 400±29 499±51 485 ±31
    Reach Duplo 247±11 234±21 206±32 8±1 119±30 117±12
    平均性能 606.6 560.3 226.4 236.4 464.9 379.8
    中位性能 657 632 179 181 369 400
    注: 加粗字体表示在不同环境下各算法的最优结果。
    下载: 导出CSV

    表  4  与不同表征学习目标的对比

    Table  4  Comparison with other representation learning objectives

    环境TSAR (本文)TACO[16]M-CURL[11]SPR[10]ATC[39]DrQ-v2[22]
    Quadruped Run657±25541±38536±45448 ±79432±54407±21
    Hopper Hop293±41261±52248±61154±10112±98192±41
    Walker Run699±22637±21623±39560±71502±171517±43
    Quadruped Walk837±23793±8767±29701±25718±27680±52
    Cheetah Run835±32821±48794±61725±49710±51691±42
    Finger Turn Hard636±24632±75624±102573±88526±95220±21
    Acrobot Swingup318±19241±21234±22198±21206±61210±12
    Reacher Hard937±18883±63865±72711±92863±12572±51
    Reach Duplo247±11234±21229±34217±25219±27206±32
    平均性能606.6560.3546.7476.3475.4226.4
    中位性能657632623560502179
    下载: 导出CSV

    表  5  状态预测准确性对比

    Table  5  Comparison of state prediction accuracy

    环境 TSAR (本文) TACO[16] M-CURL[11]
    误差 性能 误差 性能 误差 性能
    Quadruped Run 0.097 657±25 0.157 541±38 0.124 536±45
    Walker Run 0.081 699±22 0.145 637±21 0.111 623±39
    Hopper Hop 0.206 293±41 0.267 261±52 0.245 248±61
    Reacher Hard 0.052 937±18 0.142 883±63 0.107 865±72
    Acrobot Swingup 0.063 318±19 0.101 241±21 0.082 234±22
    下载: 导出CSV
  • [1] SHAO K, ZHU Y, ZHAO D. StarCraft micromanagement with reinforcement learning and curriculum transfer learning. IEEE Transactions on Emerging Topics in Computational Intelligence, 2019, 3(1): 73−84 doi: 10.1109/TETCI.2018.2823329
    [2] HU G, LI H, LIU S, et al. NeuronsMAE: a novel multi-agent reinforcement learning environment for cooperative and competitive multi-robot tasks[C]//2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 2023: 1−8.
    [3] WANG J, ZHANG Q, ZHAO D. Highway lane change decision-making via attention-based deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 2022, 9(3): 567−569 doi: 10.1109/JAS.2021.1004395
    [4] KOSTRIKOV I, YARATS D, FERGUS R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels[C]//International Conference on Learning Representations. 2021.
    [5] LIU M, ZHU Y, CHEN Y, et al. Enhancing reinforcement learning via Transformer-based state predictive representations. IEEE Transactions on Artificial Intelligence, 2014, 5(9): 4364−4375
    [6] LIU M, LI L, HAO S, et al. Soft contrastive learning with Q-irrelevance abstraction for reinforcement learning. IEEE Transactions on Cognitive and Developmental Systems, 2023, 15(3): 1463−1473 doi: 10.1109/TCDS.2022.3218940
    [7] CHEN L, LU K, RAJESWARAN A, et al. Decision Transformer: Reinforcement learning via sequence modeling[C]//Advances in Neural Information Processing Systems: Vol. 34. 2021: 15084−15097.
    [8] LASKIN M, SRINIVAS A, ABBEEL P. CURL: Contrastive unsupervised representations for reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2020: 5639−5650.
    [9] OORD A V D, LI Y, VINYALS O. Representation learning with contrastive predictive coding[C]//International Conference on Learning Representations. 2021.
    [10] SCHWARZER M, ANAND A, GOEL R, et al. Data-efficient reinforcement learning with self-predictive representations[C]//International Conference on Learning Representations. 2021.
    [11] ZHU J, XIA Y, WU L, et al. Masked contrastive representation learning for reinforcement learning. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2023, 45(03): 3421−3433
    [12] YU T, ZHANG Z, LAN C, et al. Mask-based latent reconstruction for reinforcement learning[C]//Advances in Neural Information Processing Systems: Vol. 35. 2022: 25117−25131.
    [13] YE W, LIU S, KURUTACH T, et al. Mastering Atari games with limited data[C]//Advances in Neural Information Processing Systems: Vol. 34. 2021: 25476−25488.
    [14] KIM M, RHO K, KIM Y D, et al. Action-driven contrastive representation for reinforcement learning. PLOS ONE, 2022, 17(3): e0265456 doi: 10.1371/journal.pone.0265456
    [15] FUJIMOTO S, CHANG W D, SMITH E, et al. For SALE: State-action representation learning for deep reinforcement learning[C]//Advances in Neural Information Processing Systems: Vol. 36. 2023: 61573−61624.
    [16] ZHENG R, WANG X, SUN Y, et al. TACO: Temporal latent action-driven contrastive loss for visual reinforcement learning[C]//Advances in Neural Information Processing Systems: Vol. 36. 2024.
    [17] ZHANG A, MCALLISTER R, CALANDRA R, et al. Learning invariant representations for reinforcement learning without reconstruction[C]//International Conference on Learning Representations. 2021.
    [18] CHAI J, LI W, ZHU Y, et al. UNMAS: Multiagent reinforcement learning for unshaped cooperative scenarios. IEEE Transactions on Neural Networks and Learning Systems, 2021, 34(4): 2093−2104
    [19] HANSEN N, SU H, WANG X. Stabilizing deep Q-learning with convnets and vision transformers under data augmentation[C]//Advances in Neural Information Processing Systems: Vol. 34. 2021: 3680−3693.
    [20] GELADA C, KUMAR S, BUCKMAN J, et al. DeepMDP: Learning continuous latent space models for representation learning[C]//International Conference on Machine Learning. PMLR, 2019: 2170−2179.
    [21] LEE A X, NAGABANDI A, ABBEEL P, et al. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model[C]//Advances in Neural Information Processing Systems: Vol. 33. 2020: 741−752.
    [22] YARATS D, FERGUS R, LAZARIC A, et al. Mastering visual continuous control: Improved data-augmented reinforcement learning[C]//International Conference on Learning Representations. 2022.
    [23] PARK S, LEVINE S. Predictable MDP abstraction for unsupervised model-based RL[C]//International Conference on Machine Learning. PMLR, 2023: 27246−27268.
    [24] YARATS D, FERGUS R, LAZARIC A, et al. Reinforcement learning with prototypical representations[C]//International Conference on Machine Learning. PMLR, 2021: 11920−11931.
    [25] YARATS D, ZHANG A, KOSTRIKOV I, et al. Improving sample efficiency in model-free reinforcement learning from images[C]//Proceedings of the AAAI Conference on Artificial Intelligence: Vol. 35. 2021: 10674−10681.
    [26] SCHWARZER M, RAJKUMAR N, NOUKHOVITCH M, et al. Pretraining representations for data-efficient reinforcement learning[C]//Advances in Neural Information Processing Systems: Vol. 34. 2021: 12686−12699.
    [27] HANSEN N A, SU H, WANG X. Temporal difference learning for model predictive control[C]//International Conference on Machine Learning. PMLR, 2022: 8387−8406.
    [28] HANSEN N, WANG X. Generalization in reinforcement learning by soft data augmentation[C]//2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021: 13611−13617.
    [29] MA Y J, SODHANI S, JAYARAMAN D, et al. VIP: Towards universal visual reward and representation via value-implicit pre-training[C]//International Conference on Learning Representations. 2022.
    [30] PARISI S, RAJESWARAN A, PURUSHWALKAM S, et al. The unsurprising effectiveness of pre-trained vision models for control[C]//International Conference on Machine Learning. PMLR, 2022: 17359−17371.
    [31] HUA P, CHEN Y, XU H. Simple emergent action representations from multi-task policy training[C]//International Conference on Learning Representations. 2023.
    [32] CHANDAK Y, THEOCHAROUS G, KOSTAS J, et al. Learning action representations for reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2019: 941−950.
    [33] ALLSHIRE A, MARTÍN-MARTÍN R, LIN C, et al. LASER: Learning a latent action space for efficient reinforcement learning[C]//2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021: 6650−6656.
    [34] EYSENBACH B, ZHANG T, LEVINE S, et al. Contrastive learning as goal-conditioned reinforcement learning[C]//Advances in Neural Information Processing Systems: Vol. 35. 2022: 35603−35620.
    [35] GRILL J B, STRUB F, ALTCHÉ F, et al. Bootstrap your own latent-a new approach to self-supervised learning[C]//Advances in Neural Information Processing Systems: Vol. 33. 2020: 21271−21284.
    [36] MAZOURE B, TACHET DES COMBES R, DOAN T L, et al. Deep reinforcement and infomax learning[C]//Advances in Neural Information Processing Systems: Vol. 33. 2020: 3686−3698.
    [37] RAKELLY K, GUPTA A, FLORENSA C, et al. Which mutual-information representation learning objectives are sufficient for control?[C]//Advances in Neural Information Processing Systems: Vol. 34. 2021: 26345−26357.
    [38] ANAND A, RACAH E, OZAIR S, et al. Unsupervised state representation learning in Atari[C]//Advances in Neural Information Processing Systems: Vol. 32. 2019: 8766−8779.
    [39] STOOKE A, LEE K, ABBEEL P, et al. Decoupling representation learning from reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2021: 9870−9879.
    [40] ZHU Y, ZHAO D. Online minimax Q network learning for two-player zero-sum Markov games. IEEE Transactions on Neural Networks and Learning Systems, 2020, 33(3): 1228−1241
    [41] KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems: Vol. 25. 2012: 1097−1105.
    [42] HAYNES D, CORNS S, VENAYAGAMOORTHY G K. An exponential moving average algorithm[C]//2012 IEEE Congress on Evolutionary Computation. IEEE, 2012: 1−8.
    [43] LI N, CHEN Y, LI W, et al. BViT: Broad attention-based vision Transformer. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(9): 12772−12783 doi: 10.1109/TNNLS.2023.3264730
    [44] SHAKYA A K, PILLAI G, CHAKRABARTY S. Reinforcement learning algorithms: A brief survey. Expert Systems with Applications, 2023, 231: 120495 doi: 10.1016/j.eswa.2023.120495
  • 加载中
计量
  • 文章访问数:  50
  • HTML全文浏览量:  24
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-04-30
  • 录用日期:  2024-09-25
  • 网络出版日期:  2024-12-12

目录

    /

    返回文章
    返回