-
摘要: 为了提升具有高维动作空间的复杂连续控制任务的性能和样本效率, 提出一种基于Transformer的状态−动作−奖赏预测表征学习框架(Transformer-based state-action-reward prediction representation learning framework, TSAR). 具体来说, TSAR提出一种基于Transformer的融合状态−动作−奖赏信息的序列预测任务. 该预测任务采用随机掩码技术对序列数据进行预处理, 通过最大化掩码序列的预测状态特征与实际目标状态特征间的互信息, 同时学习状态与动作表征. 为进一步强化状态和动作表征与强化学习(Reinforcement learning, RL)策略的相关性, TSAR引入动作预测学习和奖赏预测学习作为附加的学习约束以指导状态和动作表征学习. TSAR同时将状态表征和动作表征显式地纳入到强化学习策略的优化中, 显著提高了表征对策略学习的促进作用. 实验结果表明, 在DMControl的9个具有挑战性的困难环境中, TSAR的性能和样本效率超越了现有最先进的方法.
-
关键词:
- 深度强化学习 /
- 表征学习 /
- 自监督对比学习 /
- Transformer
Abstract: To enhance the performance and sample efficiency of complex continuous control tasks with high-dimensional action spaces, this paper introduces a transformer-based state-action-reward prediction representation learning framework (TSAR). Specifically, TSAR proposes a sequence prediction task integrating state-action-reward information using the Transformer architecture. This prediction task employs random masking techniques for preprocessing sequence data and seeks to maximize the mutual information between predicted features of masked sequences and actual target state features, thus concurrently learning state representation and action representation. To further strengthen the relevance of state representation and action representation to reinforcement learning strategies, TSAR incorporates an inverse dynamics model and a reward prediction model as additional learning constraints to guide the learning of state and action representations. TSAR explicitly incorporates state representation and action representation into the optimization of reinforcement learning strategies, significantly enhancing the facilitative role of representations in policy learning. Experimental results demonstrate that, across nine challenging and difficult environments in DMControl, the performance and sample efficiency of TSAR exceed those of existing state-of-the-art methods. -
表 1 9个困难环境的基本信息
Table 1 The fundamental information of nine challenging environments
环境 动作空间维度 难易程度 Quadruped Walk 12 困难 Quadruped Run 12 困难 Reach Duplo 9 困难 Walker Run 6 困难 Cheetah Run 6 困难 Hopper Hop 4 困难 Finger Turn Hard 2 困难 Reacher Hard 2 困难 Acrobot Swingup 1 困难 表 2 TSAR额外的超参数
Table 2 Additional hyperparameters for TSAR
超参数 含义 值 $ \lambda_1 $ 状态预测损失权重 1 $ \lambda_2 $ 动作预测损失权重 1 $ \lambda_3 $ 奖赏预测损失权重 1 batch_size 训练批次大小 256 mask_ratio 掩码比例 50% $ K $ 序列长度 16 $ L $ 注意力层数 2 $ n $ 奖赏预测步长 2: Hopper Hop Reacher Hard 1: 其他 $ \tau $ EMA衰减率 0.95 表 3 TSAR和对比算法在100万步长时的得分
Table 3 Scores achieved by TSAR and comparison algorithms at 1 M time steps
环境 TSAR (本文) TACO[16] DrQ-v2[22] CURL[8] Dreamer-v3[44] TDMPC[27] Quadruped Run 657±25 541±38 407±21 181±14 331±42 397±37 Hopper Hop 293±41 261±52 189±35 152±34 369±21 195±18 Walker Run 699±22 637±11 517±43 387±24 765±32 600±28 Quadruped Walk 837±23 793±8 680±52 123±11 353±27 435±16 Cheetah Run 835±32 821±48 691±42 657±35 728±32 565±61 Finger Turn Hard 636±24 632±75 220±21 215±17 810±58 400±113 Acrobot Swingup 318±19 241 ±21 128±8 5±1 210±12 224±20 Reacher Hard 937±18 883±63 572±51 400±29 499±51 485 ±31 Reach Duplo 247±11 234±21 206±32 8±1 119±30 117±12 平均性能 606.6 560.3 226.4 236.4 464.9 379.8 中位性能 657 632 179 181 369 400 注: 加粗字体表示在不同环境下各算法的最优结果。 表 4 与不同表征学习目标的对比
Table 4 Comparison with other representation learning objectives
环境 TSAR (本文) TACO[16] M-CURL[11] SPR[10] ATC[39] DrQ-v2[22] Quadruped Run 657±25 541±38 536±45 448 ±79 432±54 407±21 Hopper Hop 293±41 261±52 248±61 154±10 112±98 192±41 Walker Run 699±22 637±21 623±39 560±71 502±171 517±43 Quadruped Walk 837±23 793±8 767±29 701±25 718±27 680±52 Cheetah Run 835±32 821±48 794±61 725±49 710±51 691±42 Finger Turn Hard 636±24 632±75 624±102 573±88 526±95 220±21 Acrobot Swingup 318±19 241±21 234±22 198±21 206±61 210±12 Reacher Hard 937±18 883±63 865±72 711±92 863±12 572±51 Reach Duplo 247±11 234±21 229±34 217±25 219±27 206±32 平均性能 606.6 560.3 546.7 476.3 475.4 226.4 中位性能 657 632 623 560 502 179 表 5 状态预测准确性对比
Table 5 Comparison of state prediction accuracy
环境 TSAR (本文) TACO[16] M-CURL[11] 误差 性能 误差 性能 误差 性能 Quadruped Run 0.097 657±25 0.157 541±38 0.124 536±45 Walker Run 0.081 699±22 0.145 637±21 0.111 623±39 Hopper Hop 0.206 293±41 0.267 261±52 0.245 248±61 Reacher Hard 0.052 937±18 0.142 883±63 0.107 865±72 Acrobot Swingup 0.063 318±19 0.101 241±21 0.082 234±22 -
[1] SHAO K, ZHU Y, ZHAO D. StarCraft micromanagement with reinforcement learning and curriculum transfer learning. IEEE Transactions on Emerging Topics in Computational Intelligence, 2019, 3(1): 73−84 doi: 10.1109/TETCI.2018.2823329 [2] HU G, LI H, LIU S, et al. NeuronsMAE: a novel multi-agent reinforcement learning environment for cooperative and competitive multi-robot tasks[C]//2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 2023: 1−8. [3] WANG J, ZHANG Q, ZHAO D. Highway lane change decision-making via attention-based deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 2022, 9(3): 567−569 doi: 10.1109/JAS.2021.1004395 [4] KOSTRIKOV I, YARATS D, FERGUS R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels[C]//International Conference on Learning Representations. 2021. [5] LIU M, ZHU Y, CHEN Y, et al. Enhancing reinforcement learning via Transformer-based state predictive representations. IEEE Transactions on Artificial Intelligence, 2014, 5(9): 4364−4375 [6] LIU M, LI L, HAO S, et al. Soft contrastive learning with Q-irrelevance abstraction for reinforcement learning. IEEE Transactions on Cognitive and Developmental Systems, 2023, 15(3): 1463−1473 doi: 10.1109/TCDS.2022.3218940 [7] CHEN L, LU K, RAJESWARAN A, et al. Decision Transformer: Reinforcement learning via sequence modeling[C]//Advances in Neural Information Processing Systems: Vol. 34. 2021: 15084−15097. [8] LASKIN M, SRINIVAS A, ABBEEL P. CURL: Contrastive unsupervised representations for reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2020: 5639−5650. [9] OORD A V D, LI Y, VINYALS O. Representation learning with contrastive predictive coding[C]//International Conference on Learning Representations. 2021. [10] SCHWARZER M, ANAND A, GOEL R, et al. Data-efficient reinforcement learning with self-predictive representations[C]//International Conference on Learning Representations. 2021. [11] ZHU J, XIA Y, WU L, et al. Masked contrastive representation learning for reinforcement learning. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2023, 45(03): 3421−3433 [12] YU T, ZHANG Z, LAN C, et al. Mask-based latent reconstruction for reinforcement learning[C]//Advances in Neural Information Processing Systems: Vol. 35. 2022: 25117−25131. [13] YE W, LIU S, KURUTACH T, et al. Mastering Atari games with limited data[C]//Advances in Neural Information Processing Systems: Vol. 34. 2021: 25476−25488. [14] KIM M, RHO K, KIM Y D, et al. Action-driven contrastive representation for reinforcement learning. PLOS ONE, 2022, 17(3): e0265456 doi: 10.1371/journal.pone.0265456 [15] FUJIMOTO S, CHANG W D, SMITH E, et al. For SALE: State-action representation learning for deep reinforcement learning[C]//Advances in Neural Information Processing Systems: Vol. 36. 2023: 61573−61624. [16] ZHENG R, WANG X, SUN Y, et al. TACO: Temporal latent action-driven contrastive loss for visual reinforcement learning[C]//Advances in Neural Information Processing Systems: Vol. 36. 2024. [17] ZHANG A, MCALLISTER R, CALANDRA R, et al. Learning invariant representations for reinforcement learning without reconstruction[C]//International Conference on Learning Representations. 2021. [18] CHAI J, LI W, ZHU Y, et al. UNMAS: Multiagent reinforcement learning for unshaped cooperative scenarios. IEEE Transactions on Neural Networks and Learning Systems, 2021, 34(4): 2093−2104 [19] HANSEN N, SU H, WANG X. Stabilizing deep Q-learning with convnets and vision transformers under data augmentation[C]//Advances in Neural Information Processing Systems: Vol. 34. 2021: 3680−3693. [20] GELADA C, KUMAR S, BUCKMAN J, et al. DeepMDP: Learning continuous latent space models for representation learning[C]//International Conference on Machine Learning. PMLR, 2019: 2170−2179. [21] LEE A X, NAGABANDI A, ABBEEL P, et al. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model[C]//Advances in Neural Information Processing Systems: Vol. 33. 2020: 741−752. [22] YARATS D, FERGUS R, LAZARIC A, et al. Mastering visual continuous control: Improved data-augmented reinforcement learning[C]//International Conference on Learning Representations. 2022. [23] PARK S, LEVINE S. Predictable MDP abstraction for unsupervised model-based RL[C]//International Conference on Machine Learning. PMLR, 2023: 27246−27268. [24] YARATS D, FERGUS R, LAZARIC A, et al. Reinforcement learning with prototypical representations[C]//International Conference on Machine Learning. PMLR, 2021: 11920−11931. [25] YARATS D, ZHANG A, KOSTRIKOV I, et al. Improving sample efficiency in model-free reinforcement learning from images[C]//Proceedings of the AAAI Conference on Artificial Intelligence: Vol. 35. 2021: 10674−10681. [26] SCHWARZER M, RAJKUMAR N, NOUKHOVITCH M, et al. Pretraining representations for data-efficient reinforcement learning[C]//Advances in Neural Information Processing Systems: Vol. 34. 2021: 12686−12699. [27] HANSEN N A, SU H, WANG X. Temporal difference learning for model predictive control[C]//International Conference on Machine Learning. PMLR, 2022: 8387−8406. [28] HANSEN N, WANG X. Generalization in reinforcement learning by soft data augmentation[C]//2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021: 13611−13617. [29] MA Y J, SODHANI S, JAYARAMAN D, et al. VIP: Towards universal visual reward and representation via value-implicit pre-training[C]//International Conference on Learning Representations. 2022. [30] PARISI S, RAJESWARAN A, PURUSHWALKAM S, et al. The unsurprising effectiveness of pre-trained vision models for control[C]//International Conference on Machine Learning. PMLR, 2022: 17359−17371. [31] HUA P, CHEN Y, XU H. Simple emergent action representations from multi-task policy training[C]//International Conference on Learning Representations. 2023. [32] CHANDAK Y, THEOCHAROUS G, KOSTAS J, et al. Learning action representations for reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2019: 941−950. [33] ALLSHIRE A, MARTÍN-MARTÍN R, LIN C, et al. LASER: Learning a latent action space for efficient reinforcement learning[C]//2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021: 6650−6656. [34] EYSENBACH B, ZHANG T, LEVINE S, et al. Contrastive learning as goal-conditioned reinforcement learning[C]//Advances in Neural Information Processing Systems: Vol. 35. 2022: 35603−35620. [35] GRILL J B, STRUB F, ALTCHÉ F, et al. Bootstrap your own latent-a new approach to self-supervised learning[C]//Advances in Neural Information Processing Systems: Vol. 33. 2020: 21271−21284. [36] MAZOURE B, TACHET DES COMBES R, DOAN T L, et al. Deep reinforcement and infomax learning[C]//Advances in Neural Information Processing Systems: Vol. 33. 2020: 3686−3698. [37] RAKELLY K, GUPTA A, FLORENSA C, et al. Which mutual-information representation learning objectives are sufficient for control?[C]//Advances in Neural Information Processing Systems: Vol. 34. 2021: 26345−26357. [38] ANAND A, RACAH E, OZAIR S, et al. Unsupervised state representation learning in Atari[C]//Advances in Neural Information Processing Systems: Vol. 32. 2019: 8766−8779. [39] STOOKE A, LEE K, ABBEEL P, et al. Decoupling representation learning from reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2021: 9870−9879. [40] ZHU Y, ZHAO D. Online minimax Q network learning for two-player zero-sum Markov games. IEEE Transactions on Neural Networks and Learning Systems, 2020, 33(3): 1228−1241 [41] KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems: Vol. 25. 2012: 1097−1105. [42] HAYNES D, CORNS S, VENAYAGAMOORTHY G K. An exponential moving average algorithm[C]//2012 IEEE Congress on Evolutionary Computation. IEEE, 2012: 1−8. [43] LI N, CHEN Y, LI W, et al. BViT: Broad attention-based vision Transformer. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(9): 12772−12783 doi: 10.1109/TNNLS.2023.3264730 [44] SHAKYA A K, PILLAI G, CHAKRABARTY S. Reinforcement learning algorithms: A brief survey. Expert Systems with Applications, 2023, 231: 120495 doi: 10.1016/j.eswa.2023.120495
计量
- 文章访问数: 50
- HTML全文浏览量: 24
- 被引次数: 0