2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于表征学习的离线强化学习方法研究综述

王雪松 王荣荣 程玉虎

杨峰, 郑丽涛, 王家琦, 潘泉. 双层无迹卡尔曼滤波. 自动化学报, 2019, 45(7): 1386-1391. doi: 10.16383/j.aas.c180349
引用本文: 王雪松, 王荣荣, 程玉虎. 基于表征学习的离线强化学习方法研究综述. 自动化学报, 2024, 50(6): 1104−1128 doi: 10.16383/j.aas.c230546
YANG Feng, ZHENG Li-Tao, WANG Jia-Qi, PAN Quan. Double Layer Unscented Kalman Filter. ACTA AUTOMATICA SINICA, 2019, 45(7): 1386-1391. doi: 10.16383/j.aas.c180349
Citation: Wang Xue-Song, Wang Rong-Rong, Cheng Yu-Hu. A review of offline reinforcement learning based on representation learning. Acta Automatica Sinica, 2024, 50(6): 1104−1128 doi: 10.16383/j.aas.c230546

基于表征学习的离线强化学习方法研究综述

doi: 10.16383/j.aas.c230546
基金项目: 国家自然科学基金(62373364, 62176259), 江苏省重点研发计划项目(BE2022095)资助
详细信息
    作者简介:

    王雪松:中国矿业大学信息与控制工程学院教授. 2002年获得中国矿业大学博士学位. 主要研究方向为机器学习与模式识别. E-mail: wangxuesongcumt@163.com

    王荣荣:中国矿业大学信息与控制工程学院博士研究生. 2021年获得济南大学硕士学位. 主要研究方向为深度强化学习. E-mail: wangrongrong1996@126.com

    程玉虎:中国矿业大学信息与控制工程学院教授. 2005年获得中国科学院自动化研究所博士学位. 主要研究方向为机器学习与智能系统. 本文通信作者. E-mail: chengyuhu@163.com

A Review of Offline Reinforcement Learning Based on Representation Learning

Funds: Supported by National Natural Science Foundation of China (62373364, 62176259) and Key Research and Development Program of Jiangsu Province (BE2022095)
More Information
    Author Bio:

    WANG Xue-Song Professor at the School of Information and Control Engineering, China University of Mining and Technology. She received her Ph.D. degree from China University of Mining and Technology in 2002. Her research interest covers machine learning and pattern recognition

    WANG Rong-Rong Ph.D. candidate at the School of Information and Control Engineering, China University of Mining and Technology. She received her master degree from University of Jinan in 2021. Her main research interest is deep reinforcement learning

    CHENG Yu-Hu Professor at the School of Information and Control Engineering, China University of Mining and Technology. He received his Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences in 2005. His research interest covers machine learning and intelligent system. Corresponding author of this paper

  • 摘要: 强化学习(Reinforcement learning, RL)通过智能体与环境在线交互来学习最优策略, 近年来已成为解决复杂环境下感知决策问题的重要手段. 然而, 在线收集数据的方式可能会引发安全、时间或成本等问题, 极大限制了强化学习在实际中的应用. 与此同时, 原始数据的维度高且结构复杂, 解决复杂高维数据输入问题也是强化学习面临的一大挑战. 幸运的是, 基于表征学习的离线强化学习能够仅从历史经验数据中学习策略, 而无需与环境产生交互. 它利用表征学习技术将离线数据集中的特征表示为低维向量, 然后利用这些向量来训练离线强化学习模型. 这种数据驱动的方式为实现通用人工智能提供了新契机. 为此, 对近期基于表征学习的离线强化学习方法进行全面综述. 首先给出离线强化学习的形式化描述, 然后从方法、基准数据集、离线策略评估与超参数选择3个层面对现有技术进行归纳整理, 进一步介绍离线强化学习在工业、推荐系统、智能驾驶等领域中的研究动态. 最后, 对全文进行总结, 并探讨基于表征学习的离线强化学习未来所面临的关键挑战与发展趋势, 以期为后续的研究提供有益参考.
  • 状态估计在信号处理、计算机视觉、自动控制、目标跟踪、导航、金融、通信等领域[1-6]有着广泛应用.在高斯噪声环境下, 卡尔曼滤波(Kalman fllter, KF)[7]及其次优滤波算法可以很好解决该问题.在非高斯噪声环境下, KF算法及其次优滤波算法不再适用, 因此存在着粒子滤波(Particle filter, PF)[8]及其衍生滤波算法来解决状态估计问题.

    基于无迹变换(Unscented transform, UT)的无迹卡尔曼滤波(Unscented Kalman fllter, UKF)[9-11]是一种计算非线性变换均值和协方差的次优卡尔曼滤波算法.相比于扩展卡尔曼滤波(Extended Kalman fllter, EKF), UKF不需要计算雅可比矩阵, 且其可以达到非线性函数二阶泰勒展开式的精度[9].因此其在导航制导、目标跟踪、信号处理和图像跟踪等方面有着很广泛应用.但UKF算法也存在着在某些情况下估计效果差等问题.

    目前, 针对UKF算法估计值不准确的问题, 有众多改进方法.为了解决UKF在工程应用中因舍入误差导致数值不稳定的问题, 提出了求根UKF (Square-root unscented Kalman fllter, SRUKF)[12]算法.在加性噪声条件下, 为了降低UKF算法的计算复杂度, 提出了简化UKF (Simplified unscented Kalman fllter, SUKF)[13]算法.在先验信息不确定性大而量测精度高的情况下, 只用一次量测值的UKF算法的估计效果较差.因此, 提出了多次利用量测值的迭代UKF (Iterated unscented Kalman fllter, IUKF)[14], 递归更新滤波器(递归更新扩展卡尔曼滤波(Recursive update extended Kalman fllter, RUEKF)[15]、递归更新容积卡尔曼滤波(Recursive update cubature Kalman fllter, RUCKF)[16])等算法.基于二阶UT变换的UKF算法滤波估计精度只能达到二阶, 为了提高滤波精度, 提出了基于高阶UT变换和高阶容积变换(Cubature transform, CT)的高阶UKF[17-18]和高阶容积卡尔曼滤波(Cubature Kalman fllter, CKF)[19-21]等算法.

    UKF及其改进算法虽然可以较好处理UKF算法的估计不准确的问题, 但其仍然存在在非线性程度高的环境下估计效果差等问题, 文献[22-23]中提出将UKF算法作为PF算法建议分布, 将UKF算法估计值作为重要性密度函数, 这就是无迹粒子滤波(Unscented particle fllter, UPF)[22-23]算法.从理论上讲, 随着随机采样粒子数量提高, UPF算法的精度可以逐渐提高.但UPF算法也存在一些问题, 如其运算时间很长, 时效性较差.且UPF算法效果不总是好于UKF算法, 在量测噪声较大时, UPF算法估计精度会不如UKF算法.

    为了在低计算负载的情况下获得高的滤波估计精度, 本文提出了双层无迹卡尔曼滤波器(Double layer unscented Kalman filter, DLUKF)算法.其核心思想是用带有权值的采样点表示前一个时刻的后验密度函数; 而后用内层的UKF算法对每个带权值的采样点进行更新, 并用最新的量测值对采样点的权值进行更新; 然后将各个采样点进行加权融合, 得到了初始的估计值; 最后用外层UKF算法的更新机制对初始估计值进行更新得到最终的估计值.

    假设非线性函数为$\mathit{\boldsymbol{y}} = \mathit{\boldsymbol{f}}(\mathit{\boldsymbol{x}}) $, UT变换是通过近似非线性函数的概率密度分布来近似非线性函数.其在得到先验均值$\bar {\mathit{\boldsymbol{x}}}$和协方差$ {\mathit{\boldsymbol{P}}_{xx}}$的基础上, 用采样策略选取一组确定性采样点集.而后得到这些采样点集经非线性变换后的采样点集, 进而求得经非线性变换后的均值$\bar {\mathit{\boldsymbol{y}}}$和协方差${\mathit{\boldsymbol{P}}_{yy}}$.

    UT变换算法可以归纳为以下三步:

    1) 根据先验均值$\bar {\mathit{\boldsymbol{x}}}$和协方差${\mathit{\boldsymbol{P}}_{xx}}$, 用采样策略得到$N$个确定性采样点$\{ {\mathit{\boldsymbol{x}}_i}\} _{i = 1}^N$.定义$w_i^m$为均值加权作用的权值, $w_i^c$为协方差加权所用的权值.

    2) 将确定性采样点$\{ {\mathit{\boldsymbol{x}}_i}\} _{i = 1}^N$进行非线性$\mathit{\boldsymbol{f}}(\cdot) $变换, 得到$N$个经非线性变换后的采样点集$\{ {\mathit{\boldsymbol{y}}_i}\} _{i = 1}^N = \mathit{\boldsymbol{f}}(\{ {\mathit{\boldsymbol{x}}_i}\} _{i = 1}^N) $.

    3) 通过对采样点集$\{ {\mathit{\boldsymbol{y}}_i}\} _{i = 1}^N$进行加权的形式得到经非线性变换后的均值和协方差为.

    考虑典型的非线性系统, 其状态方程和量测方程分别为:

    $ \begin{equation} {\mathit{\boldsymbol{x}}_{k + 1}} = \mathit{\boldsymbol{f}}({\mathit{\boldsymbol{x}}_k}) + {\mathit{\boldsymbol{w}}_k} \end{equation} $

    (1)

    $ \begin{equation} {\mathit{\boldsymbol{z}}_{k + 1}} = \mathit{\boldsymbol{h}}({\mathit{\boldsymbol{x}}_{k + 1}}) + {\mathit{\boldsymbol{v}}_{k + 1}} \end{equation} $

    (2)

    ${\mathit{\boldsymbol{x}}_k}$为$k$时刻$n$维的状态向量, ${\mathit{\boldsymbol{z}}_{k + 1}}$为$k + 1$时刻的量测向量. ${\mathit{\boldsymbol{w}}_k}$为$m$维的过程噪声, 其服从均值为0方差为$\mathit{\boldsymbol{Q}}$的高斯分布. ${\mathit{\boldsymbol{v}}_{k + 1}}$为$q$维的量测噪声, 其服从均值为0方差为$\mathit{\boldsymbol{R}}$的高斯分布.滤波算法的目的就是从带有噪声的量测值${\mathit{\boldsymbol{z}}_{k + 1}}$中估计出真实值${\mathit{\boldsymbol{x}}_{k + 1}}$.

    UKF[9-10]算法是基于UT变换的一种滤波算法, 其思想是在一步预测的时候, 用UT变换来进行均值和协方差传递.在UKF算法中, 因为存在噪声项, 需要对状态进行扩维.因此状态向量可以表示为. UKF算法流程为:

    1) 在$k$时刻由UT变换中的采样策略得到$N$个采样点集$\{ \mathit{\boldsymbol{x}}_k^i\} _{i = 1}^N$.

    2) 采样点集$\{ \mathit{\boldsymbol{x}}_k^i\} _{i = 1}^N$经非线性变换$\mathit{\boldsymbol{f}}(\cdot) $后得到采样点集$\{ \mathit{\boldsymbol{x}}_{k + 1|k}^i\} _{i = 1}^N$.

    3) 由采样点集$\{ \mathit{\boldsymbol{x}}_{k + 1|k}^i\} _{i = 1}^N$加权求得预测值${\hat {\mathit{\boldsymbol{x}}}_{k + 1|k}}$和预测协方差${\hat {\mathit{\boldsymbol{P}}}_{k + 1|k}}$.

    4) 采样点集$\{ \mathit{\boldsymbol{x}}_{k + 1|k}^i\} _{i = 1}^N$经非线性变换$\mathit{\boldsymbol{h}}(\cdot) $后得到采样点集$\{ \mathit{\boldsymbol{z}}_{k + 1|k}^i\} _{i = 1}^N$.

    5) 由采样点集$\{ \mathit{\boldsymbol{z}}_{k + 1|k}^i\} _{i = 1}^N$加权求得预测的量测值${\hat {\mathit{\boldsymbol{z}}}_{k + 1|k}}$及其协方差${\mathit{\boldsymbol{P}}_{zz}}$和互协方差${\mathit{\boldsymbol{P}}_{xz}}$.

    6) 求得$k + 1$时刻的估计值${\hat {\mathit{\boldsymbol{x}}}_{k + 1}}$及和协方差${\hat {\mathit{\boldsymbol{P}}}_{k + 1}}$.

    在实际应用中, 受初始误差的影响, UKF算法存在着收敛速度慢, 精度不高等问题.基于此, 文献[14]提出了IUKF算法, 文献[15]提出了RUEKF算法, 文献[16]出了RUCKF算法.这三种算法的核心思想都是多次利用量测值对估计值进行更形, 以获得更好的滤波估计效果.

    UPF[22-23]算法是在PF算法的基础上, 用UKF算法的滤波估计值作为PF算法的建议密度函数.这虽然可以解决UKF算法不适用于非高斯环境等问题, 但其由于要选取大量的随机性采样点来逼近密度函数, 所以UPF算法会临着计算量大的问题. UPF算法具体步骤如下:

    1) 由$p({\mathit{\boldsymbol{x}}_0}) $得到$N$个粒子点$\{ \mathit{\boldsymbol{x}}_0^{(i)}\} _{i = 1}^N$, 初始权值为$\mathit{\boldsymbol{w}}_0^{(i)} = 1/N$.

    2) 用UKF算法对每一粒子进行状态更新.

    3) 计算粒子点对应的权值$\mathit{\boldsymbol{w}}_k^{(i)} = \mathit{\boldsymbol{w}}_{k - 1}^{(i)}\frac{{p({\mathit{\boldsymbol{z}}_k}|\mathit{\boldsymbol{x}}_k^{(i)})p(\mathit{\boldsymbol{x}}_k^{(i)}|\mathit{\boldsymbol{x}}_{k - 1}^{(i)})}}{{q(\mathit{\boldsymbol{x}}_k^{(i)}|{\mathit{\boldsymbol{z}}_{1:k}})}}$并对其归一化.

    4) 当粒子退化严重时, 对粒子进行重采样.

    5) 计算每个粒子点${\mathit{\boldsymbol{x}}^{(i)}}$对应的协方差.

    6) 重复步骤2)~5).

    最后得到$k$时刻状态量的估计为${\hat {\mathit{\boldsymbol{x}}}_k} = \sum\nolimits_{i = 1}^N {\tilde {\mathit{\boldsymbol{w}}}_k^{(i)}\mathit{\boldsymbol{x}}_k^{(i)}} $.

    UPF需要用大量的粒子点去逼近状态的后验密度函数, 因此其有着运算量大的问题.本文所提的DLUKF算法用带权值的采样点去表征状态的后验密度函数, 其核心思想为用内层的UKF对每个带权值的采样点进行更新, 而后用最新的量测值对每个采样点的权值进行更新, 并对更新后的采样点进行加权求和得到下一时刻初始估计值, 然后将该初始估计值作为预测值运行外层UKF算法, 从而得到最终估计值.

    DLUKF算法由外层UKF算法和内层UKF算法组成, 其算法流程如下:

    状态初始条件为初始值${\hat {\mathit{\boldsymbol{x}}}_0} = {\rm E}({\mathit{\boldsymbol{x}}_0}) $, 初始协方差${\hat {\mathit{\boldsymbol{P}}}_0} = {\rm E}(({\mathit{\boldsymbol{x}}_0} - {\hat {\mathit{\boldsymbol{x}}}_0}){({\mathit{\boldsymbol{x}}_0} - {\hat {\mathit{\boldsymbol{x}}}_0})^{\rm T}}) $.因为存在噪声项, 需要对初始的状态进行扩维处理.其可以表示为

    $ \begin{equation} \hat {\mathit{\boldsymbol{x}}}_0^a = {\left[ {\begin{array}{*{20}{c}} {{{\hat {\mathit{\boldsymbol{x}}}}_0}}&0&0 \end{array}} \right]^{\rm T}} \end{equation} $

    (3)

    $ \begin{equation} \mathit{\boldsymbol{P}}_0^a = {\left[ {\begin{array}{*{20}{c}} {{\mathit{\boldsymbol{P}}_0}}&0&0\\ 0&\mathit{\boldsymbol{Q}}&0\\ 0&0&\mathit{\boldsymbol{R}} \end{array}} \right]^{\rm T}} \end{equation} $

    (4)

    内层UKF算法:

    在$k$时刻, 用采样策略选取$N$个采样点$\{ {\hat {\mathit{\boldsymbol{x}}}_{i, k}}\} _{i = 1}^N$, 并求取其权值对应的一阶矩$w_{i, k}^m$和二阶矩$w_{i, k}^c$.而后用内层UKF算法对每个采样点进行更新.

    对每个采样点, 用采样策略选取$M$个采样点$ \{ {\hat {\mathit{\boldsymbol{x}}}_{j, i, k}}\} _{j = 1}^M$, 并取其对应的一阶矩$w_{j, i, k}^m$和二阶矩$w_{j, i, k}^c$.

    时间更新:

    $ \begin{equation} \hat {\mathit{\boldsymbol{x}}}_{j, i, k + 1|k}^x = \mathit{\boldsymbol{f}}(\hat {\mathit{\boldsymbol{x}}}_{j, i, k}^x, \hat {\mathit{\boldsymbol{x}}}_{j, i, k}^w) \end{equation} $

    (5)

    $ \begin{equation} {\hat {\mathit{\boldsymbol{x}}}_{i, k + 1|k}} = \sum\limits_{j = 1}^M { w_{j, i, k}^m\hat {\mathit{\boldsymbol{x}}}_{j, i, k + 1|k}^x} \end{equation} $

    (6)

    $ \begin{align} {{\hat {\mathit{\boldsymbol{P}}}}_{i, k + 1|k}} = &\sum\limits_{j = 1}^M {w_{j, i}^c(\hat {\mathit{\boldsymbol{x}}}_{j, i, k + 1|k}^x - {{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1|k}})}\times\nonumber\\& {(\hat {\mathit{\boldsymbol{x}}}_{j, i, k + 1|k}^x - {{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1|k}})^{\rm T}} + \mathit{\boldsymbol{Q}} \end{align} $

    (7)

    量测更新:

    基于预测值${\hat {\mathit{\boldsymbol{x}}}_{i, k + 1|k}}$和预测协方差${\hat {\mathit{\boldsymbol{P}}}_{i, k + 1|k}}$产生新的$M$个带权值的采样点.

    $ \begin{equation} {\mathit{\boldsymbol{z}}_{j, i, k + 1|k}} = \mathit{\boldsymbol{h}}(\mathit{\boldsymbol{x}}_{j, i, k + 1|k}^x, \mathit{\boldsymbol{x}}_{j, i, k + 1|k}^v) \end{equation} $

    (8)

    $ \begin{equation} {\hat {\mathit{\boldsymbol{z}}}_{i, k + 1|k}} = \sum\limits_{j = 1}^M {w_{j, i}^m{\mathit{\boldsymbol{z}}_{j, i, k + 1|k}}} \end{equation} $

    (9)

    $ \begin{align} {\mathit{\boldsymbol{P}}_{i, zz}} = & \sum\limits_{j = 1}^M {w_{j, i}^c({\mathit{\boldsymbol{z}}_{j, i, k + 1|k}} - {{\hat {\mathit{\boldsymbol{z}}}}_{i, k + 1|k}})}\times \nonumber\\& {({\mathit{\boldsymbol{z}}_{j, i, k + 1|k}} - {{\hat {\mathit{\boldsymbol{z}}}}_{i, k + 1|k}})^{\rm T}} + \mathit{\boldsymbol{R}} \end{align} $

    (10)

    $ \begin{align} {\mathit{\boldsymbol{P}}_{i, xz}} = &\sum\limits_{j = 1}^M w_{j, i}^c(\hat {\mathit{\boldsymbol{x}}}_{j, i, k + 1|k}^x - {{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1|k}})\times \nonumber\\&{{({\mathit{\boldsymbol{z}}_{j, i, k + 1|k}} - {{\hat {\mathit{\boldsymbol{z}}}}_{i, k + 1|k}})}^{\rm T}} \end{align} $

    (11)

    $ \begin{equation} {\mathit{\boldsymbol{K}}_{i, k + 1}} = {\mathit{\boldsymbol{P}}_{i, xz}}\mathit{\boldsymbol{P}}_{i, zz}^{ - 1} \end{equation} $

    (12)

    $ \begin{equation} {\hat {\mathit{\boldsymbol{x}}}_{i, k + 1}} = {\hat {\mathit{\boldsymbol{x}}}_{i, k + 1|k}} + {\mathit{\boldsymbol{K}}_{i, k + 1}}({\mathit{\boldsymbol{z}}_{k + 1}} - {\hat {\mathit{\boldsymbol{z}}}_{i, k + 1|k}}) \end{equation} $

    (13)

    $ \begin{equation} {\hat {\mathit{\boldsymbol{P}}}_{i, k + 1}} = {\hat {\mathit{\boldsymbol{P}}}_{i, k + 1|k}} - {\mathit{\boldsymbol{K}}_{i, k + 1}}{\mathit{\boldsymbol{P}}_{i, zz}}\mathit{\boldsymbol{K}}_{i, k + 1}^{\rm T} \end{equation} $

    (14)

    在采样点用内层UKF算法更新后, 类似于UPF算法, 表示一阶矩的权值和表示二阶矩的权值的更新可以表示为:

    $ \begin{equation} \left\{ {\begin{array}{*{20}{c}} {w_i^m = w_i^m\frac{{p({\mathit{\boldsymbol{z}}_{k + 1}}|{{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1}})p({{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1}}|{{\hat {\mathit{\boldsymbol{x}}}}_{i, k}})}}{{q({{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1}}|{\mathit{\boldsymbol{z}}_{1:k}})}}}\\ {w_i^c = w_i^c\frac{{p({\mathit{\boldsymbol{z}}_{k + 1}}|{{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1}})p({{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1}}|{{\hat {\mathit{\boldsymbol{x}}}}_{i, k}})}}{{q({{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1}}|{\mathit{\boldsymbol{z}}_{1:k}})}}} \end{array}} \right. \end{equation} $

    (15)

    在得到权值更新的基础上, 对权值进行归一化处理, 有

    $ \begin{equation} \left\{ {\begin{array}{*{20}{c}} {w_i^m = \frac{{w_i^m}}{{\sum\limits_{i = 1}^N {w_i^m} }}}\\ {w_i^c = \frac{{w_i^c}}{{\sum\limits_{i = 1}^N {w_i^c} }}} \end{array}} \right. \end{equation} $

    (16)

    $k + 1$时刻的初始估计值及其协方差可以表示为

    $ \begin{equation} \hat {\mathit{\boldsymbol{x}}}_{k + 1}^I = \sum\limits_{i = 1}^N {w_i^m{{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1}}} \end{equation} $

    (17)

    $ \begin{equation} \hat {\mathit{\boldsymbol{P}}}_{k + 1}^I = \sum\limits_{i = 1}^N {w_i^c({{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1}} - \hat {\mathit{\boldsymbol{x}}}_{k + 1}^I){{({{\hat {\mathit{\boldsymbol{x}}}}_{i, k + 1}} - \hat {\mathit{\boldsymbol{x}}}_{k + 1}^I)}^{\rm T}}} + \mathit{\boldsymbol{Q}} \end{equation} $

    (18)

    外层UKF算法:

    基于$\hat {\mathit{\boldsymbol{x}}}_{k + 1}^I$和$\hat {\mathit{\boldsymbol{P}}}_{k + 1}^I$, 用采样策略选取$N$个带权值的采样点$ \{ \mathit{\boldsymbol{x}}_{i, k + 1}^I\} _{i = 1}^N$.而后再次对粒子点进行量测更新, 可以表示为:

    $ \begin{equation} \mathit{\boldsymbol{z}}_{i, k + 1}^I = \mathit{\boldsymbol{h}}(\mathit{\boldsymbol{x}}_{i, k + 1}^{I, x}, \mathit{\boldsymbol{x}}_{i, k + 1}^{I, v}) \end{equation} $

    (19)

    $ \begin{equation} \hat {\mathit{\boldsymbol{z}}}_{k + 1}^I = \sum\limits_{j = 1}^N {w_i^m\mathit{\boldsymbol{z}}_{i, k + 1}^I} \end{equation} $

    (20)

    $ \begin{equation} \mathit{\boldsymbol{P}}_{zz}^I = \sum\limits_{i = 1}^N {w_i^c(\mathit{\boldsymbol{z}}_{i, k + 1}^I - \hat {\mathit{\boldsymbol{z}}}_{k + 1}^I){{(\mathit{\boldsymbol{z}}_{i, k + 1}^I - \hat {\mathit{\boldsymbol{z}}}_{k + 1}^I)}^{\rm T}}} + \mathit{\boldsymbol{R}} \end{equation} $

    (21)

    $ \begin{equation} \mathit{\boldsymbol{P}}_{xz}^I = \sum\limits_{i = 1}^N {w_i^c(\mathit{\boldsymbol{x}}_{i, k + 1}^{I, x} - \hat {\mathit{\boldsymbol{x}}}_{k + 1}^I){{(\mathit{\boldsymbol{z}}_{i, k + 1}^I - \hat {\mathit{\boldsymbol{z}}}_{k + 1}^I)}^{\rm T}}} \end{equation} $

    (22)

    $ \begin{equation} \mathit{\boldsymbol{K}}_{k + 1}^I = \frac{{\mathit{\boldsymbol{P}}_{xz}^I}}{{\mathit{\boldsymbol{P}}_{zz}^I}} \end{equation} $

    (23)

    $ \begin{equation} {\hat {\mathit{\boldsymbol{x}}}_{k + 1}} = \hat {\mathit{\boldsymbol{x}}}_{k + 1}^I + \mathit{\boldsymbol{K}}_{k + 1}^I({\mathit{\boldsymbol{z}}_{k + 1}} - \hat {\mathit{\boldsymbol{z}}}_{k + 1}^I) \end{equation} $

    (24)

    $ \begin{equation} {\hat {\mathit{\boldsymbol{P}}}_{k + 1}} = \hat {\mathit{\boldsymbol{P}}}_{k + 1}^I - \mathit{\boldsymbol{K}}_{k + 1}^I\mathit{\boldsymbol{P}}_{zz}^I{(\mathit{\boldsymbol{K}}_{k + 1}^I)^{\rm T}} \end{equation} $

    (25)

    不断重复方程(5)~(25), 即可求得DLUKF算法在每个时刻的估计值${\hat x_k}$.

    DLUKF算法的流程图如图 1所示.

    图 1  DLUKF算法流程图
    Fig. 1  The flow-chart of DLUKF

    根据选取粒子点的采样策略不同, 又可以得到多种DLUKF算法.在UT变换中, 目前的采样策略方法包括对称采样、单形采样、3阶矩偏度采样和高斯分布4阶矩对称采样[8]等.还有为了保证经过非线性变换后协方差${\mathit{\boldsymbol{P}}_{yy}}$的正定性而提出的对基本采样策略进行比例修正的算法框架.

    下面主要详细介绍对称采样策略.

    考虑均值$\bar {\mathit{\boldsymbol{x}}}$和协方差${\mathit{\boldsymbol{P}}_{xx}}$的情况下, 通过对称采样的策略选取$N = 2n + 1$个采样点.采样点及其权值可以表示为:

    $ \begin{equation} \left\{ \begin{array}{l} {\mathit{\boldsymbol{x}}^{(1)}} = \bar {\mathit{\boldsymbol{x}}}\\ \{ {\mathit{\boldsymbol{x}}^{(i)}}\} _{i = 2}^{N - n} = \bar {\mathit{\boldsymbol{x}}} + \sqrt {(n + \kappa )} {(\sqrt {{\mathit{\boldsymbol{P}}_{xx}}} )_{i - 1}}\\ \{ {\mathit{\boldsymbol{x}}^{(i)}}\} _{i = N - n + 1}^N = \bar {\mathit{\boldsymbol{x}}} - \sqrt {(n + \kappa )} {(\sqrt {{\mathit{\boldsymbol{P}}_{xx}}} )_{i - n + 1}} \end{array} \right. \end{equation} $

    (26)

    $ \begin{equation} \left\{ \begin{array}{l} w_1^m = w_1^c = \frac{{\kappa }}{{n + \kappa }}\\ \{ w_i^m\} _{i = 2}^N = \{ w_i^c\} _{i = 2}^N = \frac{{1}}{{2n + 2\kappa }} \end{array} \right. \end{equation} $

    (27)

    式(26)中的$n$表示均值$\bar {\mathit{\boldsymbol{x}}}$维数. $\kappa $为比例参数, 可调节采样点与均值$\bar x$间的距离, 仅影响二阶以后高阶矩带来的误差. $ {(\sqrt {{\mathit{\boldsymbol{P}}_{xx}}})_i}$表示平方根矩阵的第$i$列或行.

    在对称采样策略中, 采样点除了中心点外, 其他的采样点的权值是相同的.这说明除中心点外, 其他采样点的重要性是相同的.从采样点的分布可以看出, 采样点是关于中心点呈中心对称的.

    基于对称采样的DLUKF算法就是在产生粒子点时用对称采样策略产生粒子点, 其具体的算法流程如下:

    1) $k$时刻的估计值为${\hat {\mathit{\boldsymbol{x}}}_k}$, 协方差为${\hat {\mathit{\boldsymbol{P}}}_k}$.

    2) 基于${\hat {\mathit{\boldsymbol{x}}}_k}$和${\hat {\mathit{\boldsymbol{P}}}_k}$, 通过式(26)和式(27)求得外层UKF算法$N$个采样点$\{ {\hat {\mathit{\boldsymbol{x}}}_{i, k}}\} _{i = 1}^N$, 及其权值对应的一阶矩$w_{i, k}^m$和二阶矩$w_{i, k}^c$.

    3) 通过方程(5) $ \sim $ (14)得到每个粒子经内层UKF更新后的粒子点$\{ {\hat {\mathit{\boldsymbol{x}}}_{i, k + 1}}\} _{i = 1}^N$及其协方差$\{ {\hat {\mathit{\boldsymbol{P}}}_{i, k + 1}}\} _{i = 1}^N$.

    4) 通过方程(15) $ \sim $ (16)得到外层UKF更新后的权值$\{ w_i^m\} _{i = 1}^N$和$\{ w_i^c\} _{i = 1}^N$.

    5) 通过方程(17) $ \sim $ (18), 得到$k + 1$时刻的初始估计值$\hat {\mathit{\boldsymbol{x}}}_{k + 1}^I$及其协方差$\hat {\mathit{\boldsymbol{P}}}_{k + 1}^I$.

    6) 基于$\hat {\mathit{\boldsymbol{x}}}_{k + 1}^I$和$\hat {\mathit{\boldsymbol{P}}}_{k + 1}^I$, 通过式(26)和式(27)求得$N$个采样点$\{ \mathit{\boldsymbol{x}}_{i, k + 1}^{I}\} _{i = 1}^N$.

    7) 通过方程(19) $ \sim $ (25), 得到$k + 1$时刻的估计值为${\hat {\mathit{\boldsymbol{x}}}_{k + 1}}$, 协方差为${\hat {\mathit{\boldsymbol{P}}}_{k + 1}}$.

    将本文所提的基于对称采样策略DLUKF算法与UKF算法、IUKF算法、RUEKF算法、RUCKF算法、高阶UKF算法、高阶UKF算法和UPF算法分别在一维和二维仿真场景下进行仿真对比分析, 用滤波算法估计值与真实值间的均方根误差(Root mean square error, RMSE)来表示滤波算法估计效果.

    假设有下述状态空间模型, 其状态方程和量测方程分别可以表示为:

    $ \begin{equation} {x_{k + 1}} = 0.5{x_k} + \sin (0.04\pi k) + 1 + {w_k} \end{equation} $

    (28)

    $ \begin{equation} {z_{k + 1}} = 0.2x_{k + 1}^2 + {v_{k + 1}} \end{equation} $

    (29)

    式(28)中${w_k}$表示过程噪声, 其服从$Ga(3, 2) $的伽马分布.式(29)中的${v_{k + 1}}$表示量测噪声, 其服从均值为0, 方差为的高斯分布.初始位置为${x_0} = 3$, IUKF算法, RUEKF算法和RUCKF算法的迭代次数都为10次. UPF算法粒子数量为100个, DLUKF算法产生粒子的方法是对称采样策略.仿真时间为30 s, 蒙特卡洛仿真次数为100次.其仿真结果如图 2所示.

    图 2  300次蒙特卡洛仿真的RMSE
    Fig. 2  The calculation time and RMSE of each algorithm

    通过图 2可以看出, IUKF算法、RUEKF算法、RUCKF算法、高阶UKF算法、高阶UKF算法和UPF算法滤波估计效果都略好于UKF算法.这是因为IUKF算法、RUEKF算法、RUCKF算法、高阶UKF算法、高阶UKF算法和UPF算法都对UKF算法进行了改进, 所以其效果是好于UKF算法的.本文所提的DLUKF算法在每个时刻的估计效果都好于其他的滤波算法.这说明, DLUKF算法对于UKF算法的改进效果比其他经典算法更加显著.且因为DLUKF算法用两层UKF算法对状态进行估计, 所以可以有着很好的滤波估计效果.

    将UPF算法的粒子数由100逐渐增加到500, 其与UKF算法、IUKF算法、RUEKF算法、RUCKF算法、高阶UKF算法、高阶UKF算法和DLUKF算法的单次运行时间以及平均RMSE如表 1所示.

    表 1  各算法计算时间及RMSE对比分析表
    Table 1  The calculation time and RMSE of each algorithm
    算法 运行时间(s) 平均RMSE
    UKF 0.0002 0.1566
    IUKF 0.0014 0.0881
    RUEKF 0.0006 0.0378
    RUCKF 0.0031 0.0337
    高阶UKF 0.0006 0.1434
    高阶CKF 0.0006 0.1437
    UPF (100) 0.1032 0.1153
    UPF (200) 0.2097 0.0714
    UPF (300) 0.3200 0.0626
    UPF (400) 0.4296 0.0564
    UPF (500) 0.5416 0.0476
    DLUKF 0.0016 0.0297
    下载: 导出CSV 
    | 显示表格

    通过表 1可以看出, UKF算法、RUEKF算法、高阶UKF算法和高阶CKF算法的用时都很少. IUKF算法、RUCKF和DLUKF算法的用时略长, 这是由于这三种算法都进行了多次滤波计算. UPF算法用时最长.在UPF算法中, 随着粒子数目的增多, 用时也是逐渐增加.在RMSE方面, DLUKF算法比另外7种方法小很多.在UPF算法中, 随着粒子数增多, RMSE也是逐渐变小的.但当500个粒子点时, UPF算法RMSE依然是DLUKF算法的两倍.这说明了基于带权值的采样点表征后验分布的方法是优于随机点表征后验分布的.

    考虑一个二维匀速直线运动的例子, 其状态方程和量测方程分别为:

    $ \begin{equation} {\mathit{\boldsymbol{{\rm{X}}}}_{k + 1}} = \mathit{\boldsymbol{F}}{\mathit{\boldsymbol{X}}_k} + {\mathit{\boldsymbol{w}}_k} \end{equation} $

    (30)

    $ \begin{equation} {\mathit{\boldsymbol{Z}}_{k + 1}} = \mathit{\boldsymbol{h}}({\mathit{\boldsymbol{X}}_{k + 1}}) + {\mathit{\boldsymbol{v}}_{k + 1}} \end{equation} $

    (31)

    式(30)中, ${{\mathit{\boldsymbol{X}}_k} = [{x_k}, {\dot x_k}, {y_k}, {\dot y_k}]^{\rm T}}$是状态变量, 分别表示$x$轴和$y$轴方向的位置和速度. ${\mathit{\boldsymbol{w}}_k}$为过程噪声, 其服从均值为0, 方差为$\mathit{\boldsymbol{Q}}$的高斯分布.其中, $\mathit{\boldsymbol{F}}$和$\mathit{\boldsymbol{Q}}$分别可以表示为

    $ \begin{equation} \mathit{\boldsymbol{F}} = \left[ {\begin{array}{*{20}{c}} 1&T&0&0\\ 0&1&0&0\\ 0&0&1&T\\ 0&0&0&1 \end{array}} \right] \end{equation} $

    (32)

    $ \begin{equation} \mathit{\boldsymbol{Q}} = {q^2}\left[ {\begin{array}{*{20}{c}} {\frac{T^3}{3}}&{\frac{T^2}{2}}&0&0\\ {\frac{T^2}{2}}&T&0&0\\ 0&0&{\frac{T^3}{3}}&{\frac{T^2}{2}}\\ 0&0&{\frac{T^2}{2}}&T \end{array}} \right] \end{equation} $

    (33)

    式(31)中, ${\mathit{\boldsymbol{Z}}_{k + 1}} = {[{r_{k + 1}}, {\theta _{k + 1}}]^{\rm T}}$为观测变量, 分别表示对目标的径向距和方位角. ${\mathit{\boldsymbol{v}}_{k + 1}}$为量测噪声, 其为闪烁噪声, 可以表示为:

    $ \begin{align} p({\mathit{\boldsymbol{v}}_{k + 1}}) = &(1 - \varepsilon ){p_1}({\mathit{\boldsymbol{v}}_{k + 1}}) + \varepsilon {p_2}({\mathit{\boldsymbol{v}}_{k + 1}}) = \nonumber\\& (1 - \varepsilon )N({\mathit{\boldsymbol{v}}_{k + 1}};0, {\mathit{\boldsymbol{R}}_1}) + \varepsilon N({\mathit{\boldsymbol{v}}_{k + 1}};0, {\mathit{\boldsymbol{R}}_2}) \end{align} $

    (34)

    量测方程$\mathit{\boldsymbol{h}}(\cdot) $可以表示为:

    $ \begin{equation} \mathit{\boldsymbol{h}}({\mathit{\boldsymbol{X}}_{k + 1}}) = {\left[ {\begin{array}{*{20}{c}} {\sqrt {x_{k + 1}^2 + y_{k + 1}^2} }&{\arctan (\frac{{{y_{k + 1}}}}{{{x_{k + 1}}}})} \end{array}} \right]^{\rm{T}}} \end{equation} $

    (35)

    式(34)中, ${\mathit{\boldsymbol{R}}_1}$和${\mathit{\boldsymbol{R}}_2}$分别可以表示为

    $ \begin{equation} {\mathit{\boldsymbol{R}}_1} = \left[ {\begin{array}{*{20}{c}} {\sigma _{1r}^2}&0\\ 0&{\sigma _{1\varepsilon }^2} \end{array}} \right] \end{equation} $

    (36)

    $ \begin{equation} {\mathit{\boldsymbol{R}}_2} = \left[ {\begin{array}{*{20}{c}} {\sigma _{2r}^2}&0\\ 0&{\sigma _{2\varepsilon }^2} \end{array}} \right] \end{equation} $

    (37)

    仿真中, 仿真时间为100 s, 蒙特卡洛仿真次数为300次.目标初始位置为(20 000 m, 40 000 m), 初始速度为(-160 m/s, -150 m/s). IUKF算法、RUEKF算法和RUCKF算法的迭代次数都为10次. UPF算法粒子数量为300个, DLUKF算法产生粒子的方法是对称采样策略.

    其他的参数设置为:

    表 2  仿真参数设置
    Table 2  The Simulation parameters
    参数 $T$ $q$ ${\sigma _{1r}}$ ${\sigma _{1\varepsilon }}$ ${\sigma _{2r}}$ ${\sigma _{2\varepsilon }}$ $\varepsilon $
    数值 1 1 20 m 0.2$^{o}$ 200 m 0.2$^{o}$ 0.1
    下载: 导出CSV 
    | 显示表格

    位置的RMSE公式可以表示为$RMS{E_{}} = \sqrt {RMSE_{x}^2 + RMSE_{y}^2} $, 进行仿真分析, 其效果如图 3所示.

    图 3  位置的RMSE
    Fig. 3  The RMSE of position

    图 3是各个算法在位置方面的RMSE, 可以看出, RUEKF算法、RUCKF算法、高阶UKF算法、高阶CKF算法和UKF算法的估计效果基本相同, 而IUKF算法和UPF算法的估计效果优于UKF算法.而本文算法性能是最好的, 这是因为本文算法用带权值的采样点表征后验分布, 这比随机的粒子点表征后验分布更有优势, 故DLUKF算法的RMSE是好于其他滤波算法的.这也说明所提的DLIKF算法对匀速直线运动可以有着很好的滤波估计效果.

    在匀速直线运动中, 将UPF算法的粒子数由300逐渐增加到1 000, 其与其他算法的单次运行时间以及位置和速度的平均RMSE如表 3所示.

    表 3  各个算法的性能
    Table 3  The performance of each algorithm
    算法 运行时间(s) 平均RMSE
    UKF 0.0059 99.8709
    IUKF 0.0424 85.0107
    RUEKF 0.0150 100.2616
    RUCKF 0.0397 99.8704
    高阶UKF 0.0193 100.4763
    高阶CKF 0.0191 99.7558
    UPF (300) 3.5953 88.2638
    UPF (400) 4.8406 86.5004
    UPF (500) 6.0552 85.8206
    UPF (600) 7.2596 85.1056
    UPF (700) 8.4211 84.6700
    UPF (800) 9.6178 83.2706
    UPF (900) 10.8389 82.9057
    UPF (1 000) 12.0105 82.4258
    DLUKF 0.0757 78.5559
    下载: 导出CSV 
    | 显示表格

    表 3可以看出, 本文算法的运算时间虽然略长于UKF算法、IUKF算法、REUKF算法、RUCKF算法, 高阶UKF算法和高阶CKF算法, 却远远小于UPF算法.且由于DLUKF算法的外层UKF算法选取了9个确定性采样点, 所以其运算时间是大约是UKF算法的9倍.在UPF算法中, 随着粒子数目的增多, 运算时间也是逐渐增加的.在各个方面的RMSE、DLUKF算法都是最好的.在UPF算法中, 随着粒子数目的增多, RMSE也是逐渐减小的, 但比起DLUKF算法、UPF算法的RMSE依然是很大的.这说明基于双层采样的的DLUKF算法在多维目标跟踪中有着很好的滤波估计效果.

    本文所提的DLUKF算法是在双层UKF算法的基础上, 用采样策略选取带权值的采样点, 而后用内层UKF算法对每个采样点进行更新, 同时用最新的量测对采样点的权值进行更新, 最后通过外层UKF算法的更新机制得到每个时刻的滤波估计值.仿真结果表明, 在一维和二维的仿真场景中, 相比于存在的经典算法, 本文所提的DLUKF算法可以在较短的时间内获得很好的滤波估计效果.

  • 图  1  基于表征学习的离线强化学习总体框架

    Fig.  1  The overall framework of offline reinforcement learning based on representation learning

    图  2  基于动作表征的离线强化学习框架

    Fig.  2  The framework of offline reinforcement learning based on action representation

    图  3  基于状态表征的离线强化学习框架

    Fig.  3  The framework of offline reinforcement learning based on state representation

    图  4  基于状态−动作对表征的离线强化学习框架

    Fig.  4  The framework of offline reinforcement learning based on state-action pairs representation

    图  5  基于轨迹表征的离线强化学习框架

    Fig.  5  The framework of offline reinforcement learning based on trajectory representation

    图  6  基于任务(环境)表征的离线强化学习框架

    Fig.  6  The framework of offline reinforcement learning based on task (environment) representation

    表  1  基于表征学习的离线强化学习方法对比

    Table  1  Comparison of offline reinforcement learning based on representation learning

    表征对象 参考文献 表征网络架构 环境建模方式 应用场景 特点 缺点
    动作表征 [1521] VAE 无模型 机器人控制、导航 状态条件下生成动作, 将目标
    策略限制在行为策略范围内,
    缓解分布偏移
    不适用于离散动作空间
    [2223] 流模型
    [2425] 扩散模型
    状态表征 [2627] VAE 无模型 基于视觉的机器人控制 压缩高维观测状态, 减少
    冗余信息, 提高泛化能力
    限定于图像(像素)输入
    [28] VAE 基于模型
    [29] GAN 基于模型
    [30] 编码器架构 基于模型
    [3132] 编码器架构 无模型
    状态−动作
    对表征
    [33] 自编码器 基于模型 基于视觉的机器人控制、
    游戏、自动驾驶
    学习状态−动作联合表征,
    捕捉两者交互关系,
    指导后续决策任务
    限定于图像(像素)输入
    [34] VAE 基于模型
    [3536] 编码器架构 无模型
    [3738] 编码器架构 基于模型
    轨迹表征 [3944] Transformer 序列模型 机器人控制、导航、游戏 将强化学习视为条件序列建模
    问题, 用于预测未来轨迹序列
    轨迹生成速度慢,
    调优成本高
    [4547] 扩散模型
    任务表征 [4849] 编码器架构 无模型 机器人控制、导航 借助元学习思想, 使智能体
    快速适应新任务
    泛化能力依赖于任务或
    环境之间的相似性
    环境表征 [5051] 编码器架构 基于模型
    下载: 导出CSV

    表  2  离线强化学习基准数据集对比

    Table  2  Comparison of benchmarking datasets for offline reinforcement learning

    名称 领域 应用领域 数据集特性
    RL Unplugged DeepMind控制套件 机器人连续控制 连续域, 探索难度由易到难
    DeepMind运动套件 模拟啮齿动物的运动 连续域, 探索难度大
    Atari 2600 视频游戏 离散域, 探索难度适中
    真实世界强化学习套件 机器人连续控制 连续域, 探索难度由易到难
    D4RL Maze2D 导航 非马尔科夫策略, 不定向与多任务数据
    MiniGrid-FourRooms 导航, Maze2D的离散模拟 非马尔科夫策略, 不定向与多任务数据
    AntMaze 导航 非马尔科夫策略, 稀疏奖励, 不定向与多任务数据
    Gym-MuJoCo 机器人连续控制 次优数据, 狭窄数据分布
    Adroit 机器人操作 非表示性策略, 狭窄数据分布, 稀疏奖励, 现实领域
    Flow 交通流量控制管理 非表示性策略, 现实领域
    FrankaKitchen 厨房机器人操作 不定向与多任务数据, 现实领域
    CARLA 自动驾驶车道跟踪与导航 部分可观测性, 非表示性策略, 不定向与多任务数据, 现实领域
    NeoRL Gym-MuJoCo 机器人连续控制 保守且数据量有限
    工业基准 工业控制任务 高维连续状态和动作空间, 高随机性
    FinRL 股票交易市场 高维连续状态和动作空间, 高随机性
    CityLearn 不同类型建筑的储能控制 高维连续状态和动作空间, 高随机性
    SalesPromotion 商品促销 由人工操作员与真实用户提供的数据
    下载: 导出CSV

    表  3  基于表征学习的离线强化学习应用综述

    Table  3  Summarization of the applications for offline reinforcement learning based on representation learning

    应用领域 文献 表征对象 表征网络架构 环境建模方式 所解决的实际问题 策略学习方法
    工业 [68] 任务表征 编码器架构 无模型 工业连接器插入 从离线数据中元学习自适应策略
    [104] 任务表征 编码器架构 无模型 工业连接器插入 利用域对抗神经网络的域不变性和变分信息瓶颈的
    域特定信息流控制来实现策略泛化
    [67] 轨迹表征 Transformer 序列模型 工业芯片布局 采用因果自注意力掩码并通过自回归
    输入标记来预测动作
    推荐系统 [57] 动作表征 VAE 基于模型 快速适应冷启动用户 利用逆强化学习从少量交互中恢复出
    用户策略与奖励
    [60] 状态表征 编码器架构 基于模型 数据稀疏性 利用群体偏好注入的因果用户模型训练策略
    [61] 状态表征 编码器架构 无模型 离线交互推荐 利用保守的Q函数来估计策略
    智能驾驶 [58] 动作表征 VAE 无模型 交叉口生态驾驶控制 利用VAE生成动作
    [69] 环境表征 VAE 基于模型 长视域任务 利用VAE生成动作
    医疗 [63] 状态−动作对表征 编码器架构 基于模型 个性化诊断 使用在线模型预测控制方法选择策略
    能源管理 [59] 动作表征 VAE 无模型 油电混动汽车能源利用效率 利用VAE生成动作
    量化交易 [70] 环境表征 编码器架构 无模型 最优交易执行的过拟合问题 利用时序差分误差或策略梯度法来学习策略
    下载: 导出CSV
  • [1] Sutton R S, Barto A G. Reinforcement Learning: An Introduction (Second edition). Cambridge: The MIT Press, 2018.
    [2] 孙悦雯, 柳文章, 孙长银. 基于因果建模的强化学习控制: 现状及展望. 自动化学报, 2023, 49(3): 661−677

    Sun Yue-Wen, Liu Wen-Zhang, Sun Chang-Yin. Causality in reinforcement learning control: The state of the art and prospects. Acta Automatica Sinica, 2023, 49(3): 661−677
    [3] Silver D, Huang A, Maddison C J, Guez A, Sifre L, van den Driessche G, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484−489 doi: 10.1038/nature16961
    [4] Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 2020, 588(7839): 604−609 doi: 10.1038/s41586-020-03051-4
    [5] Senior A W, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, et al. Improved protein structure prediction using potentials from deep learning. Nature, 2020, 577(7792): 706−710 doi: 10.1038/s41586-019-1923-7
    [6] Li Y J, Choi D, Chung J, Kushman N, Schrittwieser J, Leblond R, et al. Competition-level code generation with AlphaCode. Science, 2022, 378(6624): 1092−1097 doi: 10.1126/science.abq1158
    [7] Degrave J, Felici F, Buchli J, Neunert M, Tracey B, Carpanese F, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022, 602(7897): 414−419 doi: 10.1038/s41586-021-04301-9
    [8] Fawzi A, Balog M, Huang A, Hubert T, Romera-Paredes B, Barekatain M, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 2022, 610(7930): 47−53 doi: 10.1038/s41586-022-05172-4
    [9] Fang X, Zhang Q C, Gao Y F, Zhao D B. Offline reinforcement learning for autonomous driving with real world driving data. In: Proceedings of the 25th IEEE International Conference on Intelligent Transportation Systems (ITSC). Macao, China: IEEE, 2022. 3417−3422
    [10] 刘健, 顾扬, 程玉虎, 王雪松. 基于多智能体强化学习的乳腺癌致病基因预测. 自动化学报, 2022, 48(5): 1246−1258

    Liu Jian, Gu Yang, Cheng Yu-Hu, Wang Xue-Song. Prediction of breast cancer pathogenic genes based on multi-agent reinforcement learning. Acta Automatica Sinica, 2022, 48(5): 1246−1258
    [11] Levine S, Kumar A, Tucker G, Fu J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv: 2005.01643, 2020.
    [12] Prudencio R F, Maximo M R O A, Colombini E L. A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems, DOI: 10.1109/TNNLS.2023.3250269
    [13] 程玉虎, 黄龙阳, 侯棣元, 张佳志, 陈俊龙, 王雪松. 广义行为正则化离线Actor-Critic. 计算机学报, 2023, 46(4): 843−855 doi: 10.11897/SP.J.1016.2023.00843

    Cheng Yu-Hu, Huang Long-Yang, Hou Di-Yuan, Zhang Jia-Zhi, Chen Jun-Long, Wang Xue-Song. Generalized offline actor-critic with behavior regularization. Chinese Journal of Computers, 2023, 46(4): 843−855 doi: 10.11897/SP.J.1016.2023.00843
    [14] 顾扬, 程玉虎, 王雪松. 基于优先采样模型的离线强化学习. 自动化学报, 2024, 50(1): 143−153

    Gu Yang, Cheng Yu-Hu, Wang Xue-Song. Offline reinforcement learning based on prioritized sampling model. Acta Automatica Sinica, 2024, 50(1): 143−153
    [15] Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration. In: Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR, 2019. 2052−2062
    [16] He Q, Hou X W, Liu Y. POPO: Pessimistic offline policy optimization. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore: IEEE, 2022. 4008−4012
    [17] Wu J L, Wu H X, Qiu Z H, Wang J M, Long M S. Supported policy optimization for offline reinforcement learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022. Article No. 2268
    [18] Lyu J F, Ma X T, Li X, Lu Z Q. Mildly conservative Q-learning for offline reinforcement learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022. Article No. 125
    [19] Rezaeifar S, Dadashi R, Vieillard N, Hussenot L, Bachem O, Pietquin O, et al. Offline reinforcement learning as anti-exploration. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. Virtual Event: AAAI Press, 2022. 8106−8114
    [20] Zhou W X, Bajracharya S, Held D. PLAS: Latent action space for offline reinforcement learning. In: Proceedings of the 4th Conference on Robot Learning. Cambridge, USA: PMLR, 2020. 1719−1735
    [21] Chen X, Ghadirzadeh A, Yu T H, Wang J H, Gao A, Li W Z, et al. LAPO: Latent-variable advantage-weighted policy optimization for offline reinforcement learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022. Article No. 2674
    [22] Akimov D, Kurenkov V, Nikulin A, Tarasov D, Kolesnikov S. Let offline RL flow: Training conservative agents in the latent space of normalizing flows. In: Proceedings of Offline Reinforcement Learning Workshop at Neural Information Processing Systems. New Orleans, USA: OpenReview.net, 2022.
    [23] Yang Y Q, Hu H, Li W Z, Li S Y, Yang J, Zhao Q C, et al. Flow to control: Offline reinforcement learning with lossless primitive discovery. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. Washington, USA: AAAI Press, 2023. 10843−10851
    [24] Wang Z D, Hunt J J, Zhou M Y. Diffusion policies as an expressive policy class for offline reinforcement learning. In: Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: OpenReview.net, 2023.
    [25] Chen H Y, Lu C, Ying C Y, Su H, Zhu J. Offline reinforcement learning via high-fidelity generative behavior modeling. In: Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: OpenReview.net, 2023.
    [26] Zhang H C, Shao J Z, Jiang Y H, He S C, Zhang G W, Ji X Y. State deviation correction for offline reinforcement learning. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. Virtual Event: AAAI Press, 2022. 9022−9030
    [27] Weissenbacher M, Sinha S, Garg A, Kawahara Y. Koopman Q-learning: Offline reinforcement learning via symmetries of dynamics. In: Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR, 2022. 23645−23667
    [28] Rafailov R, Yu T H, Rajeswaran A, Finn C. Offline reinforcement learning from images with latent space models. In: Proceedings of the 3rd Annual Conference on Learning for Dynamics and Control. Zurich, Switzerland: PMLR, 2021. 1154−1168
    [29] Cho D, Shim D, Kim H J. S2P: State-conditioned image synthesis for data augmentation in offline reinforcement learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022. Article No. 838
    [30] Gieselmann R, Pokorny F T. An expansive latent planner for long-horizon visual offline reinforcement learning. In: Proceedings of the RSS 2023 Workshop on Learning for Task and Motion Planning. Daegu, South Korea: OpenReview.net, 2023.
    [31] Zang H Y, Li X, Yu J, Liu C, Islam R, Combes R T D, et al. Behavior prior representation learning for offline reinforcement learning. In: Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: OpenReview.net, 2023.
    [32] Mazoure B, Kostrikov I, Nachum O, Tompson J. Improving zero-shot generalization in offline reinforcement learning using generalized similarity functions. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022. Article No. 1819
    [33] Kim B, Oh M H. Model-based offline reinforcement learning with count-based conservatism. In: Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: PMLR, 2023. 16728−16746
    [34] Tennenholtz G, Mannor S. Uncertainty estimation using riemannian model dynamics for offline reinforcement learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022. Article No. 1381
    [35] Ada S E, Oztop E, Ugur E. Diffusion policies for out-of-distribution generalization in offline reinforcement learning. IEEE Robotics and Automation Letters, 2024, 9(4): 3116−3123 doi: 10.1109/LRA.2024.3363530
    [36] Kumar A, Agarwal R, Ma T Y, Courville A C, Tucker G, Levine S. DR3: Value-based deep reinforcement learning requires explicit regularization. In: Proceedings of the 10th International Conference on Learning Representations. Virtual Event: OpenReview.net, 2022.
    [37] Lee B J, Lee J, Kim K E. Representation balancing offline model-based reinforcement learning. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event: OpenReview.net, 2021.
    [38] Chang J D, Wang K W, Kallus N, Sun W. Learning bellman complete representations for offline policy evaluation. In: Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR, 2022. 2938−2971
    [39] Chen L L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, et al. Decision transformer: Reinforcement learning via sequence modeling. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates, Inc., 2021. 15084−15097
    [40] Janner M, Li Q Y, Levine S. Offline reinforcement learning as one big sequence modeling problem. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates, Inc., 2021. 1273−1286
    [41] Furuta H, Matsuo Y, Gu S S. Generalized decision transformer for offline hindsight information matching. In: Proceedings of the 10th International Conference on Learning Representations. Virtual Event: OpenReview.net, 2022.
    [42] Liu Z X, Guo Z J, Yao Y H, Cen Z P, Yu W H, Zhang T N, et al. Constrained decision transformer for offline safe reinforcement learning. In: Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: JMLR.org, 2023. Article No. 893
    [43] Wang Y Q, Xu M D, Shi L X, Chi Y J. A trajectory is worth three sentences: Multimodal transformer for offline reinforcement learning. In: Proceedings of the 39th Conference on Uncertainty in Artificial Intelligence. Pittsburgh, USA: JMLR.org, 2023. Article No. 208
    [44] Zeng Z L, Zhang C, Wang S J, Sun C. Goal-conditioned predictive coding for offline reinforcement learning. arXiv preprint arXiv: 2307.03406, 2023.
    [45] Janner M, Du Y L, Tenenbaum J B, Levine S. Planning with diffusion for flexible behavior synthesis. In: Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR, 2022. 9902−9915
    [46] Ajay A, Du Y L, Gupta A, Tenenbaum J B, Jaakkola T S, Agrawal P. Is conditional generative modeling all you need for decision making? In: Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: OpenReview.net, 2023.
    [47] Liang Z X, Mu Y, Ding M Y, Ni F, Tomizuka M, Luo P. AdaptDiffuser: Diffusion models as adaptive self-evolving planners. In: Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: JMLR.org, 2023. Article No. 854
    [48] Yuan H Q, Lu Z Q. Robust task representations for offline meta-reinforcement learning via contrastive learning. In: Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR, 2022. 25747−25759
    [49] Zhao C Y, Zhou Z H, Liu B. On context distribution shift in task representation learning for online meta RL. In: Proceedings of the 19th Advanced Intelligent Computing Technology and Applications. Zhengzhou, China: Springer, 2023. 614−628
    [50] Chen X H, Yu Y, Li Q Y, Luo F M, Qin Z W, Shang W J, et al. Offline model-based adaptable policy learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates, Inc., 2021. 8432−8443
    [51] Sang T, Tang H Y, Ma Y, Hao J Y, Zheng Y, Meng Z P, et al. PAnDR: Fast adaptation to new environments from offline experiences via decoupling policy and environment representations. In: Proceedings of the 31st International Joint Conference on Artificial Intelligence. Vienna, Austria: IJCAI, 2022. 3416−3422
    [52] Lou X Z, Yin Q Y, Zhang J G, Yu C, He Z F, Cheng N J, et al. Offline reinforcement learning with representations for actions. Information Sciences, 2022, 610: 746−758 doi: 10.1016/j.ins.2022.08.019
    [53] Kingma D P, Welling M. Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations. Banff, Canada: ICLR, 2014.
    [54] Mark M S, Ghadirzadeh A, Chen X, Finn C. Fine-tuning offline policies with optimistic action selection. In: Proceedings of NeurIPS Workshop on Deep Reinforcement Learning. Virtual Event: OpenReview.net, 2022.
    [55] 张博玮, 郑建飞, 胡昌华, 裴洪, 董青. 基于流模型的缺失数据生成方法在剩余寿命预测中的应用. 自动化学报, 2023, 49(1): 185−196

    Zhang Bo-Wei, Zheng Jian-Fei, Hu Chang-Hua, Pei Hong, Dong Qing. Missing data generation method based on flow model and its application in remaining life prediction. Acta Automatica Sinica, 2023, 49(1): 185−196
    [56] Yang L, Zhang Z L, Song Y, Hong S D, Xu R S, Zhao Y, et al. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2023, 56(4): Article No. 105
    [57] Wang Y N, Ge Y, Li L, Chen R, Xu T. Offline meta-level model-based reinforcement learning approach for cold-start recommendation. arXiv preprint arXiv: 2012.02476, 2020.
    [58] 张健, 姜夏, 史晓宇, 程健, 郑岳标. 基于离线强化学习的交叉口生态驾驶控制. 东南大学学报(自然科学版), 2022, 52(4): 762−769 doi: 10.3969/j.issn.1001-0505.2022.04.018

    Zhang Jian, Jiang Xia, Shi Xiao-Yu, Cheng Jian, Zheng Yue-Biao. Offline reinforcement learning for eco-driving control at signalized intersections. Journal of Southeast University (Natural Science Edition), 2022, 52(4): 762−769 doi: 10.3969/j.issn.1001-0505.2022.04.018
    [59] He H W, Niu Z G, Wang Y, Huang R C, Shou Y W. Energy management optimization for connected hybrid electric vehicle using offline reinforcement learning. Journal of Energy Storage, 2023, 72: Article No. 108517 doi: 10.1016/j.est.2023.108517
    [60] Nie W Z, Wen X, Liu J, Chen J W, Wu J C, Jin G Q, et al. Knowledge-enhanced causal reinforcement learning model for interactive recommendation. IEEE Transactions on Multimedia, 2024, 26: 1129−1142 doi: 10.1109/TMM.2023.3276505
    [61] Zhang R Y, Yu T, Shen Y L, Jin H Z. Text-based interactive recommendation via offline reinforcement learning. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. Virtual Event: AAAI Press, 2022. 11694−11702
    [62] Rigter M, Lacerda B, Hawes N. RAMBO-RL: Robust adversarial model-based offline reinforcement learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022. 16082−16097
    [63] Agarwal A, Alomar A, Alumootil V, Shah D, Shen D, Xu Z, et al. PerSim: Data-efficient offline reinforcement learning with heterogeneous agents via personalized simulators. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates, Inc., 2021. 18564−18576
    [64] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc., 2017. 6000−6010
    [65] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, et al. An image is worth 16×16 words: Transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. Vienna, Austria: OpenReview.net, 2021.
    [66] 王雪松, 王荣荣, 程玉虎. 安全强化学习综述. 自动化学报, 2023, 49(9): 1813−1835

    Wang Xue-Song, Wang Rong-Rong, Cheng Yu-Hu. Safe reinforcement learning: A survey. Acta Automatica Sinica, 2023, 49(9): 1813−1835
    [67] Lai Y, Liu J X, Tang Z T, Wang B, Hao J Y, Luo P. ChiPFormer: Transferable chip placement via offline decision transformer. In: Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: PMLR, 2023. 18346−18364
    [68] Zhao T Z, Luo J L, Sushkov O, Pevceviciute R, Heess N, Scholz J, et al. Offline meta-reinforcement learning for industrial insertion. In: Proceedings of International Conference on Robotics and Automation. Philadelphia, USA: IEEE, 2022. 6386−6393
    [69] Li Z N, Nie F, Sun Q, Da F, Zhao H. Boosting offline reinforcement learning for autonomous driving with hierarchical latent skills. arXiv preprint arXiv: 2309.13614, 2023.
    [70] Zhang C H, Duan Y T, Chen X Y, Chen J Y, Li J, Zhao L. Towards generalizable reinforcement learning for trade execution. In: Proceedings of the 32nd International Joint Conference on Artificial Intelligence. Macao, China: IJCAI, 2023. Article No. 553
    [71] Gulcehre C, Wang Z Y, Novikov A, Le Paine T, Colmenarejo S G, Zołna K, et al. RL unplugged: A suite of benchmarks for offline reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020. Article No. 608
    [72] Fu J, Kumar A, Nachum O, Tucker G, Levine S. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv: 2004.07219, 2020.
    [73] Qin R J, Zhang X Y, Gao S Y, Chen X H, Li Z W, Zhang W N, et al. NeoRL: A near real-world benchmark for offline reinforcement learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022. Article No. 1795
    [74] Song H F, Abdolmaleki A, Springenberg J T, Clark A, Soyer H, Rae J W, et al. V-MPO: On-policy maximum a posteriori policy optimization for discrete and continuous control. In: Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: Open Review.net, 2020.
    [75] Merel J, Hasenclever L, Galashov A, Ahuja A, Pham V, Wayne G, et al. Neural probabilistic motor primitives for humanoid control. In: Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: OpenReview.net, 2019.
    [76] Merel J, Aldarondo D, Marshall J, Tassa Y, Wayne G, Olveczky B. Deep neuroethology of a virtual rodent. In: Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: OpenReview.net, 2020.
    [77] Machado M C, Bellemare M G, Talvitie E, Veness J, Hausknecht M, Bowling M. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 2018, 61: 523−562 doi: 10.1613/jair.5699
    [78] Dulac-Arnold G, Levine N, Mankowitz D J, Li J, Paduraru C, Gowal S, et al. An empirical investigation of the challenges of real-world reinforcement learning. arXiv preprint arXiv: 2003.11881, 2020.
    [79] Abdolmaleki A, Springenberg J T, Tassa Y, Munos R, Heess N, Riedmiller M A. Maximum a posteriori policy optimisation. In: Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: OpenReview.net, 2018.
    [80] Pomerleau D A. ALVINN: An autonomous land vehicle in a neural network. In: Proceedings of the 1st International Conference on Neural Information Processing Systems. Denver, USA: MIT Press, 1988. 305−313
    [81] Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529−533 doi: 10.1038/nature14236
    [82] Barth-Maron G, Hoffman M W, Budden D, Dabney W, Horgan D, Dhruva T B, et al. Distributed distributional deterministic policy gradients. In: Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: OpenReview.net, 2018.
    [83] Dabney W, Ostrovski G, Silver D, Munos R. Implicit quantile networks for distributional reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR, 2018. 1104−1113
    [84] Wu Y F, Tucker G, Nachum O. Behavior regularized offline reinforcement learning. arXiv preprint arXiv: 1911.11361, 2019.
    [85] Siegel N, Springenberg J T, Berkenkamp F, Abdolmaleki A, Neunert M, Lampe T, et al. Keep doing what worked: Behavior modelling priors for offline reinforcement learning. In: Proceedings of International Conference on Learning Representations. Addis Ababa, Ethiopia: OpenReview.net, 2020.
    [86] Agarwal A, Schuurmans D, Norouzi M. An optimistic perspective on offline reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. Virtual Event: PMLR, 2020. 104−114
    [87] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR, 2018. 1856−1865
    [88] Kumar A, Fu J, Soh M, Tucker G, Levine S. Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Proceedings of the International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates, Inc., 2019. 11761−11771
    [89] Peng X B, Kumar A, Zhang G, Levine S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv: 1910.00177, 2019.
    [90] Kumar A, Zhou A, Tucker G, Levine S. Conservative Q-learning for offline reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020. Article No. 100
    [91] Nachum O, Dai B, Kostrikov I, Chow Y, Li L H, Schuurmans D. AlgaeDICE: Policy gradient from arbitrary experience. arXiv preprint arXiv: 1912.02074, 2019.
    [92] Wang Z Y, Novikov A, Żołna K, Springenberg J T, Reed S, Shahriari B, et al. Critic regularized regression. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020. Article No. 651
    [93] Matsushima T, Furuta H, Matsuo Y, Nachum O, Gu S X. Deployment-efficient reinforcement learning via model-based offline optimization. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event: OpenReview.net, 2021.
    [94] Yu T H, Thomas G, Yu L T, Ermon S, Zou J, Levine S, et al. MOPO: Model-based offline policy optimization. In: Proceedings of the 34th International Conference on Advances in Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020. Article No. 1185
    [95] Le H M, Voloshin C, Yue Y S. Batch policy learning under constraints. In: Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR, 2019. 3703−3712
    [96] Koller D, Friedman N. Probabilistic Graphical Models: Principles and Techniques. Cambridge: MIT Press, 2009.
    [97] 王硕汝, 牛温佳, 童恩栋, 陈彤, 李赫, 田蕴哲, 等. 强化学习离线策略评估研究综述. 计算机学报, 2022, 45(9): 1926−1945 doi: 10.11897/SP.J.1016.2022.01926

    Wang Shuo-Ru, Niu Wen-Jia, Tong En-Dong, Chen Tong, Li He, Tian Yun-Zhe, et al. Research on off-policy evaluation in reinforcement learning: A survey. Chinese Journal of Computers, 2022, 45(9): 1926−1945 doi: 10.11897/SP.J.1016.2022.01926
    [98] Fu J, Norouzi M, Nachum O, Tucker G, Wang Z Y, Novikov A, et al. Benchmarks for deep off-policy evaluation. In: Proceedings of the 9th International Conference on Learning Representations. Virtual Event: OpenReview.net, 2021.
    [99] Schweighofer K, Dinu M, Radler A, Hofmarcher M, Patil V P, Bitto-nemling A, et al. A dataset perspective on offline reinforcement learning. In: Proceedings of the 1st Conference on Lifelong Learning Agents. McGill University, Canada: PMLR, 2022. 470−517
    [100] Konyushkova K, Chen Y T, Paine T, Gülçehre C, Paduraru C, Mankowitz D J, et al. Active offline policy selection. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates, Inc., 2021. 24631−24644
    [101] Kurenkov V, Kolesnikov S. Showing your offline reinforcement learning work: Online evaluation budget matters. In: Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR, 2022. 11729−11752
    [102] Lu C, Ball P J, Parker-Holder J, Osborne M A, Roberts S J. Revisiting design choices in offline model based reinforcement learning. In: Proceedings of the 10th International Conference on Learning Representations. Virtual Event: OpenReview.net, 2022.
    [103] Hu H, Yang Y Q, Zhao Q C, Zhang C J. On the role of discount factor in offline reinforcement learning. In: Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR, 2022. 9072−9098
    [104] Nair A, Zhu B, Narayanan G, Solowjow E, Levine S. Learning on the job: Self-rewarding offline-to-online finetuning for industrial insertion of novel connectors from vision. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). London, United Kingdom: The IEEE, 2023. 7154−7161
    [105] Kostrikov I, Nair A, Levine S. Offline reinforcement learning with implicit Q-learning. In: Proceedings of the 10th International Conference on Learning Representations. Virtual Event: OpenReview.net, 2022.
  • 期刊类型引用(27)

    1. 孙惠平,周进爵,刘澍轩,陈钟. BFT问责机制综述. 信息网络安全. 2024(01): 14-23 . 百度学术
    2. 王旭东,符精晶,王赟. 基于PBFT算法的分片技术的研究. 计算机与数字工程. 2024(01): 213-218+246 . 百度学术
    3. 付晶,倪旭. 区块链赋能战略性关键矿产资源高效利用:应用、挑战与对策. 学术探索. 2023(01): 94-102 . 百度学术
    4. 牛淑芬,韩松,谢亚亚,王彩芬. 基于区块链的多关键字属性基可搜索加密方案. 信息安全学报. 2023(01): 131-143 . 百度学术
    5. 于谦,李志淮,田娜. 针对区块链状态分片合谋攻击的改进方案. 计算机应用与软件. 2023(09): 341-349 . 百度学术
    6. 白兵,李志淮,李敏. 降低跨分片交易回滚概率的多轮验证方案. 计算机工程与应用. 2022(02): 129-136 . 百度学术
    7. 倪旭,付晶,张海亮. 区块链赋能战略性矿产资源生态治理研究. 云南社会科学. 2022(01): 91-97 . 百度学术
    8. 乔蕊,刘敖迪,陈迪,王清贤. 复杂物联网联盟链系统通信机制研究. 自动化学报. 2022(07): 1847-1860 . 本站查看
    9. 牛淑芬,陈俐霞,李文婷,王彩芬,杜小妮. 基于区块链的电子病历数据共享方案. 自动化学报. 2022(08): 2028-2038 . 本站查看
    10. 袁东彤,张桥云,顾瑞涛,曾建光. 金融科技的知识图谱分析——基于大数据与文献计量的视角. 产业经济评论. 2022(06): 153-171 . 百度学术
    11. 刘海洋,曹永生,陈彦清,井福荣,方沩. 农作物种质资源登记区块链模型研究. 植物遗传资源学报. 2021(01): 28-37 . 百度学术
    12. 张长鲁,张健. 国内区块链研究主题挖掘、热点分析及趋势探究. 统计与信息论坛. 2021(02): 119-128 . 百度学术
    13. 高昊昱,李雷孝,林浩,李杰,邓丹,李少旭. 区块链在数据完整性保护领域的研究与应用进展. 计算机应用. 2021(03): 745-755 . 百度学术
    14. 袁勇,欧阳丽炜,王晓,王飞跃. 基于区块链的智能组件:一种分布式人工智能研究新范式. 数据与计算发展前沿. 2021(01): 1-14 . 百度学术
    15. 解岩凯,魏凌波,张驰,王庆涛,孙启彬. 面向区块链轻节点的支付通道瞭望塔技术研究. 密码学报. 2021(05): 778-794 . 百度学术
    16. 李政,肖冰冰,李笑若,祝丙南,金晨光. Roundabout:一种基于燃烧证明的比特币隐私保护方法. 计算机与数字工程. 2021(12): 2538-2543 . 百度学术
    17. 王旭,甘国华,吴凌云. 区块链性能的量化分析研究. 计算机工程与应用. 2020(03): 55-60 . 百度学术
    18. 傅易文晋,陈华辉,钱江波,董一鸿. 面向时空数据的区块链研究综述. 计算机工程. 2020(03): 1-10 . 百度学术
    19. 毛志来,刘亚楠,孙惠平,陈钟. 区块链性能扩展与安全研究. 信息网络安全. 2020(03): 56-64 . 百度学术
    20. 朱建明,张沁楠,高胜. 区块链关键技术及其应用研究进展. 太原理工大学学报. 2020(03): 321-330 . 百度学术
    21. 袁勇,王飞跃. 可编辑区块链:模型、技术与方法. 自动化学报. 2020(05): 831-846 . 本站查看
    22. 刘海房,吴雨芯. 比特币系统综述. 现代计算机. 2020(19): 45-51 . 百度学术
    23. 李洋,门进宝,余晗,王思宁,范金刚,郭艳来. 区块链扩容技术研究综述. 电力信息与通信技术. 2020(06): 1-9 . 百度学术
    24. 刘海洋,方沩,陈彦清,曹永生. 区块链在农作物种质资源数据管理中的应用初探. 农业大数据学报. 2019(02): 105-113 . 百度学术
    25. 孙国梓,王纪涛,谷宇. 区块链技术安全威胁分析. 南京邮电大学学报(自然科学版). 2019(05): 48-62 . 百度学术
    26. 张朝栋,王宝生,邓文平. 基于侧链技术的供应链溯源系统设计. 计算机工程. 2019(11): 1-8 . 百度学术
    27. 杨林瑶,陈思远,王晓,张俊,王成红. 数字孪生与平行系统:发展现状、对比及展望. 自动化学报. 2019(11): 2001-2031 . 本站查看

    其他类型引用(50)

  • 加载中
图(6) / 表(3)
计量
  • 文章访问数:  3496
  • HTML全文浏览量:  811
  • PDF下载量:  475
  • 被引次数: 77
出版历程
  • 收稿日期:  2023-09-04
  • 录用日期:  2023-11-09
  • 网络出版日期:  2024-04-30
  • 刊出日期:  2024-06-27

目录

/

返回文章
返回