2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

数据驱动自适应评判控制研究进展

王鼎 赵明明 刘德荣 乔俊飞 宋世杰

李凤岐, 金佳玉, 杜学峰, 张鑫, 徐凤强, 王德广. 基于域对抗自适应学习的旋翼无人机姿态稳定方法. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240186
引用本文: 王鼎, 赵明明, 刘德荣, 乔俊飞, 宋世杰. 数据驱动自适应评判控制研究进展. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240706
Li Feng-Qi, Jin Jia-Yu, Du Xue-Feng, Zhang Xin, Xu Feng-Qiang, Wang De-Guang. Domain adversarial adaptive learning based attitude stabilization method for rotary wing unmanned aerial vehicles. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240186
Citation: Wang Ding, Zhao Ming-Ming, Liu De-Rong, Qiao Jun-Fei, Song Shi-Jie. Advances in data-driven adaptive critic control. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240706

数据驱动自适应评判控制研究进展

doi: 10.16383/j.aas.c240706 cstr: 32138.14.j.aas.c240706
基金项目: 国家自然科学基金(62222301, 62473012, 62021003), 国家科技重大专项(2021ZD0112302, 2021ZD0112301), 北京市自然科学基金(JQ19013)资助
详细信息
    作者简介:

    王鼎:北京工业大学信息科学技术学院教授. 2009年获得东北大学硕士学位, 2012年获得中国科学院自动化研究所博士学位. 主要研究方向为强化学习与智能控制. 本文通信作者. E-mail: dingwang@bjut.edu.cn

    赵明明:北京工业大学信息科学技术学院博士研究生. 主要研究方向为强化学习和智能控制. E-mail: zhaomm@emails.bjut.edu.cn

    刘德荣:南方科技大学自动化与智能制造学院教授. 主要研究方向为强化学习和智能控制. E-mail: liudr@sustech.edu.cn

    乔俊飞:北京工业大学信息科学技术学院教授. 主要研究方向为污水处理过程智能控制和神经网络结构设计与优化. E-mail: adqiao@bjut.edu.cn

    宋世杰:西南交通大学智慧城市与交通学院讲师. 主要研究方向为强化学习和智能控制. E-mail: shijie.song@swjtu.edu.cn

Advances in Data-Driven Adaptive Critic Control

Funds: Supported by National Natural Science Foundation of China (62222301, 62473012, 62021003), National Science and Technology Major Project (2021ZD0112302, 2021ZD0112301), and Beijing Natural Science Foundation (JQ19013)
More Information
    Author Bio:

    WANG Ding Professor at the School of Information Science and Technology, Beijing University of Technology. He received his master degree from Northeastern University in 2009 and Ph.D.degree from Institute of Automation, Chinese Academy of Sciences in 2012. His research interest covers reinforcement learning and intelligent control. Corresponding author of this paper

    ZHAO Ming-Ming Ph.D.candidate at the School of Information Science and Technology, Beijing University of Technology. His research interest covers reinforcement learning and intelligent control

    LIU De-Rong Professor at the School of Automation and Intelligent Manufacturing, Southern University of Science and Technology. His research interest covers reinforcement learning and intelligent control

    QIAO Jun-Fei Professor at the School of Information Science and Technology, Beijing University of Technology. His research interest covers intelligent control of wastewater treatment processes, structure design and optimization of neural networks

    SONG Shi-Jie Lecturer at the Institute of Smart City and Intelligent Transportation, Southwest Jiaotong University. His research interest covers reinforcement learning and intelligent control

  • 摘要: 最优控制与人工智能两个领域的融合发展产生了一类以执行-评判设计为主要思想的自适应动态规划(Adaptive dynamic programming, ADP)方法. 通过集成动态规划理论、强化学习机制、神经网络技术、函数优化算法, ADP在求解大规模复杂非线性系统的决策和调控问题上取得了重要进展. 然而, 实际系统的未知参数和不确定扰动经常导致难以建立精确的数学模型, 给最优控制器的设计构成了挑战. 近年来, 具有强大自学习和自适应能力的数据驱动ADP方法受到了广泛关注, 它能够在不依赖动态模型的情况下, 仅利用系统的输入输出数据为复杂非线性系统设计出稳定、安全、可靠的最优控制器, 符合智能自动化的发展潮流. 通过对数据驱动ADP方法的算法实现、理论特性、相关应用等方面进行梳理, 着重介绍了最新的研究进展, 包括在线Q学习、值迭代Q学习、策略迭代Q学习、加速Q学习、迁移Q学习、跟踪Q学习、安全Q学习、博弈Q学习, 并涵盖数据学习范式、稳定性、收敛性以及最优性的分析. 此外, 为了提高学习效率和控制性能, 设计了一些改进的评判机制和效用函数. 最后, 以污水处理过程为背景, 总结了数据驱动ADP方法在实际工业系统中的应用效果和存在问题, 并展望了一些未来值得研究的方向.
  • 旋翼无人机在海上作业时常面临复杂变化的海风环境. 由于无人机体积小、重量轻, 容易产生机身剧烈摇晃, 导致控制失效[14]. 尽管增加机身尺寸和提升配置可以增强抗风性能, 但这会增加载重、降低无人机的灵活性[56]. 因此, 在保证无人机轻便灵活的前提下提升抗风能力, 成为当前研究的重要方向. 部分研究集中于开发和优化控制算法来提升无人机的性能[7], 但在海上环境的应用中仍存在显著挑战. 首先, 这些算法难以适应海风的突变特性. 其次, 增强抗风能力通常以增加计算负载和降低机动性为代价[8]. 以上问题表明, 提高无人机在复杂海上任务中的稳定性和安全性十分重要.

    为实现无人机海上稳定飞行, 亟需解决现有算法难以有效响应风力变化且缺乏自适应调节的问题. 传统的气动力建模主要依赖明确的气动模型[9], 在动态环境中存在误差较大、迁移能力差的问题. 此外, 这类模型缺乏自适应控制策略. 为此, 文献[1011]提出基于扰动观测器 (Disturbance observer-based, DOB) 和自抗扰控制 (Active disturbance rejection control, ADRC) 的算法, 以补偿系统扰动, 提高输入与输出之间的匹配度. 然而, 这些方法高度依赖于对扰动的精确估计, 误差可能导致控制结果下降. 此外, 为提升无人机的抗风能力, 多项研究工作致力于优化风感知、风阻抗调整算法和风场预测策略. Sziroczak 等[12] 提出轻量化风感知方案, 但其在复杂风场中的学习效果不佳. Xue 等[13] 结合深度强化学习和域随机化方法, 但在多变风场中的适应性有限. 此外, Song 等[14] 通过模型预测控制 (Model predictive control, MPC) 实时调整正则化参数, 提升无人机的抗风能力. 然而, 这些方法仍未充分解决风场变化的预测问题. 为此, Li 等[15] 提出 FlowDrone 模型, 通过热线传感器和风感知残差控制器进行风场估计, 但其泛化能力有限. 此外, Zha 等[16] 采用门控循环单元 (Gated recurrent unit, GRU) 和时间卷积网络 (Temporal convolutional network, TCN) 组合优化风场预测性能, 但其模型复杂且需精确调参. 综上所述, 现有研究虽在风场感知和抗风控制方面取得进展, 但在应对复杂海风环境时仍存在旋翼无人机的控制算法适应性不足、计算复杂度较高和飞行姿态不稳定的问题, 因此, 亟需提升算法的适应性、优化风场预测性能, 并降低计算负担, 以确保无人机在复杂海上环境中的稳定飞行.

    为此, 本文提出一种具有抗风特性的域对抗自适应姿态稳定方法, 旨在提升无人机对复杂海上环境的适应能力, 确保飞行姿态稳定性. 风场概念图如图1所示. 该方法包含离线的对称式时序域对抗自适应学习算法 SymTAL 和在线风场预测模型 POP. SymTAL 利用域对抗学习捕捉气动共享特征和全面信息, 采用对称性网络层结合双向时序网络的设计, 有效减少网络层的复杂度, 提升学习效率和算法性能. 为提升计算效率, 引入深度学习加速技术及改进的 Adam 自适应优化器, 以降低损失率, 实现精准的飞行姿态控制. 此外, 通过正则化超参数调整优化对抗性框架, 进一步增强抗风能力. 在线风场预测模型 POP通过变分模态分解技术 (Vibration mode decomposition, VMD)对风速信号进行分解, 以便捕捉准确风速特征. 模型池根据风速特征进行决策, 选择 GRU-TCN 或支持向量回归 (Support vector regression, SVR) 模型, 来捕捉风速信号, 从而提升预测真实性. 该方法增强无人机在面对不同风力条件下的适应性, 实现海风下的自适应稳定飞行.

    图 1  风场概念图
    Fig. 1  Wind field concept map

    本文的主要创新点:

    1) 提出一种具有抗风特性的自适应飞行控制方法: 将离线域对抗自适应学习算法 SymTAL 和在线风场预测模型 POP 相结合, 解决海面风力变化导致无人机不稳定的问题.

    2) 优化离线学习算法: SymTAL 在域对抗学习算法中引入对称性网络和双向长短期记忆网络 (Bi-directional long short-term memory, BiLSTM), 并通过深度学习加速技术改进 Adam 优化器, 优化对抗性框架, 提升算法性能和计算效率.

    3) 提升在线风场预测能力: POP 使用变分模态分解技术分解风速信号, 并根据特性选择适当的预测模型, 大幅度提高预测的精确度.

    为确保无人机在海风环境下稳定飞行, 研究者在不扩大体积和调整配置的情况下, 致力于提升控制算法的性能. 文献[1718]提出对称性神经网络层结构在无人机几何自适应控制中展现优异性能, 特别在处理多级数据改善和抑制风扰动方面表现出色. Shi 等[19] 研究结合具有对称特性的深度神经网络的控制算法, 通过考虑风向的对称性, 明显减小跟踪误差. 除此之外, 域对抗学习被应用于提高无人机在不同风域中的性能. Michael 等[20] 提出的 Neural-fly 使用域对抗不变元学习减小跟踪误差实现精确的飞行控制, 使无人机能更好适应复杂多变的风场环境. Sun 等[21] 研究引入具有对称性的网络层, 结合鲸鱼优化算法 (Whale optimization algorithm, WOA) 和多层感知机模型, 提高风速预测的精度, 从而控制无人机稳定飞行.

    Kim 等[22] 提出一种 MemN2N 架构, 引入 BiLSTM 以更好地捕捉时间特征. Ma 等[23] 研究基于堆叠 长短期记忆网络 (Long short-term memory, LSTM) 的多输出迭代预测模型, 利用多输出迭代预测策略有效减少长期时间序列预测的误差积累和误差传播问题. Guo 等[24] 研究集成 BiLSTM 预测模型, 通过获得稳定的时间序列作为模型输入数据, 有效消除大范围特征值的影响, 从而提升算法性能.

    利用深度学习能够显著提高算法性能. Samadianfard 等[25] 提出基于深度学习的层次化自适应策略, 该策略通过优化大批量训练过程有效减少深度神经网络的训练数据需求; Dhakal 等[26] 开发 GSTuner 自适应调优框架, 可自动确定 GPU 上模型全局优化空间的最优参数设置, 有效加快任务速度. 此外, Filippini 等[27] 提出一种 GPU 服务系统, 利用深度学习训练方法得到最佳速度曲线, 最小化能耗并有效减小预测偏差. 此外, You 等[28] 使用的混合深度模型采用不同的编码器–解码器 LSTM 展现优于其他学习模型的任务完成速度. 在此基础上, 改进的 Adam 自适应优化器能够进一步自动调整学习率, Zhang 等[29] 采用 Adam 优化算法来解决大规模数据和参数的优化问题, 可以针对不同参数设定不同的学习率, 以实现自适应优化. Liu 等[30] 采用深度学习方法的轨迹映射网络 (Trajectory mapping network, TMN) 和 L2 正则参数调整相结合以提高规划效率与鲁棒轨迹跟踪. L2 正则化方式可有效降低损失函数减小预测误差.

    海风环境是多变的, 为应对复杂的海风, 需要进一步提高算法的适应性, 深化对风场的理解. 当前, 无人机风场预测和建模的方法主要包括水平空速归零法、解析测风方法、航位推算法以及皮托- 静压管测风法等. Ma 等[31] 通过改进低延迟 K-means 聚类算法优化无人机数据收集任务, 实现对不同风速的快速分类. Lv 等[32] 提出基于优化变分模式分解和非线性加权组合深度学习算法的风电功率预测方法, 显著提升预测精度. Dudukcu 等[33] 提出混合深度神经网络架构利用变分模态分解和长短期记忆技术, 减少误差传播并提升模型的泛化能力. Zhang 等[34] 提出循环神经网络 (Recurrent neural network, RNN) 架构的 TCN 来解决混沌时间序列预测问题, 应对环境复杂性和不可预测性. Zhang 等[35] 采用时空深度学习的方法开发 MT-DETrajGRU 模型可同时修正风速和风向. 不使用风传感器的四旋翼无人机如何进行风估计和修正问题. Kobayashi 等[36] 提出的新颖 VMD-A-LSTM-SVR 模型解决超前预测问题. N.L.C 等[37] 提供两种不同的旋翼无人机在线风估计方法, 能够仅使用少量数据来准确地预测风速分量. 以上风估计方法为风场预测研究提供基础, 而风场预测模型则为控制算法的优化提供支持.

    VMD 被广泛用于分解具有不同震动模式的信号[32]. 具体表达式如下:

    $$ \begin{aligned} s_{i}(t)=A_{i}(t) \cos \left(2 \pi f_{i} t+\phi_{i}(t)\right) \end{aligned} $$ (1)

    其中$ A_i(t) $是幅度; $ f_i $是中心频率; $ \phi_i(t) $是相位. 将 VMD 用于处理实时风速信号, 通过选择合适调谐参数和带宽参数, 对信号调谐分解, 进而产生一系列窄带信号. 通过对每个窄带信号进行带通滤波, 提取出该频带内的有效信息:

    $$ \begin{aligned} y(t)=F^{-1}[F[x(t)]] \cdot F[b(t)] \end{aligned} $$ (2)

    其中, $ F[x(t)] $和 $ F[b(t)] $表示傅里叶变换; $ x(t) $是时域信号; $ b(t) $是带通滤波器. 在完成滤波处理之后, 采用迭代优化策略对每个风信号的窄带分量进行调整. 重复上述过程, 不断逼近最优解, 直至达到收敛状态. 最终, 通过将所有经过调谐分解的风信号分量累加重构出风速信号.

    $$ \begin{aligned} \hat{x}(t)=\sum_{i=1}^N x_i(t) \end{aligned} $$ (3)

    TCN 擅长捕捉长期依赖关系[16], 通过因果卷积处理风速序列, 从前序风速$ x_1,\; x_2,\; \cdots ,\; x_t $预测未来风速$ y_1,\; y_2,\; \cdots\; ,\; y_t $, 实现长期风速预测. 风速 $ x_t $处的因果卷积为$ (F * X)_{\left(x_t\right)}=\sum_{k=1}^K f_k x_t-K+k $, 其中, $ K $为卷积核, $ X $为输入序列, 滤波器为$ F= \left(f_1,\; f_2\right) $. 计算未来风速公式如下:

    $$ \begin{aligned} y_t=f_1 x_{t-1}+f_2 x_t \end{aligned} $$ (4)

    因果卷积进行严格的时间约束其结构是单向的, 从而确保在每个时间点仅能访问当前及之前的风速信息, 未来数据则被屏蔽. 针对风速数据的海量且非线性特征, TCN 网络采用的残差连接技术, 可以有效缓解梯度消失问题, 以提升网络训练的稳健性.

    GRU 是循环神经网络的一种高效变体[16]. GRU 的门控机制有效增强网络对时间序列中长期依赖关系的捕捉和记忆能力, 以解决风速中的梯度消失问题, 同时减少参数数量. 并且 GRU 能够在每个时间步根据当前输入和前一个隐藏状态来更新隐藏状态, 同时通过重置门和更新门来控制风速信息的流动. 这种结构使得 GRU 在处理风速时间序列数据时, 既能够捕捉到长期依赖关系, 又能够有效地传递信息, 与 LSTM 相比, GRU 在保持相当效果的同时, 训练更为简便, 显著提升训练效率.

    面对复杂海风环境下无人机飞行控制误差较大的挑战, 本研究实现一种具有抗风特性的自适应姿态稳定控制方法. 该方法融合域对抗学习、对称性网络和双向时序网络的无人机姿态稳定算法 (Symmetric temporal domain adversarial learning, SymTAL), 以及一个在线实时预测风场模型 (Partitioned online prediction model, POP). 整体架构示意图, 详见图2. 首先, 无人机通过 POP 模型实现实时处理风速数据. 使用 K-means 聚类和 VMD 技术对风速信号进行分解, 从模型池中选取合适的模型, 根据风速信号特征, 选择 TCN-GRU 模型捕捉长短期依赖关系, 使用 SVR 处理非线性关系, 获取预测风速$ W_1 $. 同时, 传感器收集当前风况下无人机的状态数据, 包括速度、四元数和脉冲宽度调制(Pulse width modulation, PWM), 并将上述数据传输至离线 SymTAL 算法中. SymTAL 将将得到的共性基函数$ {Q} $与风相关线性系数$ {a} $相乘, 计算出风效应力. 基于此结果, 确定期望力, 并将其发送至无人机飞行控制器, 从而实现无人机在线自适应稳定飞行. 此外, 由于风速和阻力之间存在复杂的关系, 且不同海拔下风速动态变化, 无人机根据风速计算公式得到当前高度下的风速$ k_z $, 从而选择最优飞行区域.

    图 2  无人机抗风自适应总体框架图
    Fig. 2  Overall framework diagram of UAVs wind resistance adaptation
    3.1.1   无人机气动理论建模

    传统的气动力建模方法虽然在物理意义上表现出色, 但其在迁移性方面的不足及参数获取过程中存在误差的问题, 使其实时气动力 (矩) 预测变得困难.

    因此, 本研究提出基于对称式时序域对抗自适应学习的无人机控制算法. 该方法使用深度神经网络, 不仅具有强大的学习能力和自动提取物理特征能力, 并且通过域对抗自适应学习方法提升模型的迁移性和泛化能力. 这种设计不仅适用于不同风力条件, 且能支持实时气动力建模, 从而克服传统方法在多场景中的局限性和测量误差问题. 无人机动力学模型如下:

    $$ \begin{aligned} M(q) \ddot{q}+C(q,\; \dot{q}) \dot{q}+g(q)=u+f(q,\; \dot{q},\; w) \end{aligned} $$ (5)

    其中$ {q},\;\; \dot{q},\;\; \ddot{q} \in {\bf{R}}_{n} $表示$ \text{n} $维的位置、速度和加速度矢量; $ M(q) $为对称的正定惯性矩阵; $ C(q,\; \dot{q}) $是科里奥利矩阵; $ g(q) $是重力矢量; $ u \in {\bf{R}}_{n} $是控制力. $ f(q, \dot{q},\; w) $是未建模的动力学, 其中$ w \in {\bf{R}}_{n} $是一个未知的隐藏状态, 用于表示时序变化的潜在环境条件. 本文中$ w $是风速矢量. 根据多元函数的切比雪夫级数理论, 多元函数可分解为不同变量的多项式乘积之和. 文献[15]证明

    $$ \begin{aligned} f(q,\; \dot{q},\; w) \approx Q(q,\; \dot{q}) a(w) \end{aligned} $$ (6)

    其中$ {Q}(\cdot) $是所有风条件共享的基函数或表示函数, 并捕获未建模动力学对机器人状态的依赖性; $ a(\omega) $是针对每个条件更新的一组线性系数. 任何一个$ Q(q, \dot{q}) a(w) $对于任何解析函数$ f(q,\; \dot{q},\; w) $都存在. 考虑到四旋翼无人机的动力学中全局位置$ p \in {\bf{R}}^{3} $、速度$ v \in {\bf{R}}^{3} $、姿态旋转矩阵为$ SO\text{ (}3\text{) } $, 即特殊正交群, 物体角速度$ w \in {\bf{R}}^{3} $给出的状态. 四旋翼飞行器的动力学如下:

    $$ \begin{aligned} \left\{\begin{aligned} &\dot{p}=v \\ &m\dot{v}=m g+R f_u+f \end{aligned}\right. \end{aligned} $$ (7)
    $$ \begin{aligned} \left\{\begin{aligned} &\dot{R}=R S(\omega) \\& J\dot{\omega}=J \omega \times \omega+\tau_u \end{aligned}\right. \end{aligned} $$ (8)

    其中$ m $为质量; $ R $为旋转矩阵; $ J $为四旋翼的惯性矩阵; $ S(\cdot) $为斜对称映射; $ g $为重力矢量; $ f_{u}=[0, 0,\; T]^{T} $和$ \tau_u=[\tau_x,\; \tau_y,\; \tau_z]^{T} $为由标称模型预测的四旋翼的总推力和机体力矩; $ f=\left[f_{x},\; f_{y},\; f_{z}\right]^{T} $是由于不同的风条件而产生的未建模空气动力效应所产生的力. 将上述无人机位置动态 (7) 与(8)与动力学模型(5)相结合, 其中$ M(q)= $ $ m I,\; C(q,\; \dot{q}) \equiv 0, u=R f_u $. 因此在位置控制中, 通过该方法计算期望的力$ u_d $, 动力学将期望的力分解为期望的姿态$ R_d $和期望的推力$ T_d $, 并将其发送到无人机的飞行控制器中, 以控制无人机飞行. 同时除无人机速度$ v $, 气动效应还取决于无人机姿态和旋翼转速, 因此在算法中, 基函数$ {Q} $的设计考虑这些因素, 该算法的输入状态$ {X} $是由 11 维向量组成的深度神经网络, 其中包括 3 维的无人机速度$ v $、4 维的姿态四元数和 4 维的飞行控制的电机速度指令 PWM 组成. 此外风效应力的三个分量$ f_x $、$ f_y $和$ f_z $高度相关并具有共同的特征, 因此, 风效应力$ f $近似为

    $$ \begin{aligned} f \approx\left[\begin{array}{ccc} Q(X) & 0 & 0 \\ 0 & Q(X) & 0 \\ 0 & 0 & Q(X) \end{array}\right]\left[\begin{array}{l} a_x \\ a_y \\ a_z \end{array}\right] \end{aligned} $$ (9)

    其中$ a_{x},\; \; a_{y},\; \; a_{z} \in R^{4} $是风效应力的每个分量的线性系数.

    本文采用对称式时序域对抗自适应学习算法, 对共性基函数$ {Q} $ 进行离线提取, 并应用到实时风况$ \omega $条件下进行训练. 在此基础上, 将计算所得的期望力传输至无人机的飞行控制器中. 基于该算法不仅增强迁移能力, 也确保无人机姿态的稳定控制.

    3.1.2   无人机控制算法

    在这一节中, 我们将详细讨论上文提到的对称式时序域对抗自适应学习算法, 见算法 1. 与其他自适应学习算法不同, SymTAL 的目标是学习一个共性基函数$ {Q} $, 以确保在海面不同风况下都能保持其普适性. 该算法通过域对抗学习提取气动共享特征作为生成器, 并在网络结构中结合具有对称性的网络结构和双向长短期记忆网络 (BiLSTM), 以共享参数来提升速度和鲁棒性. 同时, 双向时序网络能够捕捉更全面的信息, 从而适应不同环境下的数据. 判别器则引入最小化域分类器来提取错误特征, 用于判别不同的风况类型, 从而达到在基函数提取中保持物理特征不变性的目标. 以下是算法的细节及优化方法.

    在给定的数据集下, 对抗自适应学习的目标是使在任何风力条件$ \omega $, 都存在一个潜在变量$ a(\omega) $, 允许$ Q(x) a(\omega) $可以更好的近似真实的数值$ f(x,\; w) $.

    $$ \begin{aligned} \min_{Q,\; a_1,\; \cdots a_K} \sum_{k=1}^K \sum_{i=1}^{N_k}\left\|y_k^{(i)}-Q\left(x_k^{(i)}\right) a_k\right\|^2 \end{aligned} $$ (10)

    其中$ {y} $是气动力 (矩); $ {x} $是矢量向量由 3 维的无人机速度$ {v} $、$ 4 $维的姿态四元数和 4 维的飞行控制的电机速度指令 PWM 组成, 也是深度神经网络 (Deep Neural Networks, DNN) 的输入数据, $ a_{k} \in R_{h} $是潜在的一个线性系数. 并且最佳权重$ a_{k} $是特定于每种风力条件的, 最优的结果是由风力条件所共享的. 在该算法中, 使用 DNN 来训练$ {Q} $, 拥有足够多的神经元就可以使用$ Q(x) a(\omega) $在任意的精度足够逼近真实的数值$ f(x,\; w) $, 从而得到风效应力$ f_u $, 其中$ {Q} $ 的结构是一个深度神经网络, 具有输入层、对称网络层、BiLSTM 层和输出层. 输入层接收 11 维的输入数据; 对称网络层由两个对称的分支组成, 每个分支由一个包含 25 个神经元的全连接层和 ReLU 激活函数构成; BiLSTM 层使用 50 个 LSTM 单元. 输出层包含 4 个神经元, 对应于 4 维输出. 其中对称层通过共享权重和拼接操作来减少参数数量, 以此降低训练量, 进而提升训练的稳定性.

    $ {w} $的影响会导致产生域位移. 在真实的场景中, 风的影响也会导致飞行轨迹改变. 例如, 无人机顶风飞行时会产生仰俯倾斜, 并且其仰俯倾斜的平均程度取决于风的条件. 音高是状态$ {x} $的一个分量, 整个状态$ x $中的域移动会更加剧烈. 这对深度学习带来巨大的挑战. DNN 可以记忆$ {x} $在不同风的状态下的分布, 使得动力学$ \{f\left(x,\; w_{1}\right),\; f\left(x,\; w_{2}\right),\; \cdots\; , f(x,\; w_{k})\} $不是通过风的状况$ \left\{w_{1},\; w_{2},\; \cdots\; ,\; w_{k}\right\} $, 而是通过$ {x} $的分布来反映情况. 这样就导致过度拟合, 并且可能无法准确找到适合不同风况下无人机相对稳定的飞行姿态. 为解决上述问题, 受到 Ganin 等[38] 的工作启发, 本文提出对抗性优化框架.

    $$ \begin{split} &\max _h \min _{Q,\; a_i,\;\cdots,\;a_k} \sum_{j=1}^{k} \sum_{i=1}^{N_k}\left\|y_k^{(i)}-\text{BiLSTM}\Big(\text{sym}\right.\\ &\;\;\left.\left.\left(Q\left(x_{c j}^{(i)}\right)\right),\; a_{j}\right)\right\|^2-\alpha \cdot \text{Loss}\left(R\left(Q\left(x_{c j}^{(i)}\right)\right),\; j\right) \end{split}$$ (11)

    其中, $ R $是另一个 DNN, 作为判别器来预测$ {j} $ 个风条件之外的环境指数, 其网络层包含三层: 输入层有 4 个神经元, 隐藏层有 126 个神经元, 输出层有 6 个神经元, 同样使用 ReLU 激活函数; $ {\text{Loss}}(\cdot) $是分类损失函数; $ \alpha $为控制正则化程度的超参数; 本文中$ \alpha $为 0.1; $ j $是风条件指数; 公式中的 $ y^{(i)}_k $表示第 $ k $个类别中第 $ i $个样本的目标值; $ \text{sym}(Q(x_{c j}^{(i)})) $表示具有对称性的网络层结构, 具体解释在算法细节中; $ {\text{BiLSTM}}(\text{sym}(Q(x_{c j}^{(i)})),\; a_{ j}) $表示通过双向长短时记忆网络 (BiLSTM) 提取对称化特征. 直观上, $ R $和$ Q $是对抗学习和 GAN 思想的结合, $ Q $是生成器, $ {R} $是判别器, 生成器$ {Q} $最小化第一项$ \|f(x)- Q(x) a\|_{2}^{2} $试图生成与目标函数$ f(x) $相似的输出, 作为提取的基函数. $ {R} $用于判别不同的风况类型, 从而达到特征不变性, 便于在线风场预测模型传递不同风况时判别, 实现不同风况的迁移. $ R(Q(x)) $在分类任务中难以区分, 为使$ \alpha \times \text{Loss}(R(Q(x_{c j}^{(i)})),\; j) $达到最大化, 因此通过$ R $判别器正确的区分$ {Q} $生成的样本和真实的$ f(x) $, 该实验中, $ \text{{Loss}}(\cdot) $使用的是交叉熵损失, 它具体如下:

    $$ \begin{split} &\text { CrossEntropy }\left(R\left(Q\left(x_{c j}^{(i)}\right)\right),\; j\right)=\\ &\qquad-\sum_c y_{c j} \log \left(R_c\left(Q\left(x_{c j}^{(i)}\right)\right)\right) \end{split}$$ (12)

    算法 1. 对称式时序域对抗自适应学习算法

    输入. 数据集 $ P=\{P_{f_{1}},\; \cdots,\; P_{f k}\} $; $ \alpha=0.001 $, $ 0< \eta \leq 1 $, $ \mu>0 $; 生成器 $ \text{Q} $, 判别器 $ \text{R} $.

    输出. 训练后的 Q 和 R.

    1) 从数据库中随机抽取样本集 $ P_{f k} $;

    2) 从样本集 $ P_{f k} $ 随机划分适应集 $ \text{D}^{\text{a}} $ 和训练集 $ \text{D} $;

    3) 使用最小二乘法求解

    $ \quad\quad a(\omega )=\text{arg}\min \sum_{i\in D_a}{\parallel y_{k}^{(i)}}-Q\left( x_{k}^{(i)} \right) a_k\parallel ^2 $

    4) 当 $ \left\|a^*\right\|>\gamma,\; a^* \leftarrow \gamma \cdot \frac{a^*}{\left\|a^*\right\|} $;

    5) 训练生成器 $ \text{Q} $ 和 $ \text{R} $, 使用 BiLSTM 提取特征值

    $ \begin{aligned} \quad \quad l s t m_{\text {feature }}=\operatorname{BiLSTM}\left(\text { sym }_{\text {features }},\; a_{j}\right) \end{aligned} $

    6) 使用改进 Adam 优化器, 引入梯度的二阶矩估计;

    7) 用随机梯度下降 (SGD) 和谱归一化, 计算损失

    $ \begin{aligned} \quad \quad\max _{h} \min_{Q,\;\{a_j\}} \sum_{j} \sum_{i}\left\|\operatorname{BiLSTM}\left(\operatorname{sym}\left(Q\left(x_{c j}^{(i)}\right)\right),\; \right.\right. \\ \quad \quad \left.\left.a_{j}\right)-f\left(x_{c j}^{(i)},\; c_{j}\right)\right\|^{2}-\alpha \cdot \operatorname{Loss}\left(R\left(Q\left(x_{c j}^{(i)}\right)\right),\; j\right) \end{aligned} $

    8) 如果 $ \operatorname{rand}() \leq \eta $ 训练判别器 $ \text{R} $, 达到 $ \sum_{i \in B}\operatorname{Loss} $

    $ \quad \quad (R(Q(x_{k}^{(i)})),\;j) $;

    9) 重复步骤 1) $ \sim $ 步骤 8) 直到损失收敛.

    其中, $ y_{c j} $为独热编码, 如果$ j=c $, 则$ cj=1 $, 否则为$ 0; \; cj $为标准的基函数. 算法 1 在生成器中使用$ \text{BiLSTM}\left(\text{sym} \left(Q\left(x_{c j}^{(i)}\right)\right),\; a_{j}\right) $, 通过双向长短时记忆网络 (BiLSTM) 提取对称化特征, 具体公式如下:

    $$ \begin{split} \text{BiLSTM}&\left( \text{sym}\left( Q\left( x_{cj}^{(i)} \right) \right) ,\;a_j \right) =\\ &\qquad\text{Concat}\left( \overset{\rightarrow }{h}_t,\;\overset{\gets}{h}_t \right) \end{split} $$ (13)

    在时刻$ {t} $处, $ \vec{h}_{t} $表示向前传播, LSTM 隐藏状态的计算公式如下:

    $$ \begin{aligned} \begin{cases} \vec{i}_t =\sigma\left(\vec{W}_{i i} x_t+\vec{b}_{i i}+\vec{W}_{h i} \vec{h}_{t-1}+\vec{b}_{h i}\right) \\ \vec{f}_t =\sigma\left(\vec{W}_{i f} x_t+\vec{b}_{i f}+\vec{W}_{h f} \vec{h}_{t-1}+\vec{b}_{h f}\right) \\ \vec{g}_t =\tanh \left(\vec{W}_{i g} x_t+\vec{b}_{i g}+\vec{W}_{h g} \vec{h}_{t-1}+\vec{b}_{h g}\right) \\ \vec{o}_t =\sigma\left(\vec{W}_{i o} x_t+\vec{b}_{i o}+\vec{W}_{h o} \vec{h}_{t-1}+\vec{b}_{h o}\right) \\ \vec{c}_t =\vec{f}_t \odot \vec{c}_{t-1}+\vec{i}_t \odot \vec{g}_t \\ \vec{h}_t =\vec{o}_t \odot \tanh \left(\vec{c}_t\right) \end{cases} \end{aligned} $$ (14)

    其中, $ x_t $是当前时间步的输入向量; $ h_{t-1} $是上一时间步的隐藏状态; $ c_{t-1} $是上一时间步的记忆单元状态; $\vec{W}_{ii} $、$\vec{W}_{if} $、$\vec{W}_{ig} $、$ \vec{W}_{io} $和$ \vec{W}_{hi} $、$\vec{W}_{hf} $、$ \vec{W}_{hg} $、$\vec{W}_{ho} $分别是输入和隐藏状态的权重矩阵; $ \vec{b}_{ii} $、$\vec{b}_{if} $、$\vec{b}_{ig} $、$\vec{b}_{io} $和$ \vec{b}_{hi} $、$\vec{b}_{hf} $、$\vec{b}_{hg} $、$\vec{b}_{ho} $是对应的偏置项; $ \vec{W}_{ho} $和 $ \vec{b}_{ho} $是权重和偏置; $ \vec{i}_t $是输入门, $ \vec{f}_t $是遗忘门, $ \vec{g}_t $是候选记忆单元, $ \vec{o}_t $是输出门; $ \vec{c}_t $是当前时间步的记忆单元状态; $ \vec{h}_t $是当前时间步的隐藏状态; $ \tanh $为双曲正切激活函数; $ \sigma $表示 Sigmoid 函数; $ \odot $表示元素相乘. 向后 LSTM 更新方程如下:

    $$ \begin{aligned} \begin{cases} \overleftarrow{i}_t=\sigma\left(\overleftarrow{W}_{i i} x_t+\overleftarrow{b}_{i i}+\overleftarrow{W}_{h i} \overleftarrow{h}_{t-1}+\overleftarrow{b}_{h i}\right) \\ \overleftarrow{f}_t=\sigma\left(\overleftarrow{W}_{i f} x_t+\overleftarrow{b}_{i f}+\overleftarrow{W}_{h f} \overleftarrow{h}_{t-1}+\overleftarrow{b}_{h f}\right) \\ \overleftarrow{g}_t=\tanh \left(\overleftarrow{W}_{i g} x_t+\overleftarrow{b}_{i g}+\overleftarrow{W}_{h g} \overleftarrow{h}_{t-1}+\overleftarrow{b}_{h g}\right) \\ \overleftarrow{o}_t=\sigma\left(\overleftarrow{W}_{i o} x_t+\overleftarrow{b}_{i o}+\overleftarrow{W}_{h o} \overleftarrow{h}_{t-1}+\overleftarrow{b}_{h o}\right) \\ \overleftarrow{c}_t=\overleftarrow{f}_t \odot \overleftarrow{c}_{t-1}+\overleftarrow{i}_t \odot \overleftarrow{g}_t \\ \overleftarrow{h}_t=\overleftarrow{o}_t \odot \tanh \left(\overleftarrow{c}_t\right) \end{cases} \end{aligned} $$ (15)

    合并前后的隐藏状态, 使其连接成一个单一的隐藏状态.

    $$ \begin{aligned} h_t=\left[ \overset{\rightarrow }{h}_t;\overset{\gets}{h}_t \right] \end{aligned} $$ (16)

    对于训练的结果使用改进的 Adam 优化器, 其中包含的参数表示为:

    $$ \begin{aligned} \begin{cases} m_t=\beta_1 m_{t-1}+\left(1-\beta_1\right) g_t \\ v_t=\beta_2 v_{t-1}+\left(1-\beta_2\right) g_t^2 \\ \hat{m}_t=m_t /\left(1-\beta_1^t\right) \\ \hat{v}_t=v_t /\left(1-\beta_2^t\right) \\ \theta_{t=1}=\theta_t-\dfrac{\alpha}{\sqrt{\hat{v}_t+\varepsilon}} \hat{m}_t \end{cases} \end{aligned} $$ (17)

    其中, $ {m}_{t} $和 $ {v}_{t} $分别是梯度的一阶和二阶估计,$ \beta_{1} $和 $ \beta_{2} $是衰减率设置为 0.9 和 0.999; $ \beta_{1}^{t} $和 $ \beta_{2}^{t} $是$ \beta_{1} $和 $ \beta_{2} $的 t 次方; $ g_{t} $是当前步骤的梯度; $ \hat{m}_t $和 $ \hat{v}_t $是修正后的一阶和二阶矩估计. 学习率 $ \alpha $控制参数更新的步长; $ \varepsilon $是一个很小的常数, 用于避免分母为零; 模型参数 $ \theta_t $, 通过更新规则 $ \theta_{t+1} = \theta_t - \alpha / \sqrt{\hat{v}_t + \epsilon} \times \hat{m}_t $进行迭代优化. 通过修正 $ v_t $ 使其始终保持为历史的梯度平方最大值

    $$ \begin{aligned} v_t=\max \left(v_{t-1},\; g_t^2\right) \end{aligned} $$ (18)

    来减少学习率过大问题, 从而提升算法的稳定性. 步骤 5) 通过控制调整正则化超参数, 使用随机梯度和谱归一化, 旨在提升算法鲁棒性. 此外在每次迭代更新中概率$ \eta \leq 1 $, 以提高算法的收敛性.

    鉴于复杂海风环境下, 离线训练模型难以涵盖所有可能的风况场景. 因此, 引入实时在线风场预测模型. 该模型能够动态预测风速, 并将预测风速数据传输至离线 SymTAL 算法中, 以更新算法中的参数$ w $. 通过这种方式, 确保无人机能够及时响应风速变化, 实现自适应稳定控制.

    3.2.1   在线风场预测模型

    本节详细解释上文提到的在线风场预测模型, 首先通过 K-means 算法进行数据预处理, 风的时间序列$ v_{{date }} $对应数据点$ x_{i} $, 将其分割成$ {K} $个簇, 每个簇的中心$ c_{j} $, 每个簇包含相似度较高的数据点, 用下列公式来计算它们之间的距离.

    $$ \begin{aligned} d\left(x_i,\; c_j\right)=\sqrt{\sum_{k=1}^n\left(x_{i,\; k}-c_{j,\; k}\right)^2} \end{aligned} $$ (19)

    其中$ {n} $是风的数据点$ x_{i} $的特征维度. 之后分配将数据点$ x_{i} $分配到离他最近的簇中心$ c_{j} $所属的簇.

    $$ \begin{aligned} \text{Cluster}\left(x_i\right)=\arg \min _j d\left(x_i,\; c_j\right) \end{aligned} $$ (20)

    而后使用 VMD 对$ x_{i} $进行分解, 将信号分解成多个本征模态函数. 将进行风信号镜像扩展, 对输入信号$ X(t) $进行周期性的镜像扩展, 从而形成镜像序列$ f(t) $:

    $$ \begin{aligned} f(t)= \begin{cases}x\left(t-\dfrac{T}{2}\right),\; & t \in\left[0,\; \dfrac{T}{2}\right) \\ x(t),\; & t \in\left[\dfrac{T}{2},\; \dfrac{3 T}{2}\right) \\ x\left(t+\dfrac{T}{2}\right),\; & t \in\left[\dfrac{3 T}{2},\; {2 {\; T}}\right)\end{cases} \end{aligned} $$ (21)

    其中$ {T} $是信号的长度. 对 VMD 进行初始化, 初始化平衡参数$ {\alpha} $和频谱镜像$ f_{{hat\ plus}} $:

    $$ \begin{aligned} \begin{cases} \alpha=\alpha \times 1_K \\ f_{{hat\ plus }}= { fftshift }(f f t(f)) \end{cases} \end{aligned} $$ (22)

    其中$ 1_{K} $是大小为$ {K} $的全一向量, 通过 VMD 的迭代调整模态和中心的频率, 不断优化能量函数:

    $$ \begin{aligned} \min _{u_1,\; \cdots\; ,\; u_k} \sum_{i=1}^K\left\|u_i\right\|_1+\alpha_i \sum_{i=1}^K \frac{\left\|u_i\right\|_2^2}{2} \end{aligned} $$ (23)

    其中$ u_{i} $是第$ {i} $个模态; $ \alpha_{i} $是平衡参数. 最后进行风场速度重构, 通过反傅里叶变换将模态的频谱转换回时域, 得到每个模态对应的风场速度信号:

    $$ \begin{aligned} u_i(t)=\text{Re}\left( { ifft }\left({ ifftshift }\left(u_{i,\; { hat }}(t)\right)\right)\right) \end{aligned} $$ (24)

    最后输出$ u_{i}(t) $, 频谱$ u_{i,\; \; { hat }}(t) $以及估计的中心频率$ \omega_i $. $ u_i(t) $输入到模型池, 在模型池中通过风速信号的赫斯特指数(Hurst exponent, HURST)来选择模型. 由于连续风速具有强烈的连贯性, 因此通过时间序列的 R/S 分析法[39] 计算出 HURST = 0.9. 因此当 HURST 指数小于 0.9 时, 进入 TCN-GRU 模型中, 表示为$ X=\left\{x_1,\; x_2,\;\; \cdots\; ,\; x_{s e q_{\text {len }}}\right\} $, $ s e q_{l e n} $代表输入进去的序列长度, 通过 TCN 捕捉风长速度时间序列的长程依赖关系, 输出$ H=\{H_{1}, H_{2},\; \; \cdots\; ,\; H_{s e q_{{len }}}\} $, 训练公式如下:

    $$ \begin{aligned} h_t=f\left(W_k \times X_{t-k}+b_k\right) \text {,\; } \end{aligned} $$ (25)

    其中$ W_k $是卷积核数量为 10, 且卷积核大小为 2; $ b_k $是偏差; $ f $是激活函数. 卷积操作的参数通过训练学短期依赖关系, 包含隐藏单元数量为 32, 而后返回一个时间步输出, 通过如下公式:

    $$ \begin{aligned} g_t = \text{GRU}\left(h_t\right) \end{aligned} $$ (26)

    输出$ G=\left\{G_{1},\; G_{2},\;\; \cdots\; ,\; G_{s e q_{l e n}}\right\} $, 该模型使用三个全连接层, 并且每层都具有 ReLU 激活函数. 这些层的输出分别为:

    $$ \begin{aligned} \begin{cases} F C_1=\text{ReLU} \left(W_1 \times G+b_1\right) \\ F C_2=\text{ReLU} \left(W_2 \times F C_1+b_2\right) \\ F C_3=\text{ReLU} \left(W_3 \times F C_2+b_3\right) \end{cases} \end{aligned} $$ (27)

    其中$ W_{1},\; W_{2},\; W_{3} $是全连接层的权重矩阵, $ b_{1},\; b_{2}, b_{3} $是偏差; ReLU 表示修正线性单元激活函数, 其中每一层的神经元数量分别为 16、8、4; $ FC_1 $、$ FC_2 $和$ FC_3 $分别表示三个全连接层的输出, 最后通过如下公式计算, 输出最终的风速序列结果$ final_{{result }} $

    $$ \begin{aligned} final_{{result }}=W_{ {out }} \times F C_3+b_{ {out }} \end{aligned} $$ (28)

    其中, $ W_{ {out }} $是输出层的矩阵权重; $ b_{ {out }} $是偏差.

    当 HURST 指数大于 0.9 时, 进入支持向量回归 (Support vector regression, SVR) 模型中, 使用网格搜索方式. 将$ u_{i}(t) $划分$ {X} $和$ {Y} $数据集, $ X= \{ x_{1}, x_{2},\; \; \cdots\; ,\; x_{s e q_{len}} \} $是时间序列风速数据, $ Y=\{Y_{1}, Y_{2},\; \; \cdots\; ,\; Y_{s e q_{t m}}\} $, 其中$ y_{i} $是$ t_{i+1} $处的风速值, 即是要预测的下一个时间点的风速. 定义一组候选的$ C $和$ \gamma $值. 对于每个组合执行以下步骤: 将数据集$ X $和$ Y $划分为训练集和测试集, 使用 X 来训练 SVR, 其中核函数使用的 RBF:

    $$ \begin{aligned} K\left(x,\; x^{\prime}\right)=\text{e}^{\left(-\frac{\left\|x-x^{\prime}\right\|^2}{2 \gamma^2}\right)} \end{aligned} $$ (29)

    其中$ K\left(x,\; x^{\prime}\right) $是核函数的值; $ x $ 和$ x^{\prime} $是输入空间中的述为两个样本点; $ \left\|x-x^{\prime}\right\| $是这两个样本点之间的欧氏距离; $ \gamma $是控制核宽度的参数, 初始设定$ \gamma $为 0.1. SVR 可以表述为:

    $$ \begin{aligned} \min _{\omega,\; b,\; \xi,\; \xi^*} \frac{1}{2}\|\omega\|^2+C \sum_{i=1}^N\left(\xi_i+\xi_i^*\right) \end{aligned} $$ (30)

    使得

    $$ \begin{aligned} \begin{cases} y_i-\langle\omega,\; \phi\left(x_i\right)\rangle-b \leq \varepsilon+\xi_i \\ \langle\omega,\; \phi\left(x_i\right)\rangle+b-y_i \leq \varepsilon+\xi_i^*,\; \xi_i,\; \xi_i^* \geq 0 \end{cases} \end{aligned} $$ (31)

    其中, $ y_i $表示第 $ i $个样本的真实标签(目标值), 是回归问题中需要拟合的目标; $ \omega $是模型的权重向量, 用于定义回归超平面的方向; $ \phi(x_i) $是将输入特征 $ x_i $映射到高维特征空间的函数, 通过核函数实现非线性映射; $ b $是模型的偏置项, 用于调整回归超平面的位置; $ \varepsilon $是误差带的宽度, 表示允许的误差范围; $ \xi_i $和 $ \xi_i^* $分别是第$ i $个样本的上界和下界松弛变量, 用于衡量样本点超出误差带的程度, 且均为非负值; $ \langle \omega,\; \phi(x_i) \rangle $表示权重向量 $ \omega $与映射后的特征向量 $ \phi(x_i) $的内积, 即模型对第 $ i $个样本的预测值.

    通过寻找的$ C $和$ \gamma $的最优组合, 再使用$ Y $来评估模型性能, 从而得到最好的模型. 最后预测新的时间点$ t_{n e w} $, 使用学习到的最优模型进行预测:

    $$ \begin{aligned} \hat{y}_{\text {new }}=\langle\omega,\; \phi\left(t_{\text {new }}\right)\rangle+b \end{aligned} $$ (32)

    从而得到最终的风速序列结果$ final_{{result}} $, 即算法 2.

    算法 2. 在线风场预测模型

    输入. 风的时间序列 $ v_{{date}} $, VMD 分解的模态数量 $ v m d_k $, 序列长度 $ s e q_{l e n} $.

    输出. 预测出的风的时间序列 $ final_{{result }.} $

    1) K-Means 聚类 $ perform_{{kmeans}}<-v_{{date }.} $

    2) VMD 分解:

     $ function vmd(signal,\; \alpha,\; tau,\; K,\; DC,\; init,\;tol) $

    3) for $ i=1 $ to $ v m d_{k} $

      根据 R/S 分析法计算 Hurst, 得到$\quad\quad hurst_{value} $.

    4) 判断 $ hurst_{value} $, 选择模型

    $\quad\quad if\ hurst_{{value}}> threshold: flag=1 $

    $\quad\quad\quad\quad model = fuction\ {tcn}_{gru}(\; ) $

    $\quad\quad else: flag=2 $

    $\quad\quad\quad\quad model = fuction\ svr(\; ) $

    5) 结果组合

    $ \quad\quad final_{{result }}= combine_{{r}}\left(date,\; predicted_{{date}}\right) $

    为验证 SymTAL-POP 无人机飞行控制方法在抑制大风扰动方面的有效性, 设计以下实验, 实验包括以下内容: 1) SymTAL 离线部分实验, 评估其在遭受大风干扰时保持姿态稳定的能力; 2) POP 在线部分实验, 对风速预测算法进行验证, 评估其在实时预测风速变化方面的准确性; 3) 整体方法实验, 定量比较 SymTAL-POP 与非线性基线方法和两种最先进的自适应飞行控制方法的轨迹跟踪性能, 评估本文方法在大风中的控制能力.

    本实验的数据集由两部分组成: 离线数据和在线数据. 离线数据来源于文献[20], 涵盖无人机在 0 至$ 45\;\text{km} / \text{h} $范围内均匀且稳定的风速条件下的性能参数, 给定输入包括体轴速度$ v $ (单位: $ \text{m} / \text{s} $)、角速度$ w $ (单位: $ \text{rad} / \text{s} $) 以及飞控系统输出的电机速度 PWM 信号. 这些数据以$ 50\; \text{Hz} $的频率采集, 总计 36 000 个数据点. 在线数据则来源于美国南德克萨斯州近海的风能数据资源. 在此选取 2014 年 1 月 1 日至 2 月 3 日期间的数据作为训练集, 而 2 月 4 日至 2 月 11 日期间的数据则用作测试集. 在此期间, 每隔 1 小时进行一次数据采集和测试.

    基于训练获得共性基函数的深度神经网络模型, 离线数据集分割为适应集与测试集两部分. SymTAL 算法通过其对称的网络层结构和权重共享有效减少模型参数数量, 进而显著提升模型的收敛速度, 使得训练过程更为迅速且易于达到收敛状态, 从而更有效地适配训练数据.

    该实验旨在验证 SymTAL 算法中对称式对抗网络层结构的稳定性. 为此将算法 1 中的判别器 R 输出的联合对抗损失与 Neural-Fly 方法进行比较, 结果如图3所示. 使用交叉熵误差 (Cross-entropy loss) 作为性能评估指标, 以有效衡量模型性能.

    图 3  对抗损失 Loss_c
    Fig. 3  Against loss (Loss_c)

    实验结果显示, SymTAL 算法在训练初期迅速收敛, 前 300 轮即达到较低损失值 1.778, 并在后续训练中持续稳定下降, 至 1000 轮时损失值为 1.772. 相比之下, Neural-Fly 算法的损失值在训练的第 150 至 450 轮之间波动, 范围约为 1.797 到 1.779, 后期在 750 轮损失在 1.791 至 1.762 之间剧烈波动. 综上所述, SymTAL 算法在较少的迭代次数内便能学习到有效的控制策略, 并且随着训练轮数的增加, 其性能呈现稳定提升的趋势.

    为验证算法使用改进 Adam 优化器后, $ Q(x) a(\omega) $是否足够逼近真实的数值$ f(x,\; w) $, 以均方误差 (Mean squared error, MSE) 为衡量指标, 比较 SymTALm 与 Neural-Fly 方法的 Loss_f, 结果如图4所示.

    图 4  预测力损失Loss_f
    Fig. 4  Predict loss (Loss_f)

    在训练的起始阶段, SymTAL 与 Neural-fly 两种算法的损失值表现出相似的趋势. 对于 SymTAL 算法, 其生成器 Q 的损失在前 200 轮训练中迅速下降, 并最终稳定在 0.485. 而 Neural-fly 算法的损失在大约 170 轮训练后开始趋于稳定. 为深入比较这两种算法, 进一步分析它们的收敛值以及完成 1 000 轮训练所需的运行时间, 结果如表1所示.

    表 1  Loss_f 值分析表
    Table 1  Loss_f value analysis table
    方法 收敛轮次 (轮) 收敛值 运行时间 (s)
    Loss_f $ \text{N} $-$ \text{F} $ 150 $ 0.8638 \pm 0.044 $ $ 149.2 \pm 1.37 $
    SymTAL 160 $ 0.4818 \pm 0.045 $ $ 116.6 \pm 1.15 $
    下载: 导出CSV 
    | 显示表格

    根据图4表1的数据分析, 在训练前 150 轮, Loss_f 的损失值迅速降低, 随后下降速度逐渐放缓并趋于稳定. 在此期间, Neural-fly 算法展现较快的收敛速度. 然而, Neural-fly 算法的最终收敛值为 0.863 8, 而 SymTAL 算法的收敛值则为 0.481 8, 这意味着 SymTAL 算法在收敛阶段的性能提升大约$ 44 \% $. 这一结果证实改进的 Adam 优化器在训练初期采用较高学习率的策略是有效的. 随着训练轮数的增加, 学习率逐渐降低, 有效缓解梯度消失的问题. 此外, 尽管 Neural-fly 算法在收敛速度上领先, 但 SymTAL 算法由于采用深度学习方法, 在训练时间上的表现更出色, 其速度比 Neural-fly 快$ 21.8\% $. 综合各项指标, 改进后的 Adam 优化器不仅提升 SymTAL 算法的精度, 使其预测值更接近真实值, 而且在训练效率上也实现显著增强. 这些成果共同表明, SymTAL 算法在精确度和效率两个维度上均展现出明显的优势.

    为评估 SymTAL 算法中神经网络学习效果对损失的影响, 采用均方差作为评价指标, 分别在不同的风速下对比平均输出损失、N-F 学习后损失以及 SymTAL 学习后损失, 如图5所示, 通过拟合曲线可以明显看出, 在不同的风速下, 未学习状态下的输出损失最高, 经过神经网络学习后损失均有减少. 特别值得注意的是, 在最高风速 45 km/h 的条件下, SymTAL 算法的均方误差降低至 2.8, 而 Neural-fly 算法的损失误差为 3.2, 相比之下, SymTAL 算法相较于 Neural-fly 算法误差降低 12.5%.

    图 5  抗风算法学习损失对比
    Fig. 5  Comparison of learning loss of anti-wind algorithms

    在海面上, 风场受到多种外界条件的影响, 如地形、大气压力等, 导致风速呈现出连续风、湍流风和间歇性风等不确定性特征. 为应对这些复杂情况, 本实验设计一个基于近海风场数据的预测模型. 该模型通过训练集数据进行训练, 旨在预测未来 200 小时内的风速状况, 如图6所示. 实验结果表明, 该模型在处理包含多种复杂特征的海面风速场景时的表现, 显示出其良好的适应性和较低的预测误差. 即便当风速在 70$ \; \sim\; $120 小时区间内发生突变, 且伴随着不规则的气流波动, 模型仍能维持有效的预测性能.

    图 6  风速预测结果
    Fig. 6  Wind speed prediction results

    为精确评估算法在风速预测方面的准确性, 本研究针对连续风、湍流风和间歇性风这三种不同的风速模式进行预测分析. 在评估风速预测精度时, 采用结合信号分解和深度学习技术的方法[4043], 这一方法通过一系列误差度量指标对不同算法进行定量的评估. 具体的数据和分析结果见表2, 这些数据为该模型在不同风速条件下的预测性能提供科学的量化依据.

    表 2  连续风场算法性能比较
    Table 2  Performance comparison of continuous wind field algorithms
    评估指标 cnn - lstm[40] vmd-am-lstm vmd-cnn[41] vmd-cnn-lstm[42] vmd-gru[43] vmd-lstm vmd-tcn-lstm POP(ours)
    MSEMean0.06530.06540.05960.06180.06250.05450.0616$ {\boldsymbol{0 . 0 5 1 9}} $
    SD0.04900.04900.04000.04200.04600.03900.0390$ {\boldsymbol{0 . 0 2 80}} $
    MAEMean0.19990.20000.19140.19550.19540.18270.1954$ {\boldsymbol{0 . 1 7 5 2}} $
    SD0.08010.08040.06940.07270.07620.06730.0679$ {\boldsymbol{0 . 0 5 0 8}} $
    RMSEMean0.24340.24350.23460.23840.23880.22430.2391$ {\boldsymbol{0 . 2 1 9 4}} $
    SD0.07800.07800.06700.07000.07400.06500.0660$ {\boldsymbol{0 . 0 5 00}} $
    MAPEMean66.937366.192461.298061.047563.0942$ {\boldsymbol{5 1 . 9 1 7 7}} $65.228556.0368
    SD23.014022.006017.574017.063019.845015.665019.3100$ {\boldsymbol{9 . 1 5 50}} $
    MAXEMean0.64910.64160.64210.64440.64130.64180.6481$ {\boldsymbol{0 . 6 1 0 6}} $
    SD0.11200.11000.11000.10900.10900.10900.1051$ {\boldsymbol{0 . 0 9 60}} $
    AREMean0.61630.64410.64910.55580.63300.63400.6116$ {\boldsymbol{0 . 5 4 0 6}} $
    SD0.18100.20200.21100.08500.20200.20500.1540$ {\boldsymbol{0 . 0 7 80}} $
    下载: 导出CSV 
    | 显示表格

    均方误差 (MSE) 通过平方误差确保预测误差的重要性, 其中较低的 MSE 值意味着预测值与真实值之间的平均差异较小. POP 模型在风速预测中展现出的低 MSE 均值 (0.051 9) 表明其具有较高的预测精度, 且优于其他算法. 平均绝对误差 (Mean absolute error, MAE) 衡量的是预测值与真实值之间绝对差的平均值, 反映预测值与真实值的接近程度. POP 算法的 MAE 均值 (0.175 2) 不仅是所有算法中最低的, 而且与 vmd-lstm 算法相比, 有显著的 24% 提升, 这进一步验证其预测结果的稳定性. 均方根误差 (Root mean square error, RMSE) 是误差平方的平均值的平方根, 提供对风速预测误差的一种直观度量. POP 算法的 RMSE 均值 (0.219 4) 最小, 说明其预测结果偏差最小, 进一步证实其精准的预测能力. 最大误差 (Mean absolute percentage error, MAPE) 指标以百分比形式表现预测误差相对于真实值的绝对大小. 尽管 vmd-lstm 算法在平均误差上表现最佳, 但 POP 算法在标准差上提升 41.6%, 表明其在预测精度相当的情况下, 具有更高的稳定性. 最大绝对误差 (Maximum Absolute Error, MAXE) 指标反映在最坏情况下的预测误差, POP 算法的低 MAXE 均值 (0.619 6) 表明即使在极端情况下, 预测误差也能得到有效控制. 平均相对误差 (Absolute relative error, ARE)指标, 即平均绝对误差与真实值的比率, 衡量预测误差占真实值的比例, 体现模型的相对准确度. POP 算法在 ARE 上的表现虽然较 vmd-tcn-lstm 有所提高, 但仍在可接受范围内, 证明其对风速变化快速响应的能力. 平均值 (Mean) 是描述数据集中趋势的一种常用方法, 而低平均值表示预测结果能较好地捕捉风速变化规律, 从整体看 POP 的平均值低于其他预测模型. 标准差 (Standard Deviation, SD) 是衡量预测值变异程度的关键指标, 低的标准差说明预测结果稳定. POP 算法在 MAE 和 RMSE 上的低标准差进一步证实其结果的稳定性和可靠性.

    综上所述, POP 算法在风速预测上的表现通过各项性能指标得到全面验证. 其不仅在平均误差上表现优异, 而且在误差波动和极端误差控制方面也展现出色的性能, 这充分证明 POP 算法在风速预测领域的优越性和稳健性.

    间歇风是指风速在间歇性地增减, 即风速会突然增加然后又突然减小, 风速变化剧烈且频繁. 湍流风是指风速和风向在空间和时间上均呈现不规则的变动, 风速变化随机且无规律. 表3展示 POP 算法与其他对比方法在间歇性风场中的风速预测结果. 表4展示 POP 算法与其他对比方法在湍流风风场中的风速预测结果. 通过观察表格数据, 在极端风速条件下, POP 算法相比其他方法具有更高的预测精度. 在间歇风场中, 由于风速会突然降至接近零的情况, 导致 MAPE 和 ARE 的计算分母为零, 因此这两种误差评估指标在这种情况下不适用, 并未记录. 通过使用不同的风况和误差衡量指标来定量和全面地评估 POP 算法的预测准确性. 这些结果进一步验证在线风场预测模型 POP 算法的有效性和精确性, 表明 POP 算法在复杂多变的风场环境中具有优越的预测能力.

    表 3  间歇风风场算法性能比较
    Table 3  Performance comparison of intermittent wind field algorithms
    评估指标 cnn-lstm[40] vmd-am-lstm vmd-cnn[41] vmd-cnn-lstm[42] vmd-gru[43] vmd-lstm vmd-tcn-lstm POP(ours)
    MSEMean0.04820.05140.05210.04650.05300.04970.0514$ {\boldsymbol{0 . 0 4 4 9}} $
    SD0.02200.02700.03400.01800.02300.02300.0180$ {\boldsymbol{0 . 0 1 40}} $
    MAEMean0.17560.17230.17470.17120.17480.17030.1703$ {\boldsymbol{0 . 1 5 7 4}} $
    SD0.04610.05310.05690.04360.04480.04480.0375$ {\boldsymbol{0 . 0 2 3 9}} $
    RMSEMean0.22420.21850.22160.21910.22560.21540.2231$ {\boldsymbol{0 . 1 9 4 4}} $
    SD0.04700.05000.05400.04400.04500.04500.0410$ {\boldsymbol{0 . 0 1 50}} $
    MAPEMean66.310873.116772.469768.695664.579872.091859.1044$ {\boldsymbol{5 8 . 5 4 1 8}} $
    SD25.331032.797033.599026.133024.647030.289014.5280$ {\boldsymbol{1 2 . 1 3 50}} $
    MAXEMean0.7156$ {\boldsymbol{0 . 6 8 4 3}} $0.69260.70320.73550.70390.75370.6882
    SD0.11300.09700.10600.10700.10800.10400.1014$ {\boldsymbol{0 . 0 6 30}} $
    AREMean0.68430.73120.73430.68700.67440.72090.5910$ {\boldsymbol{0 . 5 8 5 4}} $
    SD0.26200.32800.34800.26100.25500.30300.1452$ {\boldsymbol{0 . 1 2 10}} $
    下载: 导出CSV 
    | 显示表格
    表 4  湍流风风场算法性能比较
    Table 4  Turbulent wind field algorithms
    评估指标 cnn-lstm[40] vmd-am-lstm vmd-cnn[41] vmd-cnn-lstm[42] vmd-gru[43] vmd-lstm vmd-tcn-lstm POP(ours)
    MSEMean0.27990.29740.27600.27410.29430.29660.2966$ {\boldsymbol{0 . 2 4 9 4}} $
    SD0.09500.07200.07400.08200.07200.07700.0770$ {\boldsymbol{0 . 0 6 90}} $
    MAEMean0.39630.40640.40720.39650.40830.41170.4038$ {\boldsymbol{0 . 3 4 8 2}} $
    SD0.06050.05810.06110.06180.06160.06220.0631$ {\boldsymbol{0 . 0 1 2 5}} $
    RMSEMean0.52150.52830.53340.50850.53460.53370.5337$ {\boldsymbol{0 . 4 7 4 2}} $
    SD0.08900.06700.07200.07800.07200.07300.0680$ {\boldsymbol{0 . 0 4 60}} $
    MAPEMean
    SD
    MAXEMean0.82680.86100.84820.82610.87600.87740.8700$ {\boldsymbol{0 . 7 6 1 4}} $
    SD0.14200.12400.13100.14400.12300.13000.1080$ {\boldsymbol{0 . 0 7 10}} $
    AREMean
    SD
    下载: 导出CSV 
    | 显示表格

    为验证在多变海风环境下, 整体 SymTAL-POP 姿态稳定方法的性能优势. 本研究设计一个实验, 通过量化评估不同控制算法, 在 0 km/h、15 km/h、30 km/h 和 45 km/h 恒定风速下的飞行轨迹跟踪性能. 实验中, 无人机飞行时间为 2 min, 完成 6 圈闭环绕圈飞行. 本文选取非线性跟踪控制 N-T[44]、自适应非线性预测控制 N-MPC[45]、增量非线性动力学反演控制 INDI[46]、L1 自适应控制 L1-A [47] 以及 本文 SymTAL-POP 控制算法 S-P 进行性能比较.

    首先对各算法的轨迹跟踪误差 RMS 进行分析, 如图7所示. 定量结果表明, 随着风速的增加, 各种控制算法的气动幅度变化变得更加明显, 其性能也更易受到风速的影响. 在所有测试的控制算法中, 随着风速的增加, 气动力变化的更大, 非线性 N-T 算法性能大幅下降. L1-A、N-MPC 和 INDI 在风速低于 12 km/h 时相比 N-T 有所改善, 但是在风速达到 20 km/h 时, L1-A 抗风控制算法的 RMS 误差高达 17, 说明相较于 N-MPC 和 INDI 算法个别点误差较大, 飞行控制逐渐不稳定. 此时 S-P 算法在该风速下的误差仅为 4.5, 相比 N-MPC 算法提高 55%. 在最具挑战的 45 km/h 的风速下, S-P 算法的误差仍然保持在 10 以下, 此外测量该方法执行时间和通信延迟在 15 至 30 ms之间. 证明本算法相较于其他算法在不同风况条件下仍能保持很好的稳定性同时实现快速响应.

    图 7  抗风算法跟踪误差
    Fig. 7  Wind resistance algorithm tracking error

    各种算法的平均位置误差 MEAN, 如表5所示, 单位为厘米. 在 15 km/h 的风速下 N-MPC、INDI 和 S-P 相较其他算法平均位置误差较小, 但当风速达到 30 km/h 时 N-MPC 和 INDI 误差逐渐增大, 此时 S-P 的平均跟踪比 INDI 算法提升 15.6%. 当风速达到最大 45 km/h 时, S-P 平均跟踪误差仍低于 10, 相较于其他算法稳定性较好. 轨迹跟踪实验显示出均方根误差 RMS 和平均位置误差 MEAN 都呈现出相同的结论, 自适应控制算法大大优于依赖积分控制的非线性跟踪控制, 而在自适应控制算法中 SymTAL-POP 算法优于其他最佳控制算法, 性能提升 23.5%.

    表 5  抗风算法的平均位置误差 (cm)
    Table 5  The average position error of wind resistance algorithm (cm)
    风速 N-T N-MPC INDI L1-A S-P
    (km/h) 误差 误差 误差 误差 误差
    0 10.8 4.5 6.8 4.2 2.9
    15 13.6 7.6 8.1 11.1 4.1
    30 22.6 11.3 10.3 21.4 8.7
    45 34.7 16.7 12.6 28.6 8.9
    下载: 导出CSV 
    | 显示表格

    为便于分析, 如表6所示, 本文采用三个等级来评估算法的抗风性能: 优、中、差, 以误差值 9 cm 和 15 cm 作为区分这些等级的分界点[48]. 在无风条件下, 除 N-T 控制算法外, 所有算法的误差都低于 9 cm, 因此被评定为优等级. 然而, 随着风速增加到微风级别, N-T 算法的误差增加到 15.4 cm, 降级为差等级. L1-A 算法的性能也有所下降, 误差达到 12.1 cm, 被评定为中等级. 其他算法尽管风速增加, 但误差依然保持在 9 cm 以下, 显示出良好的控制能力. 当风速进一步增加到劲风水平时, 除本研究提出的 S-P 算法外, 其他算法的控制性能均有所下降. L1-A 算法的误差上升至 22.7 cm, 评级为差等级. N-MPC 和 INDI 算法的误差分别为 11.6 cm 和 10.9 cm, 评级为中等级. 在强风条件下, 所有算法的误差都超过 10 cm, 性能都降至中等水平以下. 但是, S-P 算法相比 INDI 算法误差低 23.6%, 这进一步验证本文算法在飞行控制方面的显著优势.

    表 6  抗风算法可控范围风速等级
    Table 6  Wind resistance algorithm controllable range wind speed level
    方法 无风
    $ (0\sim1 $ 级$ ) $
    微风
    $ (2\sim3 $ 级$ ) $
    劲风
    $ (4\sim5 $级 $ ) $
    强风
    $ (>5 $级$ ) $
    N-T
    N-MPC
    INDI
    L1-A
    S-P
    下载: 导出CSV 
    | 显示表格

    本文引入轨迹跟踪偏移误差作为阈值. 一旦飞行误差超过阈值, 则需调整飞行高度确保在稳定可控的风场下正常执行任务. 考虑到旋翼无人机最大上升速率, 因此将阈值设定为 5 m. 根据海拔高度对风速的影响, 使用风速随高度变化的关系进行计算[49]:

    $$ \begin{aligned} V(z)=V_0\left(\frac{z}{z_0}\right)^\alpha \end{aligned} $$ (33)

    其中$ V(z) $是高度$ z $处的风速; $ V_0 $是参考高度$ z_0 $处的风速; $ z $观测的高度; $ \alpha $是幂指数, 描述风速随高度变化的速率. 已知的参数是$ \alpha=0.18 $ m, 并且通过在线风场数据计算得出参考高度$ \alpha=50 $ m 处的平均风速$ V_0 $=8.23 m/s. 不同海拔高度的风速, 结果如表7所示. 结合 4.4 实验中轨迹偏移误差值与风速信息, 得出结论: 当飞行高度超过 60 m 时, 无人机无法稳定执行任务. 为确保稳定性, 必须降低飞行高度. 当飞行高度低于 60 m且该高度的风速不超过 8.5 m/s时, 无人机无需进行高度调整.

    表 7  高度与风速的关系
    Table 7  Relationship between height and wind speed
    高度 $ (\text{m}) $ 风速 $ (\text{m} / \text{s}) $
    10 6.16
    20 6.97
    30 7.50
    50 8.23
    60 8.50
    70 8.74
    80 8.95
    90 9.14
    100 9.32
    110 9.48
    120 9.63
    下载: 导出CSV 
    | 显示表格

    本文介绍一种具有抗风特性的域对抗自适应飞行控制方法 SymTAL-POP, 结合离线对称式时序域对抗学习算法与在线风场预测技术, 提升无人机在复杂海上风况中稳定性和适应性.

    论文的主要贡献包括:

    1) 提出 SymTAL-POP 无人机飞行控制方法, 其中包括离线域对抗自适应学习和在线风场预测两个关键部分, 旨在增强无人机在复杂海洋环境中的稳定性和抗风性能.

    2) 设计离线 SymTAL 算法, 结合域对抗学习、对称性网络和双向时序网络, 以提取气动共享特征并提升模型的学习能力.

    3) 设计在线风场预测模型 POP, 使用变分模态分解技术处理风场速度信号, 并根据风信号特性选择模型池中的 GRU-TCN 或 SVR 进行风速预测.

    4) 通过实验验证, SymTAL-POP 算法较其他算法在闭环轨迹跟踪性能上有更好的表现.

    未来将继续优化该方法, 重点研究变风速和变风向条件下如何实现旋翼无人机的姿态稳定控制, 并将该方法应用到实际无人机系统中.

  • 图  1  在线Q学习算法结构图

    Fig.  1  The architecture of the online Q-learning algorithm

    图  2  确定的值迭代Q学习算法结构图

    Fig.  2  The architecture of the deterministic value iteration-based Q-learning algorithm

    图  3  确定的策略迭代Q学习算法结构图

    Fig.  3  The architecture of the deterministic policy iteration-based Q-learning algorithm

  • [1] 张化光, 张欣, 罗艳红, 杨珺. 自适应动态规划综述. 自动化学报, 2013, 39(4): 303−311 doi: 10.1016/S1874-1029(13)60031-2

    Zhang Hua-Guang, Zhang Xin, Luo Yan-Hong, Yang Jun. An overview of research on adaptive dynamic programming. Acta Automatica Sinica, 2013, 39(4): 303−311 doi: 10.1016/S1874-1029(13)60031-2
    [2] Lewis F L, Vrabie D, Vamvoudakis K G. Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Control Systems Magazine, 2012, 32(6): 76−105 doi: 10.1109/MCS.2012.2214134
    [3] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
    [4] Werbos P J. Approximate dynamic programming for real-time control and neural modeling. In: Proceedings of the Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. New York, USA: 1992.
    [5] 刘德荣, 李宏亮, 王鼎. 基于数据的自学习优化控制: 研究进展与展望. 自动化学报, 2013, 39(11): 1858−1870 doi: 10.3724/SP.J.1004.2013.01858

    Liu De-Rong, Li Hong-Liang, Wang Ding. Data-based self-learning optimal control: research progress and prospects. Acta Automatica Sinica, 2013, 39(11): 1858−1870 doi: 10.3724/SP.J.1004.2013.01858
    [6] Mao R Q, Cui R X, Chen C L P. Broad learning with reinforcement learning signal feedback: Theory and applications. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(7): 2952−2964 doi: 10.1109/TNNLS.2020.3047941
    [7] Shi Y X, Hu Q L, Li D Y, Lv M L. Adaptive optimal tracking control for spacecraft formation flying with event-triggered input. IEEE Transactions on Industrial Informatics, 2023, 19(5): 6418−6428 doi: 10.1109/TII.2022.3181067
    [8] 孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题. 自动化学报, 2020, 46(7): 1301−1312

    Sun Chang-Yin, Mu Chao-Xu. Important scientific problems of multi-agent deep reinforcement learning. Acta Automatica Sinica, 2020, 46(7): 1301−1312
    [9] Wei Q L, Liao Z H, Shi G. Generalized actor-critic learning optimal control in smart home energy management. IEEE Transactions on Industrial Informatics, 2021, 17(10): 6614−6623 doi: 10.1109/TII.2020.3042631
    [10] 王鼎, 赵明明, 哈明鸣, 乔俊飞. 基于折扣广义值迭代的智能最优跟踪及应用验证. 自动化学报, 2022, 48(1): 182−193

    Wang Ding, Zhao Ming-Ming, Ha Ming-Ming, Qiao Jun-Fei. Intelligent optimal tracking with application verifications via discounted generalized value iteration. Acta Automatica Sinica, 2022, 48(1): 182−193
    [11] Sun J Y, Dai J, Zhang H G, Yu S H, Xu S, Wang J J. Neural-network-based immune optimization regulation using adaptive dynamic programming. IEEE Transactions on Cybernetics, 2023, 53(3): 1944−1953 doi: 10.1109/TCYB.2022.3179302
    [12] Liu D R, Ha M M, Xue S. State of the art of adaptive dynamic programming and reinforcement learning. CAAI Artificial Intelligence Research, 2022, 1(2): 93−110 doi: 10.26599/AIR.2022.9150007
    [13] Wang D, Gao N, Liu D R, Li J N, Lewis F L. Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications. IEEE/CAA Journal of Automatica Sinica, 2024, 11(1): 18−36 doi: 10.1109/JAS.2023.123843
    [14] 王鼎, 赵明明, 哈明鸣, 任进. 智能控制与强化学习: 先进值迭代评判设计. 北京: 人民邮电出版社, 2024.

    Wang Ding, Zhao Ming-Ming, Ha Ming-Ming, Ren Jin. Intelligent Control and Reinforcement Learning: Advanced Value Iteration Critic Design. Beijing: Posts and Telecommunications Press, 2024.
    [15] 孙景亮, 刘春生. 基于自适应动态规划的导弹制导律研究综述. 自动化学报, 2017, 43(7): 1101−1113

    Sun Jing-Liang, Liu Chun-Sheng. An overview on the adaptive dynamic programming based missile guidance law. Acta Automatica Sinica, 2017, 43(7): 1101−1113
    [16] Zhao M M, Wang D, Qiao J F, Ha M M, Ren J. Advanced value iteration for discrete-time intelligent critic control: A survey. Artificial Intelligence Review, 2023, 56: 12315−12346 doi: 10.1007/s10462-023-10497-1
    [17] Wang D, Ha M M, Zhao M M. The intelligent critic framework for advanced optimal control. Artificial Intelligence Review, 2022, 55(1): 1−22 doi: 10.1007/s10462-021-10118-9
    [18] Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof. IEEE Transactions on Systems, Man, and Cybernetics–Part B: Cybernetics, 2008, 38(4): 943−949 doi: 10.1109/TSMCB.2008.926614
    [19] Li H L, Liu D R. Optimal control for discrete-time affine non-linear systems using general value iteration. IET Control Theory and Applications, 2012, 6(18): 2725−2736 doi: 10.1049/iet-cta.2011.0783
    [20] Wei Q L, Liu D R, Lin H Q. Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems. IEEE Transactions on Cybernetics, 2016, 46(3): 840−853 doi: 10.1109/TCYB.2015.2492242
    [21] Wang D, Zhao M M, Ha M M, Qiao J F. Stability and admissibility analysis for zero-sum games under general value iteration formulation. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(11): 8707−8718 doi: 10.1109/TNNLS.2022.3152268
    [22] Wang D, Ren J, Ha M M, Qiao J F. System stability of learning-based linear optimal control with general discounted value iteration. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(9): 6504−6514 doi: 10.1109/TNNLS.2021.3137524
    [23] Heydari A. Stability analysis of optimal adaptive control under value iteration using a stabilizing initial policy. IEEE Transactions on Neural Networks and Learning Syetems, 2018, 29(9): 4522−4527 doi: 10.1109/TNNLS.2017.2755501
    [24] Wei Q L, Lewis F L, Liu D R, Song R Z, Lin H Q. Discrete-time local value iteration adaptive dynamic programming: Convergence analysis. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018, 48(6): 875−891 doi: 10.1109/TSMC.2016.2623766
    [25] Zhao M M, Wang D, Ha M M, Qiao J F. Evolving and incremental value iteration schemes for nonlinear discrete-time zero-sum games. IEEE Transactions on Cybernetics, 2023, 53(7): 4487−4499 doi: 10.1109/TCYB.2022.3198078
    [26] Ha M M, Wang D, Liu D R. Neural-network-based discounted optimal control via an integrated value iteration with accuracy guarantee. Neural Networks, 2021, 144: 176−186 doi: 10.1016/j.neunet.2021.08.025
    [27] Luo B, Liu D R, Huang T W, Yang X, Ma H W. Multi-step heuristic dynamic programming for optimal control of nonlinear discrete-time systems. Information Sciences, 2017, 411: 66−83 doi: 10.1016/j.ins.2017.05.005
    [28] Wang D, Wang J Y, Zhao M M, Xin P, Qiao J F. Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control. IEEE/CAA Journal of Automatica Sinica, 2023, 10(9): 1797−1809 doi: 10.1109/JAS.2023.123684
    [29] Rao J, Wang J C, Xu J H, Zhao S W. Optimal control of nonlinear system based on deterministic policy gradient with eligibility traces. Nonlinear Dynamics, 2023, 111: 20041−20053 doi: 10.1007/s11071-023-08909-6
    [30] Yu L Y, Liu W B, Liu Y R, Alsaadi F E. Learning-based T-sHDP(λ) for optimal control of a class of nonlinear discrete-time systems. International Journal of Robust and Nonlinear Control, 2022, 32(5): 2624−2643 doi: 10.1002/rnc.5847
    [31] Al-Dabooni S, Wunsch D. An improved N-step value gradient learning adaptive dynamic programming algorithm for online learning. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(4): 1155−1169 doi: 10.1109/TNNLS.2019.2919338
    [32] Wang J Y, Wang D, Li X, Qiao J F. Dichotomy value iteration with parallel learning design towards discrete-time zero-sum games. Neural Networks, 2023, 167: 751−762 doi: 10.1016/j.neunet.2023.09.009
    [33] Wei Q L, Wang L X, Lu J W, Wang F Y. Discrete-time self-learning parallel control. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(1): 192−204 doi: 10.1109/TSMC.2020.2995646
    [34] Ha M M, Wang D, Liu D R. A novel value iteration scheme with adjustable convergence rate. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(10): 7430−7442 doi: 10.1109/TNNLS.2022.3143527
    [35] Ha M M, Wang D, Liu D R. Novel discounted adaptive critic control designs with accelerated learning formulation. IEEE Transactions on Cybernetics, 2024, 54(5): 3003−3016 doi: 10.1109/TCYB.2022.3233593
    [36] Wang D, Huang H M, Liu D R, Zhao M M, Qiao J F. Evolution-guided adaptive dynamic programming for nonlinear optimal control. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(10): 6043−6054 doi: 10.1109/TSMC.2024.3417230
    [37] Liu D R, Wei Q L. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621−634 doi: 10.1109/TNNLS.2013.2281663
    [38] Liu D R, Wei Q L. Generalized policy iteration adaptive dynamic programming for discrete-time nonlinear systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2015, 45(12): 1577−1591 doi: 10.1109/TSMC.2015.2417510
    [39] Liang M M, Wang D, Liu D R. Neuro-optimal control for discrete stochastic processes via a novel policy iteration algorithm. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(11): 3972−3985 doi: 10.1109/TSMC.2019.2907991
    [40] Luo B, Yang Y, Wu H N, Huang T W. Balancing value iteration and policy iteration for discrete-time control. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(11): 3948−3958 doi: 10.1109/TSMC.2019.2898389
    [41] Li T, Wei Q L, Wang F Y. Multistep look-ahead policy iteration for optimal control of discrete-time nonlinear systems with isoperimetric constraints. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(3): 1414−1426 doi: 10.1109/TSMC.2023.3327492
    [42] Yang Y L, Kiumarsi B, Modares H, Xu C Z. Model-free λ-policy iteration for discrete-time linear quadratic regulation. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(2): 635−649 doi: 10.1109/TNNLS.2021.3098985
    [43] Huang H M, Wang D, Wang H, Wu J L, Zhao M M. Novel generalized policy iteration for efficient evolving control of nonlinear systems. Neurocomputing, 2024, 608: Article No. 128418 doi: 10.1016/j.neucom.2024.128418
    [44] Dierks T, Jagannathan S. Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23(7): 1118−1129 doi: 10.1109/TNNLS.2012.2196708
    [45] Wang D, Xin P, Zhao M M, Qiao J F. Intelligent optimal control of constrained nonlinear systems via receding-horizon heuristic dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(1): 287−299 doi: 10.1109/TSMC.2023.3306338
    [46] Moghadam R, Natarajan P, Jagannathan S. Online optimal adaptive control of partially uncertain nonlinear discrete-time systems using multilayer neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(9): 4840−4850 doi: 10.1109/TNNLS.2021.3061414
    [47] Zhang H G, Qin C B, Jiang B, Luo Y H. Online adaptive policy learning algorithm for H state feedback control of unknown affine nonlinear discrete-time systems. IEEE Transactions on Cybernetics, 2014, 44(12): 2706−2718 doi: 10.1109/TCYB.2014.2313915
    [48] Ming Z Y, Zhang H G, Yan Y Q, Zhang J. Tracking control of discrete-time system with dynamic event-based adaptive dynamic programming. IEEE Transactions on Circuits and Systems II: Express Briefs, 2022, 69(8): 3570−3574
    [49] 罗彪, 欧阳志华, 易昕宁, 刘德荣. 基于自适应动态规划的移动机器人视觉伺服跟踪控制. 自动化学报, 2023, 49(11): 2286−2296

    Luo Biao, Ouyang Zhi-Hua, Yi Xin-Ning, Liu De-Rong. Adaptive dynamic programming based visual servoing tracking control for mobile robots. Acta Automatica Sinica, 2023, 49(11): 2286−2296
    [50] Ha M M, Wang D, Liu D R. Discounted iterative adaptive critic designs with novel stability analysis for tracking control. IEEE/CAA Journal of Automatica Sinica, 2022, 9(7): 1262−1272 doi: 10.1109/JAS.2022.105692
    [51] Dong L, Zhong X N, Sun C Y, He H B. Adaptive event-triggered control based on heuristic dynamic programming for nonlinear discrete-time systems. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(7): 1594−1605 doi: 10.1109/TNNLS.2016.2541020
    [52] Wang D, Hu L Z, Zhao M M, Qiao J F. Dual event-triggered constrained control through adaptive critic for discrete-time zero-sum games. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(3): 1584−1595 doi: 10.1109/TSMC.2022.3201671
    [53] Yang X, Wang D. Reinforcement learning for robust dynamic event-driven constrained control. IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2024.3394251
    [54] 王鼎. 基于学习的鲁棒自适应评判控制研究进展. 自动化学报, 2019, 45(6): 1031−1043

    Wang Ding. Research progress on learning-based robust adaptive critic control. Acta Automatica Sinica, 2019, 45(6): 1031−1043
    [55] Ren H, Jiang B, Ma Y J. Zero-sum differential game-based fault-tolerant control for a class of affine nonlinear systems. IEEE Transactions on Cybernetics, 2024, 54(2): 1272−1282 doi: 10.1109/TCYB.2022.3215716
    [56] Zhang S C, Zhao B, Liu D R, Zhang Y W. Event-triggered decentralized integral sliding mode control for input-constrained nonlinear large-scale systems with actuator failures. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(3): 1914−1925 doi: 10.1109/TSMC.2023.3331150
    [57] Wei Q L, Zhu L, Song R Z, Zhang P J, Liu D R, Xiao J. Model-free adaptive optimal control for unknown nonlinear multiplayer nonzero-sum game. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(2): 879−892 doi: 10.1109/TNNLS.2020.3030127
    [58] Ye J, Bian Y G, Luo B, Hu M J, Xu B, Ding R. Costate-supplement ADP for model-free optimal control of discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(1): 45−59 doi: 10.1109/TNNLS.2022.3172126
    [59] Li Y Q, Yang C Z, Hou Z S, Feng Y J, Yin C K. Data-driven approximate Q-learning stabilization with optimality error bound analysis. Automatica, 2019, 103: Article No. 435-442
    [60] Al-Dabooni S, Wunsch D C. Online model-free n-step HDP with stability analysis. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(4): 1255−1269 doi: 10.1109/TNNLS.2019.2919614
    [61] Ni Z, He H B, Zhong X N, Prokhorov D V. Model-free dual heuristic dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(8): 1834−1839 doi: 10.1109/TNNLS.2015.2424971
    [62] Wang D, Ha M M, Qiao J F. Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation. IEEE Transactions on Automatic Control, 2020, 65(3): 1272−1279 doi: 10.1109/TAC.2019.2926167
    [63] Wang D, Ha M M, Qiao J F. Data-driven iterative adaptive critic control toward an urban wastewater treatment plant. IEEE Transactions on Industrial Electronics, 2021, 68(8): 7362−7369 doi: 10.1109/TIE.2020.3001840
    [64] Wang D, Hu L Z, Zhao M M, Qiao J F. Adaptive critic for event-triggered unknown nonlinear optimal tracking design with wastewater treatment applications. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(9): 6276−6288 doi: 10.1109/TNNLS.2021.3135405
    [65] Zhu L, Wei Q L, Guo P. Synergetic learning neuro-control for unknown affine nonlinear systems with asymptotic stability guarantees. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(2): 3479−3489 doi: 10.1109/TNNLS.2023.3347663
    [66] Pang B, Jiang Z P. Adaptive optimal control of linear periodic systems: An off-policy value iteration approach. IEEE Transactions on Automatic Control, 2021, 66(2): 888−894 doi: 10.1109/TAC.2020.2987313
    [67] Xu Y S, Zhao Z G, Yin S. Performance optimization and fault-tolerance of highly dynamic systems via Q-learning with an incrementally attached controller gain system. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(11): 9128−9138 doi: 10.1109/TNNLS.2022.3155876
    [68] Yang X, Xu M M, Wei Q L. Adaptive dynamic programming for nonlinear-constrained H∞ control. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(7): 4393−4403 doi: 10.1109/TSMC.2023.3247888
    [69] Werbos P. Neural networks for control and system identification. Proceedings of the 28th IEEE Conference on Decision and Control, 1989260−265
    [70] Prokhorov D V, Wunsch D C. Adaptive critic designs. IEEE Transactions on Neural Networks, 1997, 8(5): 997−1007 doi: 10.1109/72.623201
    [71] Watkins C. Learning from delayed rewards. King’s College of Cambridge, 1989.
    [72] Al-Tamimi A, Lewis F L, Abu-Khalaf M. Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica, 2007, 43(3): 473−481 doi: 10.1016/j.automatica.2006.09.019
    [73] Kiumarsi B, Lewis F L, Modares H, Karimpour A, Naghibi-Sistani M. Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica, 2014, 50(4): 1167−1175 doi: 10.1016/j.automatica.2014.02.015
    [74] Jiang Y, Jiang Z P. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica, 2012, 48(10): 2699−2704 doi: 10.1016/j.automatica.2012.06.096
    [75] Kiumarsi B, Lewis F L, Jiang Z P. H control of linear discrete-time systems: Off-policy reinforcement learning. Automatica, 2017, 78: 144−152 doi: 10.1016/j.automatica.2016.12.009
    [76] Farjadnasab M, Babazadeh M. Model-free LQR design by Q-function learning. Automatica, 2022, 137: Article No. 110060 doi: 10.1016/j.automatica.2021.110060
    [77] Lopez V G, Alsalti M, Müller M A. Efficient off-policy Q-learning for data-based discrete-time LQR problems. IEEE Transactions on Automatic Control, 2023, 68(5): 2922−2933 doi: 10.1109/TAC.2023.3235967
    [78] Nguyen H, Dang H B, Dao P N. On-policy and off-policy Q-learning strategies for spacecraft systems: An approach for time-varying discrete-time without controllability assumption of augmented system. Aerospace Science and Technology, 2024, 146: Article No. 108972 doi: 10.1016/j.ast.2024.108972
    [79] Skach J, Kiumarsi B, Lewis F L, Straka O. Actor-critic off-policy learning for optimal control of multiple-model discrete-time systems. IEEE Transactions on Cybernetics, 2018, 48(1): 29−40 doi: 10.1109/TCYB.2016.2618926
    [80] Wen Y L, Zhang H G, Ren H, Zhang K. Off-policy based adaptive dynamic programming method for nonzero-sum games on discrete-time system. Journal of the Franklin Institute, 2020, 357(12): 8059−8081 doi: 10.1016/j.jfranklin.2020.05.038
    [81] Xu Y, Wu Z G. Data-efficient off-policy learning for distributed optimal tracking control of HMAS with unidentified exosystem dynamics. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(3): 3181−3190 doi: 10.1109/TNNLS.2022.3172130
    [82] Cui L L, Pang B, Jiang Z P. Learning-based adaptive optimal control of linear time-delay systems: A policy iteration approach. IEEE Transactions on Automatic Control, 2024, 69(1): 629−636 doi: 10.1109/TAC.2023.3273786
    [83] Amirparast A, Sani S K H. Off-policy reinforcement learning algorithm for robust optimal control of uncertain nonlinear systems. International Journal of Robust and Nonlinear Control, 2024, 34(8): 5419−5437 doi: 10.1002/rnc.7278
    [84] Qasem O, Gao W N, Vamvoudakis K G. Adaptive optimal control of continuous-time nonlinear affine systems via hybrid iteration. Automatica, 2023, 157: Article No. 111261 doi: 10.1016/j.automatica.2023.111261
    [85] Jiang H Y, Zhou B, Duan G R. Modified λ-policy iteration based adaptive dynamic programming for unknown discrete-time linear systems. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(3): 3291−3301 doi: 10.1109/TNNLS.2023.3244934
    [86] Zhao J G, Yang C Y, Gao W N, Park J H. Novel single-loop policy iteration for linear zero-sum games. Automatica, 2024, 163: Article No. 111551 doi: 10.1016/j.automatica.2024.111551
    [87] 肖振飞, 李金娜. 基于非策略Q学习方法的两个个体优化控制. 控制工程, 2022, 29(10): 1874−1880

    Xiao Zhen-Fei, Li Jin-Na. Two-player optimization control based on off-policy Q-learning algorithm. Control Engineering of China, 2022, 29(10): 1874−1880
    [88] Liu Y, Zhang H G, Yu R, Xing Z X. H tracking control of discrete-time system with delays via data-based adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(11): 4078−4085 doi: 10.1109/TSMC.2019.2946397
    [89] Zhang H G, Liu Y, Xiao G Y, Jiang H. Data-based adaptive dynamic programming for a class of discrete-time systems with multiple delays. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(2): 432−441 doi: 10.1109/TSMC.2017.2758849
    [90] Tan X F, Li Y, Liu Y. Stochastic linear quadratic optimal tracking control for discrete-time systems with delays based on Q-learning algorithm. AIMS Mathematics, 2023, 8(5): 10249−10265 doi: 10.3934/math.2023519
    [91] Zhang L L, Zhang H G, Sun J Y, Yue X. ADP-based fault-tolerant control for multiagent systems with semi-markovian jump parameters. IEEE Transactions on Cybernetics, 2024, 54(10): 5952−5962 doi: 10.1109/TCYB.2024.3411310
    [92] Li Y, Zhang H, Wang Z P, Huang C, Yan H C. Data-driven decentralized control for large-scale systems with sparsity and communication delays. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(9): 5614−5624 doi: 10.1109/TSMC.2023.3274292
    [93] Shen X Y, Li X J. Data-driven output-feedback LQ secure control for unknown cyber-physical systems against sparse actuator attacks. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2021, 51(9): 5708−5720 doi: 10.1109/TSMC.2019.2957146
    [94] Qasem O, Davari M, Gao W N, Kirk D R, Chai T Y. Hybrid iteration ADP algorithm to solve cooperative, optimal output regulation problem for continuous-time, linear, multiagent systems: Theory and application in islanded modern microgrids with IBRs. IEEE Transactions on Industrial Electronics, 2024, 71(1): 834−845 doi: 10.1109/TIE.2023.3247734
    [95] Zhang H G, Liang H J, Wang Z S, Feng T. Optimal output regulation for heterogeneous multiagent systems via adaptive dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(1): 18−29 doi: 10.1109/TNNLS.2015.2499757
    [96] Wang W, Chen X. Model-free optimal containment control of multi-agent systems based on actor-critic framework. Neurocomputing, 2018, 314(7): 242−250
    [97] Cui L L, Wang S, Zhang J F, Zhang D S, Lai J, Zheng Y, Zhang Z Y, Jiang Z P. Learning-based balance control of wheel-legged robots. IEEE Robotics and Automation Letters, 2021, 6(4): 7667−7674 doi: 10.1109/LRA.2021.3100269
    [98] Liu T, Cui L L, Pang B, Jiang Z P. A unified framework for data-driven optimal control of connected vehicles in mixed traffic. IEEE Transactions on Intelligent Vehicles, 2023, 8(8): 4131−4145 doi: 10.1109/TIV.2023.3287131
    [99] Davari M, Gao W N, Aghazadeh A, Blaabjerg F, Lewis F L. An optimal synchronization control method of PLL utilizing adaptive dynamic programming to synchronize inverter-based resources with unbalanced, low-inertia, and very weak grids. IEEE Transactions on Automation Science and Engineering, 2025, 22: 24−42 doi: 10.1109/TASE.2023.3329479
    [100] Wang Z Y, Wang Y Q, Davari M, Blaabjerg F. An effective PQ-decoupling control scheme using adaptive dynamic programming approach to reducing oscillations of virtual synchronous generators for grid connection with different impedance types. IEEE Transactions on Industrial Electronics, 2024, 71(4): 3763−3775 doi: 10.1109/TIE.2023.3279564
    [101] Si J, Wang Y T. Online learning control by association and reinforcement. IEEE Transactions on Neural Networks, 2001, 12(2): 264−276 doi: 10.1109/72.914523
    [102] Liu F, Sun J, Si J, Guo W T, Mei S W. A boundedness result for the direct heuristic dynamic programming. Neural Networks, 2012, 32: 229−235 doi: 10.1016/j.neunet.2012.02.005
    [103] Sokolov Y, Kozma R, Werbos L D, Werbos P J. Complete stability analysis of a heuristic approximate dynamic programming control design. Automatica, 2015, 59: 9−18 doi: 10.1016/j.automatica.2015.06.001
    [104] Malla N, Ni Z. A new history experience replay design for model-free adaptive dynamic programming. Neurocomputing, 2017, 266(29): 141−149
    [105] Luo B, Wu H N, Huang T W, Liu D R. Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design. Automatica, 2014, 50(12): 3281−3290 doi: 10.1016/j.automatica.2014.10.056
    [106] Zhao D B, Xia Z P, Wang D. Model-free optimal control for affine nonlinear systems with convergence analysis. IEEE Transactions on Automation Science and Engineering, 2015, 12(4): 1461−1468 doi: 10.1109/TASE.2014.2348991
    [107] Xu J H, Wang J C, Rao J, Zhong Y J, Wu S Y, Sun Q F. Parallel cross entropy policy gradient adaptive dynamic programming for optimal tracking control of discrete-time nonlinear systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(6): 3809−3821 doi: 10.1109/TSMC.2024.3373456
    [108] Wei Q L, Lewis F L, Sun Q Y, Yan P F, Song R Z. Discrete-time deterministic Q-learning: A novel convergence analysis. IEEE Transactions on Cybernetics, 2017, 47(5): 1224−1237 doi: 10.1109/TCYB.2016.2542923
    [109] 王鼎, 王将宇, 乔俊飞. 融合自适应评判的随机系统数据驱动策略优化. 自动化学报, 2024, 50(5): 980−990

    Wang Ding, Wang Jiang-Yu, Qiao Jun-Fei. Data-driven policy optimization for stochastic systems involving adaptive critic. Acta Automatica Sinica, 2024, 50(5): 980−990
    [110] Qiao J F, Zhao M M, Wang D, Ha M M. Adjustable iterative Q-learning schemes for model-free optimal tracking control. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(2): 1202−1213 doi: 10.1109/TSMC.2023.3324215
    [111] Ni Z, Malla N, Zhong X N. Prioritizing useful experience replay for heuristic dynamic programming-based learning systems. IEEE Transactions on Cybernetics, 2019, 49(11): 3911−3922 doi: 10.1109/TCYB.2018.2853582
    [112] Al-Dabooni S, Wunsch D. The boundedness conditions for model-free HDP(λ). IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(7): 1928−1942 doi: 10.1109/TNNLS.2018.2875870
    [113] Zhao Q T, Si J, Sun J. Online reinforcement learning control by direct heuristic dynamic programming: From time-driven to event-driven. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(8): 4139−4144 doi: 10.1109/TNNLS.2021.3053037
    [114] Wei Q L, Liao Z H, Song R Z, Zhang P J, Wang Z, Xiao J. Self-learning optimal control for ice-storage air conditioning systems via data-based adaptive dynamic programming. IEEE Transactions on Industrial Electronics, 2021, 68(4): 3599−3608 doi: 10.1109/TIE.2020.2978699
    [115] Zhao J, Wang T Y, Pedrycz W, Wang W. Granular prediction and dynamic scheduling based on adaptive dynamic programming for the blast furnace gas system. IEEE Transactions on Cybernetics, 2021, 51(4): 2201−2214 doi: 10.1109/TCYB.2019.2901268
    [116] Wang D, Li X, Zhao M M, Qiao J F. Adaptive critic control design with knowledge transfer for wastewater treatment applications. IEEE Transactions on Industrial Informatics, 2024, 20(2): 1488−1497 doi: 10.1109/TII.2023.3278875
    [117] Qiao J F, Zhao M M, Wang D, Li M H. Action-dependent heuristic dynamic programming with experience replay for wastewater treatment processes. IEEE Transactions on Industrial Informatics, 2024, 20(4): 6257−6265 doi: 10.1109/TII.2023.3344130
    [118] Luo B, Liu D R, Wu H N. Adaptive constrained optimal control design for data-based nonlinear discrete-time systems with critic-only structure. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(6): 2099−2111 doi: 10.1109/TNNLS.2017.2751018
    [119] Zhao M M, Wang D, Qiao J F. Stabilizing value iteration Q-learning for online evolving control of discrete-time nonlinear systems. Nonlinear Dynamics, 2024, 112: 9137−9153 doi: 10.1007/s11071-024-09524-9
    [120] Xiang Z R, Li P C, Zou W C, Ahn C K. Data-based optimal switching and control with admissibility guaranteed Q-learning. IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2024.3405739.
    [121] Li X F, Dong L, Xue L, Sun C Y. Hybrid reinforcement learning for optimal control of non-linear switching system. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(11): 9161−9170 doi: 10.1109/TNNLS.2022.3156287
    [122] Li J N, Chai T Y, Lewis F L, Ding Z T, Jiang Y. Off-policy interleaved Q-learning: Optimal control for affine nonlinear discrete-time systems. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(5): 1308−1320 doi: 10.1109/TNNLS.2018.2861945
    [123] Song S J, Zhao M M, Gong D W, Zhu M L. Convergence and stability analysis of value iteration Q-learning under non-discounted cost for discrete-time optimal control. Neurocomputing, 2024, 606: Article No. 128370 doi: 10.1016/j.neucom.2024.128370
    [124] Song S J, Zhu M L, Dai X L, Gong D W. Model-free optimal tracking control of nonlinear input-affine discrete-time systems via an iterative deterministic Q-learning algorithm. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(1): 999−1012 doi: 10.1109/TNNLS.2022.3178746
    [125] Wei Q L, Liu D R. A novel policy iteration based deterministic Qlearning for discrete-time nonlinear systems. Science China Information Sciences, 2015, 58(12): 1−15
    [126] Yan P F, Wang D, Li H L, Liu D R. Error bound analysis of Q-function for discounted optimal control problems with policy iteration. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2017, 47(7): 1207−1216 doi: 10.1109/TSMC.2016.2563982
    [127] Wang W, Chen X, Fu H, Wu M. Model-free distributed consensus control based on actor-critic framework for discrete-time nonlinear multiagent systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(11): 4123−4134 doi: 10.1109/TSMC.2018.2883801
    [128] Luo B, Liu D R, Wu H N, Wang D, Lewis F L. Policy gradient adaptive dynamic programming for data-based optimal control. IEEE Transactions on Cybernetics, 2017, 47(10): 3341−3354 doi: 10.1109/TCYB.2016.2623859
    [129] Zhang Y W, Zhao B, Liu D R. Deterministic policy gradient adaptive dynamic programming for model-free optimal control. Neurocomputing, 2020, 387: 40−50 doi: 10.1016/j.neucom.2019.11.032
    [130] Xu J H, Wang J C, Rao J, Zhong Y J, Zhao S W. Twin deterministic policy gradient adaptive dynamic programming for optimal control of affine nonlinear discrete-time systems. International Journal of Control, Automation, and Systems, 2022, 20(9): 3098−3109 doi: 10.1007/s12555-021-0473-6
    [131] Xu J H, Wang J C, Rao J, Wu S Y, Zhong Y J. Adaptive dynamic programming for optimal control of discrete-time nonlinear systems with trajectory-based initial control policy. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(3): 1489−1501 doi: 10.1109/TSMC.2023.3327450
    [132] Lin M D, Zhao B. Policy optimization adaptive dynamic programming for optimal control of input-affine discrete-time nonlinear systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(7): 4339−4350 doi: 10.1109/TSMC.2023.3247466
    [133] Lin M D, Zhao B, Liu D R. Policy gradient adaptive critic designs for model-free optimal tracking control with experience replay. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(6): 3692−3703 doi: 10.1109/TSMC.2021.3071968
    [134] Luo B, Yang Y, Liu D R. Adaptive Q-learning for data-based optimal output regulation with experience replay. IEEE Transactions on Cybernetics, 2018, 48(12): 3337−3348 doi: 10.1109/TCYB.2018.2821369
    [135] Qasem O, Gutierrez H, Gao W N. Experimental validation of data-driven adaptive optimal control for continuous-time systems via hybrid iteration: An application to rotary inverted pendulum. IEEE Transactions on Industrial Electronics, 2024, 71(6): 6210−6220 doi: 10.1109/TIE.2023.3292873
    [136] 李满园, 罗飞, 顾春华, 罗勇军, 丁炜超. 基于自适应动量更新策略的Adams算法. 上海理工大学学报, 2023, 45(2): 112−119

    Li Man-Yuan, Luo Fei, Gu Chun-Hua, Luo Yong-Jun, Ding Wei-Chao. Adams algorithm based on adaptive momentum update strategy. Journal of University of Shanghai for Science and Technology, 2023, 45(2): 112−119
    [137] 姜志侠, 宋佳帅, 刘宇宁. 一种改进的自适应动量梯度下降算法. 华中科技大学学报(自然科学版), 2023, 51(5): 137−143

    Jiang Zhi-Xia, Song Jia-Shuai, Liu Yu-Ning. An improved adaptive momentum gradient descent algorithm. Journal of Huazhong University of Science and Technology (Natural Science Edition), 2023, 51(5): 137−143
    [138] 姜文翰, 姜志侠, 孙雪莲. 一种修正学习率的梯度下降算法. 长春理工大学学报(自然科学版), 2023, 46(6): 112−120

    Jiang Wen-Han, Jiang Zhi-Xia, Sun Xue-Lian. A gradient descent algorithm with modified learning rate. Journal of Changchun University of Science and Technology (Natural Science Edition), 2023, 46(6): 112−120
    [139] Zhao B, Shi G, Liu D R. Event-triggered local control for nonlinear interconnected systems through particle swarm optimization-based adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(12): 7342−7353 doi: 10.1109/TSMC.2023.3298065
    [140] Zhang L J, Zhang K, Xie X P, Chadli M. Adaptive critic control with knowledge transfer for uncertain nonlinear dynamical dystems: A reinforcement learning approach. IEEE Transactions on Automation Science and Engineering, doi: 10.1109/TASE.2024.3453926
    [141] Gao X, Si J, Huang H. Reinforcement learning control with knowledge shaping. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(3): 3156−3167 doi: 10.1109/TNNLS.2023.3243631
    [142] Gao X, Si J, Wen Y, Li M H, Huang H. Reinforcement learning control of robotic knee with human-in-the-loop by flexible policy iteration. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(10): 5873−5887 doi: 10.1109/TNNLS.2021.3071727
    [143] Guo W T, Liu F, Si J, He D W, Harley R, Mei S W. Online supplementary ADP learning controller design and application to power system frequency control with large-scale wind energy integration. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(8): 1748−1761 doi: 10.1109/TNNLS.2015.2431734
    [144] Zhao M M, Wang D, Ren J, Qiao J. Integrated online Q-learning design for wastewater treatment processes. IEEE Transactions on Industrial Informatics, 2025, 21(2): 1833−1842 doi: 10.1109/TII.2024.3488790
    [145] Zhang H G, Wei Q L, Luo Y H. A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2008, 38(4): 937−942 doi: 10.1109/TSMCB.2008.920269
    [146] Song S J, Gong D W, Zhu M L, Zhao Y Y, Huang C. Data-driven optimal tracking control for discrete-time nonlinear systems with unknown dynamics using deterministic ADP. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(1): 1184−1198 doi: 10.1109/TNNLS.2023.3323142
    [147] Luo B, Liu D R, Huang T W, Wang D. Model-free optimal tracking control via critic-only Q-learning. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(10): 2134−2144 doi: 10.1109/TNNLS.2016.2585520
    [148] Li C, Ding J L, Lewis F L, Chai T Y. A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems. Automatica, 2021, 129: Article No. 109687 doi: 10.1016/j.automatica.2021.109687
    [149] Wang D, Gao N, Ha M M, Zhao M M, Wu J L, Qiao J F. Intelligent-critic-based tracking control of discrete-time input-affine systems and approximation error analysis with application verification. IEEE Transactions on Cybernetics, 2024, 54(8): 4690−4701 doi: 10.1109/TCYB.2023.3312320
    [150] Liang Z T, Ha M M, Liu D R, Wang Y H. Stable approximate Q-learning under discounted cost for data-based adaptive tracking control. Neurocomputing, 2024, 568: Article No. 127048 doi: 10.1016/j.neucom.2023.127048
    [151] Wang Y, Wang D, Zhao M M, Liu A, Qiao J F. Adjustable iterative Q-learning for advanced neural tracking control with stability guarantee. Neurocomputing, 2024, 584: Article No. 127592 doi: 10.1016/j.neucom.2024.127592
    [152] Zhao M M, Wang D, Li M H, Gao N, Qiao J F. A new Q-function structure for model-free adaptive optimal tracking control with asymmetric constrained inputs. International Journal of Adaptive Control and Signal Processing, 2024, 38(5): 1561−1578 doi: 10.1002/acs.3761
    [153] Wang T, Wang Y J, Yang X B, Yang J. Further results on optimal tracking control for nonlinear systems with nonzero equilibrium via adaptive dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(4): 1900−1910 doi: 10.1109/TNNLS.2021.3105646
    [154] Li D D, Dong J X. Approximate optimal robust tracking control based on state error and derivative without initial admissible input. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(2): 1059−1069 doi: 10.1109/TSMC.2023.3320653
    [155] Zhang H G, Luo Y H, Liu D R. Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints. IEEE Transactions on Neural Networks, 2009, 20(9): 1490−1503 doi: 10.1109/TNN.2009.2027233
    [156] Marvi Z, Kiumarsi B. Reinforcement learning with safety and stability guarantees during exploration for linear systems. IEEE Open Journal of Control Systems, 2022, 1: 322−334 doi: 10.1109/OJCSYS.2022.3209945
    [157] Zanon M, Gros S. Safe reinforcement learning using robust MPC. IEEE Transactions on Automatic Control, 2021, 66(8): 3638−3652 doi: 10.1109/TAC.2020.3024161
    [158] Yang Y L, Vamvoudakis K G, Modares H, Yin Y X, Wunsch D C. Safe intermittent reinforcement learning with static and dynamic event generators. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12): 5441−5455 doi: 10.1109/TNNLS.2020.2967871
    [159] Yazdani N M, Moghaddam R K, Kiumarsi B, Modares H. A safety-certified policy iteration algorithm for control of constrained nonlinear systems. IEEE Control Systems Letters, 2020, 4(3): 686−691 doi: 10.1109/LCSYS.2020.2990632
    [160] Yang Y L, Vamvoudakis K G, Modares H. Safe reinforcement learning for dynamical games. International Journal of Robust and Nonlinear Control, 2020, 30(9): 3521−3800 doi: 10.1002/rnc.4942
    [161] Song R Z, Liu L, Xia L N, Lewis F L. Online optimal event-triggered H∞ control for nonlinear systems with constrained state and input. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(1): 131−141 doi: 10.1109/TSMC.2022.3173275
    [162] Fan B, Yang Q M, Tang X Y, Sun Y X. Robust ADP design for continuous-time nonlinear systems with output constraints. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(6): 2127−2138 doi: 10.1109/TNNLS.2018.2806347
    [163] Liu S H, Liu L J, Yu Z. Safe reinforcement learning for affine nonlinear systems with state constraints and input saturation using control barrier functions. Neurocomputing, 2023, 518: 562−576 doi: 10.1016/j.neucom.2022.11.006
    [164] Farzanegan B, Jagannathan S. Continual reinforcement learning formulation for zero-sum game-based constrained optimal tracking. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(12): 7744−7757 doi: 10.1109/TSMC.2023.3299556
    [165] Marvi Z, Kiumarsi B. Safe reinforcement learning: A control barrier function optimization approach. International Journal of Robust and Nonlinear Control, 2021, 31(6): 1923−1940 doi: 10.1002/rnc.5132
    [166] Qin C B, Qiao X P, Wang J G, Zhang D H, Hou Y D, Hu S L. Barrier-critic adaptive robust control of nonzero-sum differential games for uncertain nonlinear systems with state constraints. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(1): 50−63 doi: 10.1109/TSMC.2023.3302656
    [167] Xu J H, Wang J C, Rao J, Zhong Y J, Wang H Y. Adaptive dynamic programming for optimal control of discrete-time nonlinear system with state constraints based on control barrier function. International Journal of Robust and Nonlinear Control, 2021, 32(6): 3408−3424
    [168] Jha M S, Kiumarsi B. Off-policy safe reinforcement learning for nonlinear discrete-time systems. Neurocomputing, 2024, 611: Article No. 128677
    [169] Zhang L Z, Xie L, Jiang Y, Li Z S, Liu X Q, Su H Y. Optimal control for constrained discrete-time nonlinear systems based on safe reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(1): 854−865 doi: 10.1109/TNNLS.2023.3326397
    [170] Cohen M H, Belta C. Safe exploration in model-based reinforcement learning using control barrier functions. Automatica, 2023, 147: Article No. 110684 doi: 10.1016/j.automatica.2022.110684
    [171] Liu S H, Liu L J, Yu Z. Fully cooperative games with state and input constraints using reinforcement learning based on control barrier functions. Asian Journal of Control, 2024, 26(2): 888−905 doi: 10.1002/asjc.3226
    [172] Zhao M M, Wang D, Song S J, Qiao J F. Safe Q-learning for data-driven nonlinear optimal control with asymmetric state constraints. IEEE/CAA Journal of Automatica Sinica, 2024, 11(12): 2408−2422 doi: 10.1109/JAS.2024.124509
    [173] Liu D R, Li H L, Wang D. Neural-network-based zero-sum game for discrete-time nonlinear systems via iterative adaptive dynamic programming algorithm. Neurocomputing, 2013, 110: 92−100 doi: 10.1016/j.neucom.2012.11.021
    [174] Luo B, Yang Y, Liu D R. Policy iteration Q-learning for data-based two-player zero-sum game of linear discrete-time systems. IEEE Transactions on Cybernetics, 2021, 51(7): 3630−3640 doi: 10.1109/TCYB.2020.2970969
    [175] Zhong X N, He H B, Wang D, Ni Z. Model-free adaptive control for unknown nonlinear zero-sum differential game. IEEE Transactions on Cybernetics, 2018, 48(5): 1633−1646 doi: 10.1109/TCYB.2017.2712617
    [176] Wang Y, Wang D, Zhao M M, Liu N, Qiao J F. Neural Q-learning for discrete-time nonlinear zero-sum games with adjustable convergence rate. Neural Networks, 2024, 175: 106274 doi: 10.1016/j.neunet.2024.106274
    [177] Zhang Y W, Zhao B, Liu D R, Zhang S C. Event-triggered control of discrete-time zero-sum games via deterministic policy gradient adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(8): 4823−4835 doi: 10.1109/TSMC.2021.3105663
    [178] Lin M D, Zhao B, Liu D R. Policy gradient adaptive dynamic programming for nonlinear discrete-time zero-sum games with unknown dynamics. Soft Computing, 2023, 27: 5781−5795 doi: 10.1007/s00500-023-07817-6
    [179] 王鼎, 赵慧玲, 李鑫. 基于多目标粒子群优化的污水处理系统自适应评判控制. 工程科学学报, 2024, 46(5): 908−917

    Wang Ding, Zhao Hui-Ling, Li Xin. Adaptive critic control for wastewater treatment systems based on multiobjective particle swarm optimization. Chinese Journal of Engineering, 2024, 46(5): 908−917
    [180] Yang Q M, Cao W W, Meng W C, Si J. Reinforcement-learning-based tracking control of waste water treatment process under realistic system conditions and control performance requirements. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(8): 5284−5294 doi: 10.1109/TSMC.2021.3122802
    [181] Yang R Y, Wang D, Qiao J F. Policy gradient adaptive critic design with dynamic prioritized experience replay for wastewater treatment process control. IEEE Transactions on Industrial Informatics, 2022, 18(5): 3150−3158 doi: 10.1109/TII.2021.3106402
    [182] Qiao J F, Yang R Y, Wang D. Offline data-driven adaptive critic design with variational inference for wastewater treatment process control. IEEE Transactions on Automation Science and Engineering, 2024, 21(4): 4987−4998 doi: 10.1109/TASE.2023.3305615
    [183] Sun B, Kampen E J V. Incremental model-based global dual heuristic programming with explicit analytical calculations applied to flight control. Engineering Applications of Artificial Intelligence, 2020, 89: Article No. 103425 doi: 10.1016/j.engappai.2019.103425
    [184] Zhou Y, Kampen E J V, Chu Q P. Incremental model based online heuristic dynamic programming for nonlinear adaptive tracking control with partial observability. Aerospace Science and Technology, 2020, 105: Article No. 106013 doi: 10.1016/j.ast.2020.106013
    [185] 赵振根, 程磊. 基于增量式Q学习的固定翼无人机跟踪控制性能优化. 控制与决策, 2024, 39(2): 391−400

    Zhao Zhen-Gen, Cheng Lei. Performance optimization for tracking control of fixed-wing UAV with incremental Q-learning. Control and Decision, 2024, 39(2): 391−400
    [186] Cao W W, Yang Q M, Meng W C, Xie S Z. Data-based robust adaptive dynamic programming for balancing control performance and energy consumption in wastewater treatment process. IEEE Transactions on Industrial Informatics, 2024, 20(4): 6622−6630 doi: 10.1109/TII.2023.3346468
    [187] Fu Y, Hong C W, Fu J, Chai T Y. Approximate optimal tracking control of nondifferentiable signals for a class of continuous-time nonlinear systems. IEEE Transactions on Cybernetics, 2022, 52(6): 4441−4450 doi: 10.1109/TCYB.2020.3027344
  • 加载中
计量
  • 文章访问数:  88
  • HTML全文浏览量:  44
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-10-31
  • 录用日期:  2025-01-17
  • 网络出版日期:  2025-03-24

目录

/

返回文章
返回