-
摘要: 最优控制与人工智能两个领域的融合发展产生了一类以执行-评判设计为主要思想的自适应动态规划(Adaptive dynamic programming, ADP)方法. 通过集成动态规划理论、强化学习机制、神经网络技术、函数优化算法, ADP在求解大规模复杂非线性系统的决策和调控问题上取得了重要进展. 然而, 实际系统的未知参数和不确定扰动经常导致难以建立精确的数学模型, 给最优控制器的设计构成了挑战. 近年来, 具有强大自学习和自适应能力的数据驱动ADP方法受到了广泛关注, 它能够在不依赖动态模型的情况下, 仅利用系统的输入输出数据为复杂非线性系统设计出稳定、安全、可靠的最优控制器, 符合智能自动化的发展潮流. 通过对数据驱动ADP方法的算法实现、理论特性、相关应用等方面进行梳理, 着重介绍了最新的研究进展, 包括在线Q学习、值迭代Q学习、策略迭代Q学习、加速Q学习、迁移Q学习、跟踪Q学习、安全Q学习、博弈Q学习, 并涵盖数据学习范式、稳定性、收敛性以及最优性的分析. 此外, 为了提高学习效率和控制性能, 设计了一些改进的评判机制和效用函数. 最后, 以污水处理过程为背景, 总结了数据驱动ADP方法在实际工业系统中的应用效果和存在问题, 并展望了一些未来值得研究的方向.Abstract: The fusion and development of optimal control and artificial intelligence yields adaptive dynamic programming (ADP) methods, which are primarily constructed based on the actor-critic design concept. By integrating dynamic programming theory, reinforcement learning mechanisms, neural network technologies, and function optimization algorithms, ADP has achieved significant progress in solving decision-making and control problems for large-scale complex nonlinear systems. However, the unknown parameters and uncertain disturbances of actual systems often make it difficult to establish accurate mathematical models, posing challenges to the design of optimal controllers. In recent years, data-driven ADP methods with strong self-learning and adaptive capabilities have received widespread attention. ADP methods can design stable, safe, and reliable controllers for complex nonlinear systems using only the input-output data of the system without relying on dynamical models, aligning with the trend of intelligent automation. This paper comprehensively reviews the algorithm implementation, theoretical characteristics, and related applications of data-driven ADP methods, emphasizing the latest research progress, including online Q-learning, value-iteration-based Q-learning, policy-iteration-based Q-learning, accelerated Q-learning, transfer Q-learning, tracking Q-learning, safe Q-learning, and game Q-learning. This paper also covers the analysis of data learning paradigms, stability, convergence, and optimality. Furthermore, in order to enhance learning efficiency and control performance, this paper develops some improved critic schemes and utility functions. Finally, with the background of wastewater treatment processes, this paper summarizes the application results and existing issues of data-driven ADP approaches in practical industrial systems, and outlines several future directions worthy of research.
-
近年来, 无人机集群在各个领域都受到了极大的关注, 获得了较大的发展[1−3]. 其中, 无人机的相变控制在无人机集群控制中已经引起了一定的关注[4]. 目前, 在多智能体研究领域, 相变还没有一个准确的定义, 一般而言, 是指在一个集群系统中, 由于内因、外因、系统内外因相互作用等的改变, 在集群内部的相互作用调节下, 导致群体行为从一种运动模态转变为另一种运动模态的过程. 在无人机集群中引入相变控制, 能够使无人机集群适应日益复杂的任务环境, 提升集群对复杂环境的适应能力[5].
目前, 集群的相变控制已经成为了研究的前沿领域, 学者们对不同的模型中蕴含的相变现象进行了深入的讨论. Vicsek模型是一个经典的能够产生相变的模型. Romero等[6]基于Vicsek模型开展研究, 引入了层级之间的控制衰减因子, 并讨论了衰减因子和噪声的指数临界值, 并据此研究了等级机制对集群聚集带相态的影响. 相关文献则基于自推进粒子模型而非Vicsek模型, 研究了三维空间中自推进粒子集群的运动过程, 并运用数值求解的方法计算了集群从聚集到分散的临界参数[7]. 相比于Vicsek模型而言, 自推进粒子以自然界的鸟群运动模型为基础, 能够容纳更多类型的集群交互形式, 蕴含着更丰富的集群运动相态可能性, 因此被广泛作为多智能体相变的研究对象. Cheng等[8]研究了自推进粒子在不同形式的势能项和邻居交互距离条件下集群的运动相态, 讨论了不同势能作用参数下集群产生的运动相态, 并引入了两种序参量来衡量集群的旋转运动和直线运动相态; 文献[9]研究了集群存在通信时延的情况下集群的运动相态, 通过对集群的中心和个体相对集群中心的向量差分别进行研究, 并借助混沌学科中的分岔理论得到了集群时延状态下的稳定运动相态, 探讨了存在通信延迟情况下自推进粒子集群的运动规律并给出了一些重要参数的解析解. 除了描述不同参数下集群从初始随机状态开始所形成的的运动相态, 相关文献进一步讨论了某些运动相态是如何瓦解并转为另外一个相态的过程. Zhang等[10]通过大量的仿真模拟, 讨论了在存在外界一个捕食者的情况下, 集群因为躲避而产生的涡旋到晶格运行运动状态的单向转变, 分析了速度对齐比例系数和吸引排斥力比例系数对临界危险半径的影响. Edwards等[11]研究了两个具有同样控制规律的群体在产生的从集群平移相态到集群涡旋相态的转变过程, 并研究了两个群体中心距离不同导致最终稳定的集群状态.
自推进粒子的相变理论已经初步应用到了无人集群中. 在考虑通信延迟的集群相变的研究[9]基础上, Edwards等[12]使用虚实混合方法, 在实际的小车上测试相关理论, 验证了自推进小车集群在通信延迟和集群交互强度变化时产生的不同的运动相态. Lei等[13]用小车对相变理论进行实际验证, 研究了小车集群在不同的交互作用强度下不同的集群特性, 在不同的参数下测试了集群的一致性收敛性的和面对外界威胁时集群的反应速度. Xie等[14]研究了环境中的磁场强度, 对微型机器人集群运动形态的形象, 并实现了群体在多种形态之间的快速可逆切换, Hao等[15]则研究了局部交互规则对微型机器人集群的影响, 通过改变振幅和频率, 实现了对集群的聚集和分散行为的控制.
鸟类行为是生物界中最为普遍的群体运动之一, 吸引着大量学者的研究[16−17]. 鸟群相比于其他集群运动有一个比较明显的特征, 即鸟群倾向于和邻居的速度保持一致, 这也是最经典的鸟群运动模型——Vicsek模型[18]的基本原理. 由于鸟群中存在大量个体, 对鸟群的观察和数据记录一直是相关研究的一个难点. 直到近期, 随着GPS技术的进步, 人们可以通过数据比较精确地记录鸟类的位置和速度信息, 并基于相关数据为鸟群建立一系列模型, 研究集群中信息的传递[19]、信息交互机制[20]或者飞行时的轨迹形成机制[21−22].
总之, 目前集群相变问题已经得到了广泛研究, 然而相关研究主要聚焦于集群的参数变化而导致的集群稳态解的不同, 为了得到不同的集群运动相态需要调整所有个体的控制参数, 而较少考虑同一控制方程不同稳定运动相态之间的转换8,23. 考虑同一方程不同相态变化的一些文献中所讨论的运动相态转变也仅局限于从涡旋态到晶格平移相态的单向转变[10], 或是某种运动相态的崩溃[24], 针对两种乃至多种稳定相态之间的相互转换的研究仍然处于起步阶段. 此外, 目前相变控制理论主要针对空间中的自由粒子, 实际物理系统中应用相变理论进行控制的实例仍然较少, 目前引入相变理论的实际系统主要基于的是无人小车[12−13]和各种微型机器人[14,25], 这些实际物理系统受到的硬件限制较少. 而由于无人机较为复杂的动力学特性, 基于无人机集群进行的相变仿真分析尚存在技术空白.
基于上述情况, 本文对全连通交互拓扑下的无人机集群相变控制方法进行研究. 首先, 从鸟群飞行规律中得到启发, 基于自推进粒子模型, 考虑集群系统中的无人机满足无人机的实际飞行条件限制, 设计无人机仿鸟群相变控制律, 使无人机集群形成稳定的集群运动相态. 设计简单的相变控制项, 使无人机能够在两种不同的稳定集群运动相态中进行互相转换. 此外, 基于无人机集群仿鸟群相变控制律, 进一步分析集群的稳定运动相态, 讨论部分重要的集群运动参数, 并通过仿真验证了所设计的集群相变控制律能够使集群实现稳定的集群平移和涡旋运动相态. 通过调节简单的相变控制项, 集群能够在两个运动相态之间进行转换, 完成无人机集群的相变控制. 最后, 在别的社会力模型中引入了相变控制项, 进行了仿真对比测试, 本文提出的相变控制方法更加灵活, 更适合实际无人集群的控制.
1. 问题描述
考虑在三维欧式空间中飞行的由$ N $架无人机组成的无人机集群系统. 假设集群中的每架无人机均配有一阶速度保持自动驾驶仪、一阶航向保持自动驾驶仪以及二阶高度保持自动驾驶仪, 则每架无人机的动力学模型可以表示为[26]:
$$ \begin{split} & {{{\dot{x}}}_{i}}={{V}_{i}}\text{cos}{{\psi }_{i}},\; \\ & {{{\dot{y}}}_{i}}={{V}_{i}}\text{sin}{{\psi }_{i}},\; \\ & {{{\dot{z}}}_{i}}={{h}_{i}},\; \\ & {{{\dot{V}}}_{i}}=\tau _{{{V}_{i}}}^{-1}(V_{i}^{C}-{{V}_{i}}),\; \\ & {{{\dot{\psi }}}_{i}}=\tau _{{{\psi }_{i}}}^{-1}(\psi _{i}^{C}-{{\psi }_{i}}),\; \\ & {{{\dot{h}}}_{i}}=\tau _{{{z}_{i}}}^{-1}(z_{i}^{C}-{{z}_{i}})-\tau _{{{h}_{i}}}^{-1}{{h}_{i}},\; \end{split} $$ (1) 其中, 下标$ i\in \left\{ 1,\;2,\;\cdots ,\;N \right\} $表示不同的无人机, $ {{\mathbf{x}}_{i}}={{\left[ {{x}_{i}},\;{{y}_{i}},\;{{z}_{i}} \right]}^{\text{T}}} $表示无人机$ i $的位置向量, $ {{\mathbf{v}}_{i}}=[ {{{\dot{x}}}_{i}}, {{{\dot{y}}}_{i}},\;{{{\dot{z}}}_{i}} ]^{\text{T}} $为无人机$ i $在惯性坐标系三个坐标轴上的速度分量, $ {{V}_{i}} $, $ {{\psi }_{i}} $和$ {{h}_{i}} $分别为无人机的速度、航向和高度变化率. $ {{\tau }_{{{V}_{i}}}} $, $ {{\tau }_{{{\psi }_{i}}}} $, $ {{\tau }_{{{z}_{i}}}} $, $ {{\tau }_{{{h}_{i}}}} $分别为无人机的速度保持, 航向角保持和高度保持常数, 与自动驾驶仪和无人机本身的性能有关, $ V_{i}^{C} $, $ \psi _{i}^{C} $, $ z_{i}^{C} $为自动驾驶仪的输入指令.
考虑到实际无人机飞行受到无人机的速度、水平过载和爬升速度的限制, 无人机的飞行状态需要满足如下约束条件:
$$ \begin{aligned}[b] & {{V}_{\text{min}}} \le {{V}_{i}} \le {{V}_{\text{max}}},\; \\ & \left| {{{\dot{\psi }}}_{i}} \right| \le V_{i}^{-1}{{n}_{\text{max}}}g,\; \\ & {{h}_{\min }} \le {{h}_{i}} \le {{h}_{\max }},\; \end{aligned} $$ (2) 其中, $ g=9.8\,\;\text{m/s} $为重力加速度, $ {{V}_{\text{min}}} $, $ {{V}_{\text{max}}} $分别为无人机最小和最大飞行速度, $ {{n}_{\text{max}}} $为最大水平过载, $ {{h}_{\min }} $, $ {{h}_{\max }} $为最小和最大的高度变化率.
本文设定集群的交互拓扑为全连通的, 即无人机能够和集群中的所有个体进行交互, 无人机$ i $的交互邻居集合为$ {{N}_{i}}=\left\{ j|j=1,\;2,\;\cdots ,\;N,\;j\ne i \right\} $. 受到Cheng等[8]的启发, 引入集群运动序参量定义. 并定义集群的运动相态如下.
定义 1. (集群平移序参量与平移运动相态) 给定某一时刻无人机集群的位置矢量$ \mathbf{p}=[ {{\mathbf{x}}_{1}},\;{{\mathbf{x}}_{2}}, \;\cdots , {{\mathbf{x}}_{N}} ]^{\text{T}} $和惯性坐标系下的速度矢量$ \mathbf{v}=[\dot{\mathbf{x}}_1,\; \dot{\mathbf{x}}_2,\; \cdots, \dot{\mathbf{x}}_N]^{\mathrm{T}} $, 则可定义无人机集群平移序参量$ {{V}_{m}} $如下[8]:
$$ \begin{aligned} {{V}_{m}}=\frac{1}{N}\left\| \sum\limits_{i=1}^{N}{\frac{{{\mathbf{x}}_{i}}}{\left\| {{\mathbf{x}}_{i}} \right\|}} \right\| \end{aligned} $$ (3) 当无人机集群满足$ {{V}_{m}}=1 $时, 称无人机集群处于平移运动相态.
定义2. (集群涡旋序参量与涡旋运动相态) 给定某一时刻无人机集群的位置矢量$ \mathbf{p} $, 惯性坐标系下的速度矢量$ \mathbf{v} $和集群的邻居集合$ {{N}_{a}}=\{ {{N}_{1}},\;{{N}_{2}}, \cdots ,\;{{N}_{N}} \} $, 则定义无人机集群涡旋序参量$ {{V}_{c}} $如下[8]:
$$ \begin{aligned} {{V}_{c}}=\frac{1}{N}\left\| \sum\limits_{i=1}^{N}{\frac{{{\mathbf{r}}_{i}}\times {{{\dot{\mathbf{x}}}}_{i}}}{\left\| {{\mathbf{r}}_{i}} \right\|\left\| {{{\dot{\mathbf{x}}}}_{i}} \right\|}} \right\| \end{aligned} $$ (4) 其中, $ {{\mathbf{r}}_{i}}=\frac{1}{\left| {{N}_{i}} \right|}\sum\nolimits_{j\in {{N}_{i}}}{{{\mathbf{x}}_{j}}} $为无人机$ i $所有邻居的中心位置坐标, $ \left| {{N}_{i}} \right| $表示无人机$ i $邻居集合中元素的个数, $ \times $表示两个向量的叉乘.
当无人机集群满足$ {{V}_{c}}=1 $时, 称无人机集群处于涡旋运动相态.
注意到, 集群的平移序参量和涡旋序参量不会同时为1, 因此, 在本文的序参量定义下, 无人机集群仅有可能处于三种运动相态: 平移运动相态, 涡旋运动相态和无序运动相态.
2. 仿鸟群无人机相变控制方法设计
2.1 鸟群运动规律启发
鸟群的集群行为是一种常见的集体运动, 学者们已经建立了许多模型来解释鸟群所展现出的复杂行为. 采集鸟类的飞行数据表明, 鸟群的飞行几乎都处于同一个高度, 也因此, 对鸟群运动的讨论主要基于二维平面[27−28].
Vicsek模型[18]作为最经典的仿鸟群运动的模型, 受到了广泛的关注. 对于Vicsek模型的研究进一步揭示了即使简单如Vicsek的模型也同样能够产生相变的现象.
自推进粒子模型作为一个经典的受到鸟群启发的模型, 由Reynolds提出其基本思想[29]. 自推进粒子模型的核心在于设计集群中的个体存在速度自推进项, 以保持自身速度, 并通过粒子之间的集群交互作用调整集群的行为. 在此基础上, Couzin则进一步发展了相关模型. Couzin模型[30]参考鸟群的交互机制, 设计了集群势能作用. 认为集群中的交互存在有三个区域, 排斥区, 对齐区和吸引区, 处于交互范围内的粒子分别会受到排斥力, 速度对齐力和吸引力. 本文受到类似启发, 按照相关讨论的一般做法, 将排斥力和吸引力统一为集群势能梯度作用, 集群中的个体通过势能梯度进行交互. 在没有其余因素的影响下, 两个个体总是倾向于位于互相的势能最低点附近. 通过设计不同的集群交互势函数, 集群的运动行为相比Vicsek模型的三种运动相态有了更多的可能性. 由于自推进粒子模型能够实现多种集群运动相态, 其相变特征目前仍是相关研究的热点.
鸟类归巢行为是一种特殊的鸟群运动行为. 以鸽群归巢行为为例, 鸽群在远离巢穴时, 具有多个个体的集群会选择一条从出发点到终点的较短路径, 群体的运动轨迹近似一条直线[31]; 而在靠近巢穴的位置, 集群的运动发生了变化, 鸟群倾向于在巢穴附近做盘旋运动[32]. 鸽群如何实现两种相态的切换存在着较大的研究空间.
相变能够比较好地在不同模态中进行切换, 灵活适应不同作战场景; 而鸟群归巢行为体现了灵活的的模态转换, 对无人机集群的相变控制有着较大的启发. 通过给出目标位置, 作为无人机集群的归巢目标, 仅将部分无人机设置为信息个体, 引入相变控制项, 即可控制集群到达敌方目标; 在到达敌方目标后, 巢穴吸引力降低, 相变控制项逐渐不发生作用, 无人机集群变为涡旋运动模态, 围绕在敌方目标周围开展任务; 如果遇到敌方威胁时, 集群中的部分个体感受到威胁, 产生逃逸行为, 进而促使集群所有个体远离敌方威胁; 完成任务后, 将巢穴设置为起飞点, 产生相变控制项, 集群就能够通过部分信息个体, 控制集群返回出发点.
综上, 本文借助自推进粒子模型模拟鸟群的基础群体行为, 进一步引入相变控制项来模拟巢穴的吸引对鸟群的调节作用, 设计无人机集群仿鸟群相变控制律, 实现集群运动相态的切换.
2.2 仿鸟群相变控制律设计
从鸟群运动中得到启发, 采用自推进粒子模型作为无人机集群仿鸟群控制律的基础, 可以得到无人机集群的相变控制的速度保持控制项$ \boldsymbol{\Gamma}_i^v $:
$$ \begin{aligned} \boldsymbol{\Gamma}_{i}^{v}=a\left( {{v}_{0}}-\alpha {{\left\| {{{\dot{\mathbf{x}}}}_{i}} \right\|}^{2}} \right){{\dot{\mathbf{x}}}_{i}} \end{aligned} $$ (5) 其中, $ a>0 $为速度保持项的控制增益, $ \alpha >0 $为阻尼比, $ {{v}_{0}} $为速度保持项的基准速度, $ \left\| {{{\dot{\mathbf{x}}}}_{i}} \right\| $为无人机$ i $位置变化率的范数.
基于Couzin模型的基本原理, 采用保守势函数梯度作用作为仿鸟群控制律的集群交互项, 设计得到集群无人机之间交互的集群势能梯度项$ \Gamma _{i}^{U} $:
$$ \begin{aligned} \boldsymbol{\Gamma} _{i}^{U}=b\sum\limits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)} \end{aligned} $$ (6) 其中, $ b>0 $为势能梯度项的控制增益, $ {{N}_{i}} $为无人机$ i $的交互邻居集合. 在本文中, $ {{\mathbf{x}}_{ij}}={{\mathbf{x}}_{i}}-{{\mathbf{x}}_{j}} $为无人机$ i $和$ j $之间的坐标向量差, $ {{U}_{ij}} $为无人机集群交互势函数, 设计势函数满足为:
$$ \begin{aligned} \nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)=\frac{\left( \left\| {{\mathbf{x}}_{ij}} \right\|-d \right){{\mathbf{x}}_{ij}}}{\left\| {{\mathbf{x}}_{ij}} \right\|} \end{aligned} $$ (7) 其中, $ d $为设计的势能梯度项的平衡距离, 为一个定值. 势能梯度在$ {\mathbf{x}}_{ij} = d $时为0, 意味着势能在该点处取得最小值. 势能梯度项同样起到无人机集群内部避障的作用. 当集群中任意两个个体的距离小于平衡距离时, 势能梯度项的作用会让两者互相远离; 而当两个个体距离大于平衡距离时, 势能梯度项的作用会将两者互相靠近, 防止距离过远或发生碰撞. 在仅存在两个个体时, 个体之间的距离将会保持在$ d $.
通过引入速度自推进项和势能梯度项, 就能够实现无人机集群的自组织等行为, 使集群实现一个固定的相态.
为了实现无人机集群的相变控制, 完成无人机集群在两种运动相态的切换, 在上述两项的基础上, 引入仿鸟群相变控制项$ \boldsymbol{\Gamma} _{i}^{p} $来模拟巢穴等对鸟群中部分个体的吸引, $ \boldsymbol{\Gamma} _{i}^{p}=\lambda\left( t \right) {{\mathbf{F}}_{i}} $. 其中, $ \lambda\left( t \right) \in \mathbb{R} $为相变控制项的强度, $ {{\mathbf{F}}_{i}}\left( t \right) $是相变控制力. 实际上, 相变控制力可以沿任意方向. 但是为了探究无人机集群的相变控制并简化讨论, 在本文中, 设计相变控制力为沿某一坐标的方向的单位向量, 以模拟一个固定位置的巢穴的吸引. 其具体表达式为:
$$ \begin{aligned} {{\mathbf{F}}_{i}}={{\left[ 1,\;0,\;0 \right]}^{\text{T}}} \end{aligned} $$ (8) 综上所述, 对于无人机集群, 为了使无人机能够维持自身的速度并保持群体的一致性, 设计如下的仿鸟群无人机集群相变控制律:
$$ \begin{aligned}[b] {{\mathbf{u}}_{i}} =\;&a\left( {{v}_{0}}-\alpha {{\left\| {{{\dot{\mathbf{x}}}}_{i}} \right\|}^{2}} \right){{\dot{\mathbf{x}}}_{i}}+\\ &b\sum\limits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)}+\lambda {{\mathbf{F}}_{i}}\left( t \right) \end{aligned} $$ (9) 其中, $ {{\mathbf{x}}_{ij}}={{\mathbf{x}}_{i}}-{{\mathbf{x}}_{j}} $为两个邻居个体的位置矢量差, $ a,\;b,\;\lambda $分别为粒子自推进项, 集群势能项, 相变控制力的控制增益, $ {{v}_{0}} $为速度保持项的基准速度量, $ \alpha $为粒子运动的阻尼.
将由式(9)计算得到的控制律带入如下控制指令转换器可以得到无人机$ i $的自动驾驶仪控制指令输入为:
$$ \begin{aligned}[b] & V_{i}^{C}={{\tau }_{{{V}_{i}}}}({{\mathbf{u}}_{i,\;1}}\text{cos}{{\psi }_{i}}+{{\mathbf{u}}_{i,\;2}}\text{sin}{{\psi }_{i}})+{{V}_{i}},\; \\ & \psi _{i}^{C}=\frac{{{\tau }_{{{\psi }_{i}}}}}{{{V}_{i}}}({{\mathbf{u}}_{i,\;2}}\text{cos}{{\psi }_{i}}-{{\mathbf{u}}_{i,\;1}}\text{sin}{{\psi }_{i}})+{{\psi }_{i}},\; \\ & z_{i}^{C}=\frac{{{\tau }_{{{z}_{i}}}}}{{{\tau }_{{{h}_{i}}}}}{{h}_{i}}+{{\tau }_{{{z}_{i}}}}{{\mathbf{u}}_{i,\;3}}+{{z}_{i}}. \end{aligned} $$ (10) 3. 集群运动相态分析
本部分将证明, 使用控制律式(9)和控制律转化式(10), 无人机集群能够实现两种稳定运动相态, 并实现两种运动相态的转化.
定理 1. 给定一个无人机集群, 若集群交互拓扑全连通且集群中的无人机均处在同一高度, 无人机在控制律式(9)和控制律转化式(10)的作用下, 在相变控制强度$ \lambda =0 $时, 集群存在两个运动相态, 分别为平移运动相态$ {{V}_{m}}=1 $和涡旋运动相态$ {{V}_{c}}=1 $.
证明: 为了描述方便, 对集群的中心坐标和每个个体相对集群中心的坐标向量差分别进行处理[12]. 引入集群中心坐标$ \mathbf{R} $和每个个体相对集群中心坐标的差$ \delta {{\mathbf{r}}_{i}} $, 有:
$$ \begin{aligned}[b] & \mathbf{R}=\frac{1}{N}\sum\limits_{i=1}^{N}{{{\mathbf{x}}_{i}}} \\ & {{\mathbf{x}}_{i}}=\mathbf{R}+\delta {{\mathbf{r}}_{i}} \end{aligned} $$ (11) 由上述定义式, 显然有:
$$ \begin{aligned} \sum\limits_{i}^{N}{\delta {{\mathbf{r}}_{i}}}=\mathbf{0} \end{aligned} $$ (12) 将上式带入到集群的控制律(9)中, 有:
$$ \begin{aligned} {{{\ddot{\mathbf{x}}}}_{i}}=a\left( {{v}_{0}}-\alpha {{\left\| {{{\dot{\mathbf{x}}}}_{i}} \right\|}^{2}} \right){{\dot{\mathbf{x}}}_{i}}+b\sum\limits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)}+\lambda {{\mathbf{F}}_{i}} \end{aligned} $$ (13) $$ \begin{aligned}[b] {\ddot{\mathbf{R}}}+\delta {{{\ddot{\mathbf{r}}}}_{i}} =\; & a\left( {{v}_{0}}-\alpha {{\left\| \dot{\mathbf{R}}+\delta {{{\dot{\mathbf{r}}}}_{i}} \right\|}^{2}} \right)\left( \dot{\mathbf{R}}+\delta {{{\dot{\mathbf{r}}}}_{i}} \right)+\\ & b\sum\limits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)}+\lambda {{\mathbf{F}}_{i}} \end{aligned} $$ (14) 将式(14)对集群中的所有个体进行求和, 有:
$$ \begin{aligned}[b] N{\ddot{\mathbf{R}}} =\; & \sum\limits_{j=1}^{N}{a\left[ {{v}_{0}}-\alpha {{\left( \dot{\mathbf{R}}+\delta {{{\dot{\mathbf{r}}}}_{j}} \right)}^{2}} \right]\left( \dot{\mathbf{R}}+\delta {{{\dot{\mathbf{r}}}}_{j}} \right)}+\\ & \sum\limits_{j=1}^{N}{b\sum\limits_{k\in {{N}_{j}}}{\nabla {{U}_{jk}}\left( {{\mathbf{x}}_{jk}} \right)}}+\sum\limits_{j=1}^{N}{\lambda {{\mathbf{F}}_{j}}} \end{aligned} $$ (15) 考虑到集群的交互拓扑是全连通的, 且势能梯度项$ \boldsymbol{\Gamma} _{i}^{U} $为保守力, 对于集群中的任意两个个体都具有对称性, 该性质与势能梯度项的具体表达式无关, 因此有:
$$ \begin{aligned} \sum\limits_{j=1}^{N}{b\sum\limits_{k\in {{N}_{j}}}{\nabla {{U}_{jk}}\left( {{\mathbf{x}}_{jk}} \right)}}=0 \end{aligned} $$ (16) 展开求和式(15)并将式(16)代入, 有:
$$ \begin{split} & N{\ddot{\mathbf{R}}} =\sum\limits_{j=1}^{N}{a\left[ {{v}_{0}}-\alpha \left( {{{\dot{\mathbf{R}}}}^{2}}+2\dot{\mathbf{R}}\delta {{{\dot{\mathbf{r}}}}_{j}}+\delta {{{\dot{\mathbf{r}}}}_{j}}^{2} \right) \right] \left( \dot{\mathbf{R}}+\delta {{{\dot{\mathbf{r}}}}_{j}} \right)}+\\ & \quad\sum\limits_{j=1}^{N}{\lambda {{\mathbf{F}}_{i}}} =a\sum\limits_{j=1}^{N}{\left[{{v}_{0}}-\alpha \left( {{{\dot{\mathbf{R}}}}^{2}}+2\dot{\mathbf{R}}\delta {{{\dot{\mathbf{r}}}}_{j}}+\delta {{{\dot{\mathbf{r}}}}_{j}}^{2} \right) \right]\dot{\mathbf{R}}}+\\ &\quad a\sum\limits_{j=1}^{N}{\left[ {{v}_{0}}-\alpha \left( {{{\dot{\mathbf{R}}}}^{2}}+2\dot{\mathbf{R}}\delta {{{\dot{\mathbf{r}}}}_{j}}+\delta {{{\dot{\mathbf{r}}}}_{j}}^{2} \right) \right]\delta {{{\dot{\mathbf{r}}}}_{j}}}+ \sum\limits_{j=1}^{N}{\lambda {{\mathbf{F}}_{j}}} \end{split} $$ (17) 考虑到式(12), 且$ \dot{\mathbf{R}} $与指标$ i $无关, 因此有:
$$ \begin{aligned}[b] N{\ddot{\mathbf{R}}} =\; & a\sum\limits_{j=1}^{N}{\left[ {{v}_{0}}-\alpha \left( {{{\dot{\mathbf{R}}}}^{2}}+\delta {{{\dot{\mathbf{r}}}}_{j}}^{2} \right) \right]\dot{\mathbf{R}}}+\\ & a\sum\limits_{j=1}^{N}{\left[ {{v}_{0}}-\alpha \left( 2\dot{\mathbf{R}}\delta {{{\dot{\mathbf{r}}}}_{j}}+\delta {{{\dot{\mathbf{r}}}}_{j}}^{2} \right) \right]\delta {{{\dot{\mathbf{r}}}}_{j}}}+\\ & \sum\limits_{j=1}^{N}{\lambda {{\mathbf{F}}_{j}}} \end{aligned} $$ (18) 将式(18)同除以集群中个体的数量$ N $:
$$ \begin{aligned}[b] {\ddot{\mathbf{R}}} =\; & a\dot{\mathbf{R}}\left( {{v}_{0}}-\alpha {{{\dot{\mathbf{R}}}}^{2}}-\frac{\alpha}{N}\sum\limits_{j=1}^{N}{\delta {{{\dot{\mathbf{r}}}}_{j}}^{2}} \right)+\\ & \frac{a}{N}\sum\limits_{j=1}^{N}{\left[ {{v}_{0}}-\alpha \left( 2\dot{\mathbf{R}}\delta {{{\dot{\mathbf{r}}}}_{j}}+\delta {{{\dot{\mathbf{r}}}}_{j}}^{2} \right) \right]\delta {{{\dot{\mathbf{r}}}}_{j}}}+\frac{1}{N}\sum\limits_{j=1}^{N}{\lambda {{\mathbf{F}}_{j}}} \end{aligned} $$ (19) 将式(19)代入式(14), 可以得到:
$$ \begin{aligned}[b] \delta {{{{\ddot{\mathbf{r}}}}}_{i}} =\; & a\left( -2\dot{\mathbf{R}}\delta {{{\dot{\mathbf{r}}}}_{i}}-\delta {{{\dot{\mathbf{r}}}}_{i}}^{2}+\frac{\alpha}{N}\sum\limits_{j=1}^{N}{\delta {{{\dot{\mathbf{r}}}}_{j}}^{2}} \right)\dot{\mathbf{R}}+\\ & a\left( {{v}_{0}}-\alpha {{\left\| \dot{\mathbf{R}}+\delta {{{\dot{\mathbf{r}}}}_{i}} \right\|}^{2}} \right)\delta {{{\dot{\mathbf{r}}}}_{i}}- \\ & \frac{a}{N}\sum\limits_{j=1}^{N}{\left[ {{v}_{0}}-\alpha \left( 2\dot{\mathbf{R}}\delta {{{\dot{\mathbf{r}}}}_{j}}+\delta {{{\dot{\mathbf{r}}}}_{j}}^{2} \right) \right]\delta {{{\dot{\mathbf{r}}}}_{j}}} +\\ & b\sum\limits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)}+\lambda {{\mathbf{F}}_{i}}-\frac{1}{N}\sum\limits_{j=1}^{N}{\lambda {{\mathbf{F}}_{j}}} \end{aligned} $$ (20) 通过上述处理, 我们将原本关于$ {{\mathbf{x}}_{i}} $的$ N $个方程转变为了关于集群中心坐标$ \mathbf{R} $(式(19))和坐标矢量差$ \delta {{\mathbf{r}}_{i}} $(式(20))的$ N+1 $个方程. 考虑到式(12)的隐含关系, 实际上转化后只有$ N $个方程, 转化前后方程约束条件一致, 因此可以通过研究转化后的方程来研究转化前方程的相关特性.
受到平均场理论的启发, 为了研究集群的运动相态, 令$ N\to \infty $, 并忽略式(19)中$ \delta \dot{\mathbf{r}} $的波动, 因此可以得到与相变控制强度$ \lambda $和集群的空间维数的取值无关的集群方程:
$$ \begin{aligned} {\ddot{\mathbf{R}}}=a\dot{\mathbf{R}}\left( {{v}_{0}}-\alpha {{{\dot{\mathbf{R}}}}^{2}} \right)+\frac{1}{N}\sum\limits_{j=1}^{N}{\lambda {{\mathbf{F}}_{j}}} \end{aligned} $$ (21) 在公式中, 令相变控制强度$ \lambda =0 $, 可以得到简化后的集群运动方程:
$$ \begin{aligned} {\ddot{\mathbf{R}}}=a\dot{\mathbf{R}}\left( {{v}_{0}}-\alpha {{{\dot{\mathbf{R}}}}^{2}} \right) \end{aligned} $$ (22) 研究微分方程式(22), 可以得到方程的两个平衡解: 一个静态平衡解$ \mathbf{R}={{\mathbf{R}}_{0}} $, 即集群的中心坐标为一个常矢量, 集群的中心点不变; 和一个恒速运动解$ \mathbf{R}={{\mathbf{V}}_{0}}t+{{\mathbf{R}}_{0}} $, 其中$ \left\| {{\mathbf{V}}_{0}} \right\|=\sqrt{{{v}_{0}}/\alpha } $. 在控制所有无人机处于同一个高度的情况下, 两个静态平衡解分别对应集群涡旋相态和集群平移相态, 并可以据此实现无人机集群的相变控制.
(1) 集群涡旋相态稳定解
当$ \mathbf{R}={{\mathbf{R}}_{0}} $时, 即$ \dot{\mathbf{R}}=0 $, 代入集群运动式(14), 有:
$$ \begin{aligned} \delta {{{\ddot{\mathbf{r}}}}_{i}}=a\left( {{v}_{0}}-\alpha \delta {{{\dot{\mathbf{r}}}}_{i}}^{2} \right)\delta {{\dot{\mathbf{r}}}_{i}}+b\sum\limits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)} \end{aligned} $$ (23) 方程(23)存在一个稳定解, 即$ \left\| \delta {{{\dot{\mathbf{r}}}}_{i}} \right\|=\sqrt{{{v}_{0}}/\alpha } $, $ \delta {{{\ddot{\mathbf{r}}}}_{i}}=b\sum\nolimits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)} $. 当$ \left\| \delta {{{\dot{\mathbf{r}}}}_{i}} \right\| $不变时, 存在两种情况, $ \left\| \delta {{{\ddot{\mathbf{r}}}}_{i}}\right\|=0 $或$ \delta {{{\ddot{\mathbf{r}}}}_{i}} $垂直于$ \delta {{\dot{\mathbf{r}}}_{i}} $. 若无人机势能梯度项$ \delta {{{\ddot{\mathbf{r}}}}_{i}}=b\sum\nolimits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)}=0 $, 即无人机加速度为0, 此时无人机将进行直线平移运动, 与集群中心$ \dot{\mathbf{R}}=0 $矛盾. 因此, 仅有可能无人机的加速度始终与无人机的速度垂直, 无人机围绕集群中心做匀速圆周运动, 运动方程中的势能梯度项$ \boldsymbol{\Gamma} _{i}^{U} $提供粒子运动的向心加速度, 因此可以得到单个无人机速度大小$ {{v}_{i}} $和圆周运动半径$ {{r}_{i}} $满足的关系式:
$$ \begin{aligned} \frac{{{v}_{i}}^{2}}{{{r}_{i}}}=b\sum\limits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)} \end{aligned} $$ (24) 对势能梯度项进行分析, 在集群稳定状态下, 可以近似认为集群的所有个体分布在以集群中心为圆心的一个圆周上, 如图1所示:
稳定状态下, 可以认为集群中的所有个体是在圆周上的均匀分布, 角密度为$ \frac{N}{2\pi } $, 每一小段上的无人机数量为$ \frac{N}{2\pi }\text{d}\theta $. 由于无人集群的交互拓扑为全连通的, 因此需要对无人机集群中的所有个体进行积分. 由于圆的对称性, 无人机的加速度指向圆心, 因此集群圆弧上每一点的邻居对无人机$ i $的作用分量为$ \left( 2{{r}_{i}}\cos \theta -d \right)\cos \theta $. 根据圆周角和圆心角关系, 代入式(24), 可以得到积分式:
$$ \begin{aligned}[b] \frac{{{v}_{i}}^{2}}{{{r}_{i}}}=\;&b\sum\limits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)}= \\ & b\int_{-\pi /2}^{\pi /2}{\frac{N}{2\pi }2\cos \theta \left( 2{{r}_{i}}\cos \theta -d \right)\text{d}\theta } \end{aligned} $$ (25) 对式(25)进行积分运算, 可以得到:
$$ \begin{aligned} \frac{{{v}_{i}}^{2}}{{{r}_{i}}}=b\left( Nr_i-2\frac{N}{\pi }d \right) \end{aligned} $$ (26) 式(26)是关于半径$ {{r}_{i}} $的二次方程. 求解上述方程, $ bN{{r}_{i}}^{2}-2b\frac{N}{\pi }d{{r}_{i}}-{{v}_{i}}^{2}=0 $, 并舍去没有实际意义的负数解, 可以解得涡旋相的旋转半径:
$$ \begin{aligned}[b] r_i =\; & \frac{2b\frac{N}{\pi }d+\sqrt{{{\left( 2b\frac{N}{\pi }d \right)}^{2}}+4{{v}_{i}}^{2}*bN}}{2bN} =\\ & \frac{d+\sqrt{{{d}^{2}}+{{v}_{i}}^{2}{{\pi }^{2}}/\left( bN \right)}}{\pi } \end{aligned} $$ (27) 当$ N\to \infty $时, $ {{v}_{i}}^{2}{{\pi }^{2}}/\left( bN \right)\to 0 $, 因此有
$$ \begin{aligned} r_i=\frac{2}{\pi }d \end{aligned} $$ (28) 此时, 所有无人机绕集群中心做圆周运动, 集群涡旋序参量$ {{V}_{c}}=1 $且集群平移序参量$ {{V}_{m}}=0 $.
经过讨论, 本文得到了一个稳定的涡旋状态解. 在这种状态下, 集群能够实现涡旋状态, 集群中的无人机为绕集群中心点的匀速圆周运动, 并且旋转半径与集群内无人机数目无关, 仅与势能项中设计的平衡距离有关.
(2) 集群平移相态稳定解
在这种情况下, 式(22)具有稳定运动解$ \mathbf{R}= {{\mathbf{V}}_{0}}t+{\mathbf{R}}_{0} $, 其中$ \left\| {{\mathbf{V}}_{0}} \right\|=\sqrt{{{v}_{0}}/\alpha } $, 代入式(14), 有:
$$ \begin{aligned}[b] \delta {{{\ddot{\mathbf{r}}}}_{i}} =\; & -a\left[ \alpha \left( 2\dot{\mathbf{R}}\delta {{{\dot{\mathbf{r}}}}_{i}}+\delta {{{\dot{\mathbf{r}}}}_{i}}^{2} \right) \right]\left( \dot{\mathbf{R}}+\delta {{{\dot{\mathbf{r}}}}_{i}} \right)+\\ & b\sum\limits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}}\left( {{\mathbf{x}}_{ij}} \right)}+\lambda {{\mathbf{F}}_{i}} \end{aligned} $$ (29) 此时具有稳定解$ \delta {{\dot{\mathbf{r}}}_{i}}=0 $, 且$ b\sum\nolimits_{j\in {{N}_{i}}}{\nabla {{U}_{ij}} \left( {{\mathbf{x}}_{ij}} \right)} =0 $. 可以关于无人机位置列出$ 3N $个方程求解无人机集群的$ 3N $个坐标分量, 因此, 能够得到集群中所有无人机的稳定位置, 实现集群的稳定平移运动. 在这种情况下, 集群具有稳定的运动速度, 所有无人机以同样的速度进行直线运动. 此时, 集群平移序参量$ {{V}_{m}}=1 $, 涡旋序参量为$ {{V}_{c}}=0 $.
由上述讨论可知, 集群存在两个运动相态, 涡旋运动相态和平移运动相态, 定理1得证.
当相变控制项的强度$ \lambda \ne 0 $时, 能够通过对无人机集群中的部分无人机施加相变控制项$ \boldsymbol{\Gamma} _{i}^{p} $, 使无人机集群的中心速度发生变化, 进而使集群在平移运动相态和涡旋运动相态之间进行转化. 根据集群中心运动方程(21)可知, 集群中心速度存在两个解, 在稳定状态下, 集群中心速度不会发生变化. 而通过调节无人机个体的相变控制强度, 能够调节集群中心的速度.
当集群处于涡旋运动相态时, 随着集群中心的速度逐渐增大, 无人机相对于集群中心的速度逐渐发生较大的偏差, 在势能梯度项$ \boldsymbol{\Gamma} _{i}^{U} $没有较大变化时, 势能梯度项提供的向心控制项远远超出无人机圆周运动的需要, 因此集群的涡旋相态逐渐消失, 随着集群中心速度的增加, 集群个体的速度逐渐增加, 最终达到一致的平移速度, 随后势能梯度项将集群的所有个体位置控制到平衡位置附近, 实现集群的平移运动相态.
当集群处于平移运动相态时, 随着集群中心速度的逐渐减小, 集群势能项依旧保持在0附近波动, 此时无人机的速度自推进项开始发生作用. 促使单个无人机保持一定的运动速度, 因此, 无人机的速度开始成对变化以保持集群的速度不变, 并在势能梯度项的作用下逐渐变为绕集群中心点的圆周运动.
定理1中, 假设集群的通信拓扑为全连通的, 这个假设对集群稳定解的证明存在一定的影响. 一直到式(22), 全连通假设都没有对结果造成影响. 并且, 对于平移运动模态所对应的稳定解, 全连通假设并不会影响该稳定解的特性. 影响较大的主要是涡旋状态稳定解. 在全连通通信假设条件下, 集群的稳定涡旋解能够通过对集群中所有个体求积分得到; 而如果集群的通信拓扑是非连通的, 不妨讨论一种比较简化的情况, 即假设邻居的选取原则为通信半径范围内的所有邻居, 则此时的积分范围只能是无人机附近通信半径内的个体. 在这种情况下, 一方面积分的对象发生了变化, 需要在以该无人机为圆心的圆中进行积分, 另一方面也需要对不同半径处无人机的分布密度有一定的先验知识, 分析起来较为困难. 因此, 本文引入了全连通假设来简化相应的分析.
4. 仿真实验结果
考虑由$ N $架($ N=45 $) 无人机组成的无人机集群, 在三维空间中生成无人集群中无人机的初始飞行状态, 包括的无人机的空速$ {{V}_{i}} $, 航向角$ {{\psi }_{i}} $, 高度变化率$ {{h}_{i}} $和位置$ {{\mathbf{p}}_{i}}=\left[ {{x}_{i}},\;{{y}_{i}},\;{{z}_{i}} \right]^{\text{T}} $. 其中, 下标$ i $用于区分不同的无人机. 除了高度和高度变化率分别设置为$ {{z}_{i}}=30\,\;\text{m} $和$ {{h}_{i}}=0\,\;\text{m/s} $, 其余飞行状态均为随机生成. 无人机自动驾驶仪的时间常数分别设置为$ {{\tau }_{{{V}_{i}}}}=2\,\;\text{s} $, $ {{\tau }_{{{\psi }_{i}}}}=2\,\;\text{s} $, $ {{\tau }_{{{z}_{i}}}}=2\,\;\text{s} $和$ {{\tau }_{{{h}_{i}}}}=2\,\;\text{s} $. 无人机飞行速度、最大横向载荷和高度变化率限制分别设置为$ {{V}_{\min }}=10\,\;\text{m/s} $, $ {{V}_{\max }}=150\,\;\text{m/s} $, $ {{n}_{\max }}=6\,\;\text{m/}{{\text{s}}^{2}} $, $ {{h}_{\max }}=-5\,\;\text{m/s} $和$ {{h}_{\max }}=5\,\;\text{m/s} $. 仿鸟群相变控制律的控制增益为$ a=1/55 $, $ \alpha = 1/100 $, $ b=1 $, 基准速度$ v_0=20 $, 势能平衡距离$ d=50 $.
4.1 集群稳定运动相态
在集群中不存在相变控制项, 即$ \lambda \left( t \right)=0 $时, 设置无人机集群的位于不同的初始化位置, 进行多次实验.
图2给出了$ N=45 $时某次集群的运动相态变化情况, 集群初始时刻在空间中随机分布. 可以看出, 所提出的控制律在没有相变控制项的情况下, 能够使初始时刻空间中随机分布的无人机集群自发地形成涡旋相态, 整个集群围绕集群的中心点做圆周运动, 并且集群的中心位置几乎没有变化.
图3展示了平移序参量和涡旋序参量两种序参量指标的变化情况. 在初始阶段, 无人机集群的平移序参量和旋转序参量都比较低, 代表集群在运动的一开始处于无序的状态, 没有处于优势的运动相态. 经过一段时间, 集群的旋转序参量开始逐渐上升, 而集群的平移序参量依旧没有太多的变化, 代表集群开始逐渐转变为涡旋状态. 随着系统进一步演化, 最后集群的旋转序参量逐渐趋向于1, 而平移序参量依旧在0左右浮动, 表示集群的运动相态已经达到了涡旋的运动相态, 并且能够稳定保持在该运动相态.
改变初始条件进行多次测试, 集群最终的涡旋序参量均在1附近, 而集群的平移序参量均在0左右波动, 证明集群的涡旋状态是一个比较稳定的运动相态, 集群能够通过控制律(9)和控制律转化式(10)的作用来实现无人机集群从无序到涡旋有序的相态转变.
图4展示了无人机集群中的一个个体从无序到有序过程中的无人机运动速度、航向角速率的变化曲线. 可以发现, 集群在开始阶段, 由于无人机的初始状态是随机分布, 因此其可能会出现比较大的速度变化, 以快速达到所设计的期望速度. 在速度快速下降后, 无人机通过调整自身的航向来适应集群的变化. 在最后, 无人机的速度和航向角变化率均不再出现太大变化, 集群的运动趋于稳定. 除此之外可以看出, 无人机的相应参数满足模型的限制, 验证了本文所设计的相变控制律能够对无人机集群产生有效的控制效果.
调整不同的势能平衡距离, 作出集群圆周运动半径和势能平衡距离关系如图5所示, 图5中圆点为仿真得到的集群平均半径, 直线为式(28)得到的聚集群运动半径理论值, 可以看到, 涡旋半径的仿真值与理论值拟合得很好. 并且随着势能平衡距离的增加, 涡旋半径逐渐符合理论值, 验证了本文方法理论分析的有效性.
4.2 集群相态转换
无人机集群存在两个稳定的运动相态, 涡旋运动相态和平移运动相态, 测试无人机仿鸟群相变控制律能否能够仅调整相变控制项, 完成从一个相态到另外一个相态的转换.
仿真测试每200秒分为一个阶段, 总共分为三个阶段. 在第一阶段, 在空间中随机生成$ N=45 $架无人机, 设置集群相变控制强度为$ \lambda =0 $, 使集群自行演化为涡旋状态.
在第二阶段, 对处于涡旋状态的无人机集群施加一定的相变控制, 设置相变控制强度为$ \lambda =100 $, 集群中有13个无人机个体受到相变控制的作用, 进行集群涡旋相到平移相的相变测试, 使集群形成平移运动相态, 设计具体的相变控制强度$ \lambda $如下:
$$ \begin{aligned} \lambda =\left\{ \begin{aligned} & 100&& \| {\dot{\mathbf{R}}} \| \le 40\ \text{m/s} \\ & 0&& \| {\dot{\mathbf{R}}} \| > 40\ \text{m/s} \end{aligned} \right. \end{aligned} $$ (30) 第三阶段, 在已经形成直线相态的无人机集群中, 根据集群平均运动速度, 简单设计相变控制强度, 进行无人机集群从平移态到涡旋态的仿真实验, 观察集群是否能从平移运动相态转化为涡旋运动相态. 在这一阶段, 设计具体的相变控制强度$ \lambda $如下:
$$ \begin{aligned} \lambda =\left\{ \begin{aligned} & -100&& \| {\dot{\mathbf{R}}} \|>10\ \text{m/s} \\ & 0&& \| {\dot{\mathbf{R}}} \|\le 10\ \text{m/s} \end{aligned} \right. \end{aligned} $$ (31) 仿真的算法流程图如图6所示:
图7给出了仿真过程中集群序参量和集群运动相态的变化情况, 图7(a)中展示了在相变控制项的作用下, 集群能够很快从一个运动相转换到另外一个相态. 图中的虚线标明了相变控制项的作用时间范围.
图 7 集群相态转换结果. (a)集群序参量变化情况(第1、2条垂直虚线之间和第3、4条垂直虚线之间为相变控制项不为0的时间段. 第3、4条虚线由于距离过近在显示上略有重合, 在小图中进行了放大); (b) ~ (f) $t=180,\;205,\;300,\;405,\;500\;\text{s} $时的集群运动相态Fig. 7 Results of phase transition. (a) Order parameter in phase transition process. (b) ~ (f) Group motion phase at $ t=180,\;205,\;300,\;405,\;500\;\text{s}$在第一阶段, 集群初始处于无序的状态, 在相变控制律的作用下逐渐收敛到稳定的涡旋相态; 在第二阶段, 通过相变控制项, 集群中心速度迅速增加, 在增加到$ 40\;\text{m/s} $后, 停止施加相变控制项, 集群能够通过集群相互之间的作用逐渐收敛至稳定的直线运动相态; 在第三阶段, 同样通过相变控制项, 利用部分个体对集群的速度进行控制, 集群中心的速度很快降低. 在速度降低至$ 10\; \text{m/s} $后, 停止施加相变控制项, 集群最终仍然能够回到集群涡旋相态.
从仿真结果中可以看出, 在集群相态转换的过程中, 在无人机相变控制律的作用下, 通过相变控制项的作用, 无人机从各自的位置出发, 不断调整自身的运动状态和周围邻居保持一致, 使集群逐渐从一种有序的运动相态转换到另一种有序运动相态.
图8中展示了三个阶段中集群中心速度的变化曲线, 可以看出, 集群的中心速度在涡旋相态时处于较慢的状态, 集群中心的速度逐渐减小; 而当集群处于较高速度后, 停止集群的相变控制, 无人机集群也能够逐渐收敛到平移运动相态. 通过对集群中部分个体进行控制, 调整集群的平均速度, 实现了无人集群两种稳定相态的转换, 验证了本文所提出的无人机集群相变控制律的有效性.
4.3 相变模型对比
为了体现本文方法的优势, 将本文的算法和何亚琦[33]第六章中提到的社会力控制器进行对比, 将本文提出的相变控制项作用添加到文献的社会力控制器中, 测试集群的稳定运动模态和相态转换能力, 并与本文所提出的相变控制器进行对比.
使用文献第六章中使用的社会力模型, 设计模型参数如下: $ \beta =40/11 $, $ \gamma +\sigma =1/5500 $, $ {{l}_{r}}=400, {{l}_{a}}= 600,\;{{C}_{r}}=1,\;{{C}_{a}}=0.5,\;{{\alpha }_{p}}=50 $. 相变控制项的形式如式(30)(31)所示, 相变控制强度的数值$ \left| \lambda \right|=1000 $. 得到的仿真结果如图9所示.
从图中可以看出, 在社会力控制器作用下, 能够形成一个类似的涡旋状态的圆周运动的运动模态, 并且在本文提出的相变控制项的作用下, 能够实现类涡旋到平移运动模态的转换. 但是, 该控制器相比于本文的相变控制器而言还存在一些不足, 主要有如下两点:
其一, 在所给参数下, 给出的社会力模型的确存在一个类似的涡旋模态, 但是集群中无人机的旋转方向却并不完全一致, 在应用于实际的无人机集群中可能会导致碰撞. 这一点从序参量的变化中也能够看出;
其二, 在使用相变控制项对集群的相态进行控制时, 无人机集群的速度方向并不可控. 在给出一个方向的相变控制作用后, 集群的运动并没有达到一致, 不同个体之间的速度仍然有所差别. 一种可能得原因是由于集群中不同无人机个体的旋转方向不一致, 导致在运动模态转换的过程中与期望的平移运动速度出现了偏移, 形成了不同的运动方向. 因此, 相较于文献[33]给出的社会力模型, 可以认为本文所提出的相变控制器能够比较好地适应无人机集群相变应用场景.
5. 结论
本文针对自由环境中无人机集群的相变控制问题, 设计了基于仿鸟群自推进粒子模型的无人机集群相变控制方法, 通过序参量指标度量无人机集群的运动一致性程度并进而确定集群所处的运动相态, 实现了集群在两种运动相态之间的相互转换. 根据理论证明和仿真结果得出以下结论:
(1) 受现实中鸟群运动规律的启发, 设计了仿鸟群无人机集群相变控制律. 在相变控制律的作用下, 无人机集群在自组织原则的基础上能够形成两种稳定的集群运动相态, 包括平移相态和涡旋相态, 形成无人机集群稳定的一致性运动, 并分析了相关相态的一些重要参数.
(2) 通过调节具有简单形式的相变控制项, 能够仅对集群中部分个体进行控制, 实现集群中两种相态的自由切换, 通过仿真验证了无人机仿鸟群相变控制律的有效性.
-
-
[1] 张化光, 张欣, 罗艳红, 杨珺. 自适应动态规划综述. 自动化学报, 2013, 39(4): 303−311 doi: 10.1016/S1874-1029(13)60031-2Zhang Hua-Guang, Zhang Xin, Luo Yan-Hong, Yang Jun. An overview of research on adaptive dynamic programming. Acta Automatica Sinica, 2013, 39(4): 303−311 doi: 10.1016/S1874-1029(13)60031-2 [2] Lewis F L, Vrabie D, Vamvoudakis K G. Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Control Systems Magazine, 2012, 32(6): 76−105 doi: 10.1109/MCS.2012.2214134 [3] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998. [4] Werbos P J. Approximate dynamic programming for real-time control and neural modeling. In: Proceedings of the Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. New York, USA: 1992. [5] 刘德荣, 李宏亮, 王鼎. 基于数据的自学习优化控制: 研究进展与展望. 自动化学报, 2013, 39(11): 1858−1870 doi: 10.3724/SP.J.1004.2013.01858Liu De-Rong, Li Hong-Liang, Wang Ding. Data-based self-learning optimal control: research progress and prospects. Acta Automatica Sinica, 2013, 39(11): 1858−1870 doi: 10.3724/SP.J.1004.2013.01858 [6] Mao R Q, Cui R X, Chen C L P. Broad learning with reinforcement learning signal feedback: Theory and applications. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(7): 2952−2964 doi: 10.1109/TNNLS.2020.3047941 [7] Shi Y X, Hu Q L, Li D Y, Lv M L. Adaptive optimal tracking control for spacecraft formation flying with event-triggered input. IEEE Transactions on Industrial Informatics, 2023, 19(5): 6418−6428 doi: 10.1109/TII.2022.3181067 [8] 孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题. 自动化学报, 2020, 46(7): 1301−1312Sun Chang-Yin, Mu Chao-Xu. Important scientific problems of multi-agent deep reinforcement learning. Acta Automatica Sinica, 2020, 46(7): 1301−1312 [9] Wei Q L, Liao Z H, Shi G. Generalized actor-critic learning optimal control in smart home energy management. IEEE Transactions on Industrial Informatics, 2021, 17(10): 6614−6623 doi: 10.1109/TII.2020.3042631 [10] 王鼎, 赵明明, 哈明鸣, 乔俊飞. 基于折扣广义值迭代的智能最优跟踪及应用验证. 自动化学报, 2022, 48(1): 182−193Wang Ding, Zhao Ming-Ming, Ha Ming-Ming, Qiao Jun-Fei. Intelligent optimal tracking with application verifications via discounted generalized value iteration. Acta Automatica Sinica, 2022, 48(1): 182−193 [11] Sun J Y, Dai J, Zhang H G, Yu S H, Xu S, Wang J J. Neural-network-based immune optimization regulation using adaptive dynamic programming. IEEE Transactions on Cybernetics, 2023, 53(3): 1944−1953 doi: 10.1109/TCYB.2022.3179302 [12] Liu D R, Ha M M, Xue S. State of the art of adaptive dynamic programming and reinforcement learning. CAAI Artificial Intelligence Research, 2022, 1(2): 93−110 doi: 10.26599/AIR.2022.9150007 [13] Wang D, Gao N, Liu D R, Li J N, Lewis F L. Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications. IEEE/CAA Journal of Automatica Sinica, 2024, 11(1): 18−36 doi: 10.1109/JAS.2023.123843 [14] 王鼎, 赵明明, 哈明鸣, 任进. 智能控制与强化学习: 先进值迭代评判设计. 北京: 人民邮电出版社, 2024.Wang Ding, Zhao Ming-Ming, Ha Ming-Ming, Ren Jin. Intelligent Control and Reinforcement Learning: Advanced Value Iteration Critic Design. Beijing: Posts and Telecommunications Press, 2024. [15] 孙景亮, 刘春生. 基于自适应动态规划的导弹制导律研究综述. 自动化学报, 2017, 43(7): 1101−1113Sun Jing-Liang, Liu Chun-Sheng. An overview on the adaptive dynamic programming based missile guidance law. Acta Automatica Sinica, 2017, 43(7): 1101−1113 [16] Zhao M M, Wang D, Qiao J F, Ha M M, Ren J. Advanced value iteration for discrete-time intelligent critic control: A survey. Artificial Intelligence Review, 2023, 56: 12315−12346 doi: 10.1007/s10462-023-10497-1 [17] Wang D, Ha M M, Zhao M M. The intelligent critic framework for advanced optimal control. Artificial Intelligence Review, 2022, 55(1): 1−22 doi: 10.1007/s10462-021-10118-9 [18] Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof. IEEE Transactions on Systems, Man, and Cybernetics–Part B: Cybernetics, 2008, 38(4): 943−949 doi: 10.1109/TSMCB.2008.926614 [19] Li H L, Liu D R. Optimal control for discrete-time affine non-linear systems using general value iteration. IET Control Theory and Applications, 2012, 6(18): 2725−2736 doi: 10.1049/iet-cta.2011.0783 [20] Wei Q L, Liu D R, Lin H Q. Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems. IEEE Transactions on Cybernetics, 2016, 46(3): 840−853 doi: 10.1109/TCYB.2015.2492242 [21] Wang D, Zhao M M, Ha M M, Qiao J F. Stability and admissibility analysis for zero-sum games under general value iteration formulation. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(11): 8707−8718 doi: 10.1109/TNNLS.2022.3152268 [22] Wang D, Ren J, Ha M M, Qiao J F. System stability of learning-based linear optimal control with general discounted value iteration. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(9): 6504−6514 doi: 10.1109/TNNLS.2021.3137524 [23] Heydari A. Stability analysis of optimal adaptive control under value iteration using a stabilizing initial policy. IEEE Transactions on Neural Networks and Learning Syetems, 2018, 29(9): 4522−4527 doi: 10.1109/TNNLS.2017.2755501 [24] Wei Q L, Lewis F L, Liu D R, Song R Z, Lin H Q. Discrete-time local value iteration adaptive dynamic programming: Convergence analysis. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018, 48(6): 875−891 doi: 10.1109/TSMC.2016.2623766 [25] Zhao M M, Wang D, Ha M M, Qiao J F. Evolving and incremental value iteration schemes for nonlinear discrete-time zero-sum games. IEEE Transactions on Cybernetics, 2023, 53(7): 4487−4499 doi: 10.1109/TCYB.2022.3198078 [26] Ha M M, Wang D, Liu D R. Neural-network-based discounted optimal control via an integrated value iteration with accuracy guarantee. Neural Networks, 2021, 144: 176−186 doi: 10.1016/j.neunet.2021.08.025 [27] Luo B, Liu D R, Huang T W, Yang X, Ma H W. Multi-step heuristic dynamic programming for optimal control of nonlinear discrete-time systems. Information Sciences, 2017, 411: 66−83 doi: 10.1016/j.ins.2017.05.005 [28] Wang D, Wang J Y, Zhao M M, Xin P, Qiao J F. Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control. IEEE/CAA Journal of Automatica Sinica, 2023, 10(9): 1797−1809 doi: 10.1109/JAS.2023.123684 [29] Rao J, Wang J C, Xu J H, Zhao S W. Optimal control of nonlinear system based on deterministic policy gradient with eligibility traces. Nonlinear Dynamics, 2023, 111: 20041−20053 doi: 10.1007/s11071-023-08909-6 [30] Yu L Y, Liu W B, Liu Y R, Alsaadi F E. Learning-based T-sHDP(λ) for optimal control of a class of nonlinear discrete-time systems. International Journal of Robust and Nonlinear Control, 2022, 32(5): 2624−2643 doi: 10.1002/rnc.5847 [31] Al-Dabooni S, Wunsch D. An improved N-step value gradient learning adaptive dynamic programming algorithm for online learning. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(4): 1155−1169 doi: 10.1109/TNNLS.2019.2919338 [32] Wang J Y, Wang D, Li X, Qiao J F. Dichotomy value iteration with parallel learning design towards discrete-time zero-sum games. Neural Networks, 2023, 167: 751−762 doi: 10.1016/j.neunet.2023.09.009 [33] Wei Q L, Wang L X, Lu J W, Wang F Y. Discrete-time self-learning parallel control. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(1): 192−204 doi: 10.1109/TSMC.2020.2995646 [34] Ha M M, Wang D, Liu D R. A novel value iteration scheme with adjustable convergence rate. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(10): 7430−7442 doi: 10.1109/TNNLS.2022.3143527 [35] Ha M M, Wang D, Liu D R. Novel discounted adaptive critic control designs with accelerated learning formulation. IEEE Transactions on Cybernetics, 2024, 54(5): 3003−3016 doi: 10.1109/TCYB.2022.3233593 [36] Wang D, Huang H M, Liu D R, Zhao M M, Qiao J F. Evolution-guided adaptive dynamic programming for nonlinear optimal control. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(10): 6043−6054 doi: 10.1109/TSMC.2024.3417230 [37] Liu D R, Wei Q L. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621−634 doi: 10.1109/TNNLS.2013.2281663 [38] Liu D R, Wei Q L. Generalized policy iteration adaptive dynamic programming for discrete-time nonlinear systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2015, 45(12): 1577−1591 doi: 10.1109/TSMC.2015.2417510 [39] Liang M M, Wang D, Liu D R. Neuro-optimal control for discrete stochastic processes via a novel policy iteration algorithm. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(11): 3972−3985 doi: 10.1109/TSMC.2019.2907991 [40] Luo B, Yang Y, Wu H N, Huang T W. Balancing value iteration and policy iteration for discrete-time control. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(11): 3948−3958 doi: 10.1109/TSMC.2019.2898389 [41] Li T, Wei Q L, Wang F Y. Multistep look-ahead policy iteration for optimal control of discrete-time nonlinear systems with isoperimetric constraints. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(3): 1414−1426 doi: 10.1109/TSMC.2023.3327492 [42] Yang Y L, Kiumarsi B, Modares H, Xu C Z. Model-free λ-policy iteration for discrete-time linear quadratic regulation. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(2): 635−649 doi: 10.1109/TNNLS.2021.3098985 [43] Huang H M, Wang D, Wang H, Wu J L, Zhao M M. Novel generalized policy iteration for efficient evolving control of nonlinear systems. Neurocomputing, 2024, 608: Article No. 128418 doi: 10.1016/j.neucom.2024.128418 [44] Dierks T, Jagannathan S. Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23(7): 1118−1129 doi: 10.1109/TNNLS.2012.2196708 [45] Wang D, Xin P, Zhao M M, Qiao J F. Intelligent optimal control of constrained nonlinear systems via receding-horizon heuristic dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(1): 287−299 doi: 10.1109/TSMC.2023.3306338 [46] Moghadam R, Natarajan P, Jagannathan S. Online optimal adaptive control of partially uncertain nonlinear discrete-time systems using multilayer neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(9): 4840−4850 doi: 10.1109/TNNLS.2021.3061414 [47] Zhang H G, Qin C B, Jiang B, Luo Y H. Online adaptive policy learning algorithm for H∞ state feedback control of unknown affine nonlinear discrete-time systems. IEEE Transactions on Cybernetics, 2014, 44(12): 2706−2718 doi: 10.1109/TCYB.2014.2313915 [48] Ming Z Y, Zhang H G, Yan Y Q, Zhang J. Tracking control of discrete-time system with dynamic event-based adaptive dynamic programming. IEEE Transactions on Circuits and Systems II: Express Briefs, 2022, 69(8): 3570−3574 [49] 罗彪, 欧阳志华, 易昕宁, 刘德荣. 基于自适应动态规划的移动机器人视觉伺服跟踪控制. 自动化学报, 2023, 49(11): 2286−2296Luo Biao, Ouyang Zhi-Hua, Yi Xin-Ning, Liu De-Rong. Adaptive dynamic programming based visual servoing tracking control for mobile robots. Acta Automatica Sinica, 2023, 49(11): 2286−2296 [50] Ha M M, Wang D, Liu D R. Discounted iterative adaptive critic designs with novel stability analysis for tracking control. IEEE/CAA Journal of Automatica Sinica, 2022, 9(7): 1262−1272 doi: 10.1109/JAS.2022.105692 [51] Dong L, Zhong X N, Sun C Y, He H B. Adaptive event-triggered control based on heuristic dynamic programming for nonlinear discrete-time systems. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(7): 1594−1605 doi: 10.1109/TNNLS.2016.2541020 [52] Wang D, Hu L Z, Zhao M M, Qiao J F. Dual event-triggered constrained control through adaptive critic for discrete-time zero-sum games. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(3): 1584−1595 doi: 10.1109/TSMC.2022.3201671 [53] Yang X, Wang D. Reinforcement learning for robust dynamic event-driven constrained control. IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2024.3394251 [54] 王鼎. 基于学习的鲁棒自适应评判控制研究进展. 自动化学报, 2019, 45(6): 1031−1043Wang Ding. Research progress on learning-based robust adaptive critic control. Acta Automatica Sinica, 2019, 45(6): 1031−1043 [55] Ren H, Jiang B, Ma Y J. Zero-sum differential game-based fault-tolerant control for a class of affine nonlinear systems. IEEE Transactions on Cybernetics, 2024, 54(2): 1272−1282 doi: 10.1109/TCYB.2022.3215716 [56] Zhang S C, Zhao B, Liu D R, Zhang Y W. Event-triggered decentralized integral sliding mode control for input-constrained nonlinear large-scale systems with actuator failures. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(3): 1914−1925 doi: 10.1109/TSMC.2023.3331150 [57] Wei Q L, Zhu L, Song R Z, Zhang P J, Liu D R, Xiao J. Model-free adaptive optimal control for unknown nonlinear multiplayer nonzero-sum game. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(2): 879−892 doi: 10.1109/TNNLS.2020.3030127 [58] Ye J, Bian Y G, Luo B, Hu M J, Xu B, Ding R. Costate-supplement ADP for model-free optimal control of discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(1): 45−59 doi: 10.1109/TNNLS.2022.3172126 [59] Li Y Q, Yang C Z, Hou Z S, Feng Y J, Yin C K. Data-driven approximate Q-learning stabilization with optimality error bound analysis. Automatica, 2019, 103: Article No. 435-442 [60] Al-Dabooni S, Wunsch D C. Online model-free n-step HDP with stability analysis. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(4): 1255−1269 doi: 10.1109/TNNLS.2019.2919614 [61] Ni Z, He H B, Zhong X N, Prokhorov D V. Model-free dual heuristic dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(8): 1834−1839 doi: 10.1109/TNNLS.2015.2424971 [62] Wang D, Ha M M, Qiao J F. Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation. IEEE Transactions on Automatic Control, 2020, 65(3): 1272−1279 doi: 10.1109/TAC.2019.2926167 [63] Wang D, Ha M M, Qiao J F. Data-driven iterative adaptive critic control toward an urban wastewater treatment plant. IEEE Transactions on Industrial Electronics, 2021, 68(8): 7362−7369 doi: 10.1109/TIE.2020.3001840 [64] Wang D, Hu L Z, Zhao M M, Qiao J F. Adaptive critic for event-triggered unknown nonlinear optimal tracking design with wastewater treatment applications. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(9): 6276−6288 doi: 10.1109/TNNLS.2021.3135405 [65] Zhu L, Wei Q L, Guo P. Synergetic learning neuro-control for unknown affine nonlinear systems with asymptotic stability guarantees. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(2): 3479−3489 doi: 10.1109/TNNLS.2023.3347663 [66] Pang B, Jiang Z P. Adaptive optimal control of linear periodic systems: An off-policy value iteration approach. IEEE Transactions on Automatic Control, 2021, 66(2): 888−894 doi: 10.1109/TAC.2020.2987313 [67] Xu Y S, Zhao Z G, Yin S. Performance optimization and fault-tolerance of highly dynamic systems via Q-learning with an incrementally attached controller gain system. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(11): 9128−9138 doi: 10.1109/TNNLS.2022.3155876 [68] Yang X, Xu M M, Wei Q L. Adaptive dynamic programming for nonlinear-constrained H∞ control. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(7): 4393−4403 doi: 10.1109/TSMC.2023.3247888 [69] Werbos P. Neural networks for control and system identification. Proceedings of the 28th IEEE Conference on Decision and Control, 1989260−265 [70] Prokhorov D V, Wunsch D C. Adaptive critic designs. IEEE Transactions on Neural Networks, 1997, 8(5): 997−1007 doi: 10.1109/72.623201 [71] Watkins C. Learning from delayed rewards. King’s College of Cambridge, 1989. [72] Al-Tamimi A, Lewis F L, Abu-Khalaf M. Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica, 2007, 43(3): 473−481 doi: 10.1016/j.automatica.2006.09.019 [73] Kiumarsi B, Lewis F L, Modares H, Karimpour A, Naghibi-Sistani M. Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica, 2014, 50(4): 1167−1175 doi: 10.1016/j.automatica.2014.02.015 [74] Jiang Y, Jiang Z P. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica, 2012, 48(10): 2699−2704 doi: 10.1016/j.automatica.2012.06.096 [75] Kiumarsi B, Lewis F L, Jiang Z P. H∞ control of linear discrete-time systems: Off-policy reinforcement learning. Automatica, 2017, 78: 144−152 doi: 10.1016/j.automatica.2016.12.009 [76] Farjadnasab M, Babazadeh M. Model-free LQR design by Q-function learning. Automatica, 2022, 137: Article No. 110060 doi: 10.1016/j.automatica.2021.110060 [77] Lopez V G, Alsalti M, Müller M A. Efficient off-policy Q-learning for data-based discrete-time LQR problems. IEEE Transactions on Automatic Control, 2023, 68(5): 2922−2933 doi: 10.1109/TAC.2023.3235967 [78] Nguyen H, Dang H B, Dao P N. On-policy and off-policy Q-learning strategies for spacecraft systems: An approach for time-varying discrete-time without controllability assumption of augmented system. Aerospace Science and Technology, 2024, 146: Article No. 108972 doi: 10.1016/j.ast.2024.108972 [79] Skach J, Kiumarsi B, Lewis F L, Straka O. Actor-critic off-policy learning for optimal control of multiple-model discrete-time systems. IEEE Transactions on Cybernetics, 2018, 48(1): 29−40 doi: 10.1109/TCYB.2016.2618926 [80] Wen Y L, Zhang H G, Ren H, Zhang K. Off-policy based adaptive dynamic programming method for nonzero-sum games on discrete-time system. Journal of the Franklin Institute, 2020, 357(12): 8059−8081 doi: 10.1016/j.jfranklin.2020.05.038 [81] Xu Y, Wu Z G. Data-efficient off-policy learning for distributed optimal tracking control of HMAS with unidentified exosystem dynamics. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(3): 3181−3190 doi: 10.1109/TNNLS.2022.3172130 [82] Cui L L, Pang B, Jiang Z P. Learning-based adaptive optimal control of linear time-delay systems: A policy iteration approach. IEEE Transactions on Automatic Control, 2024, 69(1): 629−636 doi: 10.1109/TAC.2023.3273786 [83] Amirparast A, Sani S K H. Off-policy reinforcement learning algorithm for robust optimal control of uncertain nonlinear systems. International Journal of Robust and Nonlinear Control, 2024, 34(8): 5419−5437 doi: 10.1002/rnc.7278 [84] Qasem O, Gao W N, Vamvoudakis K G. Adaptive optimal control of continuous-time nonlinear affine systems via hybrid iteration. Automatica, 2023, 157: Article No. 111261 doi: 10.1016/j.automatica.2023.111261 [85] Jiang H Y, Zhou B, Duan G R. Modified λ-policy iteration based adaptive dynamic programming for unknown discrete-time linear systems. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(3): 3291−3301 doi: 10.1109/TNNLS.2023.3244934 [86] Zhao J G, Yang C Y, Gao W N, Park J H. Novel single-loop policy iteration for linear zero-sum games. Automatica, 2024, 163: Article No. 111551 doi: 10.1016/j.automatica.2024.111551 [87] 肖振飞, 李金娜. 基于非策略Q学习方法的两个个体优化控制. 控制工程, 2022, 29(10): 1874−1880Xiao Zhen-Fei, Li Jin-Na. Two-player optimization control based on off-policy Q-learning algorithm. Control Engineering of China, 2022, 29(10): 1874−1880 [88] Liu Y, Zhang H G, Yu R, Xing Z X. H∞ tracking control of discrete-time system with delays via data-based adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(11): 4078−4085 doi: 10.1109/TSMC.2019.2946397 [89] Zhang H G, Liu Y, Xiao G Y, Jiang H. Data-based adaptive dynamic programming for a class of discrete-time systems with multiple delays. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(2): 432−441 doi: 10.1109/TSMC.2017.2758849 [90] Tan X F, Li Y, Liu Y. Stochastic linear quadratic optimal tracking control for discrete-time systems with delays based on Q-learning algorithm. AIMS Mathematics, 2023, 8(5): 10249−10265 doi: 10.3934/math.2023519 [91] Zhang L L, Zhang H G, Sun J Y, Yue X. ADP-based fault-tolerant control for multiagent systems with semi-markovian jump parameters. IEEE Transactions on Cybernetics, 2024, 54(10): 5952−5962 doi: 10.1109/TCYB.2024.3411310 [92] Li Y, Zhang H, Wang Z P, Huang C, Yan H C. Data-driven decentralized control for large-scale systems with sparsity and communication delays. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(9): 5614−5624 doi: 10.1109/TSMC.2023.3274292 [93] Shen X Y, Li X J. Data-driven output-feedback LQ secure control for unknown cyber-physical systems against sparse actuator attacks. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2021, 51(9): 5708−5720 doi: 10.1109/TSMC.2019.2957146 [94] Qasem O, Davari M, Gao W N, Kirk D R, Chai T Y. Hybrid iteration ADP algorithm to solve cooperative, optimal output regulation problem for continuous-time, linear, multiagent systems: Theory and application in islanded modern microgrids with IBRs. IEEE Transactions on Industrial Electronics, 2024, 71(1): 834−845 doi: 10.1109/TIE.2023.3247734 [95] Zhang H G, Liang H J, Wang Z S, Feng T. Optimal output regulation for heterogeneous multiagent systems via adaptive dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(1): 18−29 doi: 10.1109/TNNLS.2015.2499757 [96] Wang W, Chen X. Model-free optimal containment control of multi-agent systems based on actor-critic framework. Neurocomputing, 2018, 314(7): 242−250 [97] Cui L L, Wang S, Zhang J F, Zhang D S, Lai J, Zheng Y, Zhang Z Y, Jiang Z P. Learning-based balance control of wheel-legged robots. IEEE Robotics and Automation Letters, 2021, 6(4): 7667−7674 doi: 10.1109/LRA.2021.3100269 [98] Liu T, Cui L L, Pang B, Jiang Z P. A unified framework for data-driven optimal control of connected vehicles in mixed traffic. IEEE Transactions on Intelligent Vehicles, 2023, 8(8): 4131−4145 doi: 10.1109/TIV.2023.3287131 [99] Davari M, Gao W N, Aghazadeh A, Blaabjerg F, Lewis F L. An optimal synchronization control method of PLL utilizing adaptive dynamic programming to synchronize inverter-based resources with unbalanced, low-inertia, and very weak grids. IEEE Transactions on Automation Science and Engineering, 2025, 22: 24−42 doi: 10.1109/TASE.2023.3329479 [100] Wang Z Y, Wang Y Q, Davari M, Blaabjerg F. An effective PQ-decoupling control scheme using adaptive dynamic programming approach to reducing oscillations of virtual synchronous generators for grid connection with different impedance types. IEEE Transactions on Industrial Electronics, 2024, 71(4): 3763−3775 doi: 10.1109/TIE.2023.3279564 [101] Si J, Wang Y T. Online learning control by association and reinforcement. IEEE Transactions on Neural Networks, 2001, 12(2): 264−276 doi: 10.1109/72.914523 [102] Liu F, Sun J, Si J, Guo W T, Mei S W. A boundedness result for the direct heuristic dynamic programming. Neural Networks, 2012, 32: 229−235 doi: 10.1016/j.neunet.2012.02.005 [103] Sokolov Y, Kozma R, Werbos L D, Werbos P J. Complete stability analysis of a heuristic approximate dynamic programming control design. Automatica, 2015, 59: 9−18 doi: 10.1016/j.automatica.2015.06.001 [104] Malla N, Ni Z. A new history experience replay design for model-free adaptive dynamic programming. Neurocomputing, 2017, 266(29): 141−149 [105] Luo B, Wu H N, Huang T W, Liu D R. Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design. Automatica, 2014, 50(12): 3281−3290 doi: 10.1016/j.automatica.2014.10.056 [106] Zhao D B, Xia Z P, Wang D. Model-free optimal control for affine nonlinear systems with convergence analysis. IEEE Transactions on Automation Science and Engineering, 2015, 12(4): 1461−1468 doi: 10.1109/TASE.2014.2348991 [107] Xu J H, Wang J C, Rao J, Zhong Y J, Wu S Y, Sun Q F. Parallel cross entropy policy gradient adaptive dynamic programming for optimal tracking control of discrete-time nonlinear systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(6): 3809−3821 doi: 10.1109/TSMC.2024.3373456 [108] Wei Q L, Lewis F L, Sun Q Y, Yan P F, Song R Z. Discrete-time deterministic Q-learning: A novel convergence analysis. IEEE Transactions on Cybernetics, 2017, 47(5): 1224−1237 doi: 10.1109/TCYB.2016.2542923 [109] 王鼎, 王将宇, 乔俊飞. 融合自适应评判的随机系统数据驱动策略优化. 自动化学报, 2024, 50(5): 980−990Wang Ding, Wang Jiang-Yu, Qiao Jun-Fei. Data-driven policy optimization for stochastic systems involving adaptive critic. Acta Automatica Sinica, 2024, 50(5): 980−990 [110] Qiao J F, Zhao M M, Wang D, Ha M M. Adjustable iterative Q-learning schemes for model-free optimal tracking control. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(2): 1202−1213 doi: 10.1109/TSMC.2023.3324215 [111] Ni Z, Malla N, Zhong X N. Prioritizing useful experience replay for heuristic dynamic programming-based learning systems. IEEE Transactions on Cybernetics, 2019, 49(11): 3911−3922 doi: 10.1109/TCYB.2018.2853582 [112] Al-Dabooni S, Wunsch D. The boundedness conditions for model-free HDP(λ). IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(7): 1928−1942 doi: 10.1109/TNNLS.2018.2875870 [113] Zhao Q T, Si J, Sun J. Online reinforcement learning control by direct heuristic dynamic programming: From time-driven to event-driven. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(8): 4139−4144 doi: 10.1109/TNNLS.2021.3053037 [114] Wei Q L, Liao Z H, Song R Z, Zhang P J, Wang Z, Xiao J. Self-learning optimal control for ice-storage air conditioning systems via data-based adaptive dynamic programming. IEEE Transactions on Industrial Electronics, 2021, 68(4): 3599−3608 doi: 10.1109/TIE.2020.2978699 [115] Zhao J, Wang T Y, Pedrycz W, Wang W. Granular prediction and dynamic scheduling based on adaptive dynamic programming for the blast furnace gas system. IEEE Transactions on Cybernetics, 2021, 51(4): 2201−2214 doi: 10.1109/TCYB.2019.2901268 [116] Wang D, Li X, Zhao M M, Qiao J F. Adaptive critic control design with knowledge transfer for wastewater treatment applications. IEEE Transactions on Industrial Informatics, 2024, 20(2): 1488−1497 doi: 10.1109/TII.2023.3278875 [117] Qiao J F, Zhao M M, Wang D, Li M H. Action-dependent heuristic dynamic programming with experience replay for wastewater treatment processes. IEEE Transactions on Industrial Informatics, 2024, 20(4): 6257−6265 doi: 10.1109/TII.2023.3344130 [118] Luo B, Liu D R, Wu H N. Adaptive constrained optimal control design for data-based nonlinear discrete-time systems with critic-only structure. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(6): 2099−2111 doi: 10.1109/TNNLS.2017.2751018 [119] Zhao M M, Wang D, Qiao J F. Stabilizing value iteration Q-learning for online evolving control of discrete-time nonlinear systems. Nonlinear Dynamics, 2024, 112: 9137−9153 doi: 10.1007/s11071-024-09524-9 [120] Xiang Z R, Li P C, Zou W C, Ahn C K. Data-based optimal switching and control with admissibility guaranteed Q-learning. IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2024.3405739. [121] Li X F, Dong L, Xue L, Sun C Y. Hybrid reinforcement learning for optimal control of non-linear switching system. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(11): 9161−9170 doi: 10.1109/TNNLS.2022.3156287 [122] Li J N, Chai T Y, Lewis F L, Ding Z T, Jiang Y. Off-policy interleaved Q-learning: Optimal control for affine nonlinear discrete-time systems. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(5): 1308−1320 doi: 10.1109/TNNLS.2018.2861945 [123] Song S J, Zhao M M, Gong D W, Zhu M L. Convergence and stability analysis of value iteration Q-learning under non-discounted cost for discrete-time optimal control. Neurocomputing, 2024, 606: Article No. 128370 doi: 10.1016/j.neucom.2024.128370 [124] Song S J, Zhu M L, Dai X L, Gong D W. Model-free optimal tracking control of nonlinear input-affine discrete-time systems via an iterative deterministic Q-learning algorithm. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(1): 999−1012 doi: 10.1109/TNNLS.2022.3178746 [125] Wei Q L, Liu D R. A novel policy iteration based deterministic Qlearning for discrete-time nonlinear systems. Science China Information Sciences, 2015, 58(12): 1−15 [126] Yan P F, Wang D, Li H L, Liu D R. Error bound analysis of Q-function for discounted optimal control problems with policy iteration. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2017, 47(7): 1207−1216 doi: 10.1109/TSMC.2016.2563982 [127] Wang W, Chen X, Fu H, Wu M. Model-free distributed consensus control based on actor-critic framework for discrete-time nonlinear multiagent systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(11): 4123−4134 doi: 10.1109/TSMC.2018.2883801 [128] Luo B, Liu D R, Wu H N, Wang D, Lewis F L. Policy gradient adaptive dynamic programming for data-based optimal control. IEEE Transactions on Cybernetics, 2017, 47(10): 3341−3354 doi: 10.1109/TCYB.2016.2623859 [129] Zhang Y W, Zhao B, Liu D R. Deterministic policy gradient adaptive dynamic programming for model-free optimal control. Neurocomputing, 2020, 387: 40−50 doi: 10.1016/j.neucom.2019.11.032 [130] Xu J H, Wang J C, Rao J, Zhong Y J, Zhao S W. Twin deterministic policy gradient adaptive dynamic programming for optimal control of affine nonlinear discrete-time systems. International Journal of Control, Automation, and Systems, 2022, 20(9): 3098−3109 doi: 10.1007/s12555-021-0473-6 [131] Xu J H, Wang J C, Rao J, Wu S Y, Zhong Y J. Adaptive dynamic programming for optimal control of discrete-time nonlinear systems with trajectory-based initial control policy. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(3): 1489−1501 doi: 10.1109/TSMC.2023.3327450 [132] Lin M D, Zhao B. Policy optimization adaptive dynamic programming for optimal control of input-affine discrete-time nonlinear systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(7): 4339−4350 doi: 10.1109/TSMC.2023.3247466 [133] Lin M D, Zhao B, Liu D R. Policy gradient adaptive critic designs for model-free optimal tracking control with experience replay. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(6): 3692−3703 doi: 10.1109/TSMC.2021.3071968 [134] Luo B, Yang Y, Liu D R. Adaptive Q-learning for data-based optimal output regulation with experience replay. IEEE Transactions on Cybernetics, 2018, 48(12): 3337−3348 doi: 10.1109/TCYB.2018.2821369 [135] Qasem O, Gutierrez H, Gao W N. Experimental validation of data-driven adaptive optimal control for continuous-time systems via hybrid iteration: An application to rotary inverted pendulum. IEEE Transactions on Industrial Electronics, 2024, 71(6): 6210−6220 doi: 10.1109/TIE.2023.3292873 [136] 李满园, 罗飞, 顾春华, 罗勇军, 丁炜超. 基于自适应动量更新策略的Adams算法. 上海理工大学学报, 2023, 45(2): 112−119Li Man-Yuan, Luo Fei, Gu Chun-Hua, Luo Yong-Jun, Ding Wei-Chao. Adams algorithm based on adaptive momentum update strategy. Journal of University of Shanghai for Science and Technology, 2023, 45(2): 112−119 [137] 姜志侠, 宋佳帅, 刘宇宁. 一种改进的自适应动量梯度下降算法. 华中科技大学学报(自然科学版), 2023, 51(5): 137−143Jiang Zhi-Xia, Song Jia-Shuai, Liu Yu-Ning. An improved adaptive momentum gradient descent algorithm. Journal of Huazhong University of Science and Technology (Natural Science Edition), 2023, 51(5): 137−143 [138] 姜文翰, 姜志侠, 孙雪莲. 一种修正学习率的梯度下降算法. 长春理工大学学报(自然科学版), 2023, 46(6): 112−120Jiang Wen-Han, Jiang Zhi-Xia, Sun Xue-Lian. A gradient descent algorithm with modified learning rate. Journal of Changchun University of Science and Technology (Natural Science Edition), 2023, 46(6): 112−120 [139] Zhao B, Shi G, Liu D R. Event-triggered local control for nonlinear interconnected systems through particle swarm optimization-based adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(12): 7342−7353 doi: 10.1109/TSMC.2023.3298065 [140] Zhang L J, Zhang K, Xie X P, Chadli M. Adaptive critic control with knowledge transfer for uncertain nonlinear dynamical dystems: A reinforcement learning approach. IEEE Transactions on Automation Science and Engineering, doi: 10.1109/TASE.2024.3453926 [141] Gao X, Si J, Huang H. Reinforcement learning control with knowledge shaping. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(3): 3156−3167 doi: 10.1109/TNNLS.2023.3243631 [142] Gao X, Si J, Wen Y, Li M H, Huang H. Reinforcement learning control of robotic knee with human-in-the-loop by flexible policy iteration. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(10): 5873−5887 doi: 10.1109/TNNLS.2021.3071727 [143] Guo W T, Liu F, Si J, He D W, Harley R, Mei S W. Online supplementary ADP learning controller design and application to power system frequency control with large-scale wind energy integration. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(8): 1748−1761 doi: 10.1109/TNNLS.2015.2431734 [144] Zhao M M, Wang D, Ren J, Qiao J. Integrated online Q-learning design for wastewater treatment processes. IEEE Transactions on Industrial Informatics, 2025, 21(2): 1833−1842 doi: 10.1109/TII.2024.3488790 [145] Zhang H G, Wei Q L, Luo Y H. A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2008, 38(4): 937−942 doi: 10.1109/TSMCB.2008.920269 [146] Song S J, Gong D W, Zhu M L, Zhao Y Y, Huang C. Data-driven optimal tracking control for discrete-time nonlinear systems with unknown dynamics using deterministic ADP. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(1): 1184−1198 doi: 10.1109/TNNLS.2023.3323142 [147] Luo B, Liu D R, Huang T W, Wang D. Model-free optimal tracking control via critic-only Q-learning. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(10): 2134−2144 doi: 10.1109/TNNLS.2016.2585520 [148] Li C, Ding J L, Lewis F L, Chai T Y. A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems. Automatica, 2021, 129: Article No. 109687 doi: 10.1016/j.automatica.2021.109687 [149] Wang D, Gao N, Ha M M, Zhao M M, Wu J L, Qiao J F. Intelligent-critic-based tracking control of discrete-time input-affine systems and approximation error analysis with application verification. IEEE Transactions on Cybernetics, 2024, 54(8): 4690−4701 doi: 10.1109/TCYB.2023.3312320 [150] Liang Z T, Ha M M, Liu D R, Wang Y H. Stable approximate Q-learning under discounted cost for data-based adaptive tracking control. Neurocomputing, 2024, 568: Article No. 127048 doi: 10.1016/j.neucom.2023.127048 [151] Wang Y, Wang D, Zhao M M, Liu A, Qiao J F. Adjustable iterative Q-learning for advanced neural tracking control with stability guarantee. Neurocomputing, 2024, 584: Article No. 127592 doi: 10.1016/j.neucom.2024.127592 [152] Zhao M M, Wang D, Li M H, Gao N, Qiao J F. A new Q-function structure for model-free adaptive optimal tracking control with asymmetric constrained inputs. International Journal of Adaptive Control and Signal Processing, 2024, 38(5): 1561−1578 doi: 10.1002/acs.3761 [153] Wang T, Wang Y J, Yang X B, Yang J. Further results on optimal tracking control for nonlinear systems with nonzero equilibrium via adaptive dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(4): 1900−1910 doi: 10.1109/TNNLS.2021.3105646 [154] Li D D, Dong J X. Approximate optimal robust tracking control based on state error and derivative without initial admissible input. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(2): 1059−1069 doi: 10.1109/TSMC.2023.3320653 [155] Zhang H G, Luo Y H, Liu D R. Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints. IEEE Transactions on Neural Networks, 2009, 20(9): 1490−1503 doi: 10.1109/TNN.2009.2027233 [156] Marvi Z, Kiumarsi B. Reinforcement learning with safety and stability guarantees during exploration for linear systems. IEEE Open Journal of Control Systems, 2022, 1: 322−334 doi: 10.1109/OJCSYS.2022.3209945 [157] Zanon M, Gros S. Safe reinforcement learning using robust MPC. IEEE Transactions on Automatic Control, 2021, 66(8): 3638−3652 doi: 10.1109/TAC.2020.3024161 [158] Yang Y L, Vamvoudakis K G, Modares H, Yin Y X, Wunsch D C. Safe intermittent reinforcement learning with static and dynamic event generators. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12): 5441−5455 doi: 10.1109/TNNLS.2020.2967871 [159] Yazdani N M, Moghaddam R K, Kiumarsi B, Modares H. A safety-certified policy iteration algorithm for control of constrained nonlinear systems. IEEE Control Systems Letters, 2020, 4(3): 686−691 doi: 10.1109/LCSYS.2020.2990632 [160] Yang Y L, Vamvoudakis K G, Modares H. Safe reinforcement learning for dynamical games. International Journal of Robust and Nonlinear Control, 2020, 30(9): 3521−3800 doi: 10.1002/rnc.4942 [161] Song R Z, Liu L, Xia L N, Lewis F L. Online optimal event-triggered H∞ control for nonlinear systems with constrained state and input. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(1): 131−141 doi: 10.1109/TSMC.2022.3173275 [162] Fan B, Yang Q M, Tang X Y, Sun Y X. Robust ADP design for continuous-time nonlinear systems with output constraints. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(6): 2127−2138 doi: 10.1109/TNNLS.2018.2806347 [163] Liu S H, Liu L J, Yu Z. Safe reinforcement learning for affine nonlinear systems with state constraints and input saturation using control barrier functions. Neurocomputing, 2023, 518: 562−576 doi: 10.1016/j.neucom.2022.11.006 [164] Farzanegan B, Jagannathan S. Continual reinforcement learning formulation for zero-sum game-based constrained optimal tracking. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(12): 7744−7757 doi: 10.1109/TSMC.2023.3299556 [165] Marvi Z, Kiumarsi B. Safe reinforcement learning: A control barrier function optimization approach. International Journal of Robust and Nonlinear Control, 2021, 31(6): 1923−1940 doi: 10.1002/rnc.5132 [166] Qin C B, Qiao X P, Wang J G, Zhang D H, Hou Y D, Hu S L. Barrier-critic adaptive robust control of nonzero-sum differential games for uncertain nonlinear systems with state constraints. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(1): 50−63 doi: 10.1109/TSMC.2023.3302656 [167] Xu J H, Wang J C, Rao J, Zhong Y J, Wang H Y. Adaptive dynamic programming for optimal control of discrete-time nonlinear system with state constraints based on control barrier function. International Journal of Robust and Nonlinear Control, 2021, 32(6): 3408−3424 [168] Jha M S, Kiumarsi B. Off-policy safe reinforcement learning for nonlinear discrete-time systems. Neurocomputing, 2024, 611: Article No. 128677 [169] Zhang L Z, Xie L, Jiang Y, Li Z S, Liu X Q, Su H Y. Optimal control for constrained discrete-time nonlinear systems based on safe reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(1): 854−865 doi: 10.1109/TNNLS.2023.3326397 [170] Cohen M H, Belta C. Safe exploration in model-based reinforcement learning using control barrier functions. Automatica, 2023, 147: Article No. 110684 doi: 10.1016/j.automatica.2022.110684 [171] Liu S H, Liu L J, Yu Z. Fully cooperative games with state and input constraints using reinforcement learning based on control barrier functions. Asian Journal of Control, 2024, 26(2): 888−905 doi: 10.1002/asjc.3226 [172] Zhao M M, Wang D, Song S J, Qiao J F. Safe Q-learning for data-driven nonlinear optimal control with asymmetric state constraints. IEEE/CAA Journal of Automatica Sinica, 2024, 11(12): 2408−2422 doi: 10.1109/JAS.2024.124509 [173] Liu D R, Li H L, Wang D. Neural-network-based zero-sum game for discrete-time nonlinear systems via iterative adaptive dynamic programming algorithm. Neurocomputing, 2013, 110: 92−100 doi: 10.1016/j.neucom.2012.11.021 [174] Luo B, Yang Y, Liu D R. Policy iteration Q-learning for data-based two-player zero-sum game of linear discrete-time systems. IEEE Transactions on Cybernetics, 2021, 51(7): 3630−3640 doi: 10.1109/TCYB.2020.2970969 [175] Zhong X N, He H B, Wang D, Ni Z. Model-free adaptive control for unknown nonlinear zero-sum differential game. IEEE Transactions on Cybernetics, 2018, 48(5): 1633−1646 doi: 10.1109/TCYB.2017.2712617 [176] Wang Y, Wang D, Zhao M M, Liu N, Qiao J F. Neural Q-learning for discrete-time nonlinear zero-sum games with adjustable convergence rate. Neural Networks, 2024, 175: 106274 doi: 10.1016/j.neunet.2024.106274 [177] Zhang Y W, Zhao B, Liu D R, Zhang S C. Event-triggered control of discrete-time zero-sum games via deterministic policy gradient adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(8): 4823−4835 doi: 10.1109/TSMC.2021.3105663 [178] Lin M D, Zhao B, Liu D R. Policy gradient adaptive dynamic programming for nonlinear discrete-time zero-sum games with unknown dynamics. Soft Computing, 2023, 27: 5781−5795 doi: 10.1007/s00500-023-07817-6 [179] 王鼎, 赵慧玲, 李鑫. 基于多目标粒子群优化的污水处理系统自适应评判控制. 工程科学学报, 2024, 46(5): 908−917Wang Ding, Zhao Hui-Ling, Li Xin. Adaptive critic control for wastewater treatment systems based on multiobjective particle swarm optimization. Chinese Journal of Engineering, 2024, 46(5): 908−917 [180] Yang Q M, Cao W W, Meng W C, Si J. Reinforcement-learning-based tracking control of waste water treatment process under realistic system conditions and control performance requirements. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(8): 5284−5294 doi: 10.1109/TSMC.2021.3122802 [181] Yang R Y, Wang D, Qiao J F. Policy gradient adaptive critic design with dynamic prioritized experience replay for wastewater treatment process control. IEEE Transactions on Industrial Informatics, 2022, 18(5): 3150−3158 doi: 10.1109/TII.2021.3106402 [182] Qiao J F, Yang R Y, Wang D. Offline data-driven adaptive critic design with variational inference for wastewater treatment process control. IEEE Transactions on Automation Science and Engineering, 2024, 21(4): 4987−4998 doi: 10.1109/TASE.2023.3305615 [183] Sun B, Kampen E J V. Incremental model-based global dual heuristic programming with explicit analytical calculations applied to flight control. Engineering Applications of Artificial Intelligence, 2020, 89: Article No. 103425 doi: 10.1016/j.engappai.2019.103425 [184] Zhou Y, Kampen E J V, Chu Q P. Incremental model based online heuristic dynamic programming for nonlinear adaptive tracking control with partial observability. Aerospace Science and Technology, 2020, 105: Article No. 106013 doi: 10.1016/j.ast.2020.106013 [185] 赵振根, 程磊. 基于增量式Q学习的固定翼无人机跟踪控制性能优化. 控制与决策, 2024, 39(2): 391−400Zhao Zhen-Gen, Cheng Lei. Performance optimization for tracking control of fixed-wing UAV with incremental Q-learning. Control and Decision, 2024, 39(2): 391−400 [186] Cao W W, Yang Q M, Meng W C, Xie S Z. Data-based robust adaptive dynamic programming for balancing control performance and energy consumption in wastewater treatment process. IEEE Transactions on Industrial Informatics, 2024, 20(4): 6622−6630 doi: 10.1109/TII.2023.3346468 [187] Fu Y, Hong C W, Fu J, Chai T Y. Approximate optimal tracking control of nondifferentiable signals for a class of continuous-time nonlinear systems. IEEE Transactions on Cybernetics, 2022, 52(6): 4441−4450 doi: 10.1109/TCYB.2020.3027344 -
计量
- 文章访问数: 89
- HTML全文浏览量: 46
- 被引次数: 0