2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

两团队零和博弈下熵引导的极小极大值分解强化学习方法

胡光政 朱圆恒 赵冬斌

李振兴, 庄娇娇, 杨成东, 邱建龙, 曹进德. 异构不确定二阶非线性多智能体系统事件触发状态趋同. 自动化学报, 2025, 51(4): 1−9 doi: 10.16383/j.aas.c240423
引用本文: 胡光政, 朱圆恒, 赵冬斌. 两团队零和博弈下熵引导的极小极大值分解强化学习方法. 自动化学报, 2025, 51(4): 1−14 doi: 10.16383/j.aas.c240258
Li Zhen-Xing, Zhuang Jiao-Jiao, Yang Cheng-Dong, Qiu Jian-Long, Cao Jin-De. Event-triggered state consensus of heterogeneous uncertain second-order nonlinear multi-agent systems. Acta Automatica Sinica, 2025, 51(4): 1−9 doi: 10.16383/j.aas.c240423
Citation: Hu Guang-Zheng, Zhu Yuan-Heng, Zhao Dong-Bin. Entropy-guided minimax value decomposition for reinforcement learning in two-team zero-sum games. Acta Automatica Sinica, 2025, 51(4): 1−14 doi: 10.16383/j.aas.c240258

两团队零和博弈下熵引导的极小极大值分解强化学习方法

doi: 10.16383/j.aas.c240258 cstr: 32138.14.j.aas.c240258
基金项目: 国家自然科学基金(62293541, 62136008), 北京市自然科学基金(4232056), 北京市科技新星计划(20240484514), 中国科学院“全球共性挑战专项”(104GJHZ2022013GC)资助
详细信息
    作者简介:

    胡光政:阿里巴巴集团控股有限公司高级算法工程师. 2016年获得北京理工大学学士学位. 2019年获得北京理工大学硕士学位. 2024年获得中国科学院大学博士学位. 主要研究方向为深度强化学习和多机器人博弈. E-mail: hugaungzheng2019@ia.ac.cn

    朱圆恒:中国科学院自动化研究所副研究员. 2010年获得南京大学自动化专业学士学位. 2015年获得中国科学院自动化研究所控制理论和控制工程专业博士学位. 主要研究方向为深度强化学习, 博弈理论, 博弈智能和多智能体学习. E-mail: yuanheng.zhu@ia.ac.cn

    赵冬斌:中国科学院自动化研究所研究员, 中国科学院大学教授. 分别于1994年、1996年和2000年获得哈尔滨工业大学学士学位、硕士学位和博士学位. 主要研究方向为深度强化学习, 计算智能, 自动驾驶, 游戏人工智能, 机器人. 本文通信作者. E-mail: dongbin.zhao@ia.ac.cn

Entropy-guided MiniMax Value Decomposition for Reinforcement Learning in Two-team Zero-sum Games

Funds: Supported by National Natural Science Foundation of China (62293541, 62136008), Beijing Natural Science Foundation (4232056), Beijing Nova Program (20240484514), and International Partnership Program of Chinese Academy of Sciences (104GJHZ2022013GC)
More Information
    Author Bio:

    HU Guang-Zheng Senior algorithm engineer at Alibaba Group Holding Limited. He received his bachelor degree from Beijing Institute of Technology in 2016, master degree from Beijing Institute of Technology in 2019, and Ph.D. degree from the University of Chinese Academy of Sciences in 2024. His research interest covers deep reinforcement learning and multi-robot game

    ZHU Yuan-Heng Associate professor at the Institute of Automation, Chinese Academy of Sciences. He received his bachelor degree in automation from Nanjing University in 2010, and Ph.D. degree in control theory and control engineering from the Institute of Automation, Chinese Academy of Sciences in 2015. His research interest covers deep reinforcement learning, game theory, game intelligence, and multiagent learning

    ZHAO Dong-Bin Professor at the Institute of Automation, Chinese Academy of Sciences and the University of Chinese Academy of Sciences. He received his bachelor degree, master degree, and Ph.D. degree from Harbin Institute of Technology, in 1994, 1996, and 2000, respectively. His research interest covers deep reinforcement learning, computational intelligence, autonomous driving, game artificial intelligence, and robotics. Corresponding author of this paper

  • 摘要: 在两团队零和马尔科夫博弈中, 一组玩家通过合作与另一组玩家进行对抗. 由于对手行为的不确定性和复杂的团队内部合作关系, 在高采样成本的任务中快速识别优势的分布式策略仍然具有挑战性. 鉴于此, 提出一种熵引导的极小极大值分解(Entropy-guided minimax factorization, EGMF)强化学习方法, 在线学习队内合作和队间对抗的策略. 首先, 提出基于极小极大值分解的多智能体执行器−评估器框架, 在高采样成本的、不限动作空间的任务中, 提升优化效率和博弈性能; 其次, 引入最大熵使智能体可以更充分地探索状态空间, 避免在线学习过程收敛到局部最优; 此外, 策略在时间域累加的熵值用于评估策略的熵, 并将其与分解的个体独立Q函数结合用于策略改进; 最后, 在多种博弈仿真场景和一个实体任务平台上进行方法验证, 并与其他基线方法进行比较. 结果显示EGMF可以在更少样本下学到更具有对抗性能的两团队博弈策略.
  • 近20年, 多智能体系统的协同控制因其在无人机编队[1]、传感器网络同步[2]、多机器人协作[3]等工程中的广泛应用, 越来越受到控制理论领域学者们的关注. 传统的协同控制算法依赖智能体间的连续信息传输, 即使信息变化很小或没有变化仍然会进行信息传输, 这会造成电能、通信带宽、网络链路的低效利用[4]. 由于事件触发通信机制可以有效地节约能源和通信带宽, 基于事件触发机制的协同控制成为多智能体系统协同控制领域的研究热点[56]. 文献[7]给出一些基于事件触发通信机制的多智能体系统协同控制的研究成果.

    多智能体系统事件触发协同控制领域的研究成果大多要求系统模型是精确可知的, 然而实际多智能体系统不可避免地存在未知参数、模型不确定、外部噪声等不确定因素. 文献[8]对无向网络的一类不确定非线性多智能体系统的事件触发趋同问题进行了研究. 文献[9]研究无向网络拓扑下一类二阶非线性多智能体系统的自适应事件触发趋同控制问题. 针对未知二阶非线性多智能体系统, 文献[10]利用自适应事件触发控制方法研究完全分布式控制问题. 文献[11]对网络拓扑信息未知的一般线性多智能体系统的完全分布式事件触发趋同问题进行研究. 针对控制方向未知的高阶多智能体系统, 文献[12]利用虚拟控制律设计自适应事件触发跟踪控制器. 文献[13]研究具有时滞和输入饱和的异构多智能体系统, 并给出基于观测器的事件触发趋同算法. 文献[14]利用组合测量事件触发机制, 研究拓扑结构为无向图的未知非线性二阶时滞多智能体系统的自适应趋同控制. 虽然文献[810, 14]研究的系统模型与本文相似, 但都采用基于组合测量的事件触发机制, 这种事件触发机制需要连续不断地监测邻居智能体的状态信息用以判断下一次触发时刻, 即算法依赖智能体间连续信息传输. 文献[1516]利用输出调节理论, 对异构线性多智能体系统的事件触发输出同步问题进行研究. 文献[17]利用分布式内模设计, 研究一类非线性多智能体系统的事件触发全局鲁棒输出调节问题.

    上述文献的分布式控制器虽然采用了事件触发机制进行设计, 但是所给的事件触发趋同算法依然依赖智能体间的连续信息传输. 触发函数对邻居智能体状态信息连续监测问题引起了研究人员的注意. 文献[18]利用基于反步法的分布式自适应输出反馈控制策略研究不确定异构线性多智能体系统的事件触发输出同步问题. 针对由一类高阶不确定非线性系统构成的无领导型异构多智能体系统, 文献[19]给出基于事件触发机制的分布式自适应趋同算法. 文献[20]分别对同构和异构线性多智能体系统的事件触发平均跟踪算法进行研究. 针对异构领导−跟随者型多智能体系统, 文献[21]分别给出基于模型和基于数据的事件触发趋同算法. 文献[22]基于动态事件触发机制, 对一般线性多智能体系统的编队包含控制问题进行研究. 针对拓扑为有向网络的不确定下三角非线性多智能体系统, 文献[23]利用神经网络设计分布式自适应异步事件触发趋同算法. 基于输出调节理论, 文献[24]研究异构线性多智能体系统的自适应事件触发输出趋同控制, 文献[25]研究一类异构非线性多智能体系统的分布式事件触发输出趋同控制问题, 文献[26]研究严格反馈非线性多智能体系统的半全局周期事件触发输出调节问题.

    受上述文献启发, 本文研究异构不确定二阶非线性多智能体系统的事件触发状态趋同问题, 主要贡献有如下$ 3 $点: 1)本文研究领导−跟随者型异构不确定多智能体系统的状态趋同问题, 不仅跟随智能体的动力学方程存在不确定参数, 领导智能体也存在不确定参数. 文献[10, 1516, 2426]中的领导智能体均为完全已知的, 并未考虑领导智能体存在不确定参数的情形. 2)本文基于邻居智能体的观测状态设计事件触发趋同算法, 由于对邻居智能体的状态进行观测, 避免了事件触发函数对邻居智能体的连续监测, 做到控制器与触发函数都不依赖智能体间的连续信息传输. 同样研究异构不确定二阶非线性多智能体系统事件触发控制的文献[910], 其事件触发函数需要对邻居智能体的状态进行连续监测. 3)本文不确定参数为矩阵形式而非向量形式, 不同于以往将矩阵转变为向量的处理方法, 本文直接利用矩阵迹的不等式对矩阵自适应参数估计的收敛性进行证明.

    为方便表示, 本文使用如下向量与矩阵的符号: $ ||\cdot||_{\rm{F }}$和$ ||\cdot|| $分别表示向量或矩阵的Frobenius范数和2范数, $ \otimes$为矩阵的克罗内克积, $ \mathrm{diag}\{a_1,\;\cdots, a_N\} $表示对角元素为$ a_i $的对角矩阵, $ \mathrm{tr}\{A\} $表示方阵$ A $的迹, $ 1_N $表示每个元素都为$ 1 $的$ N $维常向量, I表示单位矩阵, $ \lambda_{1X} $和$ \lambda_{NX} $分别表示$ N $阶对称矩阵$ X $的最小和最大特征根, $ {\cal{A}}(t) $表示渐近收敛到$ \boldsymbol 0 $的函数集合.

    本文研究领导−跟随者型异构不确定二阶非线性多智能体系统事件触发趋同控制问题. 第$ i $个跟随智能体的动力学方程为:

    $$ \begin{equation} \left\{\begin{aligned} \dot{x}_i(t)& = y_i(t)\\ \dot{y}_i(t)& = \theta_i^{\mathrm{T}}\phi_i(x_i(t),\;y_i(t))+u_i(t) \end{aligned}\right. \end{equation} $$ (1)

    式中, $ x_i,\;y_i,\;u_i\in {\bf{R}}^n $分别表示第$ i $个智能体的位置、速度和控制输入; $ \theta_i\in {\bf{R}}^{n_i\times n} $为不确定常矩阵; $ \phi_i: {\bf{R}}^n\times {\bf{R}}^n\rightarrow {\bf{R}}^{n_i} $为已知向量函数.

    领导智能体标记为$ 0 $号智能体, 其动力学方程为含有未知输入的二阶积分器型系统:

    $$ \begin{equation} \left\{\begin{aligned} \dot{x}_0(t)& = y_0(t)\\ \dot{y}_0(t)& = \theta_0^{\mathrm{T}} \phi(t) \end{aligned}\right. \end{equation} $$ (2)

    式中, $ x_0,\;y_0\in {\bf{R}}^n $分别为领导智能体的位置和速度; $ \theta_0\in {\bf{R}}^{n_0\times n} $为不确定常矩阵; $ \phi(t):[0,\;\infty)\rightarrow {\bf{R}}^{n_0} $为已知向量函数.

    本文的目标是设计基于事件触发机制的趋同控制算法, 使得$ \lim_{t\rightarrow\infty}x_i(t) = x_0(t),\;y_i(t) = y_0(t). $

    领导−跟随者型多智能体系统(1)、(2)的网络拓扑用有向图$ {\cal{G}} = \{{\cal{V}},\;{\cal{E}}\} $描述, 其中$ {\cal{V}} = \{0,\;1,\; \cdots, N\} $为智能体集合, $ {\cal{E}} = {\cal{V}}\times{\cal{V}} $为边集. $ (i,\;j)\in{\cal{E}} $表示一条从智能体$ j $到智能体$ i $的有向边, 相应的邻接权重$ a_{ij}>0 $, 否则$ a_{ij} = 0 $. 有向边序列$ (i_l,\; i_{l-1}), l=1,\; \cdots,\; k\,\; $表示从智能体$ i_0 $到智能体$ i_k $的一条路径. 图$ {\cal{G}} $的拉普拉斯矩阵$ {\cal{L}} $定义为$ l_{ii} = \sum_{j \,\;=\,\; 0}^Na_{ij}, l_{ij} = -a_{ij},\;i\neq j $.

    注1. 由于领导智能体不能接收到跟随智能体的信息, 有向图$ {\cal{G}} $的拉普拉斯矩阵$ {\cal{L}} $可表示为:

    $$ \begin{equation*} {\cal{L}} = \left[\begin{array}{cc}0&{\bf 0}_{1\times N}\\ *&L \end{array}\right],\; \; L\in {\bf{R}}^{N\times N},\; \; *\in {\bf{R}}^{N} \end{equation*} $$

    由文献[27]的引理3可知, 当假设1成立时, 矩阵$ L $是非奇异的, 并且存在矩阵$ Q = \mathrm{diag}\{1/ q_1, \cdots,\;1/ q_N\} $, $ H = (QL+L^{\mathrm{T}}Q) /{2}$为正定矩阵, 其中$ [q_1,\;\cdots,\; q_N]^{\mathrm{T}} = L^{-1}1_N $.

    为证明算法的稳定性, 需要以下假设和引理.

    假设1. 对于任意跟随智能体$ i,\;i = 1,\;\cdots,\;N $, 至少存在一条由领导智能体到跟随智能体$ i $的有向路径.

    假设2. $ \phi(t) $, $ \phi_i(x_i(t) $, $ y_i(t)) $为不恒等于$ \bf 0 $的有界向量函数.

    假设3. 在不确定输入$ \theta_0^{\mathrm{T}}\phi(t) $的作用下, 领导智能体的状态有界.

    引理1[28]. 考虑如下系统:

    $$ \begin{equation} \dot{x}(t) = f(t,\;x(t),\;u(t)) \end{equation} $$ (3)

    式中, $ f:[0,\;\infty)\times {\bf{R}}^n\times {\bf{R}}^m\rightarrow {\bf{R}}^n $对$ t $是分段连续的, 对$ x(t) $和$ u(t) $满足局部Lipschitz条件. 输入$ u(t) $对所有$ t\geq0 $是分段连续且有界的函数. 如果系统(3)是输入状态稳定的且$ u(t)\in{\cal{A}}(t) $, 则亦有状态$ x(t) \in {\cal{A}}(t) $.

    由于领导智能体的参数$ \theta_0 $不确定, 首先为领导智能体设计如下参数观测器:

    $$ \begin{equation} \left\{\begin{aligned} \dot{\hat{y}}_0& =( \hat{\theta}_0^0)^{\mathrm{T}}\phi(t)-s_0(\hat{y}_0-y_0)\\ \dot{\hat{\theta}}_0^0& = -\phi(t)(\hat{y}_0-y_0)^{\mathrm{T}} \end{aligned}\right. \end{equation} $$ (4)

    式中, $\hat{y}_0 $为领导智能体速度状态的观测值, $s_0>0 $为正数, $ \hat{\theta}_0^0(t) $用以估计参数$ \theta_0 $. 跟随智能体的参数$ \theta_i $同样不确定, 设计如下参数观测器:

    $$ \begin{equation} \left\{\begin{aligned} \dot{\hat{y}}_i& = \hat{\theta}_i^{\mathrm{T}}\phi_i(x_i,\;y_i)+u_i-s_i(\hat{y}_i-y_i)\\ \dot{\hat{\theta}}_i& = -\phi_i(x_i,\;y_i)(\hat{y}_i-y_i)^{\mathrm{T}} \end{aligned}\right. \end{equation} $$ (5)

    式中, $\hat{y}_i $为第i个智能体速度状态的观测值, $ s_i>0 $为正数, $ \hat{\theta}_i(t) $用以估计参数$ \theta_i $.

    由于领导智能体含有不确定控制输入$ \theta_0^{\mathrm{T}}\phi(t) $, 为了使跟随智能体跟踪上领导智能体, 为跟随智能体$ i $设计如下$ \theta_0 $参数的观测器:

    $$ \begin{equation} \dot{\hat{\theta}}_0^i(t) = -\mu\sum\limits_{j = 0}^Na_{ij}(\hat{\theta}_0^i(t_k^i)-\hat{\theta}_0^j(t_{k'}^j)) \end{equation} $$ (6)

    式中, $ \mu>0 $为常数, $ t_k^i $和$ t_{k'}^j $为智能体$ i $和$ j $的事件触发时刻, 并且有$ t_0^i = t_0^j = 0 $.

    在触发时刻$ t_{k'}^j $, 智能体$ j $将其采样信息$ \hat{\theta}_0^j(t_{k'}^j) $, $ x_j(t_{k'}^j) $和$ y_j(t_{k'}^j) $发送给邻居智能体$ i $. 智能体$ i $利用采样信息$ \hat{\theta}_0^j(t_{k'}^j) $, $ x_j(t_{k'}^j) $和$ y_j(t_{k'}^j) $估计智能体$ j $在下一次采样时刻$ t_{(k+1)'}^j $前的位置和速度. 用$ \hat{x}_j^i(t) $和$ \hat{y}_j^i(t) $表示时间段$ [t_{k'}^j,\;t_{(k+1)'}^j) $内智能体$ i $对智能体$ j $的状态信息估计, 状态估计方程为:

    $$ \begin{equation} \left\{\begin{aligned} \dot{\hat{x}}_j^i(t)& = \hat{y}_j^i(t)\\ \dot{\hat{y}}_j^i(t)& = (\hat{\theta}_0^{j}(t_{k'}^j))^{\mathrm{T}}\phi(t) \end{aligned}\right. \end{equation} $$ (7)

    式中, 初始状态分别为$ \hat{x}_j^i(t_{k'}^j) = x_j(t_{k'}^j) $, $ \hat{y}_j^i(t_{k'}^j) = y_j(t_{k'}^j) $.

    同时, 智能体$ j $也将利用其事件触发采样信息估计其自身的状态信息. 如果智能体$ i $和$ l $同时接收到智能体$ j $的事件触发采样信息, 则不难验证智能体$ i $, $l $和$ j $拥有相同状态估计值, 即:

    $$ \hat{x}_j^i(t) = \hat{x}_j^l(t) = \hat{x}_j^j(t),\;\hat{y}_j^i(t) = \hat{y}_j^l(t) = \hat{y}_j^j(t) $$

    记$ \hat{\xi}_{ix} = \sum_{j = 0}^N a_{ij}(\hat{x}_i^i - \hat{x}_j^i),\; \hat{\xi}_{iy} = \sum_{j = 0}^N a_{ij}(\hat{y}_i^i - \hat{y}_j^i) $, 为跟随智能体式(1)设计如下事件触发趋同控制器:

    $$ \begin{equation} u_i = -\hat{\theta}_i^{\mathrm{T}}\phi_i(x_i,\;y_i)+(\hat{\theta}_0^{i})^{\mathrm{T}}\phi(t)-ck_1\hat{\xi}_{ix}-ck_2\hat{\xi}_{iy} \end{equation} $$ (8)

    式中, $ k_1 $, $ k_2>0 $为耦合增益; $ c>0 $为反馈增益. $ k_1 $, $ k_2 $和$ c $可根据下文式(22)选取. 智能体$ i $的第$ k+1 $次事件触发时刻由如下条件给出:

    $$ \begin{equation} t_{k+1}^i = \min\{t>t_k^i|T_{i1}(t)\geq0\; \mathrm{or}\; T_{i2}(t)\geq0\} \end{equation} $$ (9)

    式中, $T_{i1}(t) = ||\epsilon_i(t)||_{\rm{F}}^2 - f_{i1}(t),\; T_{i2}(t) = ||e_i(t)||^2 \;- f_{i2}(t)$, $ \epsilon_i(t) = \hat{\theta}_0^i(t_k^i)-\hat{\theta}_0^i(t) $, $ e_i(t) = k_1e_{ix}(t)+ k_2e_{iy}(t) $,$ e_{ix}(t) =\hat{x}_i^i(t)-x_i(t) $, $ e_{iy}(t) = \hat{y}_i^i(t)-y_i(t) $, 正函数$ f_{i1}(t),\;f_{i2}(t)\in{\cal{A}}(t) $.

    领导智能体$ 0 $的第$ k+1 $次事件触发时刻由如下条件确定:

    $$ \begin{equation} t_{k+1}^0 = \min\{t>t_k^0|T_{01}(t)\geq0\; \mathrm{or}\; T_{02}(t)\geq0\} \end{equation} $$ (10)

    式中, 各符号定义与式(9)中符号定义类似.

    注2. 跟随智能体的控制输入式(8)只依赖其自身状态、邻居智能体的估计状态和估计参数$ \hat{\theta} _0^i(t), \hat{\theta}_i(t) $, 仅需要邻居智能体提供离散的信息 $ \hat{\theta}_0^j(t_{k'}^j) $, $ x_j(t_{k'}^j) $和$ y_j(t_{k'}^j) $, 不依赖邻居智能体的任何连续信息传输. 同样, 事件触发条件(9)、(10)也不依赖邻居智能体的任何连续信息传输. 因此, 本文提出的事件触发趋同算法完全不依赖智能体间的连续信息传输.

    命题1. 如果假设2成立, 参数观测器式(4)、式(5) 中的$ \hat{\theta}_0^0(t) $和$ \hat{\theta}_i(t) $可渐近收敛到$ \theta_0 $和$ \theta_i $, 即$ \lim_{t\rightarrow\infty}\hat{\theta}_0^0(t) = \theta_0 $, $ \lim_{t\rightarrow\infty}\hat{\theta}_i(t) = \theta_i. $

    证明. 记$ \tilde{y}_i(t) = \hat{y}_i(t)-y_i(t) $, $ \tilde{\theta}_i(t) = \hat{\theta}_i(t)- \theta_i $. 对于观测器式(5), 可得:

    $$ \begin{equation} \left\{\begin{aligned} \dot{\tilde{y}}_i(t)& = \tilde{\theta}_i^{\mathrm{T}}(t)\phi_i(x_i(t),\;y_i(t))-s_i\tilde{y}_i(t)\\ \dot{\tilde{\theta}}_i(t)& = -\phi_i(x_i(t),\;y_i(t))\tilde{y}_i^{\mathrm{T}}(t) \end{aligned}\right. \end{equation} $$ (11)

    选取如下李雅普诺夫函数:

    $$ V_{i1} = \frac{1}{2}\tilde{y}_i^{\mathrm{T}}(t)\tilde{y}_i(t)+\frac{1}{2}\mathrm{tr}\{\tilde{\theta}_i^{\mathrm{T}}(t) \tilde{\theta}_i(t)\} $$

    沿式(11)的轨迹求$ V_{i1} $的导数, 可得:

    $$ \dot{V}_{i1} = -s_i\tilde{y}_i^{\mathrm{T}}(t)\tilde{y}_i(t) $$

    这表明$ \lim_{t\rightarrow\infty}\tilde{y}_i(t) = \bf 0 $. 由系统 (11)可知$ \tilde{y}_i(t) {\text{恒等于}}\, \bf 0 $, 可得$ \tilde{\theta}_i^{\mathrm{T}}(t)\phi_i(x_i (t),\;y_i(t))\,{\text{恒等于}}\, \bf 0$. 由假设2可知$ \phi_i (x_i(t),\;y_i(t)) $不恒等于$ \bf 0 $且有界, 从而可得$ \lim_{t\rightarrow\infty} \hat{\theta}_i(t) = \theta_i $. 亦可证明$ \lim_{t\rightarrow\infty} \hat{\theta}_0^0(t) = \theta_0 $.

    命题2. 如果假设1和假设2成立, 在事件触发条件(9)、(10) 作用下, 估计参数$ \hat{\theta}_0^i(t) $渐近收敛至$ \theta_0 $.

    证明. 记$ \zeta_i(t) = \sum_{j = 0}^Na_{ij}(\tilde{\theta}_0^i(t)-\tilde{\theta}_0^j(t)) $, $ \sigma_i(t) = \sum_{j = 0}^Na_{ij}(\epsilon_i(t)-\epsilon_j(t)) $, $ \tilde{\theta}_0^i(t) = \hat{\theta}_0^i(t)-\theta_0 $. 由式(6)可得:

    $$ \begin{equation} \dot{\tilde{\theta}}_0^i(t) = -\mu\zeta_i(t)-\mu\sigma_i(t) \end{equation} $$ (12)

    选取如下李雅普诺夫函数:

    $$ \begin{equation} V_2 = \sum\limits_{i = 1}^N\frac{1}{2q_i}\mathrm{tr}\{\zeta_i^{\mathrm{T}}(t)\zeta_i(t)\} \end{equation} $$ (13)

    由式(12)可得$ V_2 $的导数:

    $$ \begin{equation*} \begin{aligned} \dot{V}_2 = \;&-\mu\mathrm{tr}\{\zeta^{\mathrm{T}}((QL)\otimes I_{n_0})\zeta\}\;-\\ &\mu\mathrm{tr}\{\zeta^{\mathrm{T}}((QL)\otimes I_{n_0})\sigma\}\;+\\ &\sum_{i = 1}^N\frac{a_{i0}}{q_i}\mathrm{tr}\{\zeta_i^{\mathrm{T}}\phi(t)\tilde{y}_0^{\mathrm{T}}\} \end{aligned} \end{equation*} $$

    式中, $ \zeta = [\zeta_1^{\mathrm{T}},\;\cdots,\;\zeta_N^{\mathrm{T}}]^{\mathrm{T}} $, $ \sigma = [\sigma_1^{\mathrm{T}},\;\cdots,\;\sigma_N^{\mathrm{T}}]^{\mathrm{T}} $.

    对于$ \dot{V}_2 $的第1项, 由附录的引理2可得:

    $$ \begin{equation} \begin{split} \mathrm{tr}\{\zeta^{\mathrm{T}}((QL)\otimes I_{n_0})\zeta\} = \;&\mathrm{tr}\{\zeta^{\mathrm{T}}(H\otimes I_{n_0})\zeta\}\;\geq\\ & \lambda_{1H}\sum_{i = 1}^N\mathrm{tr}\{\zeta_i^{\mathrm{T}}\zeta_i\} \end{split} \end{equation} $$ (14)

    记$ L_e $为$ L $的增广矩阵, 即$ L_e = [-a_0|L] $, $ a_0 \;= [a_{01},\;\cdots,\; a_{0N}]^{\mathrm{T}} $. 令$ \epsilon(t) = [\epsilon_0^{\mathrm{T}}(t),\;\epsilon_1^{\mathrm{T}}(t),\;\cdots,\; \epsilon_N^{\mathrm{T}}(t)]^{\mathrm{T}} $, $ \Xi = QLL^{\mathrm{T}}Q $, $ \Delta = L_e^{\mathrm{T}}L_e $. 易证$ \sigma(t) = (L_e\otimes I_{n0})\epsilon(t) $. 对于$ \dot{V}_2 $的后2项, 由附录A的引理2和引理3可得:

    $$ \begin{split} & -\mathrm{tr}\{\zeta^{\mathrm{T}}((QL)\otimes I_{n_0})\sigma\}\leq\frac{\eta_1}{2}\mathrm{tr}\{\zeta^{\mathrm{T}}(\Xi\otimes I_{n_0})\zeta\}\; +\\ &\frac{1}{2\eta_1}\mathrm{tr}\{\epsilon^{\mathrm{T}}(\Delta\otimes I_{n_0})\epsilon\}\leq \frac{\eta_1\lambda_{N\Xi}}{2}\sum_{i = 1}^N\mathrm{tr}\{\zeta_i^{\mathrm{T}}\zeta_i\} \;+\\ &\frac{\lambda_{N\Delta}}{2\eta_1}\sum_{i = 0}^N\mathrm{tr}\{\epsilon_i^{\mathrm{T}}\epsilon_i\}\\[-1pt] \end{split} $$ (15)
    $$ \begin{equation} \begin{split} & \sum_{i = 1}^N\frac{a_{i0}}{q_i}\mathrm{tr}\{\zeta_i^{\mathrm{T}}\phi(t)\tilde{y}_0^{\mathrm{T}}\}\leq \frac{\eta_2}{2}\sum_{i = 1}^N\mathrm{tr}\{\zeta_i^{\mathrm{T}}\zeta_i\}\;+\\ &\;\;\;\sum_{i = 1}^N\frac{a_{i0}^2}{2\eta_2q_i^2}\mathrm{tr}\{\tilde{y}_0\phi^{\mathrm{T}}(t)\phi(t)\tilde{y}_0^{\mathrm{T}}\} \end{split} \end{equation} $$ (16)

    式中, $ \eta_1\in(0,\;\lambda_{1H}/ \lambda_{N\Xi}) $, $ \eta_2\in(0,\;\mu\lambda_{1H}) $.

    将式(14) ~ 式(16)代入$ \dot{V}_2 $, 有:

    $$ \begin{equation*} \begin{aligned} \dot{V}_2\leq&-(\mu(\lambda_{1H}-\frac{\eta_1\lambda_{N\Xi}}{2})-\frac{\eta_2}{2}) \sum_{i = 1}^N\mathrm{tr}\{\zeta_i^{\mathrm{T}}\zeta_i\}\;+\\ &\frac{\mu\lambda_{N\Delta}}{2\eta_1}\sum_{i = 0}^N\mathrm{tr}\{\epsilon_i^{\mathrm{T}}\epsilon_i\} +\sum_{i = 1}^N\frac{a_{i0}^2}{2\eta_2q_i^2}\mathrm{tr}\{\tilde{y}_0\phi^{\mathrm{T}}\phi\tilde{y}_0^{\mathrm{T}}\} \end{aligned} \end{equation*} $$

    令$ \kappa = \min\{q_i(\mu(2\lambda_{1H}-\eta_1\lambda_{N\Xi})-\eta_2)\} $. 由事件触发条件(9)、(10)和命题1易知存在一个函数$ b(t)\in{\cal{A}}(t) $, 使得:

    $$ \frac{\mu\lambda_{N\Delta}}{2\eta_1}\sum\limits_{i = 0}^N\mathrm{tr}\{\epsilon_i^{\mathrm{T}}\epsilon_i\} +\sum\limits_{i = 1}^N\frac{a_{i0}^2}{2\eta_2q_i^2}\mathrm{tr}\{\tilde{y}_0\phi^{\mathrm{T}}\phi\tilde{y}_0^{\mathrm{T}}\}\leq b(t) $$

    $$ \begin{equation*} \dot{V}_2\leq -\kappa V_2+b(t) \end{equation*} $$

    由引理1可得$ V_2(t)\in{\cal{A}}(t) $, 即$ \lim_{t\rightarrow\infty}\zeta(t) = \bf 0 $. 记$ \tilde{\Theta}_0(t) = [(\tilde{\theta}_0^{1})^{\mathrm{T}},\;\cdots,\;(\tilde{\theta}_0^{N})^{\mathrm{T}}]^{\mathrm{T}} $, 易得:

    $$ \zeta(t) = (L\otimes I_{n_0})\tilde{\Theta}_0(t)+a_0\otimes \tilde{\theta}_0^0(t) $$

    由命题1可知$ \lim_{t\rightarrow\infty}\tilde{\theta}_0^0(t) = \bf 0 $, 又因$L $为非奇异矩阵, 可得$ \lim_{t\rightarrow\infty}\tilde{\Theta}_0(t) = \bf 0 $, 即$ \hat{\theta}_0^i(t) $渐近收敛至$ \theta_0 $.

    注3. 由命题1和命题2可知, 观测器式(4)和式(5)可实现对参数$ \theta_0 $和$ \theta_i $的渐近估计, 分布式观测器式(6)在观测器式(4)基础上, 可渐近收敛到$ \theta_0 $. 只有观测器渐近收敛时, 所设计的事件触发趋同算法才可达到渐近趋同, 否则只能达到一致渐近有界趋同. 此外, 不确定参数$ \theta_0 $和$ \theta_i $均为矩阵而非向量, 命题1和命题2直接采用矩阵迹的不等式进行收敛性证明. 相比转化为扩维向量, 本文算法更简单明了.

    定理1. 如果假设1 ~ 3成立, 则事件触发算法式(8)、式(9)可使领导−跟随者型多智能体系统达到状态趋同.

    证明. 记$ \xi_{ix} = \sum_{j = 0}^Na_{ij}(x_i - x_j),\; \xi_{iy} = \sum_{j = 0}^N a_{ij} \times\;(y_i-y_j) $为第$ i $个跟随智能体的相对状态信息, 易证:

    $$ \begin{equation} \left\{\begin{aligned} \dot{\xi}_{ix} = \;&\xi_{iy}\\ \dot{\xi}_{iy} = \;&\sum_{j = 1}^Na_{ij}(\tilde{\theta}_j^{\mathrm{T}}\phi_j-\tilde{\theta}_i^{\mathrm{T}}\phi_i) -a_{i0}\tilde{\theta}_i^{\mathrm{T}}\phi_i\;+\\ &\sum_{j = 1}^Na_{ij}((\tilde{\theta}_0^{i})^{\mathrm{T}}-(\tilde{\theta}_0^{j})^{\mathrm{T}})\phi(t)+a_{i0}(\tilde{\theta}_0^{i})^{\mathrm{T}}\phi(t)\;-\\ &ck_1\sum_{j = 1}^Na_{ij}(\hat{\xi}_{ix}-\hat{\xi}_{jx})-ck_1a_{i0}\hat{\xi}_{ix}\;-\\ &ck_2\sum_{j = 1}^Na_{ij}(\hat{\xi}_{iy}-\hat{\xi}_{jy})-ck_2a_{i0}\hat{\xi}_{iy} \end{aligned}\right. \end{equation} $$ (17)

    记$ \xi_i = k_1\xi_{ix}+k_2\xi_{iy} $. 选取如下李雅普诺夫函数:

    $$ \begin{equation} V_3 = \sum\limits_{i = 1}^N\frac{\rho_i}{2}\xi_{ix}^{\mathrm{T}}\xi_{ix}+\sum\limits_{i = 1}^N\frac{1}{2q_i}\xi_i^{\mathrm{T}}\xi_i \end{equation} $$ (18)

    式中, $ \rho_i = k_1^2/q_i $.

    沿式(17)的轨迹可得$ V_3 $的导数:

    $$ \begin{equation*} \begin{aligned} \dot{V}_3 =\; &\sum_{i = 1}^N\left(-\frac{\rho_ik_1}{k_2}\xi_{ix}^{\mathrm{T}}\xi_{ix}+\frac{k_1}{q_ik_2}\xi_i^{\mathrm{T}}\xi_i\right)\;+\\ &k_2\xi^{\mathrm{T}}((QL)\otimes I_n)\tilde{\theta}_{\phi}+k_2\xi^{\mathrm{T}}((QL)\otimes I_n)\tilde{\theta}_{\phi}^{0}\;-\\ &ck_2\xi^{\mathrm{T}}((QL)\otimes I_n)\hat{\xi} \end{aligned} \end{equation*} $$

    式中, $ \xi = [\xi_1^{\mathrm{T}},\;\cdots,\;\xi_N^{\mathrm{T}}]^{\mathrm{T}} $, $ \hat{\xi} = [\hat{\xi}_1^{\mathrm{T}},\;\cdots,\;\hat{\xi}_N^{\mathrm{T}}]^{\mathrm{T}} $, $ \tilde{\theta}_{\phi} = [\phi_1^{\mathrm{T}}\tilde{\theta}_1,\;\cdots,\;\phi_N^{\mathrm{T}}\tilde{\theta}_N]^{\mathrm{T}} $, $ \tilde{\theta}_{\phi}^0 = [\phi^{\mathrm{T}}(t)\tilde{\theta}_0^{1},\;\cdots,\;\phi^{\mathrm{T}}(t)\tilde{\theta}_0^{N}]^{\mathrm{T}} $, $ \hat{\xi}_i = k_1\hat{\xi}_{ix}+k_2\hat{\xi}_{iy} $.

    根据Young不等式, 存在$ \gamma_1,\;\gamma_2\in(0,\;1) $, 使得$ \dot{V}_3 $的第2项和第3项满足如下不等式:

    $$ \begin{equation} \begin{split} & \xi^{\mathrm{T}}((QL)\otimes I_n)\tilde{\theta}_{\phi} = \frac{\gamma_1}{2}\xi^{\mathrm{T}}(\Xi\otimes I_n)\xi+\frac{1}{2\gamma_1} \tilde{\theta}_{\phi}^{\mathrm{T}}\tilde{\theta}_{\phi}\;\leq\\ &\;\;\;\frac{\gamma_1\lambda_{N\Xi}}{2}\sum_{i = 1}^N\xi_i^{\mathrm{T}}\xi_i+\frac{1}{2\gamma_1}\sum_{i = 1}^N \phi_i^{\mathrm{T}}\tilde{\theta}_i\tilde{\theta}_i^{\mathrm{T}}\phi_i\\[-1pt] \end{split} \end{equation} $$ (19)
    $$ \begin{equation} \begin{split} \xi^{\mathrm{T}}&((QL)\otimes I_n)\tilde{\theta}_{\phi}^0 = \frac{\gamma_2}{2}\xi^{\mathrm{T}}(\Xi\otimes I_n)\xi+\frac{1}{2\gamma_2} (\tilde{\theta}_{\phi}^{0})^{\mathrm{T}}\tilde{\theta}_{\phi}^0\;\leq\\ &\frac{\gamma_2\lambda_{N\Xi}}{2}\sum_{i = 1}^N\xi_i^{\mathrm{T}}\xi_i+\frac{1}{2\gamma_2}\sum_{i = 1}^N \phi^{\mathrm{T}}\tilde{\theta}_0^{i}(\tilde{\theta}_0^{i})^{\mathrm{T}}\phi \\[-1pt]\end{split} \end{equation} $$ (20)

    对于$ \dot{V}_3 $的最后1项, 有如下不等式:

    $$ \begin{equation} \begin{split} -\xi^{\mathrm{T}}&((QL)\otimes I_n)\hat{\xi} = -\xi^{\mathrm{T}}((QL)\otimes I_n)\xi\;-\\ &\xi^{\mathrm{T}}((QL^2)\otimes I_n)e\leq-(\lambda_{1H}\;-\\ &\frac{\gamma_3\lambda_{N\Pi}}{2})\sum_{i = 1}^N\xi_i^{\mathrm{T}}\xi_i +\frac{1}{2\gamma_3}\sum_{i = 1}^Ne_i^{\mathrm{T}}e_i \end{split} \end{equation} $$ (21)

    式中, $e=[e_1^{\mathrm{T}},\;\cdots,\;e_N^{\mathrm{T}}]^{\mathrm{T}} $, $ \Pi = QL^2(L^{2})^{\mathrm{T}}Q $, $ \gamma_3\in (0, \;2\lambda_{1H}/\lambda_{N\Pi}) $.

    将式(19) ~ 式(21)代入$ \dot{V}_3 $, 可得:

    $$ \begin{equation*} \begin{aligned} \dot{V}_3\leq&-\sum_{i = 1}^N\frac{\rho_ik_1}{k_2}\xi_{ix}^{\mathrm{T}}\xi_{ix}-\sum_{i = 1}^N\left( \frac{ck_2(2\lambda_{1H}-\gamma_3\lambda_{N\Pi})}{2}\;-\right.\\ &\left.\frac{k_1}{q_ik_2}-\frac{(\gamma_1+\gamma_2)k_2\lambda_{N\Xi}}{2}\right)\xi_i^{\mathrm{T}}\xi_i +\frac{ck_2}{2\gamma_3}\sum_{i = 1}^Ne_i^{\mathrm{T}}e_i\;+\\ &\frac{k_2}{2\gamma_1}\sum_{i = 1}^N\phi_i^{\mathrm{T}}\tilde{\theta}_i\tilde{\theta}_i^{\mathrm{T}}\phi_i +\frac{k_2}{2\gamma_2}\sum_{i = 1}^N\phi^{\mathrm{T}}\tilde{\theta}_0^{i}(\tilde{\theta}_0^{i})^{\mathrm{T}}\phi \end{aligned} \end{equation*} $$

    记$ q_{\min} = \min_{i\in\{1,\;\cdots,\;N\}}q_i $. 选取合适的参数$ k_1,\; k_2,\;\gamma_1,\;\gamma_2>0 $, $ \gamma_3\in(0,\;2\lambda_{1H}/\lambda_{N\Pi}) $, $ c>\bar{c} $, 其中:

    $$ \begin{equation} \bar{c} = \frac{(\gamma_1+\gamma_2)k_2^2\lambda_{N\Xi}+\displaystyle\frac{2k_1}{q_{\min}}} {(2\lambda_{1H}-\gamma_3\lambda_{N\Pi})k_2^2} \end{equation} $$ (22)

    记$ \alpha = k_2(2\lambda_{1H}-\gamma_3\lambda_{N\Pi})(c-\bar{c})/2 $, 可得:

    $$ \begin{equation*} \begin{aligned} \dot{V}_3\leq&-\sum_{i = 1}^N\frac{\rho_ik_1}{k_2}\xi_{ix}^{\mathrm{T}}\xi_{ix}-\sum_{i = 1}^N\alpha\xi_i^{\mathrm{T}}\xi_i +\frac{ck_2}{2\gamma_3}\sum_{i = 1}^Ne_i^{\mathrm{T}}e_i\;+\\ &\frac{k_2}{2\gamma_1}\sum_{i = 1}^N\phi_i^{\mathrm{T}}\tilde{\theta}_i\tilde{\theta}_i^{\mathrm{T}}\phi_i +\frac{k_2}{2\gamma_2}\sum_{i = 1}^N\phi^{\mathrm{T}}\tilde{\theta}_0^{i}(\tilde{\theta}_0^{i})^{\mathrm{T}}\phi \end{aligned} \end{equation*} $$

    由于$ \lim_{t\rightarrow\infty}\tilde{\theta}_i(t) = \bf 0,\;\lim_{t\rightarrow\infty}\tilde{\theta}_0^i (t) = \bf 0 $, $\phi_i(x_i (t), \; y_i (t)) $, $ \phi(t) $有界, 结合触发函数可知存在函数$ \beta(t) \in {\cal{A}}(t) $, 使得:

    $$ \begin{aligned} \beta(t)\geq\;&\frac{ck_2}{2\gamma_3}\sum_{i = 1}^Ne_i^{\mathrm{T}}e_i +\frac{k_2}{2\gamma_1}\sum_{i = 1}^N\phi_i^{\mathrm{T}}\tilde{\theta}_i\tilde{\theta}_i^{\mathrm{T}}\phi_i\;+\\ &\frac{k_2}{2\gamma_2}\sum_{i = 1}^N\phi^{\mathrm{T}}\tilde{\theta}_0^{i}(\tilde{\theta}_0^{i})^{\mathrm{T}}\phi \end{aligned} $$

    $$ \begin{equation} \dot{V}_3\leq-h V_3+\beta(t) \end{equation} $$ (23)

    式中, $ h = \min\{2k_1/k_2,\;2\alpha q_{\min}\} $. 由引理1可知, $ V_3(t) $渐近趋向$ \bf 0 $, 即对任意$ i\in\{1,\;\cdots,\;N\} $, 都有

    $$ \lim_{t\rightarrow\infty} \xi_{ix}(t) = \xi_{iy}(t) = \bf 0 $$

    记:

    $$ \begin{equation*} \begin{aligned} &\xi_x = [\xi_{1x}^{\mathrm{T}},\;\cdots,\;\xi_{Nx}^{\mathrm{T}}]^{\mathrm{T}},\;\delta_x = [\delta_{1x}^{\mathrm{T}},\;\cdots,\;\delta_{Nx}^{\mathrm{T}}]^{\mathrm{T}}\\ &\xi_y = [\xi_{1y}^{\mathrm{T}},\;\cdots,\;\xi_{Ny}^{\mathrm{T}}]^{\mathrm{T}},\;\delta_y = [\delta_{1y}^{\mathrm{T}},\;\cdots,\;\delta_{Ny}^{\mathrm{T}}]^{\mathrm{T}}\\ &\delta_{ix} = x_i-x_0,\;\delta_{iy} = y_i-y_0,\;i = 1,\;\cdots,\;N \end{aligned} \end{equation*} $$

    由$ \xi_{ix} $ 和$\; \xi_{iy} $的定义易证 $\xi_x = (L\otimes I_n)\delta_x,\;\xi_y = (L\otimes I_n)\delta_y$. 当假设1成立时, 则$ L $非奇异. 由式(23)可得, 对任意$i $有$ \lim_{t\rightarrow\infty} x_i(t) = x_0(t),\; y_i(t) = y_0(t) $.

    定理2. 分布式事件触发趋同算法式(8)和式(9)不存在芝诺现象.

    证明. 当$ t\in[t_k^i,\;t_{k+1}^i) $时, $ \epsilon_i(t) $的Frobenius范数和$ e_i(t) $的2范数的Dini导数满足如下不等式:

    $$ \begin{equation*} {\rm D}^+||\epsilon_i(t)||_{\rm{F}}\leq||\dot{\epsilon}_i(t)||_{\rm{F}},\; {\rm D}^+||e_i(t)||\leq||\dot{e}_i(t)|| \end{equation*} $$

    由式(6)和式(7)可得:

    $$ \begin{aligned} \dot{\epsilon}_i(t) =\; &\mu\sum_{j = 0}^Na_{ij}(\hat{\theta}_0^i(t_k^i)-\hat{\theta}_0^j(t_{k'}^j))\\ \dot{e}_i(t) =\; &k_2((\hat{\theta}_0^{i})^{\mathrm{T}}\phi(t)-\theta_i^{\mathrm{T}}\phi_i-u_i)\;+\\ &k_1(\hat{y}_i^i(t)-y_i) \end{aligned} $$

    由假设2、命题1、命题2和定理1可知, 存在有界实数$ \psi_k^i>0,\;\chi_k^i $和$ c>0 $, 使得:

    $$ \begin{aligned} &{\mathrm{D}}^+||\epsilon_i(t)||_{\rm{F}}\leq\psi_k^i\\ &{\mathrm{D}}^+||e_i(t)||\leq c||e_i(t)||+\chi_k^i \end{aligned} $$

    在事件触发时刻$ t_k^i $, $ \epsilon_i(t) $和$ e_i(t) $被重置为$ \bf 0 $. 对于$ t\in[t_k^i,\;t_{k+1}^i) $, 由比较原理可得:

    $$ \begin{equation} \left\{\begin{aligned} &||\epsilon_i(t)||_{\rm{F}}\leq\psi_k^i(t-t_k^i)\\ &||e_i(t)||\leq\frac{\chi_k^i}{c}(\mathrm{e}^{c(t-t_k^i)}-1) \end{aligned}\right. \end{equation} $$ (24)

    $ \forall t \in [t_k^i,\;t_{k+1}^i) $, 有$ ||\epsilon_i(t)||_{\rm{F}} < \sqrt{f_{i1}(t)},\;||e_i (t)|| < \sqrt{f_{i2}(t)} $.

    当$ {t \rightarrow t_{k+1}^i} $时, 则有$ \lim_{t\rightarrow t_{k+1}^i}||e_i(t)||\geq\sqrt{f_{i2}(t)} $, 或$\lim_{t\rightarrow t_{k+1}^i}||\epsilon_i(t)||_{\rm{F}}\geq\sqrt{f_{i1}(t)}$. 结合式(24), 可得$ t_{k+1}^i - t_k^i \geq \ln\left({c}\sqrt{f_{i2}(t)}/{\chi_k^i} + 1\right)/{c} $ 或 $ t_{k+1}^i - t_k^i \; \geq {\sqrt{f_{i1}(t)}}/ {\psi_k^i} $. 对任意有限时间$ t $, $ f_{i1}(t)>0, \;f_{i2}(t)> 0 $, 即连续2次触发时刻的时间差$ t_{k+1}^i-t_k^i $是严格大于$ 0 $的, 从而证明, 对任意有限时间$ t $, 事件触发趋同算法式(8)和式(9)不存在芝诺现象.

    推论1. 事件触发条件(10)所给出的领导智能体的事件触发算法不存在芝诺现象. 证明过程与定理 2 证明类似

    本节通过仿真模型验证事件触发控制器式(8)和式(9)的有效性. 考虑包含$ 5 $个智能体的异构不确定二阶非线性多智能体系统, 其中跟随智能体1 ~ 4为无阻尼单摆系统, 其动力学方程为:

    $$ \begin{equation} \left\{\begin{aligned} \dot{x}_i& = y_i\\ \dot{y}_i& = -\frac{g}{l_i}\sin(x_i)+u_i \end{aligned}\right. \end{equation} $$ (25)

    式中, $ x_i $为单摆的角位移, $ y_i $为角速度, $ g $为重力加速度, $ l_i $为摆长, $ u_i $为控制输入. 由于测量误差原因, 重力加速度$ g $和摆长$\; l_i $的精确值不确定. 领导智能体的动力学方程为:

    $$ \begin{equation} \left\{\begin{aligned} \dot{x}_0& = y_0\\ \dot{y}_0& = \theta_0^{\mathrm{T}}\phi(t) \end{aligned}\right. \end{equation} $$ (26)

    式中, $ \phi(t) = [\sin(t),\;\cos(2t)]^{\mathrm{T}} $为已知时间向量函数, $ \theta_0\in {\bf{R}}^2 $为未知常向量. 多智能体系统式(25)和式(26)的网络拓扑由如下拉普拉斯矩阵描述:

    $$ \begin{equation*} {\cal{L}} =\left[ \begin{array}{*{20}{r}} 0\;\;\;\;\,\,&0\;\;\;\;\,\,&0\;\;\;\;\,\,&0\;\;\;\;\,\,&0\;\;\;\;\,\,\\ 0\;\;\;\;\,\,&0.60&-0.55&0\;\;\;\;\,\,&-0.05\\ -0.50&0\;\;\;\;\,\,&0.55&-0.05&0\;\;\;\;\,\,\\ -0.50&-0.05&0\;\;\;\;\,\,&0.55&0\;\;\;\;\,\,\\ 0\;\;\;\;\,\,&0\;\;\;\;\,\,&0\;\;\;\;\,\,&-0.55&0.55 \end{array}\right] \end{equation*} $$

    根据参数观测器式(4)和式(6), 为每个智能体设计未知向量$ \theta_0 $的观测值$ \hat{\theta}_0^i $; 根据参数观测器式(5), 为跟随智能体设计不确定系数$ -g/l_i $的观测值$ \hat{\theta}_i $, 其中参数$ \mu = 2,\;\rho_i = 1 $; 根据状态估计器式(7), 为每个智能体设计邻居状态估计器. 通过计算, 可求得参数$ q_{\min} = 2.015\ 3,\;\lambda_{1H} = 0.099\ 3,\; \lambda_{N\Xi} = 0.103\ 5,\;\lambda_{N\Pi} = 0.056\ 7 $. 通过选取参数$ \gamma_1 = \gamma_2 = \gamma_3 = 0.5 $, $ k_1 = 0.5,\;k_2 = 2 $, 可求得$ c' = 1.336\ 6 $. 因此, 选取事件触发控制器的参数为$ k_1 = 0.5,\; k_2 = 2, \;c \,= 2 $. 对于触发函数式(9)和式(10), 选取函数$ f_{i1}(t) \,= f_{i2}(t) = 0.1/(1+0.5t) $.

    仿真结果如图1 ~ 图3所示. 由图1可知, 跟随智能体的角度和角速度渐近跟踪上领导智能体的状态; 由图2可知, $ \hat{\theta}_0^i $和$ \hat{\theta}_i $分别可以渐近收敛到$ \theta_0 $和$ -g/l_i $; 图3给出了各智能体的事件触发时刻. 表1为在时间段$ [0,\;40] $ s内, 本文算法的事件触发次数. 作为对比, 利用文献[810, 14] 所给出的组合测量事件触发算法对系统式(25)和式(26) 进行仿真, 表2为在时间段$ [0,\;40] $ s内, 组合测量事件触发算法的各智能体事件触发次数. 可以看出, 本文基于参数和状态观测器的事件触发控制算法可有效减少事件触发次数.

    图 1  各智能体的状态轨迹
    Fig. 1  State trajectories of each agent
    图 2  $||\tilde{\theta}_0^i||$和$\tilde{\theta}_i$的轨迹
    Fig. 2  Trajectories of $||\tilde{\theta}_0^i||$ and $\tilde{\theta}_i$
    图 3  各智能体的事件触发时刻
    Fig. 3  Event-triggered instants of each agent
    表 1  本文算法的事件触发次数
    Table 1  Event-triggered number of the proposed algorithm
    智能体01234
    触发次数4984757372
    下载: 导出CSV 
    | 显示表格
    表 2  组合测量算法的事件触发次数
    Table 2  Event-triggered number of the combined measurement algorithm
    智能体01234
    触发次数139258266255249
    下载: 导出CSV 
    | 显示表格

    本文基于参数估计与事件触发机制, 研究了异构不确定二阶非线性多智能体系统的状态趋同问题, 给出完全不依赖智能体间连续信息传输的事件触发趋同算法. 因为每个智能体均存在不确定参数, 在设计控制器前, 先设计观测器, 估计其不确定参数. 为使跟随智能体跟踪上领导智能体, 设计分布式参数观测器, 使每个跟随智能体可以渐近估计领导智能体不确定参数. 为使算法达到完全不依赖智能体间连续信息传输的目的, 每个智能体利用其邻居智能体发送的事件触发时刻采样信息, 对邻居智能体状态进行重构, 利用重构的状态信息设计控制器和事件触发函数. 进一步证明了所提事件触发趋同算法不存在芝诺现象. 最后, 通过一个多单摆系统验证了所提事件触发趋同算法的有效性, 同时对比组合测量事件触发算法, 本文所提算法可有效减少事件触发次数. 为简化反馈增益参数对拓扑网络全局信息的依赖, 未来可将现有工作推广到完全分布式事件触发状态趋同控制.

    引理 2. 对于空间$ {\bf{R}}^{m\times n} $中的矩阵$ X $, 以及空间$ {\bf{R}}^{m\times m} $中的正定矩阵$ A $, 有:

    $$ \lambda_{1A}\mathrm{tr}\{X^{\mathrm{T}}X\}\leq\mathrm{tr}\{X^{\mathrm{T}}AX\}\leq\lambda_{mA}\mathrm{tr}\{X^{\mathrm{T}}X\} $$

    证明. 矩阵$ X $可用 $ n $个列向量 $ x_i\in {\bf{R}}^m, \;i = 1,\;\cdots,\; n $表示, 即$ X = [x_1,\;\cdots,\;x_n] $. 因此, 可得:

    $$ \begin{equation*} X^{\mathrm{T}}X = \begin{bmatrix} x_1^{\mathrm{T}}x_1 & x_1^{\mathrm{T}}x_2 & \cdots & x_1^{\mathrm{T}}x_n \\ x_2^{\mathrm{T}}x_1 & x_2^{\mathrm{T}}x_2 & \cdots & x_2^{\mathrm{T}}x_n \\ \vdots & \vdots & \ddots & \vdots \\ x_n^{\mathrm{T}}x_1 & x_n^{\mathrm{T}}x_2 & \cdots & x_n^{\mathrm{T}}x_n \\ \end{bmatrix} \end{equation*} $$

    即, $ \mathrm{tr}\{X^{\mathrm{T}}X\} = \sum_{i = 1}^nx_i^{\mathrm{T}}x_i $.

    记$ \Lambda = \mathrm{diag}\{\lambda_{1A},\;\cdots,\;\lambda_{mA}\} $. 由于$ A $为正定矩阵, 所以存在单位正交矩阵$ P\in {\bf{R}}^{m\times m} $使$ P^{\mathrm{T}}AP \;= \Lambda $. 矩阵$ P $可用$ m $个列向量$ p_i\in {\bf{R}}^m, \;i = 1,\;\cdots,\;n $表示, 即$ P = [p_1,\;\cdots,\;p_m] $. 对于$ X^{\mathrm{T}}AX $, 有:

    $$ X^{\mathrm{T}}AX = X^{\mathrm{T}}PP^{\mathrm{T}}APP^{\mathrm{T}}X = X^{\mathrm{T}}P\Lambda P^{\mathrm{T}}X $$

    通过计算, 可得:

    $$ X^{\mathrm{T}}P\Lambda = \begin{bmatrix} \lambda_{1A}x_1^{\mathrm{T}}p_1 & \lambda_{2A}x_1^{\mathrm{T}}p_2 & \cdots & \lambda_{mA}x_1^{\mathrm{T}}p_m \\ \lambda_{1A}x_2^{\mathrm{T}}p_1 & \lambda_{2A}x_2^{\mathrm{T}}p_2 & \cdots & \lambda_{mA}x_2^{\mathrm{T}}p_m \\ \vdots & \vdots & \ddots & \vdots \\ \lambda_{1A}x_n^{\mathrm{T}}p_1 & \lambda_{2A}x_n^{\mathrm{T}}p_2 & \cdots & \lambda_{mA}x_n^{\mathrm{T}}p_m \\ \end{bmatrix} $$
    $$ P^{\mathrm{T}}X = \begin{bmatrix} p_1^{\mathrm{T}}x_1 & p_1^{\mathrm{T}}x_2 & \cdots & p_1^{\mathrm{T}}x_n \\ p_2^{\mathrm{T}}x_1 & p_2^{\mathrm{T}}x_2 & \cdots & p_2^{\mathrm{T}}x_n \\ \vdots & \vdots & \ddots & \vdots \\ p_m^{\mathrm{T}}x_1 & p_m^{\mathrm{T}}x_2 & \cdots & p_m^{\mathrm{T}}x_n \\ \end{bmatrix} $$

    通过计算, 可得:

    $$ \mathrm{tr}\{X^{\mathrm{T}}AX\} = \sum\limits_{i = 1}^n\sum\limits_{j = 1}^m\lambda_{jA}x_i^{\mathrm{T}}p_jp_j^{\mathrm{T}}x_i $$

    由于向量组 $ \{p_1,\;\cdots,\;p_m\} $ 为空间 $ {\bf{R}}^m $ 中的一组标准正交基, 所以对数量积$ x_i^{\mathrm{T}}p_j $有 $ x_i^{\mathrm{T}}p_j = ||x_i||\cos\theta_{ij} $, 其中$ \theta_{ij} $为向量 $ x_i $与基向量$ p_j $的夹角. 因此有:

    $$ \sum\limits_{j = 1}^m\lambda_{jA}x_i^{\mathrm{T}}p_jp_j^{\mathrm{T}}x_i = \sum\limits_{j = 1}^m\lambda_{jA}(\cos^2\theta_{ij})x_i^{\mathrm{T}}x_i $$

    又由于$ \lambda_{1A}\leq\cdots\leq\lambda_{mA} $和$ \sum_{j = 1}^m\cos^2\theta_{ij} = 1 $, 可得$ \lambda_{1A}\mathrm{tr}\{X^{\mathrm{T}}X\}\leq\mathrm{tr}\{X^{\mathrm{T}}AX\}\leq\lambda_{mA}\mathrm{tr}\{X^{\mathrm{T}}X\}. $

    引理 3. 对矩阵$ X\in {\bf{R}}^{m\times n} $, $ Y\in {\bf{R}}^{s\times n} $, $ A\in {\bf{R}}^{m\times s} $和正实数$ \eta $, 有:

    $$ \mathrm{tr}\{X^{\mathrm{T}}AY\}\leq\frac{\eta}{2}\mathrm{tr}\{X^{\mathrm{T}}AA^{\mathrm{T}}X\}+\frac{1}{2\eta} \mathrm{tr}\{Y^{\mathrm{T}}Y\} $$

    证明. $ X $, $ Y $, $ A $可表示为:

    $$ \begin{aligned} X& = [x_1,\;\cdots,\;x_n],\;x_i\in {\bf{R}}^m,\;i\in\{1,\;\cdots,\;n\}\\ Y& = [y_1,\;\cdots,\;y_n],\;y_i\in {\bf{R}}^s,\;i\in\{1,\;\cdots,\;n\}\\ A& = [a_1,\;\cdots,\;a_s],\;a_i\in {\bf{R}}^m,\;i\in\{1,\;\cdots,\;s\}\\ \end{aligned} $$

    记$ y_i = [y_{i1},\;\cdots,\;y_{is}]^{\mathrm{T}} $, 通过计算可得:

    $$ \mathrm{tr}\{X^{\mathrm{T}}AY\} = \sum\limits_{i = 1}^n\sum\limits_{j = 1}^sx_ia_jy_{ij} $$

    根据Young不等式, 可知$ x_ia_jy_{ij}\leq {\eta}(x_ia_j)^2/{2}+ y_{ij}^2/ {2\eta} $, 可得:

    $$ \mathrm{tr}\{X^{\mathrm{T}}AY\}\leq\frac{\eta}{2}\sum\limits_{i = 1}^n\sum\limits_{j = 1}^s(x_ia_j)^2+\frac{1}{2\eta} \sum\limits_{i = 1}^n\sum\limits_{j = 1}^sy_{ij}^2 $$

    容易验证$ \sum_{i = 1}^n\sum_{j = 1}^s(x_ia_j)^2 \,=\, \mathrm{tr}\{X^{\mathrm{T}}AA^{\mathrm{T}}X\} ,$ $ \sum_{i = 1}^n \sum_{j = 1}^sy_{ij}^2 = \mathrm{tr}\{Y^{\mathrm{T}}Y\} $.


  • 11 红方(Proponents, Pros), 蓝方(Antagonists, Ants)
  • 图  1  EGMF的方法架构(EGMF通过联合极小极大Q函数分解框架进行策略评估, 分解的个体独立Q函数与熵评估函数结合用于策略改进)

    Fig.  1  The architecture of EGMF (EGMF evaluates policies through the joint minimax Q decomposition framework, and combines the factorized individual Q function with the entropy evaluation function for policy improvement)

    图  2  实验验证平台(包括Wimblepong 2v2、MPE 3v3、RoboMaster 2v2和现实世界的RoboMaster 2v2)

    Fig.  2  Experimental verification platform (including Wimblepong 2v2, MPE 3v3, RoboMaster 2v2 and real-world RoboMaster 2v2)

    图  3  训练过程中与基于脚本的智能体进行对抗的结果

    Fig.  3  The result of playing against the rule-based bots during the training process

    图  4  训练过程中多种算法交叉对抗的循环赛回报

    Fig.  4  The cross-play results of RR returns throughout training

    图  5  训练期间EGMF和基线方法的近似NashConv结果

    Fig.  5  Illustration of the approximate NashConv of EGMF and baselines during training

    图  6  EGMF方法在六种场景中的收益矩阵

    Fig.  6  Illustration of the payoff values of EGMF modules in the six scenarios

    图  7  最大熵消融实验过程中与基于脚本的智能体对抗的结果

    Fig.  7  The results with respect to the ablation of maximum entropy by playing against the rule-based bots

    图  8  最大熵优化提升策略的多样性.

    Fig.  8  Maximum entropy optimization enhance the diversity of policy

    图  9  EGMF算法模型部署在实体机器人任务中的演示

    Fig.  9  Demonstration of the continuous real-world robot task based on EGMF model

    表  1  实验中所有方法的重要超参数

    Table  1  Important hyperparameters of all methods in experiments

    算法 超参数 名称 Wimblepong 2v2 MPE 3v3 RoboMaster 2v2
    共用超参数 n_episodes 回合数 13000 13000 80000
    n_seeds 种子数 8 8 8
    $\gamma$ 折扣因子 0.99 0.98 0.99
    hidden_layers 隐藏层 [64, 64] [64, 64] [128, 128]
    mix_hidden_dim 混合网络隐藏层 32 32 32
    learning_rate 学习率 0.0005 0.0005 0.0005
    EGMF (本文) buffer_size 经验池大小 400 000 40 000 400 000
    RADAR[15]/Team-PSRO[16]/NXDO[46] n_genes 迭代数 13 13 10
    ep_per_gene 单次迭代回合 1000 1000 80000
    batch_size 批大小 1000 1000 2000
    buffer_size 经验池大小 200000 20000 200000
    下载: 导出CSV

    表  2  训练结束后各个算法与基于脚本的智能体对抗的结果和循环赛交叉对抗的结果

    Table  2  Performance of all methods at the end of training by playing against the scripted-based bots, and the cross-play results of Round-Robin returns

    指标 算法 场景
    Pong-D MPE-D RM-D Pong-C MPE-C RM-C
    与固定脚本对抗 EGMF (本文) 0.95±0.01 32.3±1.0 0.63±0.03 0.95±0.02 23.0±0.5 0.62±0.03
    RADAR[15] 0.52±0.11 16.3±5.2 0.35±0.02 0.58±0.03 12.5±5.1 0.52±0.02
    Team-PSRO[16] 0.71±0.04 21.2±3.4 0.33±0.01 0.71±0.06 22.1±2.9 0.54±0.03
    NXDO[46] 0.71±0.10 24.1±1.6 0.45±0.02 0.80±0.05 23.0±0.4 0.61±0.01
    循环赛结果 EGMF (本文) 0.92±0.01 12.1±0.3 0.90±0.02 0.91±0.02 7.8±2.2 0.72±0.01
    RADAR[15] 0.45±0.02 −2.4±2.5 0.45±0.04 0.43±0.02 −1.8±1.9 0.50±0.01
    Team-PSRO[16] 0.53±0.02 1.9±1.9 0.49±0.01 0.56±0.04 −3.7±2.8 0.55±0.01
    NXDO[46] 0.51±0.02 2.5±1.2 0.51±0.02 0.63±0.02 2.9±1.9 0.58±0.02
    注: 粗体表示各算法在不同场景下的最优结果.
    下载: 导出CSV

    表  3  EGMF和FM3Q与基于脚本的智能体对抗的结果

    Table  3  Performance of EGMF and FM3Q by playing against the scripted-based bots

    算法 Pong-D MPE-D RM-D
    回合(0.8) 性能 回合(25) 性能 回合(0.6) 性能
    EGMF (本文) 3.0 k 0.95±0.01 2.8 k 32.3±1.0 35 k 0.63±0.03
    FM3Q[17] 3.1 k 0.96±0.03 3.6 k 29.9±1.2 19 k 0.68±0.03
    注: 粗体表示各方法在不同场景下的最优结果.
    下载: 导出CSV
  • [1] Silver D, Huang A, Maddison C J, Guez A, Sifre L, van den Driessche G, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484−489 doi: 10.1038/nature16961
    [2] 唐振韬, 邵坤, 赵冬斌, 朱圆恒. 深度强化学习进展: 从AlphaGo到AlphaGo Zero. 控制理论与应用, 2017, 34(12): 1529−1546 doi: 10.7641/CTA.2017.70808

    Tang Zhen-Tao, Shao Kun, Zhao Dong-Bin, Zhu Yuan-Heng. Recent progress of deep reinforcement learning: From AlphaGo to AlphaGo Zero. Control Theory and Applications, 2017, 34(12): 1529−1546 doi: 10.7641/CTA.2017.70808
    [3] Sandholm T. Solving imperfect-information games. Science, 2015, 347(6218): 122−123 doi: 10.1126/science.aaa4614
    [4] Tang Z T, Zhu Y H, Zhao D B, Lucas S M. Enhanced rolling horizon evolution algorithm with opponent model learning: Results for the fighting game AI competition. IEEE Transactions on Games, 2023, 15(1): 5−15 doi: 10.1109/TG.2020.3022698
    [5] Guan Y, Afshari M, Tsiotras P. Zero-sum games between mean-field teams: Reachability-based analysis under mean-field sharing. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI, 2024. 9731−9739
    [6] Mathieu M, Ozair S, Srinivasan S, Gulcehre C, Zhang S T, Jiang R, et al. Starcraft II unplugged: Large scale offline reinforcement learning. In: Proceedings of the 35th Conference on Neural Information Processing Systems. Sydney, Australia: NeurIPS, 2021.
    [7] Ye D H, Liu Z, Sun M F, Shi B, Zhao P L, Wu H, et al. Mastering complex control in MOBA games with deep reinforcement learning. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI, 2020. 6672−6679
    [8] Littman M L. Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of the 7th International Conference on International Conference on Machine Learning. San Francisco, USA: ACM, 1994. 157−163
    [9] Hu J L, Wellman M P. Nash q-learning for general-sum stochastic games. The Journal of Machine Learning Research, 2003, 4: 1039−1069
    [10] Zhu Y H, Zhao D B. Online minimax q network learning for two-player zero-sum Markov games. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(3): 1228−1241 doi: 10.1109/TNNLS.2020.3041469
    [11] Lanctot M, Zambaldi V, Gruslys A, Lazaridou A, Tuyls K, Pérolat J, et al. A unified game-theoretic approach to multiagent reinforcement learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017. 4193−4206
    [12] Chai J J, Chen W Z, Zhu Y H, Yao Z X, Zhao D B. A hierarchical deep reinforcement learning framework for 6-DOF UCAV air-to-air combat. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(9): 5417−5429 doi: 10.1109/TSMC.2023.3270444
    [13] Li W F, Zhu Y H, Zhao D B. Missile guidance with assisted deep reinforcement learning for head-on interception of maneuvering target. Complex and Intelligent Systems, 2022, 8(2): 1205−1216
    [14] Haarnoja T, Moran B, Lever G, Huang S H, Tirumala D, Humplik J, et al. Learning agile soccer skills for a bipedal robot with deep reinforcement learning. Science Robotics, 2024, 9(89): Article No. eadi8022 doi: 10.1126/scirobotics.adi8022
    [15] Phan T, Belzner L, Gabor T, Sedlmeier A, Ritz F, Linnhoff-Popien C. Resilient multi-agent reinforcement learning with adversarial value decomposition. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. AAAI, 2021. 11308−11316
    [16] McAleer S, Farina G, Zhou G, Wang M Z, Yang Y D, Sandholm T. Team-PSRO for learning approximate TMECor in large team games via cooperative reinforcement learning. In: Proceedings of the 37th Conference on Neural Information Processing Systems. NeurIPS, 2023.
    [17] Hu G Z, Zhu Y H, Li H R, Zhao D B. FM3Q: Factorized multi-agent MiniMax Q-learning for two-team zero-sum Markov game. IEEE Transactions on Emerging Topics in Computational Intelligence, 2024, 8(6): 4033−4045 doi: 10.1109/TETCI.2024.3383454
    [18] Bai Y, Jin C. Provable self-play algorithms for competitive reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020. Article No. 52
    [19] Perez-Nieves N, Yang Y D, Slumbers O, Mguni D H, Wen Y, Wang J. Modelling behavioural diversity for learning in open-ended games. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. 8514−8524
    [20] Balduzzi D, Garnelo M, Bachrach Y, Czarnecki W, Pérolat J, Jaderberg M, et al. Open-ended learning in symmetric zero-sum games. In: Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: ICML, 2019. 434−443
    [21] McAleer S, Lanier J B, Fox R, Baldi P. Pipeline PSRO: A scalable approach for finding approximate Nash equilibria in large games. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM, 2020. Article No. 1699
    [22] Muller P, Omidshafiei S, Rowland M, Tuyls K, Pérolat J, Liu S Q, et al. A generalized training approach for multiagent learning. In: Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: ICLR, 2020.
    [23] Marris L, Muller P, Lanctot M, Tuyls K, Graepel T. Multi-agent training beyond zero-sum with correlated equilibrium meta-solvers. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. 7480−7491
    [24] Feng X D, Slumbers O, Wan Z Y, Liu B, McAleer S, Wen Y, et al. Neural auto-curricula in two-player zero-sum games. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. NeurIPS, 2021. Article No. 268
    [25] Anagnostides I, Kalogiannis F, Panageas I, Vlatakis-Gkaragkounis E V, Mcaleer S. Algorithms and complexity for computing Nash equilibria in adversarial team games. In: Proceedings of the 24th ACM Conference on Economics and Computation. London, UK: ACM, 2023. Article No. 89
    [26] Zhu Y H, Li W F, Zhao M C, Hao J Y, Zhao D B. Empirical policy optimization for n-player Markov games. IEEE Transactions on Cybernetics, 2023, 53(10): 6443−6455 doi: 10.1109/TCYB.2022.3179775
    [27] Luo G Y, Zhang H, He H B, Li J L, Wang F-Y. Multiagent adversarial collaborative learning via mean-field theory. IEEE Transactions on Cybernetics, 2021, 51(10): 4994−5007 doi: 10.1109/TCYB.2020.3025491
    [28] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative-competitive environments. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017. 6382−6393
    [29] Sunehag P, Lever G, Gruslys A, Czarnecki W M, Zambaldi V, Jaderberg M, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. Stockholm, Sweden: ACM, 2018. 2085−2087
    [30] Rashid T, Samvelyan M, De Witt C S, Farquhar G, Foerster J, Whiteson S. Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research, 2020, 21(1): Article No. 178
    [31] Chai J J, Li W F, Zhu Y H, Zhao D B, Ma Z, Sun K W, et al. UNMAS: Multiagent reinforcement learning for unshaped cooperative scenarios. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(4): 2093−2104 doi: 10.1109/TNNLS.2021.3105869
    [32] Peng B, Rashid T, De Witt C A S, Kamienny P A, Torr P H S, Böhmer W, et al. FACMAC: Factored multi-agent centralised policy gradients. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. NeurIPS, 2021. Article No. 934
    [33] Zhang T H, Li Y H, Wang C, Xie G M, Lu Z Q. FOP: Factorizing optimal joint policy of maximum-entropy multi-agent reinforcement learning. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. 12491−12500
    [34] Haarnoja T, Tang H R, Abbeel P, Levine S. Reinforcement learning with deep energy-based policies. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017. 1352−1361
    [35] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR, 2018. 1856−1865
    [36] Duan J L, Guan Y, Li S E, Ren Y G, Sun Q, Cheng B. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(11): 6584−6598 doi: 10.1109/TNNLS.2021.3082568
    [37] Kalogiannis F, Panageas I, Vlatakis-Gkaragkounis E V. Towards convergence to Nash equilibria in two-team zero-sum games. In: Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: ICLR, 2023.
    [38] Wang J H, Ren Z Z, Liu T, Yu Y, Zhang C J. QPLEX: Duplex dueling multi-agent Q-learning. In: Proceedings of the 9th International Conference on Learning Representations. ICLR, 2021.
    [39] Condon A. On algorithms for simple stochastic games. Advances in Computational Complexity Theory, 1990, 13: 51−72
    [40] Zhou M, Liu Z Y, Sui P W, Li Y X, Chung Y Y. Learning implicit credit assignment for cooperative multi-agent reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: NeurIPS, 2020. Article No. 994
    [41] Ziebart B D, Maas A, Bagnell J A, Dey A K. Maximum entropy inverse reinforcement learning. In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence. Chicago, USA: AAAI, 2008. 1433−1438
    [42] Bellman R. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 1952, 38(8): 716−719
    [43] Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529−533 doi: 10.1038/nature14236
    [44] Terry J K, Black B, Grammel N, Mario Jayakumar M, Ananth Hari A, Sullivan R, et al. PettingZoo: A standard API for multi-agent reinforcement learning. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. NeurIPS, 2021. Article No. 1152
    [45] Hu G Z, Li H R, Liu S S, Zhu Y H, Zhao D B. NeuronsMAE: A novel multi-agent reinforcement learning environment for cooperative and competitive multi-robot tasks. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN). Gold Coast, Australia: IEEE, 2023. 1−8
    [46] McAleer S, Lanier J, Wang K A, Baldi P, Fox R. XDO: A double oracle algorithm for extensive-form games. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. NeurIPS, 2021. Article No. 1771
    [47] Samvelyan M, Khan A, Dennis M, Jiang M Q, Parker-Holder J, Foerster J N, et al. MAESTRO: Open-ended environment design for multi-agent reinforcement learning. In: Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: ICLR, 2023.
    [48] Timbers F, Bard N, Lockhart E, Lanctot M, Schmid M, Burch N, et al. Approximate exploitability: Learning a best response. In: Proceedings of the 31st International Joint Conference on Artificial Intelligence. Vienna, Austria: IJCAI, 2022. 3487−3493
    [49] Cohen A, Yu L, Wright R. Diverse exploration for fast and safe policy improvement. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI, 2018. Article No. 351
    [50] Tsai Y Y, Xu H, Ding Z H, Zhang C, Johns E, Huang B D. DROID: Minimizing the reality gap using single-shot human demonstration. IEEE Robotics and Automation Letters, 2021, 6(2): 3168−3175 doi: 10.1109/LRA.2021.3062311
  • 加载中
  • 图(9) / 表(3)
    计量
    • 文章访问数:  65
    • HTML全文浏览量:  38
    • PDF下载量:  6
    • 被引次数: 0
    出版历程
    • 收稿日期:  2024-05-10
    • 录用日期:  2024-10-05
    • 网络出版日期:  2025-02-26

    目录

    /

    返回文章
    返回