-
摘要: 近年来, 深度强化学习(Deep reinforcement learning, DRL)在诸多复杂序贯决策问题中取得巨大突破.由于融合了深度学习强大的表征能力和强化学习有效的策略搜索能力, 深度强化学习已经成为实现人工智能颇有前景的学习范式.然而, 深度强化学习在多Agent系统的研究与应用中, 仍存在诸多困难和挑战, 以StarCraft Ⅱ为代表的部分观测环境下的多Agent学习仍然很难达到理想效果.本文简要介绍了深度Q网络、深度策略梯度算法等为代表的深度强化学习算法和相关技术.同时, 从多Agent深度强化学习中通信过程的角度对现有的多Agent深度强化学习算法进行归纳, 将其归纳为全通信集中决策、全通信自主决策、欠通信自主决策3种主流形式.从训练架构、样本增强、鲁棒性以及对手建模等方面探讨了多Agent深度强化学习中的一些关键问题, 并分析了多Agent深度强化学习的研究热点和发展前景.Abstract: Recent years has witnessed the great success of deep reinforcement learning (DRL) in addressing complicated problems, and it is widely used to capture plausible policies in sequential decision-making tasks. Recognized as a promising learning paradigm, the deep reinforcement learning takes advantage of the great power of representations in deep learning and superior capability of policy improvement in reinforcement learning, driving the development of artificial intelligence into a new era. Though the DRL has shown its great power in typical applications, the effective multi-agent DRL still needs further explorations, and a challenging task is to guide multi-agents to play StarCraft Ⅱ, where the environment is partially observed and dynamic. To enable DRL better accommodate the multi-agent environment and overcome challenges, we briefly introduced the foundation of reinforcement learning and then reviewed some representative or state-of-art algorithms of multi-agent DRL, including the deep Q learning algorithm, the deep policy gradient algorithm and related extensions. Meanwhile, some dominant approaches regarding making decisions for multi-agents were elaborated, and we categorized them into three mainstream classes from the aspect of stage of communication in DRL as full communication centralized learning, full communication decentralized learning and limited communication decentralized learning Finally, we discussed some key problems in multi-agent DRL tasks, such as training architecture, example enhancement, robust improvement, and opponent modeling, and highlighted future directions in this issue.
-
Key words:
- Multi-agent system /
- deep learning /
- deep reinforcement learning (DRL) /
- artificial general intelligence
-
在信息处理领域, 通常将信号自相关矩阵最大特征值对应的特征向量称之为信号的主成分, 而由信号的多个主成分张成的空间称为信号的主子空间.在很多信号处理问题中, 需要对信号的主子空间进行在线跟踪, 如视觉跟踪[1]、波达方向估计[2]、图像处理[3]、谱分析[4]等领域.因此, 发展主子空间跟踪算法就成为了一件非常有意义的工作.
以往解决主子空间跟踪问题主要依靠矩阵特征值分解(Eigenvalue decomposition, EVD)和奇异值分解(Singular value decomposition, SVD)等, 然而该方法计算复杂度高, 而且难以满足实时信号处理的要求.为了克服这些缺点, 学者们提出了基于Hebbian神经网络的主子空间跟踪方法.相比传统的EVD和SVD方法, 神经网络方法具有以下3个方面的优点: 1)可以对输入信号的自相关矩阵进行在线估计; 2)算法的计算复杂度较低; 3)能够处理非平稳的随机信号[4].基于上述优势, 神经网络方法已经成为近些年来国际上的一个研究热点.
基于单层线性神经网络, Oja提出了著名的Oja算法[5], 然而Oja算法是一个单维主成分提取算法.为了能够实现对信号主子空间的跟踪, 学者们通过对Oja算法进行了改进提出了很多算法, 如FDPM (Fast data projection method)算法[6]、SOOJA (Stable and orthonormal OJA algorithm)算法[7]、SDPM (Stable DPM algorithm)算法[8]等, 然而上述算法大多是基于启发式推理提出来的, 而并没有建立相对应的信息准则.由于信息准则在算法发展中具有很重要的意义[9], 因此研究主子空间准则是一件非常有意义的工作.在文献[10]中, 基于最小均方误差(Least mean squared error, LMSE)准则, Yang提出了投影近似子空间跟踪算法(Projection approximation subspace tracking, PAST); 此后, Miao等提出了NIC (Novel information criterion)准则[11], 并分别导出了梯度和递归型主子空间跟踪算法; 基于Rayleigh商函数, Kong等提出了UIC (Unified information criterion)准则[12], 仿真实验表明基于UIC准则导出的算法具有很快的收敛速度.目前, 发展主子空间跟踪准则仍然具有很强的研究价值.
本文将提出一种新型的主子空间跟踪信息准则, 并通过梯度法导出快速的主子空间跟踪算法.论文的结构安排如下:第1节是提出一种新的主子空间信息准则; 第2节是对所提信息准则进行前景分析; 第3节主要是采用梯度上升法导出主子空间算法; 第4节通过两组仿真实验对所提算法的性能进行验证; 本文的结论在第5节.
1. 新型主子空间跟踪信息准则
考虑一个具有如下形式的多输入多输出线性神经网络模型:
$$ \begin{equation} {\pmb{y}} = {{W}^{\rm T}}{\pmb{x}} \end{equation} $$ (1) 其中, $ {W} = [{\pmb {w}_1}, {\pmb {w}_2}, \cdots , {\pmb {w}_r}] \in {{\bf {R}}^{n \times r}} $是神经网络的权矩阵, $ {\pmb {w}_i} $是权矩阵$ {W} $的第$ i $列; $ {\pmb{x}} \in {{\bf {R}}^{n \times 1}} $是采样信号, 这里作为神经网络的输入; $ {\pmb{y}} \in {{\bf {R}}^{r \times 1}} $为采样信号的低维表示, $ r $是子空间的维数.本文的目的就是构造合适的神经网络权矩阵迭代更新方程, 使神经网络的权矩阵最终能够收敛到采样信号的主子空间.
基于上述神经网络模型, 给定域$ \Omega = \{ {W}|0 < {{W}^{\rm T}}{RW} < \infty , {{W}^{\rm T}}{W} \ne 0\} $, 提出如下信息准则:
$$ \begin{align} {{W}^*} = \, & {\rm{arg}}{\kern 1pt} \mathop {\max }\limits_{{W} \in \Omega } J({W})\\& J({W}) = \frac{1}{2}{\rm tr}\left[ {({{W}^{\rm T}}{RW}){{({{W}^{\rm T}}{W})}^{ - 1}}} \right]+\\& \frac{1}{2}{\rm tr}\left[ {\ln ({{W}^{\rm T}}{W}) - {{W}^{\rm T}}{W}} \right] \end{align} $$ (2) 其中, $ {{R}} = {\rm E}[{{\pmb {xx}}}_{}^{\rm T}] $为采样信号的自相关矩阵.根据矩阵理论可得:矩阵$ {{R}} $是一个对称的正定矩阵, 且特征值均是非负的.对矩阵$ {{R}} $特征值分解得:
$$ \begin{equation} {{R}} = {{U\Lambda }}{{{U}}^{\rm T}} \end{equation} $$ (3) 其中, $ {{U}} = [{{\pmb {u}}_1}, {{\pmb {u}}_2}, \cdots , {{\pmb {u}}_n}] $是由矩阵$ {{R}} $的特征向量构成的矩阵, $ {{\Lambda }} = {\rm {\rm {\rm diag}}}\{{\lambda _1}, {\lambda _2}, \cdots , {\lambda _n}\} $是由矩阵$ {{R}} $的特征值组成.为了后续使用方便, 这里将特征值按降序进行排列, 即特征值满足如下方程:
$$ \begin{equation} {\lambda _1} > {\lambda _2} > \cdots > {\lambda _r} > \cdots > {\lambda _n} > 0 \end{equation} $$ (4) 根据主子空间的定义可知特征值$ {\lambda _1}, {\lambda _2}, \cdots , {\lambda _r} $所对应的特征向量张成的子空间称为输入信号的主子空间.从式(2)可得$ J({W}) $是无下界的, 且当$ {W} $趋于无穷大时, $ J({W}) $将趋于负无穷大, 因此研究$ J({W}) $的极小值是没有任何意义的.实际上, 我们关心的是$ J({W}) $的极大点.具体说来, 我们关心以下几个问题:
1) $ J({W}) $有没有全局极大点?
2) $ J({W}) $的全局极大点与信号的主子空间之间的关系是什么?
3) $ J({W}) $有没有其他局部极值?
上述三个问题的回答将由下在一节中完成.
2. 信息准则的全局最优值分析
信息准则(2)的全局最优值分析通过定理1和定理2来完成.
定理1. 在域$ \Omega = \{ {W}|0 < {{W}^{\rm T}}{RW} < \infty , {{W}^{\rm T}}{W} \ne 0\} $内, 当且仅当$ {W} = {{{U'}}_r}{{Q}} $时, 权矩阵$ {W} $是信息准则$ J({W}) $的一个平稳点, 其中$ {{{U'}}_r} = [{{\pmb {u}}_{j1}}, {{\pmb {u}}_{j2}}, \cdots , {{\pmb {u}}_{jr}}] $是由自相关矩阵$ {{R}} $的任意$ r $个特征向量构成的矩阵, $ {{Q}} $是任意一个$ r \times r $维正交矩阵.
将$ {{R}} $的特征向量作为空间$ {{\bf{R}}^{n \times n}} $的一组正交基, 则权矩阵$ {W} $可以表示为: $ {W} = {{{U}}^{\rm T}}{{\tilde W}} $, 其中$ {{\tilde W}} \in {{\bf {R}}^{n \times r}} $称为系数矩阵.将这一结果代入式(2)可得:
$$ \begin{align} {{{{\tilde W}}}^*} = \, & {\rm{arg}}{\kern 1pt} \mathop {\max }\limits_{{W} \in \Omega } E({{\tilde W}})\\ E({{\tilde W}}) = \, & \frac{1}{2}{\rm tr}[({{{{\tilde W}}}^{\rm T}}{{\Lambda \tilde W}}){({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 1}}]+\\& \frac{1}{2}{\rm tr}[\ln ({{{{\tilde W}}}^{\rm T}}{{\tilde W}}) - {{{{\tilde W}}}^{\rm T}}{{\tilde W}}] \end{align} $$ (5) 显然式(2)和式(5)是等价的, 即定理1的证明是可以通过对下述推论的证明来完成.
推论1. 在域$ \tilde \Omega = \left\{ {{{\tilde W}}|0 < {{{{\tilde W}}}^{\rm T}}{{\Lambda \tilde W}} < \infty} \right. $, $ \left. { {{{{\tilde W}}}^{\rm T}}{{\tilde W}} \ne 0} \right\} $中, 当且仅当$ {{\tilde W}} = {{{P}}_r}{{Q}} $时, $ {{\tilde W}} $是$ E({{\tilde W}}) $的一个平稳点, 其中$ {{{P}}_r} \in {{ {R}}^{n \times r}} $是任意一个$ n \times r $维置换矩阵.
证明参见附录A.
定理2. 在域$ \Omega $内, 当且仅当$ {W} = {{{\tilde U}}_r}{{Q}} $时, 其中$ {{{\tilde U}}_r} = [{{\pmb {u}}_1}, {{\pmb {u}}_2}, \cdots , {{\pmb {u}}_r}] $是由自相关矩阵$ {{R}} $的前$ r $个特征向量构成的矩阵且$ {{Q}} $是任意一个$ r \times r $维正交矩阵, 信息准则$ J({W}) $达到全局极大. $ J({W}) $没有其他局部极值, 在全局最大点处有:
$$ \begin{equation} J({W}) = \frac{1}{2}\sum\limits_{i = 1}^r {{\lambda _i}} - \frac{r}{2} \end{equation} $$ (6) 同定理1的证明过程一样, 这里将通过对推论2的证明来完成定理2的证明.
推论2. 在域$ \tilde \Omega $中, 当且仅当$ {{\tilde W}} = {{\bar PQ}} $, 信息准则$ E({{\tilde W}}) $达到全局最大点, 其中$ {{\bar P}} = {( {{{\tilde P}}}\; \; {{0}})^{\rm T}} \in {{ {\bf R}}^{n \times r}} $一个$ n \times r $维矩阵, $ {{\tilde P}} $是一个置换矩阵, 而$ E({{\tilde W}}) $所有其他的平稳点都是鞍点.
证明参见附录B.
通过定理1和定理2可知, 当神经网络权矩阵$ {W} $刚好收敛到自相关矩阵$ {{R}} $的主子空间的一组正交基时, $ J({W}) $取得全局极大值, 从而建立起神经网络与信号主子空间之间的关系.由于信息准则$ J({W}) $只有一个全局最大点, 而没有局部极值, 因此采用非线性规划算法(如梯度法、共轭梯度法、牛顿法等)来求解该优化问题.梯度算法是采用$ J({W}) $的梯度作为迭代步长; 共轭梯度法需要不断修正算法的共轭方向, 所导出的算法通常具有较高的计算复杂度; 牛顿法则用到的Hessian矩阵, 而当$ {W} $是一个矩阵时该Hessian矩阵是非常难以获得的.相比共轭梯度法和牛顿算法, 梯度法具有算法结构更为简单, 计算复杂度低, 因此下一节将采用梯度算法导出新型的主子空间跟踪算法.
3. 主子空间跟踪算法
假定$ {\pmb{x}}(k), k = 0, 1, 2, \cdots $是一个平稳的随机过程, 这里将其作为神经网络模型的输入.根据随机学习理论, 权矩阵$ {W} $的变化规律与输入向量$ {\pmb{x}}(k) $并不相关.取式(2)作为最优化函数, 则可以得出所提信息准则的梯度流为:
$$ \begin{align} \frac{{{\rm d}{W}(t)}}{{\rm d}{t}} = \, & \left[ {\frac{{{W}(t)}}{{{{W}^{\rm T}}(t){W}(t)}} - {W}(t)} \right]+\\& \left[ {{RW}(t) - \frac{{{W}(t){{W}^{\rm T}}(t){RW}(t)}} {{{{W}^{\rm T}}(t){W}(t)}}} \right]\times\\& {\left\{ {{{W}^{\rm T}}(t){W}(t)} \right\}^{ - 1}} \end{align} $$ (7) 应用随机近似理论可得:
$$ \begin{align} &\frac{{{\rm d}{W}(t)}}{{\rm d}{t}} = \\&\quad \left[ {{W}(t){{\left\{ {{{W}^{\rm T}}(t){W}(t)} \right\}}^{ - 1}} - {W}(t)} \right]- \\&\quad \left[ {{W}(t){{W}^{\rm T}}(t){\pmb{x}}(t){{\pmb{x}}^{\rm T}}(t){W}(t){{\left\{ {{{W}^{\rm T}}(t){W}(t)} \right\}}^{ - 1}}} \right.- \\&\quad \left. {{\pmb{x}}(t){{\pmb{x}}^{\rm T}}(t){W}(t)} \right]{\left\{ {{{W}^{\rm T}}(t){W}(t)} \right\}^{ - 1}} \end{align} $$ (8) 对式(8)进行离散化操作后获得如下方程:
$$ \begin{align} &{W}\left( {k + 1} \right) = \\&\quad {{W}}(k) - \eta \left[ {{{W}}(k){{\left( {{{{W}}^{\rm T}}(k){{W}}(k)} \right)}^{ - 1}}{\pmb{y}}(k){{\pmb{y}}^{\rm T}}(k)} \right.- \\&\quad \left. {{\pmb{x}}(k){{\pmb{y}}^{\rm T}}(k)} \right] {\left( {{{{W}}^{\rm T}}(k){{W}}(k)} \right)^{ - 1}} - \eta {{W}}(k)+ \\&\quad \eta {{W}}(k){\left( {{{{W}}^{\rm T}}(k){{W}}(k)} \right)^{ - 1}} \end{align} $$ (9) 其中, $ \eta $是神经网络的学习因子, 且满足$ 0 < \eta < 1 $.式(9)所描述的算法在每一步迭代过程中计算复杂度为: $ n{{r}^{2}}+4{{r}^{3}}\text{ /}3 $, 这点与UIC算法[12]是相同的, 要少于NIC算法[11]中的$ 2{n^2}r + {\rm O}(n{r^2}) $的计算量和共轭梯度算法[13] $ 12{n^2}r + {\rm O}(n{r^2}) $的计算量.
4. 仿真实验
本节通过两个仿真实例来对所提算法的性能进行验证.第一个实验考察所提算法提取多维主子空间的能力并将仿真结果与其他同类型算法进行对比; 第二个是应用所提算法解决图像重构问题.
4.1 主子空间跟踪实验
在本实验中, 所提算法将与UIC算法和SDPM算法进行对比.为了衡量算法的收敛性能, 这里采用如下两个评价函数, 第一个是第$ k $次迭代时的权矩阵模值
$$ \begin{equation} p({{{W}}_k}) = {\left\| {{{W}}_k^{\rm T}{{{W}}_k}} \right\|_F} \end{equation} $$ (10) 第二个是指标参数
$$ \begin{equation} {\rm{dist}}({{{W}}_k}) = {\left\| {{{W}}_k^{\rm T}{{{W}}_k}{\rm diag}{\left\{{{{W}}_k^{\rm T}{{{W}}_k}} \right\}^{ - 1}} - {{{I}}_r}} \right\|_F} \end{equation} $$ (11) 指标参数代表着权矩阵正交化的偏离程度.显然, 如果$ {\rm{dist}}({{{W}}_k}) $收敛到0, 则意味着权矩阵收敛到了主子空间的一个正交基.
本实验中信号产生方法与文献[12]相同, 即输入信号$ {{{X}}_k} = {{B}}{{\pmb {z}}_k} $, 其中$ B=\text{randn}(31,31)/31$是一个随机产生的矩阵, $ {{\pmb {z}}_k} \in {{ {\bf R}}^{31 \times 1}} $是高斯的、瞬时白的、随机产生的向量.在本实验中, 分别采用UIC算法、SDPM算法和所提算法对信号的12维主子空间进行提取跟踪.三种算法采用相同的学习因子$ \eta = 0.1 $, 初始权矩阵是随机产生的(即矩阵的每一个元素均服从均值为零方差为1的高斯分布).通常情况下, 取不同的初始化权矩阵时, 算法具有不同的收敛速度.为了更全面地衡量算法性能, 通常取多次实验的平均值来描述算法收敛过程.在图 1和图 2分别给出了三种算法在迭代过程中的权矩阵模值曲线和指标参数曲线, 该结果曲线是100次独立实验结果平均得到的.图中实线代表着所提算法, 虚线代表UIC算法, 点划线代表SDPM算法.
图 1和图 2中可以发现, 所提算法的权矩阵模值曲线收敛到了一个常数而且指标参数收敛到了零.这就表明所提算法具备跟踪信号主子空间的能力.从图 1中我们还可以发现, 所提算法的权矩阵模值在200步时就已经收敛, 而UIC算法则需要300步, SDPM算法需要800步, 即所提算法的权矩阵模值曲线具有最快的收敛速度.同理, 从图 2中可以发现所提算法指标参数的收敛速度要优于其他两个算法.综合两图可以得出结论:三种算法中, 所提算法具有最快的收敛速度.
4.2 图像压缩重构离实验
数据压缩是主子空间算法的一个很重要的应用.本实验将利用所提算法对著名的Lena图像进行压缩重构.如图 3所示, 原始Lena图像的像素为$ 512 \times 512 $.在本实验中, 将Lena图像分解成若干个$ 8 \times 8 $不重叠的小块.将每一个小块中的数据按照从左至右从上到下的顺序排列起来, 构成一个64维向量.在去掉均值和标准化后, 将这些图像数据构成一个输入序列.然后采用SDPM算法、UIC算法以及所提算法对该图像进行压缩重构, 这里重构维数为5.与实验1相同, 本实验同样采用权矩阵模值和指标参数两个评价函数.
本实验中三种算法的初始化参数设置方法与实验1相类似, 具体参数如下:学习因子$ \eta = 0.2 $, 初始化权矩阵是随机产生的.图 4是经过所提算法压缩后重构出来的Lena图像, 图 5是三种算法的权矩阵模值曲线, 图 6是三种算法的指标参数曲线, 该结果曲线都是100次独立实验的平均值.对比图 3和图 4可以发现, 重构的Lena图像是很清晰的, 即所提算法能够有效解决图像压缩重构问题.通过图 5和图 6可以发现:不论是权矩阵模值曲线还是指标参数曲线, 所提算法的收敛速度均要快于UIC算法和SDPM算法.这进一步证实了所提算法在收敛速度方面的优势.
5. 结论
主子空间跟踪算法在现代信息科学各个领域均有着很重要的应用.基于Hebbian神经网络的主子空间跟踪是近些年来国际上的一个研究热点.然而目前大多数主子空间跟踪神经网络算法是基于启发式推理而提出来的, 能够提供信息准则的算法并不多见.针对这一问题, 本文提出了一种新型的信息准则, 并对所提信息准则的前景进行了严格的证明.通过梯度上升法导出了一个主子空间跟踪算法.仿真实验表明:相比一些现有主子空间跟踪算法, 所提算法具有更快的收敛速度.
进一步的研究方向是寻找新型的主子空间信息准则, 创新信息准则的平稳点分析方法, 将主子空间算法应用于更广泛的领域.
附录A.推论1证明
证明 在域$ \tilde \Omega $内, $ E({{\tilde W}}) $对于矩阵$ {{\tilde W}} $的一阶微分存在, 且有
$$ \begin{align} \nabla E({{\tilde W}}) = \, & {{\Lambda \tilde W}}{({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 1}} - {{\tilde W}} - \\& {{\tilde W}}{({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 2}}{{{{\tilde W}}}^{\rm T}}{{\Lambda \tilde W}} + {{\tilde W}}{({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 1}} \end{align} $$ (A1) 定义一个矩阵集$ \left\{ {{{\tilde W}}|{{\tilde W}} = {{{P}}_r}{{Q}}} \right\} $, 则在该集合内的任意一点均有:
$$ \begin{align} & \nabla E({{\tilde W}}){|_{{{\tilde W}} = {{{P}}_r}{{Q}}}} = \\&\qquad {{\Lambda }}{{{P}}_r}{{Q}}{({{{Q}}^{\rm T}}{{P}}_r^{\rm T}{{{P}}_r} {{Q}})^{ - 1}} + {{{P}}_r}{{Q}}{({{{Q}}^{\rm T}} {{P}}_r^{\rm T}{{{P}}_r}{{Q}})^{ - 1}}- \\&\qquad {{{P}}_r}{{Q}}{({{{Q}}^{\rm T}}{{P}}_r^{\rm T}{{{P}}_r}{{Q}})^{ - 2}}{{{Q}}^{\rm T}}{{P}}_r^{\rm T}{{\Lambda }}{{{P}}_r}{{Q}} - {{{P}}_r}{{Q}} = \\&\qquad {{\Lambda }}{{{P}}_r}{{Q}} + {{{P}}_r}{{Q}}{{{Q}}^{\rm T}}{{P}}_r^{\rm T}{{\Lambda }}{{{P}}_r}{{Q}} = 0 \end{align} $$ (A2) 反之, 根据定义可得, 在$ E({{\tilde W}}) $平稳点处有$ \nabla E({{\tilde W}}) = {{0}} $成立, 即
$$ \begin{align} &{{\Lambda \tilde W}}{({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 1}} + {{\tilde W}}{({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 1}}- \\&\qquad {{\tilde W}}{({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 2}}{{{{\tilde W}}}^{\rm T}}{{\Lambda \tilde W}} - {{\tilde W}} = {{0}} \end{align} $$ (A3) 将上式左右两边各乘以$ {{{\tilde W}}^{\rm T}} $可得:
$$ \begin{align} & {{{{\tilde W}}}^{\rm T}}{{\Lambda \tilde W}}{({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 1}} + {{{{\tilde W}}}^{\rm T}}{{\tilde W}}{({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 1}}- \\&\qquad {{{{\tilde W}}}^{\rm T}}{{\tilde W}}{({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 2}}{{{{\tilde W}}}^{\rm T}}{{\Lambda \tilde W}} - {{{{\tilde W}}}^{\rm T}}{{\tilde W}} = \\&\qquad {{{{\tilde W}}}^{\rm T}}{{\Lambda \tilde W}} {({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 1}} - {{{{\tilde W}}}^{\rm T}}{{W}}- \\&\qquad {({{{{\tilde W}}}^{\rm T}}{{\tilde W}})^{ - 1}}{{{{\tilde W}}}^{\rm T}}{{\Lambda \tilde W}} + {{I}} = \\&\qquad {{I}} - {{{{\tilde W}}}^{\rm T}}{{W}} = {{0}} \end{align} $$ (A4) 根据上式可得, 在$ E({{\tilde W}}) $平稳点处有
$$ \begin{equation} {{{\tilde W}}^{\rm T}}{{W}} = {{I}} \end{equation} $$ (A5) 上式表明在$ E({{\tilde W}}) $平稳点处矩阵$ {{\tilde W}} $的各个列向量之间是相互正交的.将式(A5)代入式(A3)可得:
$$ \begin{equation} {{\Lambda \tilde W}} = {{\tilde W}}{{{\tilde W}}^{\rm T}}{{\Lambda \tilde W}} \end{equation} $$ (A6) 令矩阵$ {\tilde W}$的行向量为${{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{u}}}_{i}}\ (i=1,2,\cdots ,n) $, 即有$ \tilde W{\rm{ = }}{\left[ {\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over u} _1^{\rm{T}},\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over u} _2^{\rm{T}}, \cdots ,\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over u} _n^{\rm{T}}} \right]^{\rm{T}}} $, 同时定义矩阵$ {{B}} = {{{\tilde W}}^{\rm T}}{{\Lambda \tilde W}} $, 则根据式(11)可得:
$$ \begin{equation} {\sigma _i}{\mathord{\buildrel{\lower3pt\hbox{$ \smile$}} \over u} _i} = {\mathord{\buildrel{\lower3pt\hbox{$ \smile$}} \over u} _i}{{B}}, \quad i = 1, 2, \cdots , n \end{equation} $$ (A7) 显然, 上式可以看作是矩阵$ {{B}} $的特征值分解.由于$ {{B}} $是一个$ r \times r $维对称正定矩阵, 只有$ r $个相互正交的左行特征向量, 即矩阵$ {{\tilde W}} $只有$ r $个相互正交的行向量.更进一步, 矩阵$ {{\tilde W}} $的这$ r $个非零行向量正好构成了一个正交矩阵, 也就是说此时矩阵$ {{\tilde W}} $可以通过$ {{\tilde W}} = {{{P}}_r}{{Q}} $来表示.
附录B.推论2证明
证明 定义一个置换矩阵$ {{{P}}_r} \ne {{\bar P}} $, 则矩阵$ {{{P}}_r} $中的第$ r + 1 $到$ n $个行向量之中必有一个非零的行向量.由于$ {{{P}}_r} $和$ {{\bar P}} $同为两个置换矩阵, 则必定存在两个对角矩阵$ {{\bar \Lambda }} $和$ {{\hat \Lambda }} $使得下式成立:
$$ \begin{equation} {{{\bar P}}^{\rm T}}{{\Lambda \bar P}} = {{\bar \Lambda }}, {{P}}_r^{\rm T}{{\Lambda }}{{{P}}_r} = {{\hat \Lambda }} \end{equation} $$ (B1) 根据上式可得:
$$ \begin{equation} \begin{cases} {\rm tr}\left( {{{{{\bar P}}}^{\rm T}}{{\Lambda \bar P}}} \right) = \sum\limits_{i = 1}^r {{\lambda _i}} \\ {\rm tr}\left( {{{P}}_r^{\rm T}{{\Lambda }}{{{P}}_r}} \right) = \sum\limits_{i = 1}^r {{\lambda _{{{\hat j}_i}}}} \end{cases} \end{equation} $$ (B2) 将特征值$ {\lambda _{\hat ji}} $ $ (i = 1, 2, \cdots , r) $按照降序顺序排列, 即有$ {\hat \lambda _{\hat j1}} > {\hat \lambda _{\hat j2}} > \cdots > {\hat \lambda _{\hat jr}} $, 则对于$ {{{P}}_r} \ne {{\bar P}} $, 则必定存在有$ {\lambda _i} > {\hat \lambda _{\hat ji}}\; (i = 1, 2, \cdots , r) $成立, 即有:
$$ \begin{equation} {\rm tr}\left( {{{{{\bar P}}}^{\rm T}}{{\Lambda \bar P}}} \right) > {\rm tr}\left( {{{P}}_r^{\rm T}{{\Lambda }}{{{P}}_r}} \right) \end{equation} $$ (B3) 由于
$$ \begin{align} E({{{P}}_r}{{Q}}) = \, & \frac{1}{2}{\rm tr}[({{{Q}}^{\rm T}}{{P}}_r^{\rm T}{{\Lambda }}{{{P}}_r}{{Q}}){({{{Q}}^{\rm T}}{{P}}_r^{\rm T}{{{P}}_r}{{Q}})^{ - 1}}]+\\& \frac{1}{2}{\rm tr}[\ln ({{{Q}}^{\rm T}}{{P}}_r^{\rm T}{{{P}}_r}{{Q}}) - {{{Q}}^{\rm T}}{{P}}_r^{\rm T}{{{P}}_r}{{Q}}] = \\& \frac{1}{2}{\rm tr}[({{{Q}}^{\rm T}}{{P}}_r^{\rm T}{{\Lambda }}{{{P}}_r}{{Q}})] - \frac{r}{2} = \\& \frac{1}{2}{\rm tr}[({{P}}_r^{\rm T}{{\Lambda }}{{{P}}_r})] - \frac{r}{2} \end{align} $$ (B4) $$ \begin{align} E({{{{\bar P}}}^{\rm T}}{{Q}}) = \, & \frac{1}{2}{\rm tr}[({{{Q}}^{\rm T}}{{{{\bar P}}}^{\rm T}}{{\Lambda \bar PQ}}){({{{Q}}^{\rm T}}{{{{\bar P}}}^{\rm T}}{{\bar PQ}})^{ - 1}}]+ \\& \frac{1}{2}{\rm tr}[\ln ({{{Q}}^{\rm T}}{{{{\bar P}}}^{\rm T}}{{\bar PQ}}) - {{{Q}}^{\rm T}}{{{{\bar P}}}^{\rm T}}{{\bar PQ}}] = \\& \frac{1}{2}{\rm tr}[({{{Q}}^{\rm T}}{{{{\bar P}}}^{\rm T}}{{\Lambda \bar PQ}})] - \frac{r}{2} = \\& \frac{1}{2}{\rm tr}[({{{{\bar P}}}^{\rm T}}{{\Lambda \bar P}})] - \frac{r}{2} \end{align} $$ (B5) 根据式(B3)有:
$$ \begin{equation} E({{{P}}_r}{{Q}}) < E({{{\bar P}}^{\rm T}}{{Q}}) \end{equation} $$ (B6) 即集合$ \left\{ {{P_r}Q|{Q^{\rm{T}}}\Lambda Q > 0\& {{\rm{P}}_{\rm{r}}} \ne {\rm{\bar P}}} \right\} $的点并不是全局极大点.
由于$ {{{P}}_r} \ne {{\bar P}} $, 则矩阵$ {{\bar P}} $中必定存在一列向量$ {{{\bar p}}_i} $ $ (1 \le i \le r) $, 使得:
$$ \begin{equation} {{\bar p}}_i^{\rm T}{{{P}}_r} = {{0}} \end{equation} $$ (B7) 同理, 矩阵$ {{{P}}_r} $中也存在一列向量$ {{{p}}_{r, j}}\; (1 \le j \le r) $使得
$$ \begin{equation} {{{p}}_{r, j}}{{\bar P}} = {{0}} \end{equation} $$ (B8) 令$ {{{\bar p}}_i} $的非零元素为$ {\bar j_i} $行, $ {{{p}}_{r, j}} $的非零元素为$ {\hat j_j} $行, 则有$ {\bar j_i} > {\hat j_j} $和$ {\lambda _{{{\hat j}_j}}} > {\lambda _{{{\bar j}_i}}} $.定义矩阵:
$$ \begin{equation} {{B}} = \left[ {{{{p}}_{r, 1}}, \cdots , \frac{{{{{p}}_{r, i}} + \varepsilon {{{{\bar p}}}_i}}}{{\sqrt {1 + {\varepsilon ^2}} }}, \cdots , {{{p}}_{r, r}}} \right] \end{equation} $$ (B9) 其中, $ \varepsilon $是正任意小数.由$ {{{\bar p}}_i} $和$ {{{p}}_{r, j}} $有且仅有一个非零元素得
$$ \begin{align} {{\Lambda B}} = \, & {\rm diag}\left\{ {{\lambda _{\hat j1}}{{{p}}_{r, 1}}, \cdots , } {\frac{{{\lambda _{\hat jj}}{{{p}}_{r, i}} + \varepsilon {\lambda _{\bar ji}}{{{{\bar p}}}_i}}}{{\sqrt {1 + {\varepsilon ^2}} }}, \cdots , {\lambda _{\hat jr}}{{{p}}_{r, r}}} \right\} \end{align} $$ (B10) 更进一步有:
$$ \begin{equation} {{{B}}^{\rm T}}{{\Lambda B}} = {\rm diag}\left\{ {{\lambda _{\hat j1}}, \cdots , \frac{{{\lambda _{\hat jj}} + \varepsilon {\lambda _{\bar ji}}}}{{1 + {\varepsilon ^2}}}, \cdots , {\lambda _{\hat jr}}} \right\} \end{equation} $$ (B11) 通过式(A8)可得:
$$ \begin{align} & {{{B}}^{\rm T}}{{\Lambda B}} - {{P}}_r^{\rm T}{{\Lambda }}{{{P}}_r} = \\& \qquad {\rm diag}\left\{ {{\lambda _{\hat j1}}, \cdots , \frac{{{\lambda _{\hat jj}} + \varepsilon {\lambda _{\bar ji}}}}{{1 + {\varepsilon ^2}}}, \cdots , {\lambda _{\hat jr}}} \right\}- \\& \qquad {\rm diag}\left\{ {{\lambda _{\hat j1}}, \cdots , {\lambda _{\hat jj}}, \cdots , {\lambda _{\hat jr}}} \right\} = \\& \qquad {\rm diag}\left\{ {0, \cdots , \frac{{\left( { - {\lambda _{\hat jj}} + {\lambda _{\bar ji}}} \right){\varepsilon ^2}}}{{1 + {\varepsilon ^2}}}, \cdots , 0} \right\} \end{align} $$ (B12) 由于$ {\lambda _{\hat jj}} > {\lambda _{\bar ji}} $, 所以$ {{{B}}^{\rm T}}{{\Lambda B}} - {{P}}_r^{\rm T}{{\Lambda }}{{{P}}_r} $是一个负定矩阵, 因此有:
$$ \begin{equation} E({{BQ}}) = \frac{1}{2}{\rm tr}[{{{B}}^{\rm T}}{{\Lambda B}}] - \frac{r}{2} < E({{{P}}_r}{{Q}}) \end{equation} $$ (B13) 即集合$ \left\{ {{P_r}Q|{Q^{\rm{T}}}\hat \Lambda Q > \;0{\rm{\& }}{P_r} \ne \bar P} \right\} $中所有平稳点都是不稳定的鞍点.
接下来将证明: $ J({{W}}) $没有其他局部极值.令$ {{{\dot U}}_r} = {{{\tilde U'}}_r} + \varepsilon {{{M}}_1} $, 其中$ {{{M}}_1} = [{{0}}, \cdots , {{\pmb {u}}_k}, \cdots , {{0}}] $, $ 1 \le k \le r $.即$ {{{\dot U}}_r} $是$ {{{\tilde U'}}_r} $沿$ {{\pmb {u}}_k} $的方向增长而成.由于$ {{{\tilde U'}}_r} \ne {{{\tilde U}}_r} $, 则必定有$ {\lambda _k} > {\lambda _{jk}} $.
当$ {{W}} = {{{\dot U}}_r}{{Q}} $时, 有:
$$ \begin{align} & {\left. {J({{W}})} \right|_{{{W}} = {{{{\dot U}}}_r}{{Q}}}} - {\left. {J({{W}})} \right|_{{{W}} = {{{{\tilde U'}}}_r}{{Q}}}} = \\& \qquad \frac{1}{2}\left( {{\lambda _k} - {\lambda _{jk}}} \right){\varepsilon ^2} + o({\varepsilon ^2}) \end{align} $$ (B14) 令$ {{{\ddot U}}_r} = {{{\tilde U'}}_r} + \varepsilon {{{M}}_2} $, 其中$ {{{M}}_2} = [{{0}}, \cdots , {{\pmb {u}}_{jk}}, \cdots , {{0}}] $, $ 1 \le k \le r $.即$ {{{\dot U}}_r} $是$ {{{\tilde U'}}_r} $沿$ {{\pmb {u}}_{jk}} $的方向增长而成.当$ {{W}} = {{{\dot U}}_r}{{Q}} $时, 有:
$$ \begin{align} & {\left. {J({{W}})} \right|_{{{W}} = {{{{\ddot U}}}_r}{{Q}}}} - {\left. {J({{W}})} \right|_{{{W}} = {{{{\tilde U'}}}_r}{{Q}}}} = - 2{\varepsilon ^2} + o({\varepsilon ^2}) \end{align} $$ (B15) 从式(B14)$ \, \sim\, $(B15)得当$ {{W}} = {{{U'}}_r}{{Q}} $且$ {{W}} \ne {{{\tilde U}}_r}{{Q}} $时, $ J({{W}}) $沿$ {{\pmb {u}}_k} $方向是增的, 而沿$ {{\pmb {u}}_{jk}} $方向是减的, 所以$ J({{W}}) $在该平稳点处不可能取得局部极值.
-
表 1 与已发表相关论文的研究异同
Table 1 Research's similarities and differences
异同点 深度强化学习综述:兼论计算机围棋的发展 多Agent深度强化学习综述 出发点 深度强化学习的发展, 深度强化学习的在围棋发展中的应用. 深度强化学习方法在多Agent系统中的研究现状 综述角度 从强化学习以及深度学习的研究, 对发展而来的深度强化学习进行论述, 并指出在围棋发展中的应用. 在多Agent系统中, 如何应用深度强化学习, 并从神经网络的搭建结构出发, 对当前的多Agent深度强化学习方法进行分类与研究. 内容安排 讨论了强化学习与深度学习的研究成果及其展望, 论述了深度强化学习的主要神经网络结构.在这一基础上, 对AlphaGo进行了的分析与研究, 展开了对计算机围棋的发展研究, 详细论述了AlphaGo的对决过程, 刻画了结合MCTS的深度强化学习方法在围棋研究中的巨大成功.之后, 讨论了深度强化学习的展望, 分析了在在博弈、连续状态动作, 与其他智能方法结合, 理论分析等方面的发展前景.最后给出了深度强化学习的应用. 根据深度强化学习策略的输出形式, 对深度强化学习方法从深度Q学习和深度策略梯度两个方面进行介绍.之后讨论了在多Agent系统中如何使用深度强化学习方法, 解决多Agent系统所面临的问题, 从多Agent深度强化学习中通信过程的角度对现有的多Agent深度强化学习算法进行归纳, 将其归纳为全通信集中决策、全通信自主决策、欠通信自主决策3种主流形式.深度强化学习引入多Agent系统中, 面临着训练架构、样本增强、鲁棒性以及对手建模等新的挑战, 文章对这些问题进行了讨论与分析. -
[1] Sutton R S, Barto A G. Reinforcement Learning: An Introduction (2nd edition). MIT Press, 2018. [2] Schmidhuber J. Deep learning in neural networks: An overview. Neural Networks, 2015, 61: 85-117 http://europepmc.org/abstract/MED/25462637 [3] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436-444 [4] Bengio Y I, Goodfellow J, Courville A. Deep Learning. Cambridge: MIT Press, 2016. [5] Wang Q, Zhao X, Huang J C, Feng Y H, Liu Z, Su Z H, et al. Addressing complexities of machine learning in big data: Principles, trends and challenges from systematical perspectives. Preprints, 2017, DOI: 10.20944/ preprints201710.0076.v1 [6] Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529-533 http://europepmc.org/abstract/med/25719670 [7] 赵冬斌, 邵坤, 朱圆恒, 李栋, 陈亚冉, 王海涛, 等.深度强化学习综述:兼论计算机围棋的发展.控制理论与应用, 2016, 33(6): 701-717Zhao Dong-Bin, Shao Kun, Zhu Yuan-Heng, Li Dong, Chen Ya-Ran, Wang Hai-Tao, et al. Review of deep reinforcement learning and discussions on the development of computer Go. Control Theory and Applications, 2016, 33(6): 701-717 [8] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al. Mastering the game of Go without human knowledge. Nature, 2017, 550(7676): 354-359 http://www.ncbi.nlm.nih.gov/pubmed/29052630 [9] Graves A, Wayne G, Reynolds M, Harley T, Danihelka I, Grabska-Barwińska A, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 2016, 538(7626): 471-476 [10] Zhang T Y, Huang M L, Zhao L. Learning structured representation for text classification via reinforcement learning. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, LA, USA: AAAI Press, 2018. 6053-6060 [11] Su P H, Gašić M, Mrkšić N, Rojas-Barahona L M, Ultes S, Vandyke D, et al. On-line active reward learning for policy optimisation in spoken dialogue systems. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany: Association for Computational Linguistics, 2016. [12] 周志华. AlphaGo专题介绍.自动化学报, 2016, 42(5): 670 http://www.aas.net.cn/article/id/18855Zhou Zhi-Hua. AlphaGo special session: An introduction. Acta Automatica Sinica, 2016, 42(5): 670 http://www.aas.net.cn/article/id/18855 [13] Silver D. Deep reinforcement learning, a tutorial at ICML 2016.[Online], available: https://www.davidsilver.uk/wp-content/uploads/2020/03/deep_rl_tutorial_small_compres-sed.pdf, April 26, 2019 [14] Li Y X. Deep reinforcement learning: An overview. arXiv preprint arXiv: 1701.07274, 2017. [15] Kraemer L, Banerjee B. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 2016, 190: 82-94 http://www.sciencedirect.com/science/article/pii/S0925231216000783 [16] Pérolat J, Leibo J Z, Zambaldi V, Beattie C, Tuyls K, Graepel T. A multi-agent reinforcement learning model of common-pool resource appropriation. In: Proceedings of the 31st Conference on Neural Information Processing Systems. Long Beach, USA, 2017. 3646-3655 [17] Rǎdulescu R, Vrancx P, Nowé A. Analysing congestion problems in multi-agent reinforcement learning. In: Proceedings of the 16th Conference on Autonomous Agents and Multi-Agent Systems. São Paulo, Brazil: ACM, 2017. 1705-1707 [18] Schulman J, Moritz P, Levine S, Jordan M, Abbeel P. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv: 1506.02438, 2015. [19] Okdinawati L, Simatupang T M, Sunitiyoso Y. Multi-agent reinforcement learning for collaborative transportation management (CTM). Agent-Based Approaches in Economics and Social Complex Systems IX. Singapore: Springer, 2017. 123-136 [20] De Vrieze C, Barratt S, Tsai D, Sahai A. Cooperative multi-agent reinforcement learning for low-level wireless communication. arXiv preprint arXiv: 1801.04541, 2018. [21] Shalev-Shwartz S, Shammah S, Shashua A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv: 1610.03295, 2016. [22] Kurek M, Jaśkowski W. Heterogeneous team deep Q-learning in low-dimensional multi-agent environments. In: Proceedings of the 2016 IEEE Conference on Computational Intelligence and Games (CIG). Santorini, Greece: IEEE, 2016. 1-8 [23] Aydin M E, Fellows R. A reinforcement learning algorithm for building collaboration in multi-agent systems. arXiv preprint arXiv: 1711.10574, 2017. [24] Lin K X, Zhao R Y, Xu Z, Zhou J Y. Efficient large-scale fleet management via multi-agent deep reinforcement learning. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London, UK: ACM, 2018. 1774-1783 [25] Conde R, Llata J R, Torre-Ferrero C. Time-varying formation controllers for unmanned aerial vehicles using deep reinforcement learning. arXiv preprint arXiv: 1706.01384, 2017. [26] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing Atari with deep reinforcement learning. arXiv preprint arXiv: 1312.5602, 2013. [27] Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. Phoenix, USA: AAAI, 2016. [28] Schaul T, Quan J, Antonoglou I, Silver D. Prioritized experience replay. arXiv preprint arXiv: 1511.05952, 2015. [29] Wang Z Y, Schaul T, Hessel M, Van Hasselt H, Lanctot M, De Freitas N. Dueling network architectures for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. New York, USA: ACM, 2016. [30] Hausknecht M, Stone P. Deep recurrent Q-learning for partially observable MDPs. In: Proceedings of the 2015 AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents. Arlington, USA: AAAI, 2015. [31] Sorokin I, Seleznev A, Pavlov M, Fedorov A, Ignateva A. Deep attention recurrent Q-network. arXiv preprint arXiv: 1512.01693, 2015. [32] Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, et al. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv: 1710.02298, 2017. [33] Srouji M, Zhang J, Salakhutdinov R. Structured control nets for deep reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. Stockholm Sweden: PMLR, 2018. [34] Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. Deterministic policy gradient algorithms. In: Proceedings of the 31st International Conference on Machine Learning. Beijing, China: ACM, 2014. [35] Lillicrap T P, Hunt J J, Pritzel A, Heess N M, Erez T, Tassa Y, et al. Continuous Control with Deep Reinforcement Learning, United States Patent Application 20170024643, 2017. [36] Mnih V, Badia A P, Mirza M, Graves A, Harley T, Lillicrap T P, et al. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. New York, USA: ACM, 2016. 1928-1937 [37] Babaeizadeh M, Frosio I, Tyree S, Clemons J, Kautz J. Reinforcement learning through asynchronous advantage actor-critic on a GPU. arXiv preprint arXiv: 1611.06256, 2016. [38] Jaderberg M, Mnih V, Czarnecki W M, Schaul T, Leibo J Z, Silver D, et al. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv: 1611.05397, 2016. [39] Wang J X, Kurth-Nelson Z, Tirumala D, Soyer H, Leibo J Z, Munos R, et al. Learning to reinforcement learn. arXiv preprint arXiv: 1611.05763, 2016. [40] Schulman J, Levine S, Abbeel P, Jordan M, Moritz P. Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning. Lile, France: JMLR.org, 2015. 1889-1897 [41] Wu Y H, Mansimov E, Liao S, Grosse R, Ba J. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. arXiv preprint arXiv: 1708.05144. 2017. [42] Wang Z Y, Bapst V, Heess N, Mnih V, Munos R, Kavukcuoglu K, et al. Sample efficient actor-critic with experience replay. arXiv preprint arXiv: 1611.01224, 2016. [43] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv preprint arXiv: 1707.06347, 2017. [44] Panait L, Luke S. Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 2005, 11(3): 387-434 [45] Shoham Y, Powers R, Grenager T. If multi-agent learning is the answer, what is the question? Artificial Intelligence, 2007, 171(7): 365-377 http://dl.acm.org/citation.cfm?id=1248180 [46] Tuyls K, Weiss G. Multiagent learning: Basics, challenges, and prospects. AI Magazine, 2012, 33(3): 41-52 [47] Matignon L, Laurent G J, Le Fort-Piat N. Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems. The Knowledge Engineering Review, 2012, 27(1): 1-31 http://journals.cambridge.org/abstract_S0269888912000057 [48] Buşoniu L, Babuška R, De Schutter B. Multi-agent reinforcement learning: An overview. Innovations in Multi-Agent Systems and Applications-1. Berlin, Heidelberg, Germany: Springer, 2010. 183-221 [49] Crandall J W, Goodrich M A. Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Machine Learning, 2011, 82(3): 281-314 http://smartsearch.nstl.gov.cn/paper_detail.html?id=c015746b0e1c3ee07b9c93a30c57416d [50] Müller J P, Fischer K. Application impact of multi-agent systems and technologies: A survey. Agent-Oriented Software Engineering. Berlin, Heidelberg, Germany: Springer, 2014. 27-53 [51] Weiss G (editor). Multiagent Systems (2nd edition). Cambridge: MIT Press, 2013. [52] Bloembergen D, Tuyls K, Hennes D, Kaisers D. Evolutionary dynamics of multi-agent learning: A survey. Journal of Artificial Intelligence Research, 2015, 53(1): 659-697 [53] Hernandez-Leal P, Kaisers M, Baarslag T, De Cote E M. A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv: 1707.09183, 2017. [54] Sukhbaatar S, Szlam A, Fergus R. Learning multiagent communication with backpropagation. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: ACM, 2016. 2252-2260 [55] Peng P, Yuan Q, Wen Y, Yang Y D, Tang Z K, Long H T, et al. Multiagent bidirectionally-coordinated nets for learning to play StarCraft combat games. arXiv preprint arXiv: 1703.10069, 2017. [56] Sunehag P, Lever G, Gruslys A, Czarnecki W M, Zambaldi V, Jaderberg M, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Proceedings of the 17th International Conference on Autonomous Agents and Multi-Agent Systems. Stockholm, Sweden: ACM, 2018. [57] Kong X Y, Xin B, Liu F C, Wang Y Z. Revisiting the master-slave architecture in multi-agent deep reinforcement learning. arXiv preprint arXiv: 1712.07305, 2017. [58] Foerster J N, Assael Y M, De Freitas N, Whiteson S. Learning to communicate with deep multi-agent reinforcement learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: ACM, 2016. 2145-2153 [59] Usunier N, Synnaeve G, Lin Z M, Chintala S. Episodic exploration for deep deterministic policies: An application to StarCraft micromanagement tasks. arXiv preprint arXiv: 1609.02993, 2016. [60] Mao H Y, Gong Z B, Ni Y, Xiao Z. ACCNet: Actor-coordinator-critic net for "Learning-to-communicate" with deep multi-agent reinforcement learning. arXiv preprint arXiv: 1706.03235, 2017. [61] Yang Y D, Luo R, Li M N, Zhou M, Zhang W N, Wang J. Mean field multi-agent reinforcement learning. arXiv preprint arXiv: 1802.05438, 2018. [62] Tampuu A, Matiisen T, Kodelja D, Kuzovkin I, Korjus K, Aru J, et al. Multiagent cooperation and competition with deep reinforcement learning. PLoS One, 2017, 12(4): Article No. e0172395 [63] Omidshafiei S, Pazis J, Amato C, How J P, Vian J. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia, 2017. [64] Palmer G, Tuyls K, Bloembergen D, Savani R. Lenient multi-agent deep reinforcement learning. In: Proceedings of the 17th International Conference on Autonomous Agents and Multi-Agent Systems. Stockholm, Sweden: ACM, 2018. [65] Foerster J, Nardelli N, Farquhar G, Afouras T, Torr P H S, Kohli P, et al. Stabilising experience replay for deep multi-agent reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: ACM, 2017. [66] Leibo J Z, Zambaldi V, Lanctot M, Marecki J, Graepel T. Multi-agent reinforcement learning in sequential social dilemmas. In: Proceedings of the 16th Conference on Autonomous Agents and Multi-Agent Systems. São Paulo, Brazil: ACM, 2017. 464-473 [67] Lerer A, Peysakhovich A. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv: 1707.01068, 2017. [68] Peysakhovich A, Lerer A. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv: 1710.06975, 2017. [69] Menda K, Chen Y C, Grana J, Bono J W, Tracey B D, Kochenderfer M J, et al. Deep reinforcement learning for event-driven multi-agent decision processes. IEEE Transactions on Intelligent Transportation Systems, 2019, 20(4): 1259-1268 [70] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative-competitive environments. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017. 6382-6393 [71] Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S. Counterfactual multi-agent policy gradients. arXiv preprint arXiv: 1705.08926, 2017. [72] Ellowitz J. Multi-agent reinforcement learning and genetic policy sharing. arXiv preprint arXiv: 0812.1599V1, 2018. [73] Gupta J K, Egorov M, Kochenderfer M. Cooperative multi-agent control using deep reinforcement learning. Autonomous Agents and Multiagent Systems. Cham: Springer, 2017. 66-83 [74] Chu X X, Ye H J. Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning. arXiv preprint arXiv: 1710.00336, 2017. [75] Vinyals O, Ewalds T, Bartunov S, Georgiev P, Vezhnevets A S, Yeo M, et al. StarCraft Ⅱ: A new challenge for reinforcement learning. arXiv preprint arXiv: 1708.04782, 2017. [76] Zheng L M, Yang J C, Cai H, Zhang W N, Wang J, Yu Y. MAgent: A many-agent reinforcement learning platform for artificial collective intelligence. arXiv preprint arXiv: 1712.00600, 2017. [77] Khan A, Zhang C, Lee D D, Kumar V, Ribeiro A. Scalable centralized deep multi-agent reinforcement learning via policy gradients. arXiv preprint arXiv: 1805.08776, 2018. [78] Chen Y, Zhou M, Wen Y, Yang Y D, Su Y F, Zhang W N, et al. Factorized Q-learning for large-scale multi-agent systems. In: Proceedings of the 1st International Conference on Distributed Artificial Intelligence. Beijing, China: ACM, 2019. [79] Brodeur S, Perez E, Anand A, Golemo F, Celotti L, Strub F, et al. HoME: A household multimodal environment. arXiv preprint arXiv: 1711.11017, 2017. [80] Tian Y D, Gong Q C, Shang W L, Wu Y X, Zitnick C L. ELF: An extensive, lightweight and flexible research platform for real-time strategy games. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017. 2656-2666 [81] Bansal T, Pachocki J, Sidor S, Sutskever I, Mordatch I. Emergent complexity via multi-agent competition. arXiv preprint arXiv: 1710.03748, 2017. [82] Lanctot M, Zambaldi V, Gruslys A, Lazaridou A, Tuyls K, Pérolat, et al. A unified game-theoretic approach to multiagent reinforcement learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017. 4193-4206 [83] Melo F S, Sardinha A. Ad hoc teamwork by learning teammates$'$ task. Autonomous Agents and Multi-Agent Systems, 2016, 30(2): 175-219 doi: 10.1007/s10458-015-9280-x [84] Albrecht S V, Liemhetcharat S, Stone P. Special issue on multiagent interaction without prior coordination: Guest editorial. Autonomous Agents and Multi-Agent Systems, 2017, 31(4): 765-766 [85] Liemhetcharat S, Veloso M. Allocating training instances to learning agents for team formation. Autonomous Agents and Multi-Agent Systems, 2017, 31(4): 905-940 [86] Chakraborty M, Chua K Y P, Das S, Juba B. Coordinated versus decentralized exploration in multi-agent multi-armed bandits. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: International Joint Conferences on Artificial Intelligence Organization, 2017. 164-170 [87] Huang V, Ley T, Vlachou-Konchylaki M, Hu W F. Enhanced experience replay generation for efficient reinforcement learning. arXiv preprint arXiv: 1705.08245, 2017. [88] Kumar A, Biswas A, Sanyal S. eCommerceGAN: A generative adversarial network for E-commerce. arXiv preprint arXiv: 1801.03244, 2018. [89] Andersen P A. Deep reinforcement learning using capsules in advanced game environments. arXiv preprint arXiv: 1801.09597, 2018. [90] Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules. In: Proceedings of the 31st Conference on Neural Information Processing Systems. Long Beach, USA, 2017. 3859-3869 [91] Corneil D, Gerstner W, Brea J. Efficient model-based deep reinforcement learning with variational state tabulation. In: Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR, 2018. [92] Nishio D, Yamane S. Faster deep Q-learning using neural episodic control. arXiv preprint arXiv: 1801.01968, 2018. [93] Pritzel A, Uria B, Srinivasan S, Puigdoménech A, Vinyals O, Hassabis D, et al. Neural episodic control. arXiv preprint arXiv: 1703.01988, 2017. [94] Ha D, Schmidhuber J. World models. arXiv preprint arXiv: 1803.10122, 2018. [95] Gregor K, Papamakarios G, Besse F, Buesing L, Weber T. Temporal difference variational auto-encoder. arXiv preprint arXiv: 1806.03107, 2018. [96] Piergiovanni A J, Wu A, Ryoo M S. Learning real-world robot policies by dreaming. arXiv preprint arXiv: 1805. 07813, 2018. [97] Goel V, Weng J, Poupart P. Unsupervised video object segmentation for deep reinforcement learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada, 2018. [98] Pinto L, Davidson J, Sukthankar R, Gupta A. Robust adversarial reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: ACM, 2017. [99] Pattanaik A, Tang Z Y, Liu S J, Bommannan G, Chowdhary G. Robust deep reinforcement learning with adversarial attacks. In: Proceedings of the 17th International Conference on Autonomous Agents and Multi-Agent Systems. Stockholm, Sweden: ACM, 2018. [100] El Mhamdi E M, Guerraoui R, Hendrikx H, Maurer A. Dynamic safe interruptibility for decentralized multi-agent reinforcement learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017. [101] Hernandez-Leal P, Kaisers M. Towards a fast detection of opponents in repeated stochastic games. In: Proceedings of 2017 International Conference on Autonomous Agents and Multiagent Systems. São Paulo, Brazil: Springer, 2017. 239-257 [102] Hernandez-Leal P, Zhan Y S, Taylor M E, Sucar L E, De Cote E M. Efficiently detecting switches against non-stationary opponents. Autonomous Agents and Multi-Agent Systems, 2017, 31(4): 767-789 [103] Hernandez-Leal P, Zhan Y S, Taylor M E, Sucar L E, De Cote E M. An exploration strategy for non-stationary opponents. Autonomous Agents and Multi-Agent Systems, 2017, 31(5): 971-1002 [104] Yu C, Zhang M J, Ren F H, Tan G Z. Multiagent learning of coordination in loosely coupled multiagent systems. IEEE Transactions on Cybernetics, 2015, 45(12): 2853-2867 [105] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative-competitive environments. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017. 6382-6393 [106] Rabinowitz N C, Perbet F, Song F, Zhang C Y, Ali E, Botvinick M. Machine theory of mind. In: Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR, 2018. [107] He H, Boyd-Graber J, Kwok K, Daumé H. Opponent modeling in deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. New York, USA: ACM, 2016. 1804-1813 [108] Foerster J, Chen R Y, Al-Shedivat M, Whiteson S, Abbeel P, Mordatch I. Learning with opponent-learning awareness. In: Proceedings of the 17th International Conference on Autonomous Agents and Multi-Agent Systems. Stockholm, Sweden: ACM, 2018. [109] Hong Z W, Su S Y, Shann T Y, Chang Y H, Lee C Y. A deep policy inference Q-network for multi-agent systems. arXiv preprint arXiv: 1712.07893, 2017. [110] Raileanu R, Denton E, Szlam A, Fergus R. Modeling others using oneself in multi-agent reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR, 2018. 期刊类型引用(38)
1. 田树聪,谢愈,张远龙,周正春,高阳. 面向参数化动作空间的多智能体中心化策略梯度分解及其应用. 软件学报. 2025(02): 590-607 . 百度学术
2. 王忠禹,徐晓鹏,王东. 部分可观测条件下的策略迁移强化学习方法. 现代防御技术. 2024(02): 63-71 . 百度学术
3. 罗俊仁,张万鹏,苏炯铭,袁唯淋,陈璟. 多智能体博弈学习研究进展. 系统工程与电子技术. 2024(05): 1628-1655 . 百度学术
4. 安靖,刘伟,周杰. 基于深度强化学习的作战概念能力需求分析关键技术. 指挥控制与仿真. 2024(03): 18-24 . 百度学术
5. 陈晓轩,冯旸赫,黄金才,刘忠,徐越. 基于兵棋推演的空战编组对抗智能决策方法. 指挥与控制学报. 2024(02): 213-219 . 百度学术
6. 丁建立,刘德康. 机场航班延误恢复的强化学习算法. 重庆交通大学学报(自然科学版). 2024(09): 50-58 . 百度学术
7. 刘磊,葛振业,林杰,陶宇,孙俊杰. 基于鱼群涌现行为启发的集群机器人硬注意力强化模型. 计算机应用研究. 2024(09): 2737-2744 . 百度学术
8. 高显忠,项磊,王宝来,贾高伟,侯中喜. 针对无人机集群对抗的规则与智能耦合约束训练方法. 国防科技大学学报. 2023(01): 157-166 . 百度学术
9. 刘磊,张浩翔,陈若妍,高岩,王富正,王亚刚. 鱼群涌现机制下集群机器人运动强化的迁移控制. 控制与决策. 2023(03): 621-630 . 百度学术
10. 包战,张驭龙,朱松岩,王春光,刘忠. 智能临机规划技术要点研究. 国防科技. 2023(01): 112-118 . 百度学术
11. 王子豪,张严心,黄志清,殷辰堃. 部分可观测下基于RGMAAC算法的多智能体协同. 控制与决策. 2023(05): 1267-1277 . 百度学术
12. 温广辉,杨涛,周佳玲,付俊杰,徐磊. 强化学习与自适应动态规划:从基础理论到多智能体系统中的应用进展综述. 控制与决策. 2023(05): 1200-1230 . 百度学术
13. 周翔,王继业,陈盛,王新迎. 基于深度强化学习的微网优化运行综述. 全球能源互联网. 2023(03): 240-257 . 百度学术
14. 尹奇跃,赵美静,倪晚成,张俊格,黄凯奇. 兵棋推演的智能决策技术与挑战. 自动化学报. 2023(05): 913-928 . 本站查看
15. 刘玮,张永亮,程旭. 基于深度强化学习的人机智能对抗综述. 指挥信息系统与技术. 2023(02): 28-37 . 百度学术
16. 汪梦倩,梁皓星,郭茂耘,陈小龙,武艺. 面向飞行目标的多传感器协同探测资源调度方法. 自动化学报. 2023(06): 1242-1255 . 本站查看
17. 曾斌,樊旭,李厚朴. 支持重规划的战时保障动态调度研究. 自动化学报. 2023(07): 1519-1529 . 本站查看
18. 朱永文,陈志杰,蒲钒,王琦. 空中交通智能化管理的科学与技术问题研究. 中国工程科学. 2023(05): 174-184 . 百度学术
19. 黄海,桂起权. 基于人工智能的多Agent协同辩证逻辑推理方法. 逻辑学研究. 2023(05): 81-96 . 百度学术
20. 李玥,杨竣辉. 风力发电机组的非侵入式机械故障检测与诊断. 机械设计与研究. 2023(06): 248-254 . 百度学术
21. 刘宇,张聪,李涛. 强化学习A3C算法在电梯调度中的建模及应用. 计算机工程与设计. 2022(01): 196-202 . 百度学术
22. 蒲志强,易建强,刘振,丘腾海,孙金林,李非墨. 知识和数据协同驱动的群体智能决策方法研究综述. 自动化学报. 2022(03): 627-643 . 本站查看
23. 梁星星,马扬,冯旸赫,张驭龙,张龙飞,廖世江,刘忠. 基于预测编码的样本自适应行动策略规划. 软件学报. 2022(04): 1477-1500 . 百度学术
24. 陈晓轩,黄魁华,梁星星,冯旸赫,黄金才. 战术先验知识启发的多智能体双层强化学习. 指挥与控制学报. 2022(01): 72-79 . 百度学术
25. 张蒙,李凯,吴哲,臧一凡,徐航,兴军亮. 一种针对德州扑克AI的对手建模与策略集成框架. 自动化学报. 2022(04): 1004-1017 . 本站查看
26. 屈慧洁. 基于强化学习的D2D通信网络低能耗路由算法. 信息技术与信息化. 2022(07): 164-168 . 百度学术
27. 马子杰,谢拥军. 体系作战下巡航导弹的动态隐身. 系统工程与电子技术. 2022(09): 2826-2831 . 百度学术
28. 李静晨,史豪斌,黄国胜. 基于自注意力机制和策略映射重组的多智能体强化学习算法. 计算机学报. 2022(09): 1842-1858 . 百度学术
29. 张永梅,赵家瑞,吴爱燕. 好奇心驱动的深度强化学习机器人路径规划算法. 科学技术与工程. 2022(25): 11075-11083 . 百度学术
30. 吴哲,李凯,徐航,兴军亮. 一种用于两人零和博弈对手适应的元策略演化学习算法. 自动化学报. 2022(10): 2462-2473 . 本站查看
31. 闫超,相晓嘉,徐昕,王菖,周晗,沈林成. 多智能体深度强化学习及其可扩展性与可迁移性研究综述. 控制与决策. 2022(12): 3083-3102 . 百度学术
32. 张栋,王孟阳,唐硕. 面向任务的无人机集群自主决策技术. 指挥与控制学报. 2022(04): 365-377 . 百度学术
33. 封硕,舒红,谢步庆. 基于改进深度强化学习的三维环境路径规划. 计算机应用与软件. 2021(01): 250-255 . 百度学术
34. 文永明,石晓荣,黄雪梅,余跃. 一种无人机集群对抗多耦合任务智能决策方法. 宇航学报. 2021(04): 504-512 . 百度学术
35. 马子杰,高杰,武沛羽,谢拥军. 用于巡航导弹突防航迹规划的改进深度强化学习算法. 电子技术应用. 2021(08): 11-14+19 . 百度学术
36. 张荣霞,武长旭,孙同超,赵增顺. 深度强化学习及在路径规划中的研究进展. 计算机工程与应用. 2021(19): 44-56 . 百度学术
37. 贺嘉璠,汪慢,方峰,李清伟,费爱国. 深度强化学习技术在智能空战中的运用. 指挥信息系统与技术. 2021(05): 6-13 . 百度学术
38. 李方丽,刘杰,王延波,魏海峰,李垣江. 基于深度强化学习的PMSM匝间短路故障诊断方法. 水下无人系统学报. 2021(06): 733-738+781 . 百度学术
其他类型引用(60)
-