2.765

2022影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

融合自适应评判的随机系统数据驱动策略优化

王鼎 王将宇 乔俊飞

王鼎, 王将宇, 乔俊飞. 融合自适应评判的随机系统数据驱动策略优化. 自动化学报, 2024, 50(5): 1−11 doi: 10.16383/j.aas.c230678
引用本文: 王鼎, 王将宇, 乔俊飞. 融合自适应评判的随机系统数据驱动策略优化. 自动化学报, 2024, 50(5): 1−11 doi: 10.16383/j.aas.c230678
Wang Ding, Wang Jiang-Yu, Qiao Jun-Fei. Data-driven policy optimization for stochastic systems involving adaptive critic. Acta Automatica Sinica, 2024, 50(5): 1−11 doi: 10.16383/j.aas.c230678
Citation: Wang Ding, Wang Jiang-Yu, Qiao Jun-Fei. Data-driven policy optimization for stochastic systems involving adaptive critic. Acta Automatica Sinica, 2024, 50(5): 1−11 doi: 10.16383/j.aas.c230678

融合自适应评判的随机系统数据驱动策略优化

doi: 10.16383/j.aas.c230678
基金项目: 国家自然科学基金 (62222301, 61890930-5, 62021003), 科技创新2030 —— “新一代人工智能”重大项目 (2021ZD0112302, 2021ZD0112301) 资助
详细信息
    作者简介:

    王鼎:北京工业大学信息学部教授. 2009 年获得东北大学硕士学位, 2012 年获得中国科学院自动化研究 所博士学位. 主要研究方向为强化学 习与智能控制. 本文通信作者. E-mail: dingwang@bjut.edu.cn

    王将宇:北京工业大学信息学部博士研究生. 主要研究方向为强化学习和智能控制. E-mail: wangjiangyu@emails.bjut.edu.cn

    乔俊飞:北京工业大学信息学部教授. 主要研究方向为污水处理过程智能控制和神经网络结构设计与优化. E-mail: adqiao@bjut.edu.cn

Data-driven Policy Optimization for Stochastic Systems Involving Adaptive Critic

Funds: Supported by National Natural Science Foundation of China (62222301, 61890930-5, 62021003), National Key Research and Development Program of China (2021ZD0112302, 2021ZD0112301)
More Information
    Author Bio:

    WANG Ding Professor at the Faculty of Information Technology, Beijing University of Technology. He received his master degree from Northeastern University in 2009 and Ph.D. degree from Institute of Automation, Chinese Academy of Sciences in 2012. His research interest covers reinforcement learning and intelligent control. Corresponding author of this paper

    WANG Jiang-Yu Ph.D. candidate at the Faculty of Information Technology, Beijing University of Technology. His research interest covers reinforcement learning and intelligent control

    QIAO Jun-Fei Professor at the Faculty of Information Technology, Beijing University of Technology. His research interest covers intelligent control of wastewater treatment processes, structure design and optimization of neural networks

  • 摘要: 自适应评判技术已经广泛应用于求解复杂非线性系统的最优控制问题, 但利用其求解离散时间非线性随机系统的无限时域最优控制问题还存在一定局限性. 本文融合自适应评判技术, 建立一种数据驱动的离散随机系统折扣最优调节方法. 首先, 针对宽松假设下的非线性随机系统, 研究带有折扣因子的无限时域最优控制问题. 本文所提的随机系统 Q-learning 算法能够将初始的容许策略单调不增地优化至最优策略. 基于数据驱动思想, 随机系统 Q-learning 算法在不建立模型的情况下直接利用数据进行策略优化. 其次, 利用执行−评判神经网络方案, 实现了随机系统 Q-learning 算法. 最后, 通过两个基准系统, 验证本文提出的随机系统 Q-learning 算法的有效性.
  • 图  1  Q 网络权值曲线 (基准系统 I)

    Fig.  1  Curves of Q network weights (Benchmark system I)

    图  2  执行网络权值曲线 (基准系统 I)

    Fig.  2  Curves of action network weights (Benchmark system I)

    图  3  控制策略测试曲线 (基准系统 I)

    Fig.  3  Curves of control policies for performance test (Benchmark system I)

    图  4  球台平衡系统示意图 (基准系统II)

    Fig.  4  Schematic diagram of the ball-and-beam system (Benchmark system II)

    图  5  Q 网络权值曲线 (基准系统II)

    Fig.  5  Curves of Q network weights (Benchmark system II)

    图  6  执行网络权值曲线 (基准系统 II)

    Fig.  6  Curves of action network weights (Benchmark system II)

    图  7  系统状态曲线 (基准系统 II)

    Fig.  7  Curves of system states (Benchmark system II)

    图  8  系统控制输入曲线 (基准系统 II)

    Fig.  8  Curves of system control inputs (Benchmark system II)

    图  9  代价函数曲线 (基准系统 II)

    Fig.  9  Curves of system cost-to-go (Benchmark system II)

    表  1  随机 Q-learning 算法的主要参数

    Table  1  Main parameters of the stochastic Q-learning algorithm

    算法参数$\mathcal{Q}$$\mathcal{R}$${\rho}_{\max}$$\lambda$$\epsilon$
    基准系统I$2I_2$23000.970.01
    基准系统II$0.1I_4$0.15000.990.01
    下载: 导出CSV

    表  2  球台平衡系统的主要参数

    Table  2  Main parameters of the ball-and-beam system

    符号及取值物理意义
    $S_t=0.001 \;{\rm{N}}/ {\rm{m}}$驱动机械刚度
    $L_\omega =0.5\; {\rm{m}}$平台半径
    $L =0.48 \; {\rm{m}}$电机作用半径
    $f_c= 1\; {\rm{N_s/ m}}$驱动电机的机械摩擦系数
    $I_\omega = 0.14025 \;{\rm{kg}}\cdot {\rm{m}}^2$平台惯性矩
    $g = 9.8\; {\rm{m/s}}^2$重力加速度
    $\varpi =0.0162 \;{\rm{kg}}$球体质量
    $\tau =0.02 \;{\rm{m}}$球体滚动半径
    $I_b=4.32\times10^{-5} \;{\rm{kg}}\cdot {\rm{m}}^2$球体转动惯量
    下载: 导出CSV
  • [1] Liu D R, Xue S, Zhao B, Luo B, Wei Q L. Adaptive dynamic programming for control: A survey and recent advances. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2021, 51(1): 142-160 doi: 10.1109/TSMC.2020.3042876
    [2] Bellman R. Dynamic programming. Science, 1966, 153(3731): 34-37 doi: 10.1126/science.153.3731.34
    [3] Wang F Y, Zhang H G, Liu D R. Adaptive dynamic programming: An introduction. IEEE Computational Intelligence Magazine, 2009, 4(2): 39-47 doi: 10.1109/MCI.2009.932261
    [4] Prokhorov D V, Wunsch D C. Adaptive critic designs. IEEE Transactions on Neural Networks, 1997, 8(5): 997-1007 doi: 10.1109/72.623201
    [5] Zhao M M, Wang D, Qiao J F, Ha M M, Ren J. Advanced value iteration for discrete-time intelligent critic control: A survey. Artificial Intelligence Review, 2023, 56(10): 12315-12346 doi: 10.1007/s10462-023-10497-1
    [6] Wang D, Gao N, Liu D R, Li J N, Lewis F L. Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications. IEEE/CAA Journal of Automatica Sinica, 2024, 11(1): 18-36 doi: 10.1109/JAS.2023.123843
    [7] Liu T, Tian B, Ai Y F, Li L, Cao D P, Wang F Y. Parallel reinforcement learning: A framework and case study. IEEE/CAA Journal of Automatica Sinica, 2018, 5(4): 827-835 doi: 10.1109/JAS.2018.7511144
    [8] Miao Q H, Lv Y S, Huang M, Wang X, Wang F Y. Parallel learning: Overview and perspective for computational learning across Syn2Real and Sim2Real. IEEE/CAA Journal of Automatica Sinica, 2023, 10(3): 603-631 doi: 10.1109/JAS.2023.123375
    [9] Zhao M M, Wang D, Ha M M, Qiao J F. Evolving and incremental value iteration schemes for nonlinear discrete-time zero-sum games. IEEE Transactions on Cybernetics, 2023, 53(7): 4487-4499 doi: 10.1109/TCYB.2022.3198078
    [10] 王鼎, 胡凌治, 赵明明, 哈明鸣, 乔俊飞. 未知非线性零和博弈最优跟踪的事件触发控制设计. 自动化学报, 2023, 49(1): 91-101

    Wang Ding, Hu Ling-Zhi, Zhao Ming-Ming, Ha Ming-Ming, Qiao Jun-Fei. Event-triggered control design for optimal tracking of unknown nonlinear zero-sum games. Acta Automatica Sinica, 2023, 49(1): 91-101
    [11] 王鼎. 一类离散动态系统基于事件的迭代神经控制. 工程科学学报, 2022, 44(3): 411-419

    Wang Ding. Event-based iterative neural control for a type of discrete dynamic plant. Chinese Journal of Engineering, 2022, 44(3): 411-419
    [12] Wang D, Hu L Z, Zhao M M, Qiao J F. Dual event-triggered constrained control through adaptive critic for discrete-time zero-sum games. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(3): 1584-1595 doi: 10.1109/TSMC.2022.3201671
    [13] Wang D, Li X, Zhao M M, Qiao J F. Adaptive critic control design with knowledge transfer for wastewater treatment applications. IEEE Transactions on Industrial Informatics, DOI: 10.1109/TⅡ.2023.3278875
    [14] 王鼎, 赵慧玲, 李鑫. 基于多目标粒子群优化的污水处理系统自适应评判控制. 工程科学学报, DOI: 10.13374/j.issn2095-9389.2023.04.15.001

    Wang Ding, Zhao Hui-Ling, Li Xin. Adaptive critic control for wastewater treatment systems based on multi-objective particle swarm optimization. Chinese Journal of Engineering, DOI: 10.13374/j.issn2095-9389.2023.04.15.001
    [15] Wu T Y, He S Z, Liu J P, Sun S Q, Liu K, Han Q L, et al. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 2023, 10(5): 1122-1136 doi: 10.1109/JAS.2023.123618
    [16] Luo B, Liu D R, Wu H N, Wang D, Lewis F L. Policy gradient adaptive dynamic programming for data-based optimal control. IEEE Transactions on Cybernetics, 2017, 47(10): 3341-3354 doi: 10.1109/TCYB.2016.2623859
    [17] Luo B, Yang Y, Liu D R. Policy iteration Q-learning for data-based two-player zero-sum game of linear discrete-time systems. IEEE Transactions on Cybernetics, 2021, 51(7): 3630-3640 doi: 10.1109/TCYB.2020.2970969
    [18] Lin M D, Zhao B, Liu D R. Policy gradient adaptive critic designs for model-free optimal tracking control with experience replay. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(6): 3692-3703 doi: 10.1109/TSMC.2021.3071968
    [19] Su S, Zhu Q Y, Liu J Q, Tang T, Wei Q L, Cao Y. A data-driven iterative learning approach for optimizing the train control strategy. IEEE Transactions on Industrial Informatics, 2023, 19(7): 7885-7893 doi: 10.1109/TII.2022.3195888
    [20] Wei Q L, Song R Z, Yan P F. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(2): 444-458 doi: 10.1109/TNNLS.2015.2464080
    [21] Liang M M, Wang D, Liu D R. Improved value iteration for neural-network-based stochastic optimal control design. Neural Networks, 2020, 124: 280-295 doi: 10.1016/j.neunet.2020.01.004
    [22] Pang B, Jiang Z P. Reinforcement learning for adaptive optimal stationary control of linear stochastic systems. IEEE Transactions on Automatic Control, 2023, 68(4): 2383-2390 doi: 10.1109/TAC.2022.3172250
    [23] Wei Q L, Zhou T M, Lu J W, Liu Y, Su S, Xiao J. Continuous-time stochastic policy iteration of adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(10): 6375-6387 doi: 10.1109/TSMC.2023.3284612
    [24] Lee J, Haddad W M, Lanchares M. Finite time stability and optimal finite time stabilization for discrete-time stochastic dynamical systems. IEEE Transactions on Automatic Control, 2023, 68(7): 3978-3991
    [25] Liang M M, Wang D, Liu D R. Neuro-optimal control for discrete stochastic processes via a novel policy iteration algorithm. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(11): 3972-3985 doi: 10.1109/TSMC.2019.2907991
    [26] 王鼎, 赵明明, 哈明鸣, 乔俊飞. 基于折扣广义值迭代的智能最优跟 踪及应用验证. 自动化学报, 2022, 48(1): 182-193

    Wang Ding, Zhao Ming-Ming, Ha Ming-Ming, Qiao Jun-Fei. Intelligent optimal tracking with application verifications via discounted generalized value iteration. Acta Automatica Sinica, 2022, 48(1): 182-193
    [27] Wang D, Ren J, Ha M M, Qiao J F. System stability of learning-based linear optimal control with general discounted value iteration. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(9): 6504-6514 doi: 10.1109/TNNLS.2021.3137524
    [28] Lincoln B, Rantzer A. Relaxing dynamic programming. IEEE Transactions on Automatic Control, 2006, 51(8): 1249-1260 doi: 10.1109/TAC.2006.878720
    [29] Ha M M, Wang D, Liu D R. Generalized value iteration for discounted optimal control with stability analysis. Systems & Control Letters, 2021, 147: 104847
    [30] Ha M M, Wang D, Liu D R. Discounted iterative adaptive critic designs with novel stability analysis for tracking control. IEEE/CAA Journal of Automatica Sinica, 2022, 9(7): 1262-1272 doi: 10.1109/JAS.2022.105692
    [31] Yang X, Wei Q L. Adaptive critic designs for optimal event-driven control of a CSTR system. IEEE Transactions on Industrial Informatics, 2021, 17(1): 484-493 doi: 10.1109/TII.2020.2972383
    [32] Heydari A. Revisiting approximate dynamic programming and its convergence. IEEE Transactions on Cybernetics, 2014, 44(12): 2733-2743 doi: 10.1109/TCYB.2014.2314612
    [33] Ha M M, Wang D, Liu D R. Neural-network-based discounted optimal control via an integrated value iteration with accuracy guarantee. Neural Networks, 2021, 144: 176-186 doi: 10.1016/j.neunet.2021.08.025
    [34] Wang D, Wang J Y, Zhao M M, Xin P, Qiao J F. Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control. IEEE/CAA Journal of Automatica Sinica, 2023, 10(9): 1797-1809 doi: 10.1109/JAS.2023.123684
    [35] Liu D R, Wei Q L. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621-634 doi: 10.1109/TNNLS.2013.2281663
    [36] Zhong X N, Ni Z, He H B. Gr-GDHP: A new architecture for globalized dual heuristic dynamic programming. IEEE Transactions on Cybernetics, 2017, 47(10): 3318-3330 doi: 10.1109/TCYB.2016.2598282
    [37] Ha M M, Wang D, Liu D R. A novel value iteration scheme with adjustable convergence rate. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(10): 7430-7442 doi: 10.1109/TNNLS.2022.3143527
  • 加载中
计量
  • 文章访问数:  70
  • HTML全文浏览量:  25
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-11-02
  • 录用日期:  2024-01-08
  • 网络出版日期:  2024-02-19

目录

    /

    返回文章
    返回