Semi-supervised Concept Drift Detection Method by Combining Sample Output Space and Feature Space With Its Application
-
摘要: 城市固废焚烧(Municipal solid waste incineration, MSWI)过程受垃圾成分波动、设备磨损与维修、季节交替变化等因素的影响而存在概念漂移现象, 这导致用于污染物排放浓度的建模数据具有时变性. 为此, 需要识别能够表征概念漂移的新样本对污染物测量模型进行更新, 但现有漂移检测方法难以有效应用于建模样本真值获取困难的工业过程. 针对上述问题, 提出一种联合样本输出与特征空间的半监督概念漂移检测方法. 首先, 采用基于主成分分析(Principal component analysis, PCA)的无监督机制识别特征空间内的概念漂移样本; 然后, 在样本输出空间采用基于时间差分(Temporal-difference, TD)学习的半监督机制对上述概念漂移样本进行伪真值标注后, 再用Page-Hinkley检测法确认能够表征概念漂移的样本; 最后, 采用上述步骤获得的新样本结合历史样本对模型进行更新. 基于合成和真实工业过程数据集的仿真结果表明所提方法具有优于已有方法的性能, 能够在加强模型漂移适应性的同时有效缩减样本标注成本.Abstract: The modeling data used for pollutant emission concentration in the municipal solid waste incineration (MSWI) is time-varying due to the concept drift phenomenon, which is caused by factors such as fluctuations in waste composition, equipment wear and repair, and seasonal changes. Thus, it is necessary to identify new samples that can represent the concept drift for pollutant measurement model updating. However, the existing methods are limited by the modeling samples' true values, which are difficult to be effectively applied to industrial processes. Thus, a semi-supervised concept drift detection method by combining sample output space and feature space is proposed. Firstly, unsupervised mechanism based on principal component analysis (PCA) is used in the sample feature space to identify concept drift samples. Then, semi-supervised mechanism based on temporal-difference (TD) learning is used in the sample output space to label the pseudo-true value for the identified concept drift samples. Further, the Page-Hinkley detection method is used to confirm the concept drift samples. Finally, the new samples obtained by the above steps are combined with historical samples to update the measurement model. The simulation results based on synthetic and real industrial process data sets show that the proposed method has better performance than the existing methods. Moreover, the cost of sample annotation is effectively reduced and the drift adaptability of the measurement model is enhanced.1) 收稿日期 2020-11-27 录用日期 2021-03-02 Manuscript received November 27, 2020; accepted March 2,2021 国家自然科学基金 (62073006, 62021003, 61890930-5), 北京市自然科学基金 (4212032, 4192009), 科学技术部国家重点研发计划(2018YFC1900800-5), 矿冶过程自动控制技术国家 (北京市) 重点实验室 (BGRIMM-KZSKL-2020-02) 资助 Supported by National Natural Science Foundation of China (62073006, 62021003, 61890930-5), Natural Science Foundation of Beijing (4212032, 4192009), National Key Research and Development Program of China (2018YFC1900800-5), and the National (Beijing) Key Laboratory of Automatic Control Technology for Mining and Metallurgical Process (BGRIMM-KZSKL-2020-02)2) 本文责任编委 魏庆来 Recommended by Associate Editor WEI Qing-Lai 1. 北京工业大学信息学部 北京 100124 2. 计算智能与智能系统北京市重点实验室 北京 100124 1. Faculty of Information Technology, Beijing University of Technology, Beijing 100124 2. Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing 100124
-
表 1 各数据集参数介绍
Table 1 Detailed introduction of each data set
数据集 样本总数 建模样本数 验证样本数 漂移样本数 特征空间维数 合成 1500 500 500 500 5 过程 1500 500 500 500 18 表 2 仿真参数设置
Table 2 Simulation parameter setting
参数名称 数据集 合成 过程 GPR 核函数 径向基核函数 径向基核函数 核函数宽度 0.5967 1.5116 核函数特征长度 0.7939 1.4734 待标注样本窗口容量 (w) 8 50 PCA 控制限置信度 (ConfSPE, ConfT2) 0.8, 0.8 0.9, 0.9 TD 学习最近邻数量 $(\varepsilon) $ 6 5 Page-Hinkley 检测法基准累计
平均测量误差 (${\phi _0}$)2.2919 16.8846 表 3 所提算法检测信息
Table 3 Detection information of the proposed algorithm
合成数据集 过程数据集 缓存窗口填满次数 50 9 模型更新次数 44 8 标注漂移样本伪真值数 350 441 原始模型 RMSE 7.6478 53.0210 采用本文算法后模型 RMSE 2.5840 28.8785 表 4 不同算法检测性能比较
Table 4 Comparison of detection performance of different algorithms
数据集 检测算法 模型更新次数 更新所需真值数 模型测量 RMSE 其他 合成 无监督型 101 101 2.5846 需采用真值更新 有监督型 99 990 2.2943 需采用真值检测与更新 本文算法 44 50 2.5840 采用伪真值更新 过程 无监督型 463 463 35.8261 需采用真值更新 有监督型 19 450 28.4729 需采用真值检测与更新 本文算法 8 9 28.8785 采用伪真值更新 表 5 不同模型测量性能比较
Table 5 Comparison of measurement performance of different models
数据集 测量模型 核函数 (核宽度) 最小叶尺寸 训练 RMSE 训练 R2 测量 RMSE 合成 SVR 径向基 (0.5600) — 0.2479 0.94 3.7900 RT — 4 0.3034 0.91 3.1241 GPR 径向基 (0.5967) — 0.1899 0.96 2.5840 过程 SVR 径向基 (1.1000) — 0.1369 0.98 30.3916 RT — 4 0.1630 0.97 29.9548 GPR 径向基 (1.5116) — 0.1348 0.98 28.8785 表 6 不同距离函数对模型更新性能影响
Table 6 The influence of different distance functions on model updating performance
数据集 距离函数 伪真值标注平均误差 模型测量 RMSE 合成 曼哈顿距离 3.3434 3.1939 切比雪夫距离 3.2382 3.2484 欧氏距离 3.2760 2.5840 过程 曼哈顿距离 38.0043 28.9954 切比雪夫距离 37.7392 28.9947 欧氏距离 35.9429 28.8785 表 7 不同可变参数对应算法性能变化
Table 7 Algorithm performance changes corresponding to different variable parameters
样本窗口容量 w 最近邻数量 $\varepsilon $ PCA 控制限 ConfSPE,ConfT2 缓存窗口填满次数 标注伪真值数 更新次数 伪真值标注平均误差 模型测量 RMSE 30 3 0.85, 0.85 16 464 13 38.9005 31.0823 0.90, 0.90 16 464 15 48.2016 35.2513 0.95, 0.95 16 464 12 37.7528 28.9876 5 0.85, 0.85 16 464 15 40.0004 30.4071 0.90, 0.90 16 464 15 47.6636 34.2694 0.95, 0.95 15 435 13 39.0258 31.0078 8 0.85, 0.85 16 464 12 40.1782 28.8912 0.90, 0.90 16 464 15 46.5567 32.8323 0.95, 0.95 15 435 14 38.4400 30.5321 50 3 0.85, 0.85 9 441 8 42.9923 30.1536 0.90, 0.90 9 441 8 36.8999 29.7216 0.95, 0.95 9 441 7 31.2822 29.3330 5 0.85, 0.85 9 441 8 43.4483 29.8960 0.90, 0.90 9 441 9 35.9429 28.8785 0.95, 0.95 9 441 7 31.9674 29.9178 8 0.85, 0.85 9 441 8 42.9759 29.4615 0.90, 0.90 9 441 8 37.0338 29.2796 0.95, 0.95 9 441 6 31.4267 29.3356 70 3 0.85, 0.85 6 414 5 44.7315 33.6308 0.90, 0.90 6 414 5 46.9859 36.2573 0.95, 0.95 6 414 5 33.4711 33.1686 5 0.85, 0.85 6 414 5 41.9744 32.4663 0.90, 0.90 6 414 5 44.4580 34.3495 0.95, 0.95 6 414 5 33.6287 34.2660 8 0.85, 0.85 6 414 5 42.3929 31.0446 0.90, 0.90 6 414 5 45.8771 34.5003 0.95, 0.95 6 414 5 33.2206 33.5950 -
[1] Kolekar K A, Hazra T, Chakrabarty S N. A review on prediction of municipal solid waste generation models. Procedia Environmental Sciences, 2016, 35: 238-244. doi: 10.1016/j.proenv.2016.07.087 [2] Li X, Zhang C, Li Y, Zhi Q. The status of municipal solid waste incineration (MSWI) in China and its clean development. Energy Procedia, 2016, 104: 498-503. doi: 10.1016/j.egypro.2016.12.084 [3] 乔俊飞, 郭子豪, 汤健. 面向城市固废焚烧过程的二噁英排放浓度检测方法综述. 自动化学报, 2020, 46(06): 1063-1089.Qiao Jun-Fei, Guo Zi-Hao, Tang Jian. Dioxin emission concentration measurement approaches for municipal solid wastes incineration process: a survey. Acta Automatica Sinica, 2020, 46(06): 1063-1089. [4] 汤健, 乔俊飞, 徐喆, 郭子豪. 基于特征约简与选择性集成算法的城市固废焚烧过程二噁英排放浓度软测量. 控制理论与应用, 2021, 38(1), 110−120Tang Jian, Qiao Jun-Fei, Xu Zhe, Guo Zi-Hao. Soft measuring approach of dioxin emission concentration in municipal solid waste incineration process based on feature reduction and selective ensemble algorithm. Control Theory and Applications, 2021, 38(1), 110−120 [5] 汤健, 夏恒, 乔俊飞, 郭子豪. 深度集成森林回归建模方法及应用研究 [Online], available: http://kns.cnki.net/kcms/detail/11.2286.T.20200723.1048.002.html, July 23, 2020Tang Jian, Xia Heng, Qiao Jun-Fei, Guo Zi-Hao. Deep ensemble forest regression modeling method with its application research [Online], available: http://kns.cnki.net/kcms/detail/11.2286.T.20200723.1048.002.html, July 23, 2020 [6] Wang S, Schlobach S, Klein M. What is concept drift and how to measure it? In: Proceedings of the 2010 International Conference on Knowledge Engineering and Knowledge Management. Lisbon, Portugal: Springer, 2010. 241–256 [7] Widmer G, Kubat M. Learning in the presence of concept drift and hidden contexts. Machine Learning, 1996, 23(1): 69-101. [8] 汤健, 柴天佑, 刘卓, 余文, 周晓杰. 基于更新样本智能识别算法的自适应集成建模. 自动化学报, 2016, 42(7): 1040-1052.TANG Jian, CHAI Tian-You, LIU Zhuo, YU Wen, ZHOU Xiao-Jie. Adaptive ensemble modelling approach based on updating sample intelligent identification. Acta Automatica Sinica, 2016, 042(007): 1040-1052. [9] Žliobaitė I. Learning under concept drift: An overview [Online], available: http://arxiv.org/abs/1010.4784, October 22, 2010 [10] Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 2018, 31(12): 2346-2363. [11] Gama J, Medas P, Castillo G, Rodrigues P. Learning with drift detection. In: Proceedings of the 17th Brazilian Symposium on Artificial Intelligence. São Luís, Brazil: Springer, 2004. 286–295 [12] Pesaranghader A, Viktor H L. Fast hoeffding drift detection method for evolving data streams. In: Proceedings of the 2016 Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Riva Del Garda, Italy: Springer, 2016. 96–111 [13] Yang Z, Al-Dahidi S, Baraldi P, Zio E, Montelatici L. A novel concept drift detection method for incremental learning in nonstationary environments. IEEE Transactions on Neural Networks and Learning Systems, 2019, 31(1): 309-320. [14] Frías B I, Campo A J, Ramos J G, Morales B R, Ortiz D A, Caballero M Y. Online and non-parametric drift detection methods based on Hoeffding’s bounds. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(3): 810-823. [15] Mahdi O A, Pardede E, Ali N, Cao J. Diversity measure as a new drift detection method in data streaming. Knowledge-Based Systems, 2020, 191: Article No. 105227. doi: 10.1016/j.knosys.2019.105227 [16] Korpela T, Kumpulainen P, Majanne Y, Häyrinen A, Lautala P. Indirect NOx emission monitoring in natural gas fired boilers. Control Engineering Practice, 2017, 65: 11–25 [17] Tang J, Yu W, Chai T Y, Zhao L J. Online principal component analysis with application to process modeling. Neurocomputing, 2012, 82: l67-168. [18] Han X, Tian S, Romagnoli J A, Lic H, Suna W. PCA-SDG based process monitoring and fault diagnosis: application to an industrial pyrolysis furnace. IFAC-PapersOnLine, 2018, 51(18): 482-487. doi: 10.1016/j.ifacol.2018.09.378 [19] Liu S, Feng L, Wu J, Hou G, Han G. Concept drift detection for data stream learning based on angle optimized global embedding and principal component analysis in sensor networks. Computers & Electrical Engineering, 2017, 58(2017): 327-336. [20] Toubakh H, Sayed-Mouchaweh M. Hybrid dynamic data-driven approach for drift-like fault detection in wind turbines. Evolving Systems, 2015, 6(2): 115-129. doi: 10.1007/s12530-014-9119-8 [21] Xu S, Feng L, Liu S, Qiao H. Self-adaption neighborhood density clustering method for mixed data stream with concept drift. Engineering Applications of Artificial Intelligence, 2020, 89: Article No. 103451 [22] Wang X S, Kang Q, Zhou M C, Yao S Y. A multiscale concept drift detection method for learning from data streams. In: Proceedings of the 14th International Conference on Automation Science and Engineering. Munich, Germany: IEEE, 2018. 786–790 [23] Liu A, Lu J, Liu F, Zhang G. Accumulating regional density dissimilarity for concept drift detection in data streams. Pattern Recognition, 2018, 76: 256-272. doi: 10.1016/j.patcog.2017.11.009 [24] Lughofer E, Weigl E, Heidl W, Eitzinger C, Radauer T. Recognizing input space and target concept drifts in data streams with scarcely labeled and unlabelled instances. Information Sciences, 2016, 355: 127-151. [25] Haque A, Khan L, Baron M, Thuraisingham B, Aggarwal C. Efficient handling of concept drift and concept evolution over stream data. In: Proceedings of the 32nd International Conference on Data Engineering. Helsinki, Finland: IEEE, 2016. 481–492 [26] Tan C H, Lee V, Salehi M. Online semi-supervised concept drift detection with density estimation [Online], available: https://arxiv.org/abs/1909.11251, November 11, 2019 [27] Zhou Z H, Li M. Semi-supervised regression with co-training. In: Proceedings of the 2005 International Joint Conference on Artificial Intelligence. Scotland, UK: AAAI, 2005. 908–913 [28] Miller J A. Bowman C T. Mechanism and modelling of nitrogen chemistry in combustion. Progress in Energy and Combustion Science, 1989, 15(4): 287-338. doi: 10.1016/0360-1285(89)90017-8 [29] Kadlec P, Gabrys B, Strandt S. Data-driven soft sensors in the process industry. Computers & Chemical Engineering, 2009, 33(4): 795-814. [30] Schlimmer J C, Granger R H. Incremental learning from noisy data. Machine learning, 1986, 1(3): 317-354. [31] 杨俊志. 测量准确度及相关术语辨析. 测绘科学, 2011, 36(01): 75-76.YANG Jun-Zhi. Full analysis on accuracy and related terms. Science of Surveying and Mapping, 2011, 36(01): 75-76. [32] Wang B, Mao Z. Outlier detection based on gaussian process with application to industrial processes. Applied Soft Computing, 2019, 76: 505-516. doi: 10.1016/j.asoc.2018.12.029 [33] Schulz E, Speekenbrink M, Krause A. A tutorial on gaussian process regression: modelling, exploring, and exploiting functions. Journal of Mathematical Psychology, 2018, 85(2018): 1-16. [34] Yin S, Ding S X, Xie X, Luo H. A review on basic data-driven approaches for industrial process monitoring. IEEE Transactions on Industrial Electronics, 2014, 61(11): 6418-6428. doi: 10.1109/TIE.2014.2301773 [35] Tang J, Yu W, Chai T Y, Liu Z, Zhou X. Selective ensemble modeling load parameters of ball mill based on multi-scale frequency spectral features and sphere criterion. Mechanical Systems & Signal Processing, 2016, 66: 485-504. [36] Kaneko H, Funatsu K. Classification of the degradation of soft sensor models and discussion on adaptive models. AIChE Journal, 2013, 59(7): 2339-2347. doi: 10.1002/aic.14006 [37] 袁小锋, 葛志强, 宋执环. 基于时间差分和局部加权偏最小二乘算法的过程自适应软测量建模. 化工学报, 2016, (3): 724−728Yuan Xiao-Feng, Ge Zhi-Qiang, Song Zhi-Huan. Adaptive soft sensor based on time difference model and locally weighted partial least squares regression. Journal of Chemical Industry and Engineering (China), 2016, (3): 724−728 [38] Kaneko H, Funatsu K. Maintenance-free soft sensor models with time difference of process variables. Chemometrics and Intelligent Laboratory Systems, 2011, 107(2): 312-317. doi: 10.1016/j.chemolab.2011.04.016 [39] 濮晓龙. 关于累积和 (CUSUM) 检验的改进. 应用数学学报, 2003, (2): 225−241Pu Xiao-Long, Improvement of CUSUM test. Acta Mathematicae Applicate Sinica, 2003, (2): 225−241 [40] Ikonomovska E. Algorithms for Learning Regression Trees and Ensembles on Evolving Data Streams [Ph.D. Dissertation], Jožef Stefan International Postgraduate School, The Republic of Slovenia, 2012 [41] Channoi K, Maneewongvatana S. Concept drift for CRD prediction in broiler farms. In: Proceedings of the 12th International Joint Conference on Computer Science and Software Engineering. Songkhla, Thailand: IEEE, 2015. 287–290