2.765

2022影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

非平衡概念漂移数据流主动学习方法

李艳红 王甜甜 王素格 李德玉

李艳红, 王甜甜, 王素格, 李德玉. 非平衡概念漂移数据流主动学习方法. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c230233
引用本文: 李艳红, 王甜甜, 王素格, 李德玉. 非平衡概念漂移数据流主动学习方法. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c230233
Li Yan-Hong, Wang Tian-Tian, Wang Su-Ge, Li De-Yu. Active learning method for imbalanced concept drift data stream. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c230233
Citation: Li Yan-Hong, Wang Tian-Tian, Wang Su-Ge, Li De-Yu. Active learning method for imbalanced concept drift data stream. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c230233

非平衡概念漂移数据流主动学习方法

doi: 10.16383/j.aas.c230233
基金项目: 国家重点研发项目(2022QY0300-01), 国家自然科学基金(62076158), 山西省基础研究计划项目(202203021221001)资助
详细信息
    作者简介:

    李艳红:山西大学计算机与信息技术学院副教授. 主要研究方向为数据挖掘, 机器学习. 本文通信作者. E-mail: liyh@sxu.edu.cn

    王甜甜:山西大学计算机与信息技术学院硕士研究生. 主要研究方向为数据挖掘, 机器学习. E-mail: wttstu@163.com

    王素格:山西大学计算机与信息技术学院教授. 主要研究方向为自然语言处理, 机器学习. E-mail:wsg@sxu.edu.cn

    李德玉:山西大学计算机与信息技术学院教授. 主要研究方向为数据挖掘, 人工智能. E-mail:lidy@sxu.edu.cn

Active Learning Method for Imbalanced Concept Drift Data Stream

Funds: Supported by National Key Research and Development Program of China (2022QY0300-01), National Natural Science Foundation of China (62076158), Fundamental Research Program of Shanxi Province (202203021221001)
More Information
    Author Bio:

    Li Yan-Hong Associate professor at the School of Computer and Information Technology, Shanxi University. Her research interest covers data mining and machine learning. Corresponding author of this paper

    WANG Tian-Tian Master student at the School of Computer and Information Technology, Shanxi University. Her research interest covers data mining and machine learning

    WANG Su-Ge Professor at the School of Computer and Information Technology, Shanxi University. Her research interest covers natural language processing and machine learning

    LI De-Yu Professor at the School of Computer and Information Technology, Shanxi University. His research interest covers data mining and artificial intelligence

  • 摘要: 数据流分类研究在开放、动态环境中如何提供更可靠的数据驱动预测模型, 关键在于从实时到达且不断变化的数据流中检测并适应概念漂移. 目前, 为了检测概念漂移和更新分类模型, 数据流分类方法通常假设所有样本的标签都是已知的, 这一假设在真实场景下是不现实的. 此外, 真实数据流可能表现出较高且不断变化的类不平衡比率, 会进一步增加数据流分类任务的复杂性. 为此, 提出了一种非平衡概念漂移数据流主动学习方法. 定义了基于多预测概率的样本预测确定性度量, 提出了边缘阈值矩阵的自适应调整方法, 使得标签查询策略适用于类别数较多的非平衡数据流; 提出了基于记忆强度的样本替换策略, 将难区分、少数类样本和代表当前数据分布的样本保存在记忆窗口中, 提升了新基分类器的分类性能; 定义了基于分类精度的基分类器重要性评价及更新方法, 实现了漂移后的集成分类器更新. 在7个合成数据流和3个真实数据流上的对比实验表明, 提出的非平衡概念漂移数据流主动学习方法的分类性能优于6种概念漂移数据流学习方法.
  • 图  1  算法框架

    Fig.  1  Algorithm framework

    图  2  七种算法的ROC曲线

    Fig.  2  ROC curve of seven algorithms

    图  3  七种算法的精确率曲线

    Fig.  3  P curve of seven algorithms

    图  4  DS6上消融实验的结果

    Fig.  4  Result of the ablation experiment on DS6

    图  5  参数β对算法的影响

    Fig.  5  Effect of the parameter β on the algorithm

    图  9  参数$ n_d $对算法的影响

    Fig.  9  Effect of the parameter $ n_d $ on the algorithm

    图  6  参数$ \theta_{0} $对算法的影响

    Fig.  6  Effect of the parameter $ \theta_{0} $ on the algorithm

    图  7  参数n对算法的影响

    Fig.  7  Effect of the parameter n on the algorithm

    图  8  参数α对算法的影响

    Fig.  8  Effect of the parameter α on the algorithm

    图  10  不同类型概念漂移数据流上的精确率曲线

    Fig.  10  P curve on different types of concept drift data stream

    表  1  数据流特征

    Table  1  Data stream feature

    No数据流样本数特征数类别数类分布异常点漂移次数
    1DS1400 0002515类平衡00
    2DS2400 0002515类平衡5%3
    3DS3400 0002515(1/1/1/1/1/1/1/1/1/1/2/2/3/3/5)00
    4DS4400 0002515(1/1/1/1/1/1/1/1/1/1/2/2/3/3/5)5%3
    5DS5400 0002515(1/1/1/1/1/1/1/1/1/1/2/2/3/3/5), 00
    (2/2/3/3/5/1/1/1/1/1/1/1/1/1/1)
    6DS6400 0002515(1/1/1/1/1/1/1/1/1/1/2/2/3/3/5), 5%3
    (2/2/3/3/5/1/1/1/1/1/1/1/1/1/1)
    7DS7400 0002550类平衡5%3
    8Kddcup99_10%494 0004223
    9Shuttle570 000107
    10PokerHand830 0001010
    下载: 导出CSV

    表  2  概念漂移数据流特征

    Table  2  Concept drift data stream feature

    No数据流概念漂移类型样本数特征数类别数漂移宽度
    1DS8突变型400 00025151
    2DS9重复型400 00025151
    3DS10增量型400 000251510000
    4DS11逐渐型400 000251510000
    下载: 导出CSV

    表  3  七种算法的P

    Table  3  P value of seven algorithms

    数据流LBBOLEARFRECALMIDOALM-IDSALM-ICDDS-EALM-ICDDS
    DS196.89±0.3196.36±0.1198.07±0.4398.01±0.4198.03±0.2597.18±0.4899.07±0.34
    DS290.61±0.2188.63±0.5492.77±0.4293.31±0.1493.27±0.4991.97±0.2694.64±0.15
    DS394.41±0.1196.07±0.2396.74±0.4596.64±0.3496.75±0.5696.46±0.6197.84±0.24
    DS486.91±0.4585.23±0.5288.30±0.2989.90±0.2890.27±0.4289.70±0.7292.06±0.28
    DS593.60±0.4894.04±0.5296.30±0.1894.65±0.4995.47±0.3294.24±0.3596.17±0.19
    DS686.59±0.1984.69±0.4888.02±0.4788.44±0.1988.65±0.2587.41±0.4090.86±0.37
    DS788.25±0.8687.21±0.7990.16±0.9290.49±0.4790.51±0.5389.32±0.3893.67±0.40
    Kddcup99_10%83.85±0.5981.10±0.1585.56±0.5492.12±0.4592.13±0.3191.24±0.5195.80±0.17
    Shuttle64.63±0.4263.85±0.2779.07±0.3185.35±0.1485.70±0.3283.48±0.2585.99±0.13
    PokerHand51.63±0.3950.36±0.3552.51±0.5653.93±0.2854.57±0.5052.90±0.1855.89±0.51
    下载: 导出CSV

    表  6  七种算法的$ Kappa $值

    Table  6  $ Kappa $ value of seven algorithms

    数据流LBBOLEARFRECALMIDOALM-IDSALM-ICDDS-EALM-ICDDS
    DS195.09±0.4395.47±0.2697.11±0.3397.84±0.1897.52±0.5096.31±0.5398.72±0.18
    DS289.66±0.5088.28±0.4591.80±0.1792.55±0.2592.65±0.2891.27±0.2993.56±0.46
    DS393.08±0.1395.68±0.2295.62±0.5396.50±0.4696.46±0.6096.05±0.3697.69±0.21
    DS486.97±0.4685.86±0.1388.18±0.2589.94±0.2489.99±0.3688.61±0.4690.19±0.57
    DS592.32±0.3794.18±0.4595.86±0.2894.40±0.5095.52±0.1494.29±0.2095.81±0.35
    DS686.59±0.3285.25±0.2987.81±0.5488.90±0.5189.00±0.1387.68±0.4789.80±0.25
    DS788.28±0.4687.51±0.9789.93±0.7190.01±0.9290.19±0.4089.51±0.5993.67±0.54
    Kddcup99_10%80.94±0.2275.68±0.2579.36±0.3583.32±0.2485.83±0.5084.87±0.1686.81±0.33
    Shuttle58.73±0.3961.54±0.2273.78±0.2079.39±0.4380.11±0.5380.97±0.2483.56±0.54
    PokerHand50.34±0.5849.86±0.4050.36±0.1651.24±0.2151.39±0.1650.55±0.4152.25±0.35
    下载: 导出CSV

    表  4  七种算法的R

    Table  4  R value of seven algorithms

    数据流LBBOLEARFRECALMIDOALM-IDSALM-ICDDS-EALM-ICDDS
    DS194.78±0.1396.04±0.2496.81±0.5997.87±0.2497.92±0.2596.15±0.3198.63±0.17
    DS288.65±0.2587.86±0.5390.35±0.3091.54±0.5491.84±0.5890.78±0.7092.30±0.24
    DS392.55±0.4595.92±0.3294.80±0.4396.12±0.1497.92±0.5495.99±0.5298.55±0.29
    DS487.03±0.4987.08±0.3988.23±0.3190.50±0.3091.07±0.5290.13±0.4391.15±0.11
    DS591.54±0.1192.33±0.5196.04±0.2093.82±0.5594.94±0.2792.91±0.4296.53±0.42
    DS686.56±0.5085.48±0.2487.83±0.4989.43±0.1888.85±0.3688.39±0.3490.63±0.21
    DS787.19±0.4286.12±0.1187.29±0.3688.41±0.5088.77±0.4387.87±0.2091.61±0.78
    Kddcup99_10%60.89±0.5063.05±0.5058.26±0.3861.88±0.3863.71±0.5463.42±0.6769.34±0.57
    Shuttle61.40±0.2150.84±0.3154.36±0.3559.52±0.4163.12±0.5961.79±0.1664.59±0.29
    PokerHand43.57±0.3044.78±0.4655.21±0.6056.84±0.1152.77±0.5455.36±0.2559.57±0.43
    下载: 导出CSV

    表  5  七种算法的$ F1 $值

    Table  5  $ F1 $ value of seven algorithms

    数据流LBBOLEARFRECALMIDOALM-IDSALM-ICDDS-EALM-ICDDS
    DS195.82±0.1896.20±0.1697.44±0.5097.94±0.3097.97±0.2596.66±0.3798.85±0.23
    DS289.62±0.2388.24±0.5391.54±0.3592.42±0.2292.55±0.5391.37±0.4393.46±0.18
    DS393.47±0.1895.99±0.2795.76±0.4496.38±0.2097.33±0.5596.22±0.5798.19±0.26
    DS486.97±0.4786.15±0.4588.26±0.3090.20±0.2990.67±0.4689.91±0.5991.60±0.16
    DS592.55±0.1793.18±0.3096.17±0.1994.23±0.5295.20±0.2993.57±0.3896.35±0.26
    DS686.57±0.2785.08±0.3287.92±0.4888.93±0.1888.75±0.3087.90±0.3590.74±0.27
    DS787.72±0.5686.66±0.1988.70±0.5289.44±0.4889.61±0.4788.59±0.2992.63±0.40
    Kddcup99_10%70.55±0.5470.94±0.2369.32±0.4574.03±0.2275.33±0.3974.82±0.5480.45±0.49
    Shuttle62.97±0.2856.61±0.2964.43±0.3370.13±0.2172.70±0.4171.01±0.2073.77±0.18
    PokerHand47.26±0.3447.41±0.4053.83±0.5755.35±0.1656.12±0.5254.10±0.2357.67±0.72
    下载: 导出CSV
  • [1] Liao G, Zhang P, Yin H, Luo T, Lin J. A novel semi-supervised classification approach for evolving data streams. Expert Systems with Applications, 2023, 215: 119273 doi: 10.1016/j.eswa.2022.119273
    [2] 朱飞, 张煦尧, 刘成林. 类别增量学习研究进展和性能评价. 自动化学报, 2023, 49(3): 1−26

    Zhu Fei, Zhang Xu-Yao, Liu Cheng-Lin. Class incremental learning: A review and performance evaluation. Acta Automatica Sinica, 2023, 49(3): 1−26
    [3] Zhou Z H. Open-environment machine learning. National Science Review, 2022, 9(8): 211−221
    [4] Wang P, Jin N, Woo W L, Woodward J R, Davies D. Noise tolerant drift detection method for data stream mining. Information Sciences, 2022, 609: 1318−1333 doi: 10.1016/j.ins.2022.07.065
    [5] Yu H, Liu W, Lu J, Wen Y, Luo X, Zhang G. Detecting group concept drift from multiple data streams. Pattern Recognition, 2023, 134: 109113 doi: 10.1016/j.patcog.2022.109113
    [6] Suárez-Cetrulo A L, Quintana D, Cervantes A. A survey on machine learning for recurring concept drifting data streams. Expert Systems with Applications, 2022, 213: 118934
    [7] Yang L, Shami A. A lightweight concept drift detection and adaptation framework for IoT data streams. IEEE Internet of Things Magazine, 2021, 4(2): 96−101 doi: 10.1109/IOTM.0001.2100012
    [8] Bayram F, Ahmed B S, Kassler A. From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems, 2022, 245: 108632 doi: 10.1016/j.knosys.2022.108632
    [9] Karimian M, Beigy H. Concept drift handling: A domain adaptation perspective. Expert Systems with Applications, 2023, 224: 119946 doi: 10.1016/j.eswa.2023.119946
    [10] Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 2018, 31(12): 2346-2363
    [11] Shahraki A, Abbasi M, Taherkordi A, Jurcut A D. Active learning for network traffic classification: A technical study. IEEE Transactions on Cognitive Communications and Networking, 2021, 8(1): 422−439
    [12] Pham T, Kottke D, Sick B, Krempl G. Stream-based active learning for sliding windows under the influence of verification latency. Machine Learning, 2022, 111(6): 2011−2036 doi: 10.1007/s10994-021-06099-z
    [13] Khowaja S A, Khuwaja P. Q-learning and LSTM based deep active learning strategy for malware defense in industrial IoT applications. Multimedia Tools and Applications, 2021, 80(10): 14637−14663 doi: 10.1007/s11042-020-10371-0
    [14] Wang S, Luo H, Huang S, Li Q, Liu L, Su G, et al. Counterfactual-based minority oversampling for imbalanced classification. Engineering Applications of Artificial Intelligence, 2023, 122: 106024 doi: 10.1016/j.engappai.2023.106024
    [15] Malialis K, Panayiotou C G, Polycarpou M M. Nonstationary data stream classification with online active learning and siamese neural networks. Neurocomputing, 2022, 512: 235−252 doi: 10.1016/j.neucom.2022.09.065
    [16] Du H, Zhang Y, Gang K, Zhang L, Chen Y. Online ensemble learning algorithm for imbalanced data stream. Applied Soft Computing, 2021, 107(1): 107378
    [17] Wang W, Sun D. The improved AdaBoost algorithms for imbalanced data classification. Information Sciences, 2021, 563: 358−374 doi: 10.1016/j.ins.2021.03.042
    [18] Gao J, Fan W, Han J, Yu P. A general framework for mining concept-drifting data streams with skewed distributions. In: Proceedings of the International Conference on Data Mining. Minnesota, USA: 2007. 3−14
    [19] Lu Y, Cheung Y, Tang Y Y. Dynamic weighted majority for incremental learning of imbalanced data streams with concept drift. In: Proceedings of the International Joint Conference on Artificial Intelligence. Melbourne, Australia: AAAI, 2017. 2393−2399
    [20] Jiao B, Guo Y, Gong D, Chen Q. Dynamic ensemble selection for imbalanced data streams with concept drift. IEEE Transactions on Neural Networks and Learning Systems, 2022, 1−14
    [21] Guo H S, Zhang S, Wang W J. Selective ensemble-based online adaptive deep neural networks for streaming data with concept drift. Neural Networks, 2021, 142: 437−456 doi: 10.1016/j.neunet.2021.06.027
    [22] Wang S, Minku L L, Yao X. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(5): 1356−1368
    [23] Cano A, Krawczyk B. ROSE: Robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams. Machine Learning, 2022, 111(7): 2561−2599 doi: 10.1007/s10994-022-06168-x
    [24] Bifet A, Gavalda R. Learning from time-changing data with adaptive windowing. In: Proceedings of the International Conference on Data Mining. Minnesota, USA: 2007. 443−448
    [25] Barros R S M, Carvalho Santos S G T, Júnior P M G. A boosting-like online learning ensemble. In: Proceedings of the International Joint Conference on Neural Networks. Vancouver, Canada: 2016. 1871−1878
    [26] Gama J, Medas P, Castillo G, Rodrigues P. Learning with drift detection. In: Proceedings of the Advances in Artificial Intelligence. Maranhao, Brazil: Springer, 2004. 286−295
    [27] 张永清, 卢荣钊, 乔少杰, 韩楠, Gutierrez L A, 周激流. 一种基于样本空间的类别不平衡数据采样方法. 自动化学报, 2022, 48(10): 2549−2563

    Zhang Yong-Qing, Lu Rong-Zhao, Qiao Shao-Jie, Han Nan, Gutierrez L A, Zhou Ji-Liu. A sampling method of imbalanced data based on sample space. Acta Automatica Sinica, 2022, 48(10): 2549−2563
    [28] Bifet A, Holmes G, Pfahringer B. Leveraging bagging for evolving data stream. In: Proceedings of the Joint European conference on machine learning and knowledge discovery in databases. Barcelona, Spain: Springer, 2010, 135−150
    [29] Ferreira L E B, Gomes H M, Bifet A, Oliveira L. Adaptive random forests with resampling for imbalanced data streams. In: Proceedings of the International Joint Conference on Neural Networks. Budapest, Hungary: IEEE, 2019. 1−6
    [30] Gu Q, Tian J, Li X, Song J. A novel random forest integrated model for imbalanced data classification problem. Knowledge-Based Systems, 2022, 250: 109050 doi: 10.1016/j.knosys.2022.109050
    [31] Martins V E, Cano A, Junior S B. Meta-learning for dynamic tuning of active learning on stream classification. Pattern Recognition, 2023, 138: 109359 doi: 10.1016/j.patcog.2023.109359
    [32] Yin C Y, Chen S S, Yin Z C. Clustering-based active learning classification towards data stream. ACM Transactions on Intelligent Systems and Technology, 2023, 14(2): 1−18
    [33] Xu W H, Zhao F F, Lu Z C. Active learning over evolving data streams using paired ensemble framework. In: Proceedings of the Eighth International Conference on Advanced Computational Intelligence. Chiang Mai, Thailand: 2016. 180−185
    [34] Liu S X, Xue S, Wu J, Zhou C, Yang J, Li Z, et al. Online active learning for drifting data streams. IEEE Transactions on Neural Networks and Learning Systems, 2021, 34: 186−200
    [35] Liu W K, Zhang H, Ding Z Y, Liu Q B, Zhu C. A comprehensive active learning method for multiclass imbalanced data streams with concept drift. Knowledge-Based Systems, 2021, 215: 106778 doi: 10.1016/j.knosys.2021.106778
    [36] 李艳红, 任霖, 王素格, 李德玉. 非平衡数据流在线主动学习方法. 自动化学报, DOI: 10.16383/j.aas.c211246

    Li Yan-Hong, Ren Lin, Wang Su-Ge, Li De-Yu. Online active learning method for imbalanced data stream. Acta Automatica Sinica, DOI: 10.16383/j.aas.c211246
    [37] Zhao P, Cai L W, Zhou Z H. Handling concept drift via model reuse. Machine learning, 2020, 109: 533−568 doi: 10.1007/s10994-019-05835-w
    [38] Karimi M R, Gürel N M, Karlas B, Rausch J, Zhang C, Krause A. Online active model selection for pre-trained classifiers. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. San Diego, California, USA: 2021. 307−315
    [39] Zyblewski P, Wozniak M, Sabourin R. Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Information Fusion, 2021, 66: 138−154 doi: 10.1016/j.inffus.2020.09.004
    [40] Moraes M, Gradvohl A. MOAFS: A Massive Online Analysis library for feature selection in data streams. The Journal of Open Source Software, 2020, 5: 1970 doi: 10.21105/joss.01970
  • 加载中
计量
  • 文章访问数:  45
  • HTML全文浏览量:  17
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-04-24
  • 录用日期:  2023-10-12
  • 网络出版日期:  2024-01-25

目录

    /

    返回文章
    返回