2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于相对离群因子的标签噪声过滤方法

侯森寓 姜高霞 王文剑

侯森寓, 姜高霞, 王文剑. 基于相对离群因子的标签噪声过滤方法. 自动化学报, 2024, 50(1): 154−168 doi: 10.16383/j.aas.c230117
引用本文: 侯森寓, 姜高霞, 王文剑. 基于相对离群因子的标签噪声过滤方法. 自动化学报, 2024, 50(1): 154−168 doi: 10.16383/j.aas.c230117
Hou Sen-Yu, Jiang Gao-Xia, Wang Wen-Jian. A label noise filtering method based on relative outlier factor. Acta Automatica Sinica, 2024, 50(1): 154−168 doi: 10.16383/j.aas.c230117
Citation: Hou Sen-Yu, Jiang Gao-Xia, Wang Wen-Jian. A label noise filtering method based on relative outlier factor. Acta Automatica Sinica, 2024, 50(1): 154−168 doi: 10.16383/j.aas.c230117

基于相对离群因子的标签噪声过滤方法

doi: 10.16383/j.aas.c230117
基金项目: 国家自然科学基金(U21A20513, 62276161, 62076154), 中央引导地方科技发展基金(YDZX20201400001224), 山西省1331工程重点学科建设基金资助
详细信息
    作者简介:

    侯森寓:山西大学计算机与信息技术学院硕士研究生. 主要研究方向为机器学习, 标签噪声. E-mail: housenyu@163.com

    姜高霞:山西大学计算机与信息技术学院副教授. 主要研究方向为机器学习, 数据挖掘. E-mail: jianggaoxia@sxu.edu.cn

    王文剑:山西大学计算机与信息技术学院教授. 主要研究方向为机器学习, 数据挖掘. 本文通信作者. E-mail: wjwang@sxu.edu.cn

A Label Noise Filtering Method Based on Relative Outlier Factor

Funds: Supported by National Natural Science Foundation of China (U21A20513, 62276161, 62076154), Central Government Guides Local Science and Technology Innovation Projects (YDZX20201400001224), and Fund for the Shanxi Province 1331 Project of Key Subjects Construction
More Information
    Author Bio:

    HOU Sen-Yu Master student at the School of Computer and Information Technology, Shanxi Univer-sity. His research interest covers machine learning and label noise

    JIANG Gao-Xia Associate profe-ssor at the School of Computer and Information Technology, Shanxi University. His research interest covers machine learning and data mining

    WANG Wen-Jian Professor at the School of Computer and Information Technology, Shanxi Univer-sity. Her research interest covers machine learning and data mining. Corresponding author of this paper

  • 摘要: 分类任务中含有类别型标签噪声是传统数据挖掘中的常见问题, 目前还缺少针对性方法来专门检测类别型标签噪声. 离群点检测技术能用于噪声的识别与过滤, 但由于离群点与类别型标签噪声并不具有一致性, 使得离群点检测算法无法精确检测分类数据集中的标签噪声. 针对这些问题, 提出一种基于离群点检测技术、适用于过滤类别型标签噪声的方法 —— 基于相对离群因子(Relative outlier factor, ROF)的集成过滤方法(Label noise ensemble filtering method based on relative outlier factor, EROF). 首先, 通过相对离群因子对样本进行噪声概率估计; 然后, 再迭代联合多种离群点检测算法, 实现集成过滤. 实验结果表明, 该方法在大多数含有标签噪声的数据集上, 都能保持优秀的噪声识别能力, 并显著提升各种分类模型的泛化能力.
  • 图  1  不同含噪情况下的分类模型决策边界

    Fig.  1  Decision boundary of classification model in different cases with noise

    图  2  样本A的同质、异质样本

    Fig.  2  Homogeneous and heterogeneous samples of sample A

    图  3  噪声样本A与真实样本G的绝对离群因子

    Fig.  3  Homogeneous and heterogeneous absolute outlier factor results of noise sample A and real sample G

    图  4  数据的相对离群因子

    Fig.  4  Relative outlier factors for data

    图  5  EROF算法流程示意图

    Fig.  5  Flowchart diagram of the EROF Algorithm

    图  6  Wine数据集上, 基检测器噪声过滤对比

    Fig.  6  Comparison of base detector noise filtering on Wine

    图  7  过滤比率 r 对过滤效果的影响

    Fig.  7  Influence of filtering ratio r on filtering effect

    图  8  各算法噪声识别性能指标比较

    Fig.  8  Comparison for noise recognition performance indicators of each algorithm

    图  9  各算法噪声识别性能指标的最优次数的占比比例

    Fig.  9  Optimal frequency ratio for noise recognition performance indicators of each algorithm

    图  10  各分类模型分类准确率的临界差异图

    Fig.  10  Critical difference diagram of classification accuracy of each classification model

    图  11  时间开销对比

    Fig.  11  Running time comparison

    图  12  不同算法过滤后的准确率

    Fig.  12  Accuracy after filtering by different algorithms

    图  13  MNIST上, 最后10次迭代的平均测试精度

    Fig.  13  Average accuracy over the last 10epochs on MNIST

    表  1  数据集信息

    Table  1  Information of data sets

    序号 数据集名称 样本数 特征数 类别数
    1 Wine 178 13 3
    2 Sonar 208 60 2
    3 Seeds 210 7 3
    4 Heart 270 13 2
    5 Votes 435 16 2
    6 Musk 476 166 2
    7 Wdbc 569 30 2
    8 Australian 690 14 2
    9 Credit Approval 690 15 2
    10 Vehicle 846 18 4
    11 Fourclass 862 2 2
    12 German 1000 24 2
    13 Svmguide3 1243 21 6
    14 Isolet 1559 617 26
    15 Segment 2310 18 7
    16 Splice 3175 60 2
    17 Satimage 4435 36 6
    18 Banana 5300 2 2
    19 Mushrooms 8124 22 2
    20 Letter 20000 16 26
    下载: 导出CSV

    表  2  UCI上, 不同噪声比例下的分类准确率

    Table  2  Classification accuracy with different noise ratios on UCI

    NR 序号 NoF MNN MVF RD CRF vRD Adp_mCRF EROF NR 序号 NoF MNN MVF RD CRF vRD Adp_mCRF EROF
    10% 1 0.926 0.981 0.963 0.921 0.943 0.963 0.963 0.981 20% 1 0.926 0.981 0.981 0.895 0.888 0.926 0.907 1.000
    2 0.762 0.714 0.730 0.713 0.615 0.746 0.635 0.746 2 0.761 0.762 0.746 0.745 0.669 0.762 0.683 0.762
    3 0.905 0.921 0.937 0.852 0.883 0.889 0.905 0.905 3 0.778 0.937 0.921 0.753 0.852 0.778 0.873 0.937
    4 0.691 0.753 0.728 0.665 0.675 0.691 0.704 0.778 4 0.704 0.741 0.704 0.688 0.695 0.704 0.716 0.753
    5 0.901 0.885 0.901 0.852 0.878 0.885 0.901 0.908 5 0.878 0.794 0.802 0.851 0.836 0.878 0.855 0.924
    6 0.841 0.841 0.848 0.793 0.807 0.833 0.826 0.902 6 0.818 0.765 0.765 0.795 0.792 0.833 0.826 0.841
    7 0.727 0.671 0.713 0.687 0.667 0.720 0.699 0.727 7 0.678 0.622 0.650 0.659 0.693 0.685 0.713 0.715
    8 0.918 0.901 0.912 0.891 0.897 0.936 0.936 0.959 8 0.825 0.830 0.865 0.825 0.780 0.842 0.819 0.930
    9 0.826 0.855 0.831 0.777 0.785 0.816 0.807 0.860 9 0.758 0.787 0.768 0.746 0.720 0.763 0.749 0.855
    10 0.841 0.850 0.870 0.804 0.802 0.841 0.831 0.889 10 0.773 0.792 0.739 0.750 0.751 0.773 0.768 0.845
    11 0.654 0.610 0.646 0.630 0.578 0.650 0.591 0.606 11 0.598 0.563 0.583 0.588 0.536 0.610 0.547 0.587
    12 0.969 1.000 1.000 0.931 0.933 0.965 0.965 1.000 12 0.942 1.000 0.977 0.911 0.916 0.954 0.938 1.000
    13 0.697 0.683 0.687 0.681 0.661 0.697 0.683 0.753 13 0.663 0.650 0.663 0.630 0.570 0.663 0.600 0.720
    14 0.786 0.775 0.786 0.748 0.695 0.772 0.713 0.812 14 0.716 0.681 0.724 0.679 0.658 0.708 0.676 0.794
    15 0.720 0.726 0.733 0.689 0.676 0.716 0.705 0.733 15 0.662 0.736 0.733 0.641 0.627 0.660 0.652 0.737
    16 0.931 0.945 0.945 0.889 0.915 0.931 0.938 0.957 16 0.905 0.928 0.935 0.865 0.867 0.908 0.895 0.958
    17 0.873 0.870 0.879 0.839 0.849 0.872 0.867 0.893 17 0.817 0.823 0.840 0.801 0.781 0.818 0.818 0.897
    18 0.926 0.936 0.928 0.889 0.884 0.927 0.926 0.965 18 0.854 0.881 0.864 0.830 0.820 0.858 0.862 0.958
    19 0.972 0.984 0.978 0.942 0.931 0.970 0.969 1.000 19 0.887 0.920 0.930 0.860 0.839 0.882 0.880 0.998
    20 0.929 0.920 0.941 0.896 0.905 0.921 0.925 0.949 20 0.868 0.855 0.923 0.833 0.829 0.852 0.855 0.940
    30% 1 0.833 0.944 0.907 0.743 0.686 0.759 0.704 0.963 40% 1 0.630 0.926 0.815 0.588 0.686 0.611 0.704 0.981
    2 0.603 0.571 0.587 0.580 0.603 0.603 0.619 0.635 2 0.540 0.508 0.508 0.507 0.585 0.524 0.603 0.508
    3 0.841 0.952 0.921 0.772 0.725 0.794 0.762 0.957 3 0.603 0.794 0.825 0.609 0.590 0.635 0.603 0.921
    4 0.407 0.420 0.395 0.394 0.580 0.407 0.605 0.618 4 0.457 0.481 0.457 0.458 0.423 0.469 0.444 0.580
    5 0.710 0.649 0.672 0.652 0.639 0.672 0.664 0.870 5 0.580 0.550 0.511 0.580 0.649 0.603 0.664 0.718
    6 0.750 0.765 0.795 0.720 0.692 0.750 0.720 0.818 6 0.629 0.674 0.667 0.617 0.591 0.644 0.606 0.727
    7 0.580 0.594 0.594 0.576 0.597 0.601 0.622 0.601 7 0.573 0.497 0.503 0.561 0.550 0.573 0.573 0.524
    8 0.684 0.737 0.731 0.666 0.621 0.684 0.649 0.860 8 0.632 0.655 0.661 0.621 0.639 0.643 0.667 0.766
    9 0.647 0.696 0.696 0.617 0.585 0.643 0.604 0.807 9 0.507 0.531 0.512 0.491 0.490 0.512 0.507 0.565
    10 0.676 0.647 0.657 0.656 0.649 0.676 0.667 0.773 10 0.551 0.556 0.546 0.547 0.565 0.575 0.589 0.700
    11 0.531 0.543 0.524 0.502 0.487 0.520 0.512 0.587 11 0.453 0.476 0.472 0.415 0.459 0.429 0.476 0.555
    12 0.811 0.961 0.927 0.774 0.794 0.815 0.826 0.973 12 0.683 0.776 0.757 0.669 0.631 0.699 0.656 0.764
    13 0.637 0.633 0.647 0.629 0.586 0.643 0.603 0.693 13 0.580 0.620 0.627 0.539 0.551 0.567 0.573 0.577
    14 0.651 0.601 0.627 0.627 0.632 0.641 0.649 0.713 14 0.633 0.576 0.582 0.608 0.598 0.622 0.622 0.641
    15 0.605 0.712 0.722 0.541 0.546 0.562 0.562 0.705 15 0.545 0.739 0.720 0.480 0.465 0.494 0.483 0.703
    16 0.834 0.922 0.896 0.782 0.777 0.815 0.805 0.948 16 0.688 0.885 0.874 0.677 0.646 0.694 0.680 0.929
    17 0.740 0.780 0.778 0.717 0.706 0.744 0.739 0.891 17 0.646 0.713 0.707 0.638 0.634 0.653 0.651 0.884
    18 0.736 0.764 0.747 0.727 0.721 0.744 0.749 0.924 18 0.603 0.618 0.615 0.571 0.593 0.600 0.613 0.701
    19 0.778 0.870 0.857 0.747 0.752 0.774 0.776 0.950 19 0.652 0.766 0.740 0.614 0.626 0.640 0.645 0.781
    20 0.791 0.855 0.896 0.747 0.764 0.782 0.784 0.930 20 0.675 0.852 0.857 0.640 0.642 0.666 0.671 0.922
    下载: 导出CSV

    表  3  MNIST上的噪声识别性能

    Table  3  Noise recognition performance on MNIST

    评价指标 MNN MVF RD CRF vRD Adp_mCRF EROF
    Acc 0.664 0.732 0.791 0.780 0.801 0.844 0.872
    NfAcc 0.372 0.436 0.489 0.545 0.563 0.651 0.713
    Re 0.582 0.669 0.739 0.761 0.763 0.816 0.851
    Spec 0.993 0.985 0.998 0.858 0.951 0.956 0.957
    Pre 0.997 0.995 0.999 0.964 0.987 0.988 0.991
    F1 0.735 0.797 0.850 0.837 0.851 0.885 0.916
    下载: 导出CSV
  • [1] Bi Y T, Jeske D R. The efficiency of logistic regression compared to normal discriminant analysis under class-conditional classification noise. Academic Press, 2010, 101(7): 1622-1637
    [2] Esteva A, Kuprel B, Novoa R A, Ko J, Swetter S M, Thrun S, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 2017, 542(7639): 115-118 doi: 10.1038/nature21056
    [3] 廖海斌, 徐斌. 基于性别和年龄因子分析的鲁棒性人脸表情识别. 计算机研究与发展, 2021, 58(3): 528-538

    Liao Hai-Bin, Xu Bin. Robust face expression recognition based on genderand age factor analysis. Journal of Computer Research and Development, 2021, 58(3):528-538
    [4] Kermany D S, Goldbaum M, Cai W J, Valentim C C S, Liang H Y, Baxter S L, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell, 2018, 172(5): 1122-1131 doi: 10.1016/j.cell.2018.02.010
    [5] Breve F A, Zhao L, Quiles M G. Particle competition and cooperation for semi-supervised learning with label noise. Neurocomputing, 2015, 160: 63-72 doi: 10.1016/j.neucom.2014.08.082
    [6] Zhu X, Wu X. Class noise vs. attribute noise: a quantitative study. Artificial Intelligence Review, 2004, 22(3): 177-210 doi: 10.1007/s10462-004-0751-8
    [7] Aversano L, Bernardi M L, Cimitile M, Pecori R, Veltri L. Effective anomaly detection using deep learning in IoT systems. Wireless Communications and Mobile Computing, 2021, 2021: 1-14
    [8] Khoshgoftaar T M, Van H J. Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis, 2005, 9(6): 589-602 doi: 10.3233/IDA-2005-9606
    [9] Frenay B, Verleysen M. Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(5): 845-869 doi: 10.1109/TNNLS.2013.2292894
    [10] Yao J C, Wang J J, Tsang I W, Zhang Y, Sun J, Zhang C Q, et al. Deep learning from noisy image labels with quality embedding. IEEE Transactions on Image Processing, 2019, 28: 1909-1922 doi: 10.1109/TIP.2018.2877939
    [11] 姜高霞, 王文剑. 面向回归任务的数值型标签噪声过滤算法. 计算机研究与发展, 2022, 59(8): 1639-1652

    Jiang Gao-Xia, Wang Wen-Jian. A numerical label noise filtering algorithm for regression. Journal of Computer Research and Development, 2022, 59(8): 1639-1652
    [12] Sun J W, Zhao F Y, Wang C J, Chen S F. Identifying and correcting mislabeled training instances. In: Proceedings of the Future Generation Communication and Networking. Jeju Isl, South Korea: IEEE, 2007. 244−250
    [13] Gamberger D, Lavrac N, Groselj C. Experiments with noise filtering in a medical domain. In: Proceedings of the 16th International Conference on Machine Learning. Bled, Slovenia: 1999. 143−151
    [14] Angelova A, Abu-Mostafam Y, Perona P. Pruning training sets for learning of object categories. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005. 494−501
    [15] Brodley C E, Uiversity P, Friedl M A. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 1999, 11(6): 131-167
    [16] Sanchez J S, Barandela R, Marques A I, Alejo R, Badenas J. Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters, 2003, 24(7): 1015-1022 doi: 10.1016/S0167-8655(02)00225-8
    [17] Sluban B, Gamberger D, Lavrac N. Advances in class noise detection. In: Proceedings of the European Conference on Artificial Intelligence. Lisbon, Portugal: IOS Press, 2010. 1105−1106
    [18] Tomek I. An Experiment with the Edited Nearest-Neighbor Rule.IEEE Transactions on Systems Man and Cybernetics-systems, 2007, 6(6): 448-452
    [19] Barandela R, Gasca E. Decontamination of training samples for supervised pattern recognition methods. In: Proceedings of the Joint International Workshops of the International Association of Pattern Recognition on Structural and Syntactic Pattern Recognition and Statistical Pattern Recognition. Alicante, Spain: Springer-Verlag, 2000. 621−630
    [20] Liu H W, Zhang S C. Noisy data elimination using mutual k-nearest neighbor for classification mining. Journal of Systems and Software, 2012, 85(5): 1067-1074 doi: 10.1016/j.jss.2011.12.019
    [21] 姜高霞, 樊瑞宣, 王文剑. 近邻感知的标签噪声过滤算法. 模式识别与人工智能, 2020, 33(6): 518-529

    Jiang Gao-Xia, Fan Rui-Xuan, Wang Wen-Jian. Label noise filtering viaperception of nearest neighbors. Pattern Recognition and Artificial Intelligence, 2020, 33(6): 518-529
    [22] Xia S Y, Xiong Z Y, He Y, Li K, Dong L M, Zhang M. Relative density-based classification noise detection. Optik, 2014, 125(22): 6829-6834 doi: 10.1016/j.ijleo.2014.08.091
    [23] Xia S Y, Wang G Y, Chen Z Z, Duan Y L, Liu Q. Complete random forest based class noise filtering learning for improving the generalizability of classifiers. IEEE Transactions on Knowledge and Data Engineering, 2020, 31(11): 2063-2078
    [24] Xia S Y, Chen B Y, Wang G Y, Zheng Y, Gao X B, Giem E, et al. mCRF and mRD: Two classification methods based on a novel multiclass label noise filtering learning framework. IEEE Transactions on Neural Networks and Learning Systems, 2021, 33(7): 2916-2930
    [25] Huang L, Shao Y, Peng J. An adaptive voting mechanism based on relative density for filtering label noises. In: Proceedings of the IEEE 5th International Conference on Electronics Technology. Chengdu, China: ICET, 2022. 1327−1331
    [26] Lu J, Zhou Z Y, Leung T, Li J J, Li F F. MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: Proceedings of the International Conference on Machine Learning. Stockholm, Sweden: JMLR, 2018. 2304−2313
    [27] Han B, Yao Q M, Yu X R, Niu G, Xu M, Hu W H, et al. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In: Proceedings of the 32nd Conference on Neural Information Processing Systems. Montreal, Canada: NIPS, 2018. 1−10
    [28] Xiong H, Pandey G, Steinbach M, Kumar V. Enhancing data analysis with noise removal. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(3): 304-319 doi: 10.1109/TKDE.2006.46
    [29] Zhang W N, Tan X Y. Combining outlier detection and reconstruction error minimization for label noise reduction. In: Proceedings of the 6th IEEE International Conference on Big Data and Smart Computing. Kyoto, Japan: IEEE, 2019. 238−241
    [30] Angiulli F, Pizzuti C. Fast outlier detection in high dimensional spaces. In: Proceedings of the Principles of Data Mining and Knowledge Discovery 6th European Conference. Helsinki, Finland: Springer-Verlag, 2002. 15−27
    [31] Breunig M, Kriegel H P, Ng R T, Sander J. LOF: Identifying density-based local outliers. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. Paris, France: ACM, 2000. 93−104
    [32] Tang J, Chen Z, Fu A W C, Cheung D W. Enhancing effectiveness of outlier detections for low density patterns. In: Proceedings of the Advances in Knowledge Discovery and Data Mining. Taiwan, China: Springer Berlin, 2002. 535−548
    [33] Kriegel H P, Schubert M, Zimek A. Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA: ACM, 2008. 444−452
    [34] Schlkopf B, Platt J C, Shawe-Taylor J C, Smola A J, Williamson R C. Estimating the support of a high-dimensional distribution. Neural Computation, 2001, 13(7): 1443-1471 doi: 10.1162/089976601750264965
    [35] Bandaragoda T R, Ting K M, Albrecht D, Liu F T, Zhu Y, Wells J R. Isolation-based anomaly detection using nearest-neighbor ensembles. Computational Intelligence, 2018, 34(4): 968-998 doi: 10.1111/coin.12156
    [36] Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In: Proceedings of the International Conference on Management of Data. Dallas, USA: 2000. 427−438
    [37] 金连文, 钟卓耀, 杨钊, 杨维信, 谢泽澄, 孙俊. 深度学习在手写汉字识别中的应用综述. 自动化学报, 2016, 42(8): 1125-1141 doi: 10.16383/j.aas.2016.c150725

    Jin Lian-Wen, Zhong Zhuo-Yao, Yang Zhao, Yang Wei-Xin, Xie Ze-Cheng, Sun Jun. Applications of deep learning for handwritten Chinese character recognition:a review. Acta Automatica Sinica, 2016, 42(8): 1125-1141 doi: 10.16383/j.aas.2016.c150725
  • 加载中
图(13) / 表(3)
计量
  • 文章访问数:  486
  • HTML全文浏览量:  453
  • PDF下载量:  174
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-03-08
  • 网络出版日期:  2023-07-10
  • 刊出日期:  2024-01-29

目录

    /

    返回文章
    返回