2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种基于样本空间的类别不平衡数据采样方法

张永清 卢荣钊 乔少杰 韩楠 GUTIERREZ Louis Alberto 周激流

张永清, 卢荣钊, 乔少杰, 韩楠, Gutierrez Louis Alberto, 周激流. 一种基于样本空间的类别不平衡数据采样方法. 自动化学报, 2022, 48(10): 2549−2563 doi: 10.16383/j.aas.c200034
引用本文: 张永清, 卢荣钊, 乔少杰, 韩楠, Gutierrez Louis Alberto, 周激流. 一种基于样本空间的类别不平衡数据采样方法. 自动化学报, 2022, 48(10): 2549−2563 doi: 10.16383/j.aas.c200034
Zhang Yong-Qing, Lu Rong-Zhao, Qiao Shao-Jie, Han Nan, Gutierrez Louis Alberto, Zhou Ji-Liu. A sampling method of imbalanced data based on sample space. Acta Automatica Sinica, 2022, 48(10): 2549−2563 doi: 10.16383/j.aas.c200034
Citation: Zhang Yong-Qing, Lu Rong-Zhao, Qiao Shao-Jie, Han Nan, Gutierrez Louis Alberto, Zhou Ji-Liu. A sampling method of imbalanced data based on sample space. Acta Automatica Sinica, 2022, 48(10): 2549−2563 doi: 10.16383/j.aas.c200034

一种基于样本空间的类别不平衡数据采样方法

doi: 10.16383/j.aas.c200034
基金项目: 国家自然科学基金(61702058, 61772091, 61802035, 61962006), 四川省科技计划项目(2021JDJQ0021, 22ZDYF2680, 2021YZD0009, 2021ZYD0033), 成都市技术创新研发项目(2021-YF05-00491-SN), 成都市重大科技创新项目(2021-YF08-00156-GX), 成都市“揭榜挂帅”科技项目(2021-JB00-00025-GX), 四川音乐学院数字媒体艺术四川省重点实验室资助项目(21DMAKL02), 广东省基础与应用基础研究基金(2020B1515120028)资助
详细信息
    作者简介:

    张永清:成都信息工程大学计算机学院副教授. 2016年获四川大学计算机学院博士学位. 主要研究方向为人工智能和生物信息学.E-mail: zhangyq@cuit.edu.cn

    卢荣钊:成都信息工程大学计算机学院硕士研究生. 主要研究方向为机器学习. E-mail: 15928652663@163.com

    乔少杰:成都信息工程大学软件工程学院教授. 2009年获四川大学博士学位. 主要研究方向为轨迹预测, 移动对象数据库和机器学习. 本文通信作者. E-mail: sjqiao@cuit.edu.cn

    韩楠:成都信息工程大学管理学院副教授. 2012年获成都中医药大学博士学位. 主要研究方向为数据挖掘和人工智能.E-mail: hannan@cuit.edu.cn

    GUTIERREZ Louis Alberto:伦斯勒理工学院计算机科学系研究员. 主要研究方向为数据挖掘.E-mail: louisgutierrez2002@gmail.com

    周激流:成都信息工程大学计算机学院教授. 主要研究方向为智能计算和图像处理.E-mail: zhoujl@cuit.edu.cn

A Sampling Method of Imbalanced Data Based on Sample Space

Funds: Supported by the National Natural Science Foundation of China (61702058, 61772091, 61802035, 61962006), Sichuan Science and Technolo-gy Program (2021JDJQ0021, 22ZDYF2680, 2021YZD0009, 2021ZYD0033), Chengdu Technology Innovation and Research and Development Project(2021-YF05-00491-SN), Chengdu Major Science and Technology Innovation Project (2021-YF08-00156-GX), Chengdu “Take the lead” Science and Technology Project (2021-JB00-00025-GX), Key Laboratory of Digital Media Art of Sichuan Province, Sichuan Conservatory of Mu-sic (21DMAKL02), and Guangdong Basic and Applied Basic Resear-ch Foundation (2020B1515120028)
More Information
    Author Bio:

    ZHANG Yong-Qing Associate professor at the School of Comput-er Science, Chengdu University of Information Technology. He received his Ph.D. degree from the College of Computer Science, Sichuan University in 2016. His research interest covers artificial intelligence and bioinformatics

    LU Rong-Zhao Master student at the School of Computer Science, Ch-engdu University of Information Te-chnology. His main research interest is machine learning

    QIAO Shao-Jie Professor at the School of Software Engineering, Ch-engdu University of Information Technology. He recei-ved his Ph.D. degree from Sichuan University in 2009. His research interest covers trajectory prediction, moving objects databases, and machine learning. Corresponding author of this paper

    HAN Nan Associate professor at the School of Management, Chengdu University of Information Technology. She received her Ph.D. degree from Chengdu University of Traditional Chinese Medicine in 2012. Her research interest covers data mining and artificial intelligence

    GUTIERREZ Louis Alberto Professor in the Department of Computer Science, Rensselaer Polytechnic Institute. His main research interest is data mining

    ZHOU Ji-Liu Professor at the School of Computer Science, Chengdu University of Information Technology. His research interest covers intelligent computing and image processing

  • 摘要: 不平衡数据是机器学习中普遍存在的问题并得到广泛研究, 即少数类的样本数量远远小于多数类样本的数量. 传统基于最小化错误率方法的不足在于: 分类结果会倾向于多数类, 造成少数类的精度降低, 通常还存在时间复杂度较高的问题. 为解决上述问题, 提出一种基于样本空间分布的数据采样方法, 伪负样本采样方法. 伪负样本指被标记为负样本(多数类)但与正样本(少数类)有很大相关性的样本. 算法主要包括3个关键步骤: 1)计算正样本的空间分布中心并得到每个正样本到空间中心的平均距离; 2)以同样的距离计算方法计算每个负样本到空间分布中心的距离, 并与平均距离进行比较, 将其距离小于平均距离的负样本标记为伪负样本; 3)将伪负样本从负样本集中删除并加入到正样本集中. 算法的优势在于不改变原始数据集的数量, 因此不会引入噪声样本或导致潜在信息丢失; 在不降低整体分类精度的情况下, 提高少数类的精确度. 此外, 其时间复杂度较低. 经过13个数据进行多角度实验, 表明伪负样本采样方法具有较高的预测准确性.
  • 图  1  伪负样本采样方法

    Fig.  1  Pseudo-negative sampling method

    图  2  4个UCI数据集在SVM分类器下的ROC曲线

    Fig.  2  ROC curve of four UCI datasets in SVM

    图  3  2个KEEL数据集在SVM分类器下的ROC曲线

    Fig.  3  ROC curve of two KEEL datasets in SVM classifier

    表  1  符号及说明

    Table  1  Symbols and their explanations

    名称 解释
    $D^+,m$ 正样本集与正样本个数. 包含的样本表示为$D^{+}=\left\{\left(x^{+}_{1}, y^{+}_{1}\right),\left(x^{+}_{2}, y^{+}_{2}\right), \cdots,\left(x^{+}_{m}, y_{m}^{+}\right)\right\}$
    $D^-,n$ 负样本集与负样本个数. 包含的样本表示为$D^{-}=\left\{\left(x^{-}_{1}, y^{-}_{1}\right),\left(x^{-}_{2}, y^{-}_{2}\right), \cdots,\left(x^{-}_{n}, y^{-}_{n}\right)\right\}$
    $D^*$ 伪负样本集. 包含的样本表示为$D^{*}=\left\{\left(x^{*}_{1}, y^{*}_{1}\right),\left(x^{*}_{2}, y^{*}_{2}\right), \cdots,\left(x^{*}_{i}, y^{*}_{i}\right)\right\}$}
    $Q(x_{i})$ 样本$x_{i}$的相似性大小
    ${{dist} }(x_1,x_2)$ 样本$x_1$与样本$x_2$间的欧氏距离
    $C$ 正样本空间中心, 是所有正样本的平均值
    $meanDist$ 将负样本判断为伪负样本的阈值, 其值是所有正样本到空间中心 C 的平均距离
    下载: 导出CSV

    表  2  不平衡数据集信息

    Table  2  Information of the imbalanced dataset

    来源 数据集 样本数 特征数 比例 特征属性 (连续/离散)
    真实数据 SPECT 267 44 4 44/0
    SNP 3074 25 16 25/0
    UCI 数据 Ecoli 336 7 8.6 7/0
    SatImage 6435 36 9.3 0/36
    Abalone 4177 8 9.7 6/2
    Balance 625 4 11.7 0/4
    SolarFlare 1389 10 19 0/10
    Yeast_ME2 1484 8 28 8/0
    Abalone_19 4177 8 130 6/2
    KEEL 数据 Yeast1289vs7 947 8 30.6 8/0
    Yeast1458vs7 693 8 22.1 8/0
    Yeast4 1484 8 28.1 8/0
    Yeast5 1484 8 32.7 8/0
    下载: 导出CSV

    表  3  分类混淆矩阵

    Table  3  The confuse matrix of classification

    混淆矩阵 预测为正样本 预测为负样本
    正样本 $TP$ $FN$
    负样本 $FP$ $TN$
    下载: 导出CSV

    表  4  伪负样本采样在分类器SVM、LR、DT、RF上的结果

    Table  4  Results of pseudo-negative sampling on classifiers including SVM, LR, DT and RF

    数据集 分类算法 $Sen$ $Spe$ $Acc$ $MCC$ F-score $AUC$
    Balance SVM 0.810 0.967 0.911 0.804 0.860 0.967
    LR 0.638 0.872 0.789 0.525 0.670 0.868
    DT 0.885 0.950 0.928 0.836 0.889 0.920
    RF 0.887 0.956 0.932 0.849 0.899 0.972
    Ecoli SVM 0.826 0.975 0.952 0.806 0.828 0.982
    LR 0.746 0.975 0.941 0.755 0.781 0.962
    DT 0.741 0.961 0.932 0.704 0.734 0.865
    RF 0.733 0.975 0.938 0.734 0.756 0.963
    SatImage SVM 0.924 0.917 0.919 0.830 0.892 0.980
    LR 0.823 0.827 0.825 0.636 0.772 0.913
    DT 0.847 0.908 0.886 0.754 0.842 0.877
    RF 0.901 0.950 0.933 0.854 0.906 0.984
    Abalone SVM 0.906 0.994 0.965 0.922 0.945 0.966
    LR 0.903 0.978 0.954 0.895 0.928 0.973
    DT 0.914 0.949 0.937 0.860 0.906 0.932
    RF 0.904 0.991 0.962 0.916 0.941 0.981
    SolarFlare SVM 0.917 0.976 0.954 0.901 0.936 0.984
    LR 0.934 0.962 0.951 0.896 0.934 0.973
    DT 0.922 0.956 0.943 0.880 0.924 0.940
    RF 0.942 0.957 0.951 0.897 0.935 0.987
    Yeast_ME2 SVM 0.757 0.982 0.946 0.791 0.818 0.976
    LR 0.573 0.966 0.902 0.608 0.653 0.947
    DT 0.735 0.946 0.911 0.675 0.724 0.843
    RF 0.723 0.976 0.935 0.749 0.782 0.968
    Abalone_19 SVM 0.969 0.989 0.982 0.962 0.975 0.996
    LR 0.971 0.984 0.979 0.956 0.971 0.997
    DT 0.976 0.982 0.980 0.957 0.972 0.979
    RF 0.977 0.992 0.987 0.972 0.982 0.997
    SPECT SVM 0.767 0.907 0.862 0.682 0.774 0.941
    LR 0.732 0.862 0.816 0.586 0.707 0.909
    DT 0.627 0.817 0.753 0.440 0.608 0.732
    RF 0.674 0.931 0.846 0.637 0.725 0.929
    SNP SVM 0.677 0.980 0.850 0.709 0.795 0.966
    LR 0.692 0.961 0.845 0.693 0.793 0.902
    DT 0.892 0.911 0.903 0.803 0.888 0.902
    RF 0.900 0.958 0.933 0.864 0.920 0.971
    下载: 导出CSV

    表  5  伪负样本采样与ROS, RUS, SMOTE, ADASYN采样方法对比结果

    Table  5  Comparison of pseudo-negative sampling with the methods of ROS、RUS、SMOTE、ADASYN

    数据集 评价指标 SVM LR
    PNS ROS RUS SMOTE ADASYN PNS ROS RUS SMOTE ADASYN
    SPECT Sen 0.767 0.746 0.594 0.381 0.438 0.732 0.685 0.605 0.643 0.604
    Spe 0.907 0.856 0.860 0.985 0.970 0.862 0.846 0.828 0.838 0.843
    Acc 0.862 0.817 0.760 0.794 0.789 0.816 0.793 0.748 0.768 0.751
    MCC 0.682 0.590 0.461 0.509 0.531 0.586 0.527 0.432 0.507 0.485
    F-score 0.774 0.715 0.585 0.535 0.575 0.707 0.667 0.594 0.622 0.611
    AUC 0.941 0.912 0.861 0.857 0.867 0.909 0.889 0.848 0.849 0.824
    SNP Sen 0.677 0.842 0.489 0.879 0.879 0.692 0.614 0.605 0.653 0.637
    Spe 0.980 0.908 0.869 0.904 0.897 0.961 0.847 0.801 0.852 0.852
    Acc 0.850 0.880 0.705 0.893 0.889 0.845 0.747 0.713 0.766 0.760
    MCC 0.709 0.754 0.394 0.782 0.775 0.693 0.479 0.416 0.520 0.505
    F-score 0.795 0.857 0.585 0.876 0.871 0.793 0.676 0.643 0.706 0.693
    AUC 0.966 0.935 0.761 0.949 0.947 0.902 0.809 0.765 0.839 0.832
    Ecoli Sen 0.826 0.715 0.644 0.720 0.661 0.746 0.644 0.616 0.610 0.573
    Spe 0.975 0.962 0.964 0.963 0.956 0.975 0.958 0.954 0.962 0.956
    Acc 0.952 0.925 0.916 0.925 0.908 0.941 0.908 0.902 0.908 0.900
    MCC 0.806 0.693 0.633 0.692 0.623 0.755 0.618 0.598 0.612 0.570
    F-score 0.828 0.728 0.665 0.727 0.664 0.781 0.655 0.634 0.647 0.616
    AUC 0.982 0.958 0.949 0.957 0.951 0.962 0.936 0.923 0.935 0.930
    SatImage Sen 0.924 0.892 0.847 0.915 0.933 0.823 0.580 0.540 0.595 0.553
    Spe 0.917 0.904 0.898 0.907 0.871 0.827 0.763 0.747 0.766 0.757
    Acc 0.919 0.899 0.879 0.910 0.893 0.825 0.697 0.671 0.704 0.683
    MCC 0.830 0.786 0.741 0.810 0.784 0.636 0.344 0.288 0.361 0.312
    F-score 0.892 0.865 0.835 0.880 0.864 0.772 0.580 0.539 0.591 0.557
    AUC 0.980 0.960 0.946 0.966 0.953 0.913 0.778 0.756 0.786 0.768
    Abalone Sen 0.906 0.721 0.651 0.740 0.703 0.903 0.726 0.710 0.735 0.697
    Spe 0.994 0.835 0.839 0.830 0.822 0.978 0.805 0.802 0.804 0.804
    Acc 0.965 0.797 0.776 0.800 0.783 0.954 0.779 0.769 0.781 0.769
    MCC 0.922 0.549 0.493 0.559 0.515 0.895 0.518 0.499 0.525 0.489
    F-score 0.945 0.701 0.655 0.709 0.676 0.928 0.684 0.669 0.689 0.660
    AUC 0.966 0.868 0.840 0.876 0.861 0.973 0.850 0.842 0.850 0.836
    Balance Sen 0.810 0.937 0.619 0.517 0.510 0.638 0.605 0.597 0.693 0.518
    Spe 0.967 0.775 0.776 0.943 0.940 0.872 0.812 0.778 0.851 0.962
    Acc 0.911 0.827 0.705 0.798 0.791 0.789 0.740 0.704 0.795 0.811
    MCC 0.804 0.674 0.385 0.558 0.554 0.525 0.418 0.364 0.549 0.584
    F-score 0.860 0.783 0.564 0.624 0.627 0.670 0.608 0.565 0.694 0.646
    AUC 0.967 0.902 0.834 0.884 0.826 0.868 0.831 0.833 0.902 0.872
    SolarFlare Sen 0.917 0.821 0.528 0.882 0.883 0.934 0.599 0.602 0.866 0.860
    Spe 0.976 0.888 0.866 0.979 0.973 0.962 0.853 0.824 0.988 0.985
    Acc 0.954 0.862 0.734 0.943 0.940 0.951 0.758 0.734 0.942 0.939
    MCC 0.901 0.707 0.418 0.878 0.871 0.896 0.470 0.433 0.878 0.870
    F-score 0.936 0.815 0.583 0.919 0.915 0.934 0.647 0.620 0.917 0.912
    AUC 0.984 0.912 0.802 0.969 0.968 0.973 0.837 0.790 0.970 0.968
    Yeast_ME2 Sen 0.757 0.708 0.482 0.721 0.688 0.573 0.548 0.538 0.633 0.575
    Spe 0.982 0.965 0.970 0.967 0.966 0.967 0.958 0.959 0.960 0.960
    Acc 0.946 0.923 0.889 0.927 0.920 0.902 0.892 0.884 0.906 0.896
    MCC 0.791 0.706 0.545 0.720 0.695 0.608 0.566 0.545 0.634 0.593
    F-score 0.818 0.747 0.575 0.759 0.738 0.653 0.618 0.584 0.683 0.643
    AUC 0.976 0.955 0.882 0.961 0.955 0.947 0.901 0.891 0.910 0.901
    Abalone_19 Sen 0.969 0.885 0.315 0.947 0.948 0.971 0.636 0.538 0.725 0.725
    Spe 0.989 0.872 0.830 0.877 0.875 0.984 0.863 0.829 0.865 0.867
    Acc 0.982 0.877 0.613 0.902 0.902 0.979 0.780 0.698 0.814 0.815
    MCC 0.962 0.743 0.138 0.803 0.802 0.956 0.516 0.380 0.595 0.598
    F-score 0.975 0.839 0.299 0.876 0.875 0.971 0.677 0.539 0.739 0.740
    AUC 0.996 0.947 0.715 0.956 0.956 0.997 0.877 0.815 0.891 0.893
    下载: 导出CSV

    表  6  高比例不平衡数据采样对比

    Table  6  The comparison of high ratio imbalanced data

    数据集 评价指标 SVM LR
    PNS ROS RUS SMOTE ADASYN PNS ROS RUS SMOTE ADASYN
    Yeast1289vs7 Sen 0.892 0.752 0.533 0.845 0.843 0.775 0.691 0.558 0.726 0.719
    Spe 0.952 0.919 0.833 0.860 0.844 0.850 0.824 0.786 0.815 0.809
    Acc 0.925 0.849 0.695 0.853 0.843 0.817 0.768 0.668 0.777 0.771
    MCC 0.848 0.690 0.392 0.701 0.682 0.627 0.521 0.355 0.542 0.529
    F-score 0.909 0.806 0.582 0.827 0.817 0.780 0.712 0.570 0.731 0.723
    AUC 0.980 0.935 0.793 0.930 0.926 0.902 0.837 0.793 0.848 0.844
    Yeast1458vs7 Sen 0.855 0.681 0.356 0.713 0.737 0.590 0.503 0.415 0.570 0.592
    Spe 0.934 0.899 0.879 0.877 0.870 0.835 0.843 0.829 0.823 0.820
    Acc 0.904 0.820 0.684 0.817 0.821 0.745 0.719 0.660 0.731 0.735
    MCC 0.794 0.602 0.283 0.599 0.612 0.437 0.369 0.265 0.406 0.421
    F-score 0.866 0.730 0.431 0.736 0.748 0.623 0.562 0.445 0.602 0.617
    AUC 0.965 0.904 0.720 0.897 0.899 0.822 0.769 0.744 0.792 0.794
    Yeast4 Sen 0.770 0.687 0.543 0.733 0.703 0.574 0.572 0.558 0.603 0.566
    Spe 0.982 0.969 0.965 0.970 0.966 0.968 0.958 0.955 0.959 0.960
    Acc 0.947 0.923 0.892 0.930 0.923 0.904 0.895 0.886 0.902 0.895
    MCC 0.798 0.701 0.571 0.734 0.706 0.613 0.582 0.559 0.611 0.584
    F-score 0.824 0.741 0.609 0.770 0.747 0.662 0.634 0.605 0.656 0.635
    AUC 0.976 0.954 0.908 0.961 0.957 0.946 0.902 0.881 0.906 0.903
    Yeast5 Sen 0.704 0.706 0.596 0.745 0.721 0.622 0.576 0.559 0.590 0.546
    Spe 0.995 0.989 0.990 0.991 0.990 0.987 0.987 0.988 0.987 0.988
    Acc 0.980 0.975 0.970 0.979 0.976 0.969 0.966 0.966 0.967 0.967
    MCC 0.770 0.714 0.644 0.759 0.728 0.642 0.605 0.590 0.614 0.588
    F-score 0.772 0.720 0.641 0.765 0.734 0.647 0.609 0.587 0.620 0.593
    AUC 0.994 0.990 0.986 0.991 0.992 0.988 0.988 0.988 0.988 0.988
    下载: 导出CSV

    表  7  不同采样方法时间对比

    Table  7  Runtime comparison of different sampling methods

    数据集 算法 RUS PNS SMOTE ROS ADASYN
    SPECT SVM 0.39 0.53 0.67 0.66 0.71
    LR 0.56 0.69 0.80 0.75 0.81
    DT 0.26 0.31 0.35 0.32 0.34
    RF 1.70 1.77 1.91 1.84 1.98
    SNP SVM 1.30 27.92 80.22 92.04 80.74
    LR 0.70 1.41 2.16 2.09 2.26
    DT 0.55 1.29 2.51 1.55 2.61
    RF 2.32 7.32 13.76 9.45 13.91
    Ecoli SVM 0.31 0.31 0.36 0.34 0.39
    LR 0.39 0.43 0.44 0.44 0.44
    DT 0.23 0.23 0.23 0.23 0.24
    RF 1.54 1.58 1.56 1.56 1.58
    SatImage SVM 7.59 75.68 189.22 201.02 238.91
    LR 3.00 6.60 5.94 5.05 6.64
    DT 1.02 2.75 4.03 3.47 4.86
    RF 4.43 13.48 18.02 16.36 19.92
    Abalone SVM 3.08 14.78 62.42 64.35 65.56
    LR 1.02 3.58 4.74 4.67 4.81
    DT 0.52 0.74 1.31 1.03 1.37
    RF 2.86 4.75 9.61 7.73 9.48
    Balance SVM 0.28 0.73 1.32 1.58 1.29
    LR 0.25 0.35 0.68 0.38 0.68
    DT 0.22 0.24 0.27 0.24 0.27
    RF 1.49 1.67 1.74 1.73 1.76
    SolarFlare SVM 0.44 3.46 9.25 12.31 9.30
    LR 0.40 2.00 3.17 2.96 3.17
    DT 0.29 0.36 0.46 0.43 0.50
    RF 1.61 2.14 2.59 2.57 2.66
    Yeast_ME2 SVM 0.44 1.84 2.95 3.189 3.161
    LR 0.44 0.74 0.86 0.871 0.933
    DT 0.29 0.36 0.38 0.361 0.436
    RF 1.65 2.24 2.45 2.269 2.452
    Abalone_19 SVM 0.44 6.81 66.16 75.09 66.20
    LR 0.46 3.54 7.06 4.71 4.86
    DT 0.39 0.71 1.49 0.86 1.47
    RF 1.65 4.45 10.48 5.64 10.18
    总计 44.69 197.95 511.77 530.30 567.05
    下载: 导出CSV
  • [1] Hou J, Shi X, Chen C, Solimanislam M, Johnson A F, et al. Global impacts of chromosomal imbalance on gene expression in arabidopsis and other taxa. Proceedings of the National Academy of Sciences, 2018, 115(48): E11321−E11330 doi: 10.1073/pnas.1807796115
    [2] Zhang Y, Qiao S, Ji S, Han N, Liu D, et al. Identification of DNA-protein binding sites by bootstrap multiple convolutional neural networks on sequence information. Engineering Applications of Artificial Intelligence, 2019, 79: 58−66 doi: 10.1016/j.engappai.2019.01.003
    [3] Zhao Z, Peng H, Lan C, Zheng Y, Fang L, et al. Imbalance learning for the prediction of N 6-methylation sites in mRNAs. BMC Genomics, 2018, 19(1): 574 doi: 10.1186/s12864-018-4928-y
    [4] Du X, Yao Y, Diao Y, Zhu H, Zhang Y, et al. Deepss: exploring splice site motif through convolutional neural network directly from dna sequence. IEEE Access, 2018, 6: 32958−32978 doi: 10.1109/ACCESS.2018.2848847
    [5] Maji R K, Khatua S, Ghosh Z. A supervised ensemble approach for sensitive microRNA target prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020, 17(1): 37−46 doi: 10.1109/TCBB.2018.2858252
    [6] Zhang X, Lin X, Zhao J, Huang Q, Xu X. Efficiently predicting hot spots in PPIs by combining random forest and synthetic minority over-sampling technique. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2018, 16(3): 774−781
    [7] Luo K, Wang G, Li Q, Tao J. An improved SVM-RFE based on F-statistic and mPDC for gene selection in cancer classification. IEEE Access, 2019, 7: 147617−147628 doi: 10.1109/ACCESS.2019.2946653
    [8] Fotouhi S, Asadi S, Kattan M W. A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of Biomedical Informatics, 2019, 90: 103089 doi: 10.1016/j.jbi.2018.12.003
    [9] Soh W W, Yusuf R M. Predicting credit card fraud on a imbalanced data. International Journal of Data Science and Advanced Analytics, 2019, 1(1): 12−17
    [10] 张宏莉, 鲁刚. 分类不平衡协议流的机器学习算法评估与比较. 软件学报, 2012, 23(6): 1500−1516 doi: 10.3724/SP.J.1001.2012.04074

    Zhang Hong-Li, Lu Gang. Machine learning algorithms for classifying the imbalanced protocol flows: evaluation and comparison. Journal of Software, 2012, 23(6): 1500−1516 doi: 10.3724/SP.J.1001.2012.04074
    [11] He H, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263−1284 doi: 10.1109/TKDE.2008.239
    [12] 林舒杨, 李翠华, 江弋, 林琛, 邹权. 不平衡数据的降采样方法研究. 计算机研究与发展, 2011, 48(S3): 47−53

    Lin Shu-Yang, Li Cui-Hua, Jiang Yi, Lin Chen, Zou Quan. Under-sampling method research in class-imbalanced data. Journal of Computer Research Development, 2011, 48(S3): 47−53
    [13] Zhang Y, Qiao S, Lu R, Han N, Liu D, et al. How to balance the bioinformatics data: pseudo-negative sampling. BMC Bioinformatics, 2019, 20(25): 1−13
    [14] Liu D, Qiao S, Han N, Wu T, Mao R, et al. SOTB: semi-supervised oversampling approach based on trigonal barycenter theory. IEEE Access, 2020, 8: 50180−50189 doi: 10.1109/ACCESS.2020.2980157
    [15] 蒋盛益, 谢照青, 余雯. 基于代价敏感的朴素贝叶斯不平衡数据分类研究. 计算机研究与发展, 2011, 48(S1): 387−390

    Jiang Sheng-Yi, Xie Zhao-Qing, Yu Wen. Naive bayes classification algorithm based on cost sensitive for imbalanced data distribution. Journal of Computer Research Development, 2011, 48(S1): 387−390
    [16] Yu L, Zhou R, Tang L, Chen R. A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data. Applied Soft Computing, 2018, 69: 192−202 doi: 10.1016/j.asoc.2018.04.049
    [17] Castellanos F J, Valero-Mas J J, Calvo-Zaragoza J, Rico-Juan J R. Oversampling imbalanced data in the string space. Pattern Recognition Letters, 2018, 103: 32−38 doi: 10.1016/j.patrec.2018.01.003
    [18] Sun B, Chen H, Wang J, Xie H. Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Frontiers of Computer Science, 2018, 12(2): 331−350 doi: 10.1007/s11704-016-5306-z
    [19] Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321−357 doi: 10.1613/jair.953
    [20] Douzas G, Bacao F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems with Applications, 2018, 91: 464−471 doi: 10.1016/j.eswa.2017.09.030
    [21] Wilson D L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 1972, SMC-2(3): 408−421 doi: 10.1109/TSMC.1972.4309137
    [22] Laurikkala J. Improving identification of difficult small classes by balancing class distribution. In: Proceedings of the 2001 Conference on Artificial Intelligence in Medicine in Europe. Berlin, Ger-many: 2001. 63−66
    [23] Zhang Z L, Luo X G, García S, Herrera F. Cost-sensitive back-propagation neural networks with binarization techniques in addressing multi-class problems and non-competent classifiers. Applied Soft Computing, 2017, 56: 357−367 doi: 10.1016/j.asoc.2017.03.016
    [24] Liu N, Shen J, Xu M, Gan D, Qi E, et al. Improved cost-sensitive support vector machine classifier for breast cancer diagnosis. Mathematical Problems in Engineering, 2018, 4: 1−13
    [25] Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123−140
    [26] Schapire R E. The strength of weak learnability. Machine Learning, 1990, 5(2): 197−227
    [27] Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). the Annals of Statistics, 2000, 28(2): 337−407
    [28] Elmore K L, Richman M B. Euclidean distance as a similarity metric for principal component analysis. Monthly Weather Review, 2001, 129(3): 540−549 doi: 10.1175/1520-0493(2001)129<0540:EDAASM>2.0.CO;2
    [29] Park M W, Lee E C. Similarity measurement method between two songs by using the conditional Euclidean distance. Wseas Transaction on Information Science and Applications, 2013, 10(12), 381−388
    [30] He H, Bai Y, Garcia E A, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 International Joint Conference on Neural Networks (World Congress on Computational Intelligence). Hong Kong, China: IEEE, 2008. 1322−1328
    [31] Fernández A, del Río S, Chawla N V, Herrera F. An insight into imbalanced big data classification: Outcomes and challenges. Complex & Intelligent Systems, 2017, 3(2): 105−120
    [32] Alcalá-Fdez J, Sanchez L, Garcia S, Deljesus M J, Ventura S, et al. KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing, 2009, 13(3): 307−318 doi: 10.1007/s00500-008-0323-y
    [33] 罗珍珍, 陈靓影, 刘乐元, 张坤. 基于条件随机森林的非约束环境自然笑脸检测. 自动化学报, 2018, 44(4): 696−706

    Luo Zhen-Zhen, Chen Jing-Ying, Liu Le-Yuan, Zhang Kun. Conditional random forests for spontaneous smile detection in unconstrained environment. Acta Automatica Sinica, 2018, 44(4): 696−706
    [34] Breiman L. Random forests. Machine Learning, 2001, 45(1): 5−32 doi: 10.1023/A:1010933404324
    [35] 张学工. 关于统计学习理论与支持向量机. 自动化学报, 2000, 26(1): 32−42

    Zhang Xue-gong. Introduction to statistical learning theory and support vector machines. Acta Automatica Sinica, 2000, 26(1): 32−42
    [36] Cortes C, Vapnik V. Support-vector networks. Machine Learning, 1995, 20(3): 273−297
    [37] Cox D R. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 1958, 20(2): 215−232 doi: 10.1111/j.2517-6161.1958.tb00292.x
    [38] 毛毅, 陈稳霖, 郭宝龙, 陈一昕. 基于密度估计的逻辑回归模型. 自动化学报, 2014, 40(1): 62−72

    Mao Yi, Chen Wen-Lin, Guo Bao-Long, Chen Yi-Xin. A novel logistic regression model based on density estimation. Acta Automatica Sinica, 2014, 40(1): 62−72
    [39] Quinlan J R. Induction of decision trees. Machine Learning, 1986, 1(1): 81−106
    [40] 王雪松, 潘杰, 程玉虎, 曹戈. 基于相似度衡量的决策树自适应迁移. 自动化学报, 2013, 39(12): 2186−2192

    Wang Xue-Son, Pan Jie, Cheng Yu-Hu, Cao Ge. Self-adaptive transfer for decision trees based on similarity metric. Acta Automatica Sinica, 2013, 39(12): 2186−2192
    [41] 乔少杰, 金琨, 韩楠, 唐常杰, 格桑多吉, Gutierrez L A. 一种基于高斯混合模型的轨迹预测算法. 软件学报, 2015, 26(5): 1048−1063

    Qiao S, Jin K, Han N, Tang C, Ge S, Gutierrez L A. Trajectory prediction algorithm based on Gaussian mixture model. Journal of Software, 2015, 26(5): 1048−1063
    [42] 乔少杰, 韩楠, 丁治明, 金澈清, 孙未未, 舒红平. 多模式移动对象不确定性轨迹预测模型. 自动化学报, 2018, 44(4): 608−618

    Qiao S, Han N, Ding Z, Jin C, Sun W, Shu H. A multiple-motion-pattern trajectory prediction model for uncertain moving objects. Acta Automatica Sinica, 2018, 44(4): 608−618
    [43] 乔少杰, 郭俊, 韩楠, 张小松, 元昌安, 唐常杰. 大规模复杂网络社区并行发现算法. 计算机学报, 2017, 40(3): 687−700 doi: 10.11897/SP.J.1016.2017.00687

    Qiao S, Guo J, Han N, Zhang X, Yuan C, Tang C. Parallel algorithm for discovering communities in large-scale complex networks. Chinese Journal of Computers, 2017, 40(3): 687−700 doi: 10.11897/SP.J.1016.2017.00687
  • 加载中
图(3) / 表(7)
计量
  • 文章访问数:  779
  • HTML全文浏览量:  455
  • PDF下载量:  260
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-01-16
  • 录用日期:  2020-05-03
  • 网络出版日期:  2022-09-19
  • 刊出日期:  2022-10-14

目录

    /

    返回文章
    返回