2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种基于自训练的众包标记噪声纠正算法

杨艺 蒋良孝 李超群

杨艺, 蒋良孝, 李超群. 一种基于自训练的众包标记噪声纠正算法. 自动化学报, 2023, 49(4): 830−844 doi: 10.16383/j.aas.c210051
引用本文: 杨艺, 蒋良孝, 李超群. 一种基于自训练的众包标记噪声纠正算法. 自动化学报, 2023, 49(4): 830−844 doi: 10.16383/j.aas.c210051
Yang Yi, Jiang Liang-Xiao, Li Chao-Qun. A self-training-based label noise correction algorithm for crowdsourcing. Acta Automatica Sinica, 2023, 49(4): 830−844 doi: 10.16383/j.aas.c210051
Citation: Yang Yi, Jiang Liang-Xiao, Li Chao-Qun. A self-training-based label noise correction algorithm for crowdsourcing. Acta Automatica Sinica, 2023, 49(4): 830−844 doi: 10.16383/j.aas.c210051

一种基于自训练的众包标记噪声纠正算法

doi: 10.16383/j.aas.c210051 cstr: 32138.14.j.aas.c210051
基金项目: 国家自然科学基金联合基金(U1711267), 中央高校基本科研业务费专项资金(CUGGC03)资助
详细信息
    作者简介:

    杨艺:中国地质大学(武汉)计算机学院硕士研究生. 2018年获得中国地质大学(武汉)计算机学院学士学位. 主要研究方向为机器学习与数据挖掘. E-mail: yangyi@cug.edu.cn

    蒋良孝:中国地质大学(武汉)计算机学院教授. 2009年获得中国地质大学(武汉)地球探测与信息技术博士学位. 主要研究方向为机器学习与数据挖掘. 本文通信作者. E-mail: ljiang@cug.edu.cn

    李超群:中国地质大学(武汉)数学与物理学院副教授. 2012年获得中国地质大学(武汉)地球探测与信息技术博士学位. 主要研究方向为机器学习与数据挖掘. E-mail: chqli@cug.edu.cn

A Self-training-based Label Noise Correction Algorithm for Crowdsourcing

Funds: Supported by National Natural Science Foundation of China (U1711267) and Fundamental Research Funds for the Central Universities (CUGGC03)
More Information
    Author Bio:

    YANG Yi Master student at the School of Computer Science, China University of Geosciences (Wuhan). He received his bachelor degree from China University of Geosciences (Wuhan) in 2018. His research interest covers machine learning and data mining

    JIANG Liang-Xiao Professor at the School of Computer Science, China University of Geosciences (Wuhan). He received his Ph.D. degree in earth prospecting and information technology from China University of Geosciences (Wuhan) in 2009 . His research interest covers machine learning and data mining. Corresponding author of this paper

    LI Chao-Qun Associate professor at the School of Mathematics and Physics, China University of Geosciences (Wuhan). She received her Ph.D. degree in earth prospecting and information technology from China University of Geosciences (Wuhan) in 2012. Her research interest covers machine learning and data mining

  • 摘要: 针对众包标记经过标记集成后仍然存在噪声的问题, 提出了一种基于自训练的众包标记噪声纠正算法(Self-training-based label noise correction, STLNC). STLNC整体分为3个阶段: 第1阶段利用过滤器将带集成标记的众包数据集分为噪声集和干净集. 第2阶段利用加权密度峰值聚类算法构建数据集中低密度实例指向高密度实例的空间结构关系. 第3阶段首先根据发现的空间结构关系设计噪声实例选择策略; 然后利用在干净集上训练的集成分类器对选择的噪声实例按照设计的实例纠正策略进行纠正, 并将纠正后的实例加入到干净集, 再重新训练集成分类器; 重复实例选择与纠正过程直到噪声集中所有的实例被纠正; 最后用最后一轮训练得到的集成分类器对所有实例进行纠正. 在仿真标准数据集和真实众包数据集上的实验结果表明STLNC比其他5种最先进的噪声纠正算法在噪声比和模型质量两个度量指标上表现更优.
    1)  1 http://www.mturk.com2 http://www.crowdflower.com3 http://www.clickworker.com
    2)  http://www.crowdflower.com
    3)  http://www.clickworker.com
  • 图  1  STLNC算法的框架

    Fig.  1  Framework of STLNC

    图  2  不同T值的STLNC在ionosphere数据集上的噪声比结果

    Fig.  2  Noise ratio of STLNC with different T values on ionosphere dataset

    图  3  Leaves数据集上的噪声比对比结果

    Fig.  3  Noise ratio comparisons on Leaves datasets

    图  4  Leaves数据集上的模型质量结果对比结果

    Fig.  4  Model quality comparisons on Leaves datasets

    图  5  LabelMe数据集上的噪声比对比结果

    Fig.  5  Noise ratio comparisons on LabelMe datasets

    图  6  LabelMe数据集上的模型质量对比结果

    Fig.  6  Model quality comparisons on LabelMe datasets

    图  7  STLNC基于不同过滤器在LabelMe4数据集上的实验结果

    Fig.  7  Experimental results of STLNC with different filters on LabelMe4 dataset

    图  8  STLNC在LabelMe4数据集上的消融实验结果

    Fig.  8  Results of STLNC ablation experiment on LabelMe4 dataset

    表  1  22个仿真标准数据集详细描述

    Table  1  Description of 22 simulated benchmark datasets

    数据集#Ins#Att#Pos#Neg
    biodeg105541356699
    breast-cancer268985201
    breast-w69910241458
    credit-a69016383307
    credit-g100021300700
    diabetes7688268500
    heart-statlog2701412032
    hepatitis1552012332
    horse-colic36822232136
    ionosphere35135225126
    kr-vs-kp31963715271669
    labor57163720
    mushroom81242339164208
    sick3772302313541
    sonar2086111197
    spambase4601578132788
    tic-tac-toe95810332626
    vote43517168267
    climate5402049446
    colic36822136232
    monks4326228204
    steel-plates-faults1941336731268
    下载: 导出CSV

    表  2  工人质量0.6时的噪声比对比结果 (%)

    Table  2  Noise ratio comparisons with pj = 0.6 (%)

    数据集MVPLSTCCCAVNCCENCSTLNC
    biodeg28.2529.9528.3419.5318.4821.9015.83
    breast-cancer27.6226.9225.8731.1226.5729.3724.84
    breast-w28.769.0119.3110.309.168.447.30
    credit-a26.6720.0015.9418.8413.0413.3312.90
    credit-g26.6027.4028.4026.6025.3027.5026.40
    diabetes26.6932.2926.5626.9523.7023.9622.79
    heart-statlog25.1919.2623.7022.9624.0725.9318.52
    hepatitis30.3219.3526.4520.6527.7425.1630.97
    horse-colic27.7232.3417.3921.2017.6614.1314.13
    ionosphere27.9216.2421.659.1210.8313.3911.68
    kr-vs-kp27.3821.9610.4519.342.192.852.28
    labor31.5824.5624.5615.7912.2831.587.02
    mushroom26.7112.656.434.300.040.100
    sick27.602.608.8310.311.782.283.37
    sonar26.9231.7329.3324.0425.0024.5218.75
    spambase27.0227.4719.5014.789.1110.568.06
    tic-tac-toe26.2034.1323.0724.4322.4422.2314.61
    vote25.984.6010.3411.263.914.144.14
    climate27.418.5227.4114.078.528.528.52
    colic27.4522.2820.9223.1014.1314.9513.59
    monks26.3925.0011.3421.765.326.712.78
    steel-plates-faults27.5134.989.2218.7000.100.15
    平均值27.4521.9719.7718.6013.6915.0812.21
    下载: 导出CSV

    表  3  工人质量0.6时的模型质量对比结果(%)

    Table  3  Model quality comparisons with pj = 0.6 (%)

    数据集MVPLSTCCCAVNCCENCSTLNC
    biodeg71.3775.9172.1378.7774.3474.2178.29
    breast-cancer67.0071.2869.9869.0870.9268.1369.73
    breast-w92.8592.4790.6893.3894.0092.8595.54
    credit-a82.0384.7884.4984.7883.9183.0483.48
    credit-g62.2068.3067.4069.7067.3063.9070.40
    diabetes71.7470.5671.5070.8071.7272.1274.00
    heart-statlog65.5674.8167.7870.0074.0769.2678.15
    hepatitis69.0077.6771.3377.5072.0070.1774.17
    horse-colic77.4278.1180.0080.1581.6281.3682.35
    ionosphere83.4885.4586.9085.1985.7784.0485.76
    kr-vs-kp95.1895.4996.6290.2996.8196.5997.03
    labor70.6761.8373.5068.3368.3372.3376.33
    mushroom99.8598.5699.8899.9099.9099.8699.83
    sick96.7494.6296.9894.4897.7797.7597.11
    sonar58.9355.5750.0758.2955.0059.1458.29
    spambase85.9288.8787.4484.2089.6888.9890.39
    tic-tac-toe77.1769.7474.5471.6374.6374.6378.36
    vote89.8995.3794.2190.5994.2193.9894.21
    climate91.4891.4891.4891.4891.4891.4891.48
    colic79.9781.0981.0982.4781.0982.4781.09
    monks90.7590.7393.5183.3593.5193.5193.28
    steel-plates-faults100.0089.64100.0092.01100.00100.00100.00
    平均值80.8781.4781.8981.2082.6482.2684.06
    下载: 导出CSV

    表  4  工人质量0.6时的噪声比的威尔科克森测试结果

    Table  4  Noise ratio summary of Wilcoxon tests with pj = 0.6

    MVPLSTCCCAVNCCENCSTLNC
    MV
    PL
    STC
    CC
    AVNC
    CENC
    STLNC
    下载: 导出CSV

    表  5  工人质量0.6时的模型质量的威尔科克森测试结果

    Table  5  Model quality summary of Wilcoxon tests with pj = 0.6

    MVPLSTCCCAVNCCENCTTLNC
    MV
    PL
    STC
    CC
    AVNC
    CENC
    STLNC
    下载: 导出CSV

    表  6  工人质量[0.55, 0.75]时的噪声比对比结果 (%)

    Table  6  Noise ratio comparisons with pj ∈ [0.55, 0.75] (%)

    数据集MVPLSTCCCAVNCCENCSTLNC
    biodeg14.2221.1416.0213.8413.4613.0812.89
    breast-cancer16.4326.2220.9819.9323.4324.8324.48
    breast-w20.463.7210.014.154.154.433.72
    credit-a18.4120.5814.9313.6213.7713.0412.17
    credit-g17.7029.6022.7022.9021.6022.3024.60
    diabetes20.1822.6624.0922.2723.4422.6623.44
    heart-statlog16.3020.3715.1920.0016.6716.6718.52
    hepatitis12.2620.6514.1914.8416.7712.9012.26
    horse-colic17.6615.4913.8618.7514.6714.1315.22
    ionosphere17.3818.8013.689.6911.1110.8313.96
    kr-vs-kp17.4325.195.6011.551.311.882.44
    labor17.5429.8217.5412.2821.0521.0514.04
    mushroom18.074.944.841.670.100.110
    sick13.941.784.983.761.461.542.04
    sonar15.3837.5021.6325.9619.2322.6020.67
    spambase19.3237.5415.119.047.007.046.67
    tic-tac-toe20.6727.4519.3117.5415.7614.416.47
    vote22.076.9010.578.974.374.834.60
    climate22.968.5222.9610.748.528.528.52
    colic16.5819.5715.4917.9315.2214.4015.49
    monks17.1312.737.1823.382.784.862.78
    steel-plates-faults22.4634.837.3215.920.260.260.21
    平均值17.9320.2714.4614.4911.6411.6511.15
    下载: 导出CSV

    表  7  工人质量[0.55, 0.75]时的模型质量对比结果 (%)

    Table  7  Model quality comparisons with pj ∈ [0.55, 0.75] (%)

    数据集MVPLSTCCCAVNCCENCSTLNC
    biodeg74.5881.8776.3679.9981.5980.2581.78
    breast-cancer69.4371.8169.5070.2771.2871.6471.64
    breast-w90.7694.5491.1094.3493.8592.4094.69
    credit-a82.1785.3684.7885.6585.6584.6484.93
    credit-g69.5069.8070.5069.6071.1069.0072.00
    diabetes71.6774.4471.1574.2173.1372.8774.56
    heart-statlog70.0078.8976.3075.9379.6378.5280.37
    hepatitis76.1779.1777.3380.5076.8377.8379.00
    horse-colic82.5080.3581.5782.6383.5183.0182.68
    ionosphere80.6283.4882.3388.0387.7386.9084.07
    kr-vs-kp97.9495.3497.6692.8698.2898.0098.06
    labor78.3368.1778.3364.3377.1777.1784.33
    mushroom99.9998.5299.9599.96100.00100.0099.95
    sick97.7296.9597.4895.4797.6497.8397.14
    sonar63.7968.9367.1470.3669.2969.5070.86
    spambase86.6588.8788.0784.8390.3788.9690.76
    tic-tac-toe77.8373.0077.5876.3077.8976.8580.08
    vote93.3593.7994.9892.6895.2495.0094.77
    climate91.4891.4891.4891.4891.4891.4891.48
    colic83.7477.9084.1979.1782.5382.6381.84
    monks98.3785.2198.6088.16100.00100.00100.00
    steel-plates-faults99.90100.00100.0099.69100.00100.00100.00
    平均值83.4883.5484.3883.4785.6585.2086.14
    下载: 导出CSV

    表  8  工人质量[0.55, 0.75]时的噪声比的威尔科克森测试结果

    Table  8  Noise ratio summary of Wilcoxon tests with pj ∈ [0.55, 0.75]

    MVPLSTCCCAVNCCENCSTLNC
    MV
    PL
    STC
    CC
    AVNC
    CENC
    STLNC
    下载: 导出CSV

    表  9  工人质量[0.55, 0.75]时的模型质量的威尔科克森测试结果

    Table  9  Model quality summary of Wilcoxon tests with pj ∈ [0.55, 0.75]

    MVPLSTCCCAVNCCENCSTLNC
    MV
    PL
    STC
    CC
    AVNC
    CENC
    STLNC
    下载: 导出CSV

    表  10  8个真实众包数据的详细描述

    Table  10  Description of eight real-world crowdsourced datasets

    数据集分类任务#Instances#Positives#Negatives#Labelers#Labels
    Leaves1maple/alder1429646701093
    Leaves2maple/tilia1409644741044
    Leaves3alder/eucalyptus93464758407
    Leaves4alder/poplar89464354400
    LabelMe1highway/street1998911050395
    LabelMe2highway/forest2278913854476
    LabelMe3highway/opencountry2408915154375
    LabelMe4highway/insidecity2058911649339
    下载: 导出CSV
  • [1] Pollicelli D, Coscarella M, Delrieux C. RoI detection and segmentation algorithms for marine mammals photo-identification. Ecological Informatics, 2020, 56: Article No. 101038
    [2] Wang H, Zhao D, Ma H D. Informative image selection for crowdsourcing-based mobile location recognition. Multimedia Systems, 2019, 25(5): 513−523 doi: 10.1007/s00530-017-0562-9
    [3] Lotfian R, Busso C. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 2019, 10(4): 471−483 doi: 10.1109/TAFFC.2017.2736999
    [4] Sheng V S, Provost F, Ipeirotis P G. Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA: ACM, 2008. 614−622
    [5] Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web. Lyon, France: ACM, 2012. 469−478
    [6] Zhang H, Jiang L Z, Xu W Q. Multiple noisy label distribution propagation for crowdsourcing. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. Macao, China: AAAI Press, 2019. 1473−1479
    [7] Tian T, Zhu J, You Q B. Max-margin majority voting for learning from crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(10): 2480−2494 doi: 10.1109/TPAMI.2018.2860987
    [8] Zhong J H, Yang P, Tang K. A quality-sensitive method for learning from crowds. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(12): 2643−2654 doi: 10.1109/TKDE.2017.2738643
    [9] Nicholson B, Sheng V S, Zhang J. Label noise correction and application in crowdsourcing. Expert Systems With Applications, 2016, 66: 149−162 doi: 10.1016/j.eswa.2016.09.003
    [10] Xu W Q, Jiang L X, Li C Q. Resampling-based noise correction for crowdsourcing. Journal of Experimental & Theoretical Artificial Intelligence, 2021, 33(6): 985−999
    [11] Zhang J, Sheng V S, Li T, Wu X D. Improving crowdsourced label quality using noise correction. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(5): 1675−1688 doi: 10.1109/TNNLS.2017.2677468
    [12] Li C Q, Jiang L X, Xu W Q. Noise correction to improve data and model quality for crowdsourcing. Engineering Applications of Artificial Intelligence, 2019, 82: 184−191 doi: 10.1016/j.engappai.2019.04.004
    [13] Xu W Q, Jiang L X, Li C Q. Improving data and model quality in crowdsourcing using cross-entropy-based noise correction. Information Sciences, 2021, 546: 803−814 doi: 10.1016/j.ins.2020.08.117
    [14] Wu D, Shang M S, Luo X, Xu J, Yan H Y, Deng W H, et al. Self-training semi-supervised classification based on density peaks of data. Neurocomputing, 2018, 275: 180−191 doi: 10.1016/j.neucom.2017.05.072
    [15] Khuri S A, Sayfy A. A laplace variational iteration strategy for the solution of differential equations. Applied Mathematics Letters, 2012, 25(12): 2298−2305 doi: 10.1016/j.aml.2012.06.020
    [16] Hershey J R, Olsen P A. Approximating the kullback leibler divergence between Gaussian mixture models. In: Proceedings of the 32nd IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu, USA: IEEE, 2007. IV-317−IV-320
    [17] Zhang J, Sheng V S, Nicholson B, Wu X D. CEKA: A tool for mining the wisdom of crowds. The Journal of Machine Learning Research, 2015, 16(1): 2853−2858
    [18] Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. (3rd edition). Beijing: China Machine Press, 2005.
    [19] Gamberger D, Lavrac N, Groselj C. Experiments with noise filtering in a medical domain. In: Proceedings of the 16th International Conference on Machine Learning. Bled, Slovenia: ACM, 1999. 143−151
    [20] García S, Herrera F. An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research, 2008, 9(12): 2677−2694
    [21] Jiang L X, Zhang L G, Li C Q, Wu J. A correlation-based feature weighting filter for naive Bayes. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(2): 201−213 doi: 10.1109/TKDE.2018.2836440
    [22] Rodrigues F, Lourenčo M, Ribeiro B, Pereira F C. Learning supervised topic models for classification and regression from crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2409−2422 doi: 10.1109/TPAMI.2017.2648786
    [23] Li F F, Perona P. A Bayesian hierarchical model for learning natural scene categories. In: Proceedings of the 14th IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005. 524−531
    [24] McCallum A, Nigam K. A comparison of event models for naive Bayes text classification. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization. Palo Alto, USA: AAAI Press, 1998. 41−48
    [25] Rennie J D M, Shih L, Teevan J, Karger D R. Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of the 20th International Conference on Machine Learning. Washington, USA: AAAI Press, 2003. 616−623
    [26] Khoshgoftaar T M, Rebours P. Improving software quality prediction by noise filtering techniques. Journal of Computer Science and Technology, 2007, 22(3): 387−396 doi: 10.1007/s11390-007-9054-2
    [27] Brodley C E, Friedl M A. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 1999, 11(1): 131−167
  • 加载中
图(8) / 表(10)
计量
  • 文章访问数:  1714
  • HTML全文浏览量:  562
  • PDF下载量:  231
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-01-18
  • 网络出版日期:  2021-06-20
  • 刊出日期:  2023-04-20

目录

    /

    返回文章
    返回