2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

嵌套删失数据期望最大化的高斯混合聚类算法

余海燕 陈京京 邱航 王永 王若凡

余海燕, 陈京京, 邱航, 王永, 王若凡. 嵌套删失数据期望最大化的高斯混合聚类算法. 自动化学报, 2021, 47(6): 1302−1314 doi: 10.16383/j.aas.c190081
引用本文: 余海燕, 陈京京, 邱航, 王永, 王若凡. 嵌套删失数据期望最大化的高斯混合聚类算法. 自动化学报, 2021, 47(6): 1302−1314 doi: 10.16383/j.aas.c190081
Yu Hai-Yan, Chen Jing-Jing, Qiu Hang, Wang Yong, Wang Ruo-Fan. Adapted expectation maximization algorithm for Gaussian mixture clustering with censored data. Acta Automatica Sinica, 2021, 47(6): 1302−1314 doi: 10.16383/j.aas.c190081
Citation: Yu Hai-Yan, Chen Jing-Jing, Qiu Hang, Wang Yong, Wang Ruo-Fan. Adapted expectation maximization algorithm for Gaussian mixture clustering with censored data. Acta Automatica Sinica, 2021, 47(6): 1302−1314 doi: 10.16383/j.aas.c190081

嵌套删失数据期望最大化的高斯混合聚类算法

doi: 10.16383/j.aas.c190081
基金项目: 国家自然科学基金(71601026, 61601331, 71571105), 重庆市产业类重大主题专项(cstc2017zdcy-zdzxX0013), 四川省重点研发项目(2018SZ0114, 2019YFS0271), 天津市自然科学基金青年项目(18JCQNJC04700)资助
详细信息
    作者简介:

    余海燕:重庆邮电大学副教授. 美国宾西法尼亚州立大学博士后访问学者. 2015年获得天津大学博士学位. 主要研究方向为统计机器学习, 因果推断. 本文通信作者.E-mail: yuhy@cqupt.edu.cn

    陈京京:重庆邮电大学经济管理学院硕士研究生. 主要研究方向为聚类算法和数据缺失机制.E-mail: chenjingjing_361@163.com

    邱航:电子科技大学计算机科学与工程学院副教授. 2011年获得电子科技大学计算机应用技术博士学位. 2013 ~ 2014年英国诺丁汉大学访问学者. 主要研究方向为机器学习和计算机图形学.E-mail: qiuhang@uestc.edu.cn

    王永:重庆邮电大学管理工程系教授. 2007年于重庆大学获得计算机科学与技术专业博士学位. 主要研究方向为数据分析和信息安全.E-mail: wangyong_cqupt@163.com

    王若凡:天津职业技术师范大学讲师. 2015年获得天津大学博士学位. 2018 ~ 2019年美国宾夕法尼亚州立大学访问学者. 主要研究方向为神经影像数据分析, 机器学习.E-mail: wangrf@tju.edu.cn

Adapted Expectation Maximization Algorithm for Gaussian Mixture Clustering With Censored Data

Funds: Supported by National Natural Science Foundation of China (71601026, 61601331, 71571105), Chongqing Science and Technology Commission (cstc2017zdcy-zdzxX0013), Key Research and Development Program of Sichuan Province (2018SZ0114, 2019YFS0271), and Tianjin Natural Science Foundation Youth Project(18JCQNJC04700)
More Information
    Author Bio:

    YU Hai-Yan Associate professor at Chongqing University of Posts and Telecommunications (CQUPT). Postdoctoral visiting scholar at The Pennsylvania State University. He received his Ph.D. degree from Tianjin University in 2015. His research interest covers statistical machine learning, causal inference. Corresponding author of this paper

    CHEN Jing-Jing Master student at the School of Economics and Management, Chongqing University of Posts and Telecommunications. Her main research interest covers clustering algorithm and data missing mechanism

    QIU Hang Associate professor at the School of Computer Science and Engineering, University of Electronic Science and Technology of China (UESTC). He received his Ph. D. degree in computer application technology from UESTC in 2011. From 2013 to 2014, he was a visiting researcher at University of Nottingham, UK. His research interest covers machine learning and computer graphics

    WANG Yong Professor in the Department of Management Engineering, Chongqing University of Posts and Telecommunications. He received his Ph. D. degree in computer science and technology, from Chongqing University, in 2007. His research interest covers data analysis and information security

    WANG Ruo-Fan Lecturer at the School of Information Technology Engineering, Tianjin University of Technology and Education. She received her Ph. D. degree from Tianjin University in 2015. From 2018 to 2019, she was a visiting scholar in the Department of Biomedical Engineering, The Pennsylvania State University. Her research interest covers analysis of neuroimaging data and machine learning

  • 摘要: 针对聚类问题中的非随机性缺失数据, 本文基于高斯混合聚类模型, 分析了删失型数据期望最大化算法的有效性, 并揭示了删失数据似然函数对模型算法的作用机制. 从赤池弘次信息准则、信息散度等指标, 比较了所提出方法与标准的期望最大化算法的优劣性. 通过删失数据划分及指示变量, 推导了聚类模型参数后验概率及似然函数, 调整了参数截尾正态函数的一阶和二阶估计量. 并根据估计算法的有效性理论, 通过关于得分向量期望的方程得出算法估计的最优参数. 对于同一删失数据集, 所提出的聚类算法对数据聚类中心估计更精准. 实验结果证实了所提出算法在高斯混合聚类的性能上优于标准的随机性缺失数据期望最大化算法.
  • 图  1  在数据集DS-a右删失上的两种算法比较

    Fig.  1  Comparison of the two algorithms on the dataset DS-a with right censoring

    图  2  在数据集DS-a双边删失上的两种算法比较

    Fig.  2  Comparison of the two algorithms on the dataset DS-a with double-side censoring

    图  3  在数据集DS-b左删失上的两种算法比较

    Fig.  3  Comparison of the two algorithms on the dataset DS-b with left censoring

    图  4  在数据集DS-b双边删失上的两种算法比较

    Fig.  4  Comparison of the two algorithms on the dataset DS-b with double-side censoring

    图  5  在血糖测试数据右删失上两种算法比较

    Fig.  5  Comparison of the two algorithms on the dataset of blood sugar tests with right-side censoring

    表  1  实验合成数据集真实分布和估计分布之间的KLD值

    Table  1  Kullback-Leibler divergence (KLD) between the true densities and the estimated densities of the synthetic data set

    数据集 观测值 (删失) EMGM cenEMGM
    DS-a 右删失 0.072 ± 0.011 0.261 ± 0.016 0.051 ± 0.003
    DS-a 双边删失 0.226 ± 0.017 10.602 ± 1.966 0.028 ± 0.009
    DS-b 左删失 4.362 ± 0.393 32.263 ± 4.193 22.583 ± 3.392
    DS-b 双边删失 4.219 ± 0.381 30.321 ± 4.128 29.655 ± 3.938
    下载: 导出CSV

    表  2  实验合成数据集参数估计的两种算法AIC比较

    Table  2  AIC comparison of the two estimation algorithms on the synthetic data set

    数据集 EMGM cenEMGM
    DS-a 右删失 12852 ± 594 12349 ± 481
    DS-a 双边删失 12782 ± 436 12323 ± 417
    DS-b 左删失 9435 ± 317 8815 ± 305
    DS-b 双边删失 8759 ± 293 7152 ± 264
    下载: 导出CSV

    表  3  真实数据及其拓展数据的两种算法比较

    Table  3  Comparison of the two algorithms with the real data and its extended data

    EMGM 算法 cenEMGM 算法
    右边删失率 8.51 % 聚类中心 (4.50, 7.22) (4.53, 7.54)
    (4.94, 9.55) (6.01, 10.51)
    KLD 12.7 9.1
    AIC 4366 4263
    右边删失率 11.67 % 聚类中心 (4.50, 7.20) (4.53, 7.54)
    (4.81, 9.70) (6.08, 9.85)
    KLD 11.35 9.08
    AIC 4 290 4 209
    双边删失率 15.05 %: 右边删失 8.51 %,
    左边删失 6.54 %
    聚类中心 (5.10, 7.43) (5.10, 7.48)
    (5.48, 8.56) (5.48, 8.94)
    KLD 173.7 158.6
    AIC 2226 −24327
    下载: 导出CSV
  • [1] Scrucca L, Raftery A E. Clustvarsel: A package implementing variable selection for Gaussian model-based clustering in R. Journal of Statistical Software, 2018: 84
    [2] O´Hagan A, Murphy TB, Gormley IC, McNicholas PD, Karlis D. Clustering with the multivariate normal inverse Gaussian distribution. Computational Statistics & Data Analysis, 2016, 93: 18−30
    [3] Xu M, Yu H Y, and Shen J. New approach to eliminate structural redundancy in case resource pools usingαmutual information. Journal of Systems Engineering and Electronics, 2013, 24(4): 625−633 doi: 10.1109/JSEE.2013.00073
    [4] Qiu H, Yu H Y, Wang L Y, Yao Q, Wu S N, Yin C, Deng J. Electronic health record driven prediction for gestational diabetes mellitus in early pregnancy. Scientific Reports, 2017, 7(1): 16417 doi: 10.1038/s41598-017-16665-y
    [5] 李晓庆, 唐昊, 司加胜, 苗刚中. 面向混合属性数据集的改进半监督FCM聚类方法. 自动化学报, 2018, 44(12): 2259−2268

    Li Xiao-Qing, Tang Hao, Si Jia-Sheng, Miao Gang-Zhong. An improved semi-supervised FCM clustering method for mixed attribute datasets. Acta Automatica Sinica, 2018, 44(12): 2259−2268
    [6] Xu M, Yu H Y, and Shen J. New algorithm for CBR-RBR fusion with robust thresholds. Chinese Journal of Mechanical Engineering, 2012, 25: 1255−1263 doi: 10.3901/CJME.2012.06.1255
    [7] 沈江, 余海燕, 徐曼. 实体异构性下证据链融合推理的多属性群决策. 自动化学报, 2015, 41: 832−842

    Shen Jiang, Yu Hai-Yan, Xu Man. Heterogeneous evidence chains based fusion reasoning for multi-attribute group decision making. Acta Automatica Sinica, 2015, 41: 832−842
    [8] 余海燕, 沈江, 徐曼. 类别误标下证据链推理的群决策分类方法. 系统工程与电子技术, 2015, (11): 2546−2553 doi: 10.3969/j.issn.1001-506X.2015.11.19

    Yu Hai-Yan, Shen Jiang, Xu Man. ECs-based reasoning for group decision analysis in the mislabeled classification context. Systems Engineering and Electronic Technology, 2015, (11): 2546−2553 doi: 10.3969/j.issn.1001-506X.2015.11.19
    [9] Yu H Y, Shen J, Xu M. Temporal case matching with information value maximization for predicting physiological states. Information Sciences, 2016, 367: 766−782
    [10] Yu H Y, Shen J, Xu M. Resilient parallel similarity-based reasoning for classifying heterogeneous medical cases in mapreduce. Digital Communications & Networks, 2016, 2(3): 145−150
    [11] Lee G, Scott C. EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Computational Statistics & Data Analysis, 2012, 56(9): 2816−2829
    [12] Little R J, and Donald B R. Statistical Analysis with Missing Data. John Wiley & Sons, 2019.
    [13] Linero A R, Daniels M J. Bayesian approaches for missing not at random outcome data: The role of identifying restrictions. Statistical Science, 2018, 33: 198−213 doi: 10.1214/17-STS630
    [14] Fang F, Shao J. Model selection with nonignorable nonresponse. Biometrika, 2016, 103(4): asw039
    [15] Wu Y J, Fang W Q, Cheng L H, et al. A flexible Bayesian non-parametric approach for fitting the odds to case II interval-censored data. Journal of Statistical Computation and Simulation, 2018, 88(16): 3132−3150 doi: 10.1080/00949655.2018.1504944
    [16] Leão J, Leiva V, Saulo H, et al. A survival model with Birnbaum – Saunders frailty for uncensored and censored cancer data. Brazilian Journal of Probability and Statistics, 2018, 32(4): 707−729 doi: 10.1214/17-BJPS360
    [17] Goldberg Y, Kosorok M R. Support vector regression for right censored data. Electronic Journal of Statistics, 2017, 11(1): 532−69 doi: 10.1214/17-EJS1231
    [18] 荀立, 周勇. 左截断右删失数据分位差估计及其渐近性质. 数学学报, 2017, 60(3): 451−464

    Xun Li, Zhou Yong. Estimators and their asymptotic properties for quantile difference with left truncated and right censored data. Acta Mathematica Sinica (Chinese Series), 2017, 60(3): 451−464
    [19] Ma Y, Wang Y. Estimating disease onset distribution functions in mutation carriers with censored mixture data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 2014, 63(1): 1−23
    [20] 周志华. 机器学习. 北京: 清华大学出版社, 2016.

    Zhou Zhi-Hua. Machine Learning, Beijing: Tsinghua University Press, 2016.
    [21] Cai T T, Ma J, Zhang L. CHIME: Clustering of highdimensional Gaussian mixtures with EM algorithm and its optimality. The Annals of Statistics, 2019, 47: 1234−1267 doi: 10.1214/18-AOS1711
    [22] Chauveau D. A stochastic EM algorithm for mixtures with censored data. Journal of Statistical Planning & Inference, 1995, 46(1): 1−25
    [23] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Series B (Methodological), 1977: 1−38
    [24] Tsiatis A. Semiparametric Theory and Missing Data. Springer Science & Business Media, 2007.
    [25] Wang Yong, et al. A hybrid user similarity model for collaborative filtering. Information Sciences, 2017, 418: 102−118
    [26] Yu H, Chen J, Wang J N, Chiu Y L, Qiu H, Wang L Y. Identification of the differential effect of city-level on the Gini coefficient of healthcare service delivery in online health community. International Journal of Environmental Research and Public Health, 2019, 16: 2314 doi: 10.3390/ijerph16132314
    [27] Luers B, Klasnja P, Murphy S. Standardized effect sizes for preventive mobile health interventions in micro-randomized trials. Prevention Science, 2019, 20: 100−109 doi: 10.1007/s11121-017-0862-5
    [28] McIntyre H D, Catalano P, Zhang C, Desoye G, Mathiesen E R, Damm P. Gestational diabetes mellitus. Nature Reviews Disease Primers, 2019, 5: 47 doi: 10.1038/s41572-019-0098-8
  • 加载中
图(5) / 表(3)
计量
  • 文章访问数:  1302
  • HTML全文浏览量:  246
  • PDF下载量:  158
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-02-11
  • 录用日期:  2019-07-30
  • 网络出版日期:  2021-06-10
  • 刊出日期:  2021-06-10

目录

    /

    返回文章
    返回