2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于中文电子病历的心血管疾病风险因素标注体系及语料库构建

苏嘉 何彬 吴昊 杨锦锋 关毅 姜京池 王焕政 于秋滨

苏嘉, 何彬, 吴昊, 杨锦锋, 关毅, 姜京池, 王焕政, 于秋滨. 基于中文电子病历的心血管疾病风险因素标注体系及语料库构建. 自动化学报, 2019, 45(2): 420-426. doi: 10.16383/j.aas.2018.c170206
引用本文: 苏嘉, 何彬, 吴昊, 杨锦锋, 关毅, 姜京池, 王焕政, 于秋滨. 基于中文电子病历的心血管疾病风险因素标注体系及语料库构建. 自动化学报, 2019, 45(2): 420-426. doi: 10.16383/j.aas.2018.c170206
SU Jia, HE Bin, WU Hao, YANG Jin-Feng, GUAN Yi, JIANG Jing-Chi, WANG Huan-Zheng, YU Qiu-Bin. Annotation Scheme and Corpus Construction for Cardiovascular Diseases Risk Factors From Chinese Electronic Medical Records. ACTA AUTOMATICA SINICA, 2019, 45(2): 420-426. doi: 10.16383/j.aas.2018.c170206
Citation: SU Jia, HE Bin, WU Hao, YANG Jin-Feng, GUAN Yi, JIANG Jing-Chi, WANG Huan-Zheng, YU Qiu-Bin. Annotation Scheme and Corpus Construction for Cardiovascular Diseases Risk Factors From Chinese Electronic Medical Records. ACTA AUTOMATICA SINICA, 2019, 45(2): 420-426. doi: 10.16383/j.aas.2018.c170206

基于中文电子病历的心血管疾病风险因素标注体系及语料库构建

doi: 10.16383/j.aas.2018.c170206
基金项目: 

国家自然科学基金 71531007

详细信息
    作者简介:

    苏嘉  哈尔滨工业大学博士研究生.主要研究方向为信息抽取和自然语言处理.E-mail:sjd163mail@163.com

    何彬  哈尔滨工业大学博士研究生.主要研究方向为命名实体识别, 实体关系抽取.E-mail:hebin_hit@foxmail.com

    吴昊  哈尔滨医科大学附属第二医院硕士研究生.主要研究方向为血管瘤和circRNA在纤维化中的作用机制.E-mail:rosiewuyanxi@gmail.com

    杨锦锋  哈尔滨理工大学讲师, 博士.主要研究方向为健康信息学, 自然语言处理.E-mail:fondofbeyond@163.com

    姜京池  哈尔滨工业大学博士研究生.主要研究方向为医疗知识网络, 知识图谱.E-mail:jiangjingchi0118@163.com

    王焕政  哈尔滨工业大学硕士研究生.主要研究方向为知识挖掘, 自然语言处理.E-mail:whz123_hit@163.com

    于秋滨  哈尔滨医科大学附属第二医院副主任医师.主要研究方向为电子病案的数据挖掘.E-mail:yuqiubin6695@163.com

    通讯作者:

    关毅  哈尔滨工业大学教授, 博士.主要研究方向为健康信息学, 自然语言处理.本文通信作者.E-mail:guanyi@hit.edu.cn

Annotation Scheme and Corpus Construction for Cardiovascular Diseases Risk Factors From Chinese Electronic Medical Records

Funds: 

National Natural Science Foundation of China 71531007

More Information
    Author Bio:

     Ph. D. candidate at Harbin Institute of Technology. His research interest covers information extraction and NLP

     Ph. D. candidate at Harbin Institute of Technology. His research interest covers named entity recognition, entity relation extraction

     Master student at the Second Affiliated Hospital of Harbin Medical University. Her research interest covers hemangioma, the modulation mechanism of circular RNA on expressions of fibrosis-associated process

     Lecturer and Ph. D. at Harbin University of Science and Technology. His research interest covers health informatics and NLP

     Ph. D. candidate at Harbin Institute of Technology. His research interest covers medical knowledge network, knowledge graph

     Master student at Harbin Institute of Technology. His research interest covers knowledge mining, and natural language processing

     Deputy chief physician at the Second Affiliated Hospital of Harbin Medical University. Her research interest covers data mining on electronic medical records

    Corresponding author: GUAN Yi  Professor and Ph. D. at Harbin Institute of Technology. His research interest covers health informatics and NLP. Corresponding author of this paper
  • 摘要: 本文讨论了从中文电子病历中标注心血管疾病风险因素及其相关信息的问题,提出了适应中文电子病历内容特点的心血管疾病风险因素标注体系,构建了中文健康信息处理领域首份关于心血管疾病风险因素的标注语料库.
    1)  本文责任编委 张民
  • 图  1  风险因素标注体系缩略图

    Fig.  1  The thumbnail of risk factor annotation scheme

    图  2  风险因素语料构建流程图

    Fig.  2  Annotation flow chart of risk factor corpus

    表  1  中文电子病历心血管疾病风险因素的标注原则

    Table  1  Annotation guidelines for CVDs risk factors in CEMR

    类别 风险因素 指针 标注原则
    超重或肥胖 病历提到 提到体重超重或者肥胖的描述, 如:身材肥胖
    腰围值 提到患者的腰围或者腹围值
    高血压 病历提到 提到高血压或高血压病史, 如:既往高血压病1年(这里带有持续时间我们将其一同标注)
    血压高 提到患者的血压值或任何反映患者血压高的表述, 如:查体: $\cdots$ BP 130/80 mmHg$\cdots$
    调节血压 提到患者需要调压或已有调压效果不理想的描述, 如: 血压控制不理想
    药物 明确目的是为了调压的药物, 如:平素口服珍菊降压
    糖尿病 病历提到 提到糖尿病或糖尿病病史, 如:无糖尿病病史
    血糖高 提到血糖高、血糖的相关检查指标值或者其他可以表明患者血糖高的描述, 如: 随机血糖: 14.5 mmol/L
    疾病类 调节血糖 提到患者需要调节血糖或已有调节效果不理想的描述, 如:长期以来血压、血糖控制不佳
    药物 明确目的是为了调节血糖的药物、饮食, 如:口服降糖药控制尚可
    血脂异常 病历提到 提到血脂异常、高血脂或高血脂史, 如: 高血脂10余年
    血脂高 提到患者血脂的相关检查指标值或任何可以表明患者血脂高的描述, 如: 总胆固醇(GPO酶法): 5.39 mmol/L
    调节血脂 提到患者需要调脂或已有调脂效果不理想的描述, 如:诊疗计划: 控制血脂
    药物 明确目的是为了调脂的药物, 如:调节血脂, 稳定冠脉粥样斑块: 立普妥 20 mg Qn po
    慢性肾病 病历提到 提到慢性肾病的描述, 如:病历特点: 肾炎病史20余年
    动脉粥样硬化 病历提到 提到动脉粥样硬化、粥样斑块或冠脉狭窄的描述, 如:临床确定诊断: 冠状动脉粥样硬化
    阻塞性睡眠呼吸暂停综合征 病历提到 提到阻塞性睡眠呼吸暂停综合征的描述, 如:临床确定诊断:肾囊肿阻塞性睡眠呼吸暂停综合征
    吸烟 病历提到 提到患者吸烟或吸烟史的描述, 如: 吸烟40余年
    戒烟 提到患者戒烟或未戒烟的描述, 如: 戒烟1年(这里的1年表示戒烟距现在的时常, 不代表吸烟的时间长短, 因此不能反映吸烟的严重程度)
    生活方式类 吸烟量 提到患者吸烟量的描述, 如: $\cdots$每天20支
    过度饮酒 病历提到 提到患者过度饮酒或饮酒严重程度的描述, 如: 嗜酒40余年
    饮酒量 提到患者饮酒量的描述如:饮酒史20余年, 1斤/日
    心血管疾病家族史 病历提到 提到患者有心血管疾病家族史或一级亲属(父母、兄弟姐妹、子女)有心血管疾病史, 如:病例特点: $\cdots$母亲患有冠心病$\cdots$
    不可改变类 年龄 病历提到 提到患者的年龄, 如:年龄: 55岁
    年龄层 提到患者所处的年龄层, 如: 青年男患
    性别 病历提到 提到患者的性别, 如:中年
    下载: 导出CSV

    表  2  五轮培训的标注一致性

    Table  2  The IAA in the training

    第一轮 第二轮 第三轮 第四轮 第五轮
    $P$ 0.810 0.977 0.967 0.986 0.988
    $R$ 0.815 0.977 0.962 0.986 0.988
    $F$ 0.812 0.977 0.964 0.986 0.988
    下载: 导出CSV

    表  3  语料库中各风险因素数量统计

    Table  3  The statistics of risk factor annotated corpus

    类型 风险因素 数量
    疾病类 超重或肥胖 18
    高血压 3 729
    糖尿病 1 007
    血脂异常 372
    慢性肾病 26
    动脉粥样硬化 144
    阻塞性睡眠呼吸暂停综合征 1
    行为和生活方式类 吸烟 508
    过度饮酒 95
    不可改变类 心血管疾病家族史 10
    年龄 1 859
    性别 1 909
    下载: 导出CSV

    表  4  心血管疾病诊断实验结果

    Table  4  Diagnosis results of CVDs

    特征$\backslash$方法 LR RF GBDT XGboost
    自述症状+检查结果 0.662 0.672 0.756 0.720
    自述症状+检查结果+风险因素 0.675 0.688 0.798 0.811
    下载: 导出CSV
  • [1] World Health Organization. Cardiovascular diseases (CVDs)[Online], available:http://www.who.int/mediacentre/factsheets/fs317/en/, November 3, 2017.
    [2] Gasparyan A Y. Cardiovascular Risk Factor. Rijeka, Croatia:InTech, 2012. 1-102
    [3] Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages:a description based on the theories of Zellig Harris. Journal of Biomedical Informatics, 2002, 35(4):222-235 doi: 10.1016/S1532-0464(03)00012-1
    [4] Stubbs A, Uzuner Ö. Annotating risk factors for heart disease in clinical narratives for diabetic patients. Journal of Biomedical Informatics, 2015, 58(S):S78-S91 http://www.sciencedirect.com/science/article/pii/S1532046415000891
    [5] Marcus M P, Marcinkiewicz M A, Santorini B. Building a large annotated corpus of English:the Penn Treebank. Computational linguistics, 1993, 19(2):313-330 http://portal.acm.org/citation.cfm?id=972475
    [6] Kim J D, Ohta T, Tateisi Y, Tsujii J. GENIA corpus-semantically annotated corpus for bio-textmining. Bioinformatics, 2003, 19(S1):i180-i182 http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=HighWire000005822068
    [7] Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association, 2008, 15(1):14-24 doi: 10.1197/jamia.M2408
    [8] Uzuner Ö. Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association, 2009, 16(4):561-570 doi: 10.1197/jamia.M3115
    [9] Uzuner Ö, Solti I, Cadag E. Extracting medication information from clinical text. Journal of the American Medical Informatics Association, 2010, 17(5):514-518 doi: 10.1136/jamia.2010.003947
    [10] Uzuner Ö, South B R, Shen S Y, DuVall S L. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 2011, 18(5):552-556 doi: 10.1136/amiajnl-2011-000203
    [11] Sun W Y, Rumshisky A, Uzuner Ö. Evaluating temporal relations in clinical text:2012 i2b2 Challenge. Journal of the American Medical Informatics Association, 2013, 20(5):806-813 doi: 10.1136/amiajnl-2013-001628
    [12] Pradhan S, Elhadad N, South B R, Martinez D, Christensen L, Vogel A, Suominen H, Chapman W W, Savova G. Evaluating the state of the art in disorder recognition and normalization of the clinical narrative. Journal of the American Medical Informatics Association, 2015, 22(1):143-154 doi: 10.1136/amiajnl-2013-002544
    [13] Meystre S M, Kim Y, Gobbel G T, Matheny M E, Redd A, Bray B E, Garvin J H. Congestive heart failure information extraction framework for automated treatment performance measures assessment. Journal of the American Medical Informatics Association, 2017, 24(e1):e40-e46 http://jamia.oxfordjournals.org/content/early/2016/07/12/jamia.ocw097
    [14] Ford E, Carroll J A, Smith H E, Scott D, Cassell J A. Extracting information from the text of electronic medical records to improve case detection:a systematic review. Journal of the American Medical Informatics Association, 2016, 23(5):1007-1015 doi: 10.1093/jamia/ocv180
    [15] Styler IV W F, Bethard S, Finan S, Palmer M, Pradhan S, de Groen P C, Erickson B, Miller T, Lin C, Savova G, Pustejovsky J. Temporal annotation in the clinical domain. Transactions of the Association for Computational Linguistics, 2014, 2:143-154 doi: 10.1162/tacl_a_00172
    [16] Bethard S, Savova G, Chen W T, Derczynski L, Pustejovsky J, Verhagen M. Semeval-2016 task 12:clinical tempeval. In:Proceedings of the 2016 SemEval. San Diego, USA:SemEval, 2016. 1052-1062
    [17] Roberts A, Gaizauskas R, Hepple M, Demetriou G, Guo Y, Setzer A. Semantic annotation of clinical text:the CLEF corpus. In:Proceedings of the 2008 LREC Workshop on Building and Evaluating Resources for Biomedical Text Mining. Marrakech, Morocco:LREC, 2008. 19-26
    [18] Rink B, Harabagiu S, Roberts K. Automatic extraction of relations between medical concepts in clinical texts. Journal of the American Medical Informatics Association, 2011, 18(5):594-600 http://d.old.wanfangdata.com.cn/OAPaper/oai_pubmedcentral.nih.gov_3168312
    [19] Quan H D, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi J C, Saunders L D, Beck CA, Feasby T E, Ghali W A. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Medical Care, 2005, 43(11):1130-1139 doi: 10.1097/01.mlr.0000182534.19832.83
    [20] Stearns M Q, Price C, Spackman K A, Wang A Y. SNOMED clinical terms:overview of the development process and project status. In:Proceedings of the 2001 AMIA Symposium. Washington DC, USA:AMIA, 2001. 662-666
    [21] Bodenreider O. The unified medical language system (UMLS):integrating biomedical terminology. Nucleic Acids Research, 2004, 32(S1):D267-D270 http://d.old.wanfangdata.com.cn/OAPaper/oai_pubmedcentral.nih.gov_2245702
    [22] 杨锦锋, 于秋滨, 关毅, 蒋志鹏.电子病历命名实体识别和实体关系抽取研究综述.自动化学报, 2014, 40 (8):1537-1562 http://www.aas.net.cn/CN/abstract/abstract18425.shtml

    Yang Jin-Feng, Yu Qiu-Bin, Guan Yi, Jiang Zhi-Peng. An overview of research on electronic medical record oriented named entity recognition and entity relation extraction. Acta Automatica Sinica, 2014, 40(8):1537-1562 http://www.aas.net.cn/CN/abstract/abstract18425.shtml
    [23] Lei J B. Named Entity Recognition in Chinese Clinical Text[Ph.D. dissertation], The University of Texas, USA, 2014.
    [24] Xu Y, Wang Y, Liu T, Liu J, Fan Y, Qian Y. Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. Journal of the American Medical Informatics Association, 2014, 21(e1):e84-e92 doi: 10.1136/amiajnl-2013-001806
    [25] Wang H, Zhang W D, Zeng Q, Li Z F, Feng K Y, Liu L. Extracting important information from Chinese Operation Notes with natural language processing methods. Journal of Biomedical Informatics, 2014, 48:130-136 doi: 10.1016/j.jbi.2013.12.017
    [26] Wu Y H, Jiang M, Lei J B, Xu H. Named entity recognition in Chinese clinical text using deep neural network. Studies in Health Technology & Informatics, 2015, 216:624-628 http://europepmc.org/articles/PMC4624324
    [27] Lei J B, Tang B Z, Lu X Q, Gao K H, Jiang M, Xu H. A comprehensive study of named entity recognition in Chinese clinical text. Journal of the American Medical Informatics Association, 2014, 21(5):808-814 doi: 10.1136/amiajnl-2013-002381
    [28] Wang Y Q, Yu Z H, Chen L, Chen Y H, Liu Y G, Hu X G. Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine:an empirical study. Journal of Biomedical Informatics, 2014, 47:91-104 doi: 10.1016/j.jbi.2013.09.008
    [29] Stubbs A, Kotfila C, Xu H, Uzuner Ö. Identifying risk factors for heart disease over time:overview of 2014 i2b2/UTHealth shared task Track 2. Journal of Biomedical Informatics, 2015, 58(S):S67-S77 http://www.sciencedirect.com/science/article/pii/S1532046415001409
    [30] World Heart Federation. Cardiovascular disease risk factors[Online], available:https://www.world-heart-federation.org/resources/risk-factors/, March 28, 2017.
    [31] Tesseract[Online], available:https://github.com/tesseract-ocr, November 3, 2017.
    [32] Artstein R, Poesio M. Inter-coder agreement for computational linguistics. Computational Linguistics, 2008, 34(4):555-596 doi: 10.1162/coli.07-034-R2
    [33] Chen T Q, Guestrin C. Xgboost:a scalable tree boosting system. In:Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA:ACM, 2016. 785-794
  • 加载中
图(2) / 表(4)
计量
  • 文章访问数:  2204
  • HTML全文浏览量:  541
  • PDF下载量:  826
  • 被引次数: 0
出版历程
  • 收稿日期:  2017-04-17
  • 录用日期:  2017-10-29
  • 刊出日期:  2019-02-20

目录

    /

    返回文章
    返回