2.624

2020影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于预训练表示模型的英语词语简化方法

强继朋 钱镇宇 李云 袁运浩 朱毅

强继朋, 钱镇宇, 李云, 袁运浩, 朱毅. 基于预训练表示模型的英语词语简化方法. 自动化学报, 2021, 48(x): 1001−1013 doi: 10.16383/j.aas.c200723
引用本文: 强继朋, 钱镇宇, 李云, 袁运浩, 朱毅. 基于预训练表示模型的英语词语简化方法. 自动化学报, 2021, 48(x): 1001−1013 doi: 10.16383/j.aas.c200723
Qiang Ji-Peng, Qian Zhen-Yu, Li Yun, Yuan Yun-Hao, Zhu Yi. English lexical simplification based on pretrained language representation modeling. Acta Automatica Sinica, 2021, 48(x): 1001−1013 doi: 10.16383/j.aas.c200723
Citation: Qiang Ji-Peng, Qian Zhen-Yu, Li Yun, Yuan Yun-Hao, Zhu Yi. English lexical simplification based on pretrained language representation modeling. Acta Automatica Sinica, 2021, 48(x): 1001−1013 doi: 10.16383/j.aas.c200723

基于预训练表示模型的英语词语简化方法

doi: 10.16383/j.aas.c200723
基金项目: 国家自然科学基金(62076217, 61906060, 61703362)和江苏省自然科学基金(BK20170513)资助.
详细信息
    作者简介:

    强继朋:扬州大学信息工程学院副教授. 2016年获合肥工业大学计算机博士学位. 主要研究方法为数据挖掘, 自然语言处理. E-mail: jpqiang@yzu.edu.cn

    钱镇宇:目前正在中国扬州大学信息工程学院攻读软件工程硕士学位. 他的研究方向包括主题建模和数据挖掘. E-mail: qzyjnwss@126.com

    李云:中国扬州大学信息工程学院教授. 他的研究方向包括数据挖掘和云计算. 本文通信作者. E-mail: liyun@yzu.edu.cn

    袁运浩:扬州大学信息工程学院副教授. 2013年获得南京理工大学模式识别与智能系统博士学位. 他的研究方向包括模式识别、数据挖掘和图像处理. E-mail: yhyuan@yzu.edu.cn

    朱毅:扬州大学信息工程学院讲师. 2018年获合肥工业大学软件工程博士学位. 主要研究方向为数据挖掘, 知识图谱. E-mail: zhuyi@yzu.edu.cn

English Lexical Simplification based on Pretrained Language Representation Modeling

Funds: Supported by National Natural Science Foundation of P. R. China (62076217, 61906060, 61703362) and Natural Science Foundation of Jiangsu Province (BK20170513)
More Information
    Author Bio:

    QIANG Ji-Peng Associate professor in the School of information Engineering, Yangzhou University, China. He received his Ph. D. degree in computer science and technology from Hefei university of Technology in 2016. His research interests include data mining and natural language processing

    QIAN Zhen-Yu currently working toward the M. S. degree of software engineering at the School of information Engineering, Yangzhou University, China. His research interests include topic modeling and data mining

    LI Yun Professor in the School of information Engineering, Yangzhou University, China. His research interests include data mining and cloud computing. Corresponding author of this paper

    YUAN Yun-Hao Associate professor in the School of information Engineering, Yangzhou University, China. He received his Ph. D. degree in pattern recognition and intelligence system from Nanjing University of Science and Technology, China in 2013. His research interests include pattern recognition, data mining, and image processing

    ZHU Yi Assist Professor in the School of information Engineering, Yangzhou University, China. He received his Ph. D. degree in software engineering from Hefei university of Technology in 2018. His research interests include data mining and knowledge graph

  • 摘要: 词语简化(Lexical simplification, LS)是将给定句子中的复杂词替换成意义相等的简单替代词,从而达到简化句子的目的. 已有的词语简化方法只依靠复杂词本身而不考虑其上下文信息来生成候选替换词, 这将不可避免地产生大量的虚假候选词. 为此, 提出了一种基于预训练表示模型BERT的词语简化方法BERT-LS, 利用BERT进行候选替换词的生成和排序. BERT-LS在候选词生成过程中, 不仅不需要任何语义词典和平行语料, 而且能够充分考虑复杂词本身和上下文信息产生候选替代词. 在候选替代词排序过程中, BERT-LS采用了五个高效的特征, 除了常用的词频和词语之间相似度特征之外, 还利用了BERT的预测排序、基于BERT的上下文产生概率和复述数据库PPDB这三个新特征. 通过三个基准数据集进行验证, BERT-LS取得了明显的进步, 整体性能平均比最先进的方法准确率高出29.8%.
  • 图  1  三种词语简化方法产生的候选替换词进行对比. 给定一个句子 “John composed these verses.”和复杂词“composed”、“verses”, 每个复杂词的前三个简化候选词由BERT-LS、PaetzoldNE[16]和Rec-LS[17]生成.

    Fig.  1  The candidate replacement words generated by the three lexical simplification methods are compared. Give a sentence “John composed these verses.” and complex words “composed”, “verses”, The first three simplified candidates for each complex word were generated by BERT-LS, PaetzoldNE[16] and Rec-LS[17].

    图  2  BERT-LS使用BERT模型生成候选词, 输入为“the cat perched on the mat.” [CLS]和[SEP]是BERT中的两个特殊符号, 输入的两个句子由[CLS]开始, 使用[SEP]分割两个句子

    Fig.  2  BERT-LS uses the BERT model to generate candidate words. The input is “the cat perched on the mat.” [CLS] and [SEP] are two special symbols in BERT. The two sentences start from [CLS], and use [SEP] to split two sentences

    3a  不同的掩码比例对系统的影响

    3a  The influence of different mask proportion on the system

    3b  不同的掩码比例对系统的影响

    3b  The influence of different mask proportion on the system

    图  4  不同生成候选词数量的评估结果

    Fig.  4  Evaluation results of different number of candidate words generated

    表  1  候选词生成过程评估结果

    Table  1  Evaluation results of candidate word generation process

    LexMTurk BenchLS NNSeval
    精确率召回率F值精确率召回率F值精确率召回率F值
    Yamamoto0.0560.0790.065 0.0320.0870.047 0.0260.0610.037
    Biran0.1530.0980.1190.1300.1440.1360.0840.0790.081
    Devlin0.1640.0920.1180.1330.1530.1430.0920.0930.092
    Horn0.1530.1340.1430.2350.1310.1680.1340.0880.106
    Glavaš0.1510.1220.1350.1420.1910.1630.1050.1410.121
    PaetzoldCA0.1770.1400.1560.1800.2520.2100.1180.1610.136
    PaetzoldNE0.3100.1420.1950.2700.2090.2360.1860.1360.157
    Rec-LS0.1510.1540.1520.1290.2460.1700.1030.1550.124
    BERT-Single0.2530.1970.2210.1760.2390.2030.1380.1850.158
    BERT-LS0.3060.2380.2680.2440.3310.2810.1940.2600.222
    下载: 导出CSV

    表  2  整个简化系统评估结果

    Table  2  Evaluation results of the whole simplified system

    LexMTurk BenchLS NNSeval
    精确率准确率精确率准确率精确率准确率
    Yamamoto0.0660.066 0.0440.041 0.4440.025
    Biran0.7140.0340.1240.1230.1210.121
    Devlin0.3680.3660.3090.3070.3350.117
    PaetzoldCA0.5780.3960.4230.4230.2970.297
    Horn0.7610.6630.5460.3410.3640.172
    Glavaš0.7100.6820.4800.2520.4560.197
    PaetzoldNE0.6760.6760.6420.4340.5440.335
    Rec-LS0.7840.2560.7340.3350.6650.218
    BERT-Single0.6940.6520.4950.4610.3140.285
    BERT-LS0.8640.7920.6970.6160.5260.436
    下载: 导出CSV

    表  3  不同特征对候选词排序的影响

    Table  3  The influence of different features on the ranking of candidates

    LexMTurk BenchLS NNSeval 平均值
    精确率准确率精确率准确率精确率准确率精确率准确率
    BERT-LS0.8640.792 0.6970.616 0.5260.436 0.6960.615
    仅用BERT预测排名0.7720.6080.6950.5020.5310.3430.6660.484
    去除BERT预测排名0.8340.7780.6780.6230.4730.4230.6620.608
    去除上下文产生概率0.8380.7600.7060.6140.5150.4060.6860.593
    去除相似度0.8180.7660.6510.6040.4730.4180.6470.596
    去除词频0.8060.6700.7090.5500.5560.3970.6910.539
    去除PPDB0.8400.7740.6820.6120.5150.4310.6790.606
    下载: 导出CSV

    表  4  使用不同的BERT模型的评估结果

    Table  4  Evaluation results using different Bert models

    数据集模型候选词生成评估 完整系统评估
    精确率召回率F值精确率准确率
    LexMTurkBase0.3170.2460.277 0.7460.700
    Large0.3340.2590.2920.7860.742
    WWM0.3060.2380.2680.8640.792
    BenchLSBase0.2330.3170.269 0.5860.537
    Large0.2520.3420.2900.6360.589
    WWM0.2440.3310.2810.6970.616
    NNSevalBase0.1720.2300.197 0.3930.347
    Large0.1850.2470.2110.4020.360
    WWM0.1940.2600.2220.5260.436
    下载: 导出CSV

    表  5  LexMTurk数据集中的简化句例. 复杂词用加粗和下划线标记. “标签”由人工注释.

    Table  5  Simplified sentences in LexMTurk. Complex words are marked with bold and underline. Labels are manually annotated

    句子1: 标签: 生成词: 最终替代词: Much of the water carried by these streams is diverted. Changed, turned, moved, rerouted, separated, split, altered, veered, … transferred, directed, discarded, converted, derived transferred
    句子2: 标签: 生成词: 最终替代词: Following the death of Schidlof from a heart attack in 1987, the Amadeus Quartet disbanded. dissolved, scattered, quit, separated, died, ended, stopped, split formed, retired, ceased, folded, reformed, resigned, collapsed, closed, terminated formed
    句子3: 标签: 生成词: 最终替代词: …, apart from the efficacious or prevenient grace of God, is utterly unable to… ever, present, showy, useful, effective, capable, strong, valuable, powerful, active, efficient… irresistible, inspired, inspiring, extraordinary, energetic, inspirational irresistible
    句子4: 标签: 生成词: 最终替代词: …, resembles the mid-19th century Crystal Palace in London. mimics, represents, matches, shows, mirrors, echos, favors, match suggests, appears, follows, echoes, references, features, reflects, approaches suggests
    句子5: 标签: 生成词: 最终替代词: …who first demonstrated the practical application of electromagnetic waves,… showed, shown, performed, displayed suggested, realized, discovered, observed, proved, witnessed, sustained suggested
    句子6: 标签: 生成词: 最终替代词: …a well-defined low and strong wind gusts in squalls as the system tracked into… followed, traveled, looked, moved, entered, steered, went, directed, trailed, traced… rolled, ran, continued, fed, raced, stalked, slid, approached, slowed rolled
    句子7: 标签: 生成词: 最终替代词: …is one in which part of the kinetic energy is changed to some other form of energy… active, moving, movement, motion, static, motive, innate, kinetic, real, strong, driving… mechanical, total, dynamic, physical, the, momentum, velocity, ballistic mechanical
    句子8: 标签: 生成词: 最终替代词: None of your watched items were edited in the time period displayed. changed, refined, revise, finished, fixed, revised, revised, scanned, shortened altered, modified, organized, incorporated, appropriate altered
    下载: 导出CSV

    表  6  LexMTurk数据集中的简化句例. 复杂词用加粗和下划线标记. 标签中存在的候选词用加粗标记.

    Table  6  Simplified sentences in LexMTurk. Complex words are marked with bold and underline. Candidate words existing in tags are marked in bold.

    句子1: 标签: 生成词: 最终替代词: Triangles can also be classified according to their internal angles, measured here in degrees. grouped, categorized, arranged, labeled, divided, organized, separated, defined, described … divided, described, separated, designated classified
    句子2: 标签: 生成词: 最终替代词: …; he retained the conductorship of the Vienna Philharmonic until 1927. kept, held, had, got maintained, held, kept, remained, continued, shared maintained
    句子3: 标签: 生成词: 最终替代词: …, and a Venetian in Paris in 1528 also reported that she was said to be beautiful said, told, stated, wrote, declared, indicated, noted, claimed, announced, mentioned noted, confirmed, described, claimed, recorded, said reported
    句子4: 标签: 生成词: 最终替代词: …, the king will rarely play an active role in the development of an offensive or …. infrequently, hardly, uncommonly, barely, seldom, unlikely, sometimes, not, seldomly… never, usually, seldom, not, barely, hardly never
    下载: 导出CSV
  • [1] Hirsh DP. What vocabulary size is needed to read unsimplified texts for pleasure? Reading in a Foreign Language, 1992, 8: 689-696
    [2] Nation ISP. Learning vocabulary in another language. Ernst Klett Sprachen, 2001
    [3] De Belder J, Moens M. Text simplification for children. In: Proceedings of the International Acm Sigir Conference on Research and Development in Information Retrieval//Geneva, Switzerland, 2010: 19−26
    [4] Paetzold G, Specia L. Unsupervised lexical simplification for non-native speakers. In: Proceedings of the National Conference on Artificial Intelligence//Phoenix, USA, 2016. 3761−3767
    [5] Feng L. Automatic readability assessment for people with intellectual disabilities. In: Proceedings of the ACM Sigaccess Accessibility and Computing, 2009, 93: 84−91
    [6] Saggion, Horacio. Automatic text simplification. Synthesis Lectures on Human Language Technologies. American: Morgan & Claypool, 2017.10(1): 1-137
    [7] Devlin, Siobhan & Tait, John. The use of a psycholinguistic database in the simplification of text for aphasic readers. Linguistic Databases, 1998
    [8] Lesk M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the International Conference on Systems Documentation//New York, USA, 1986.24−26
    [9] Sinha R. UNT-SimpRank: Systems for lexical simplification ranking. In: Proceedings of the Joint Conference on Lexical and Computational semantics//Montreal, Canada, 2012.493−496
    [10] Leroy G, Endicott J E, Kauchak D, et al. User evaluation of the effects of a text simplification algorithm using term familiarity on perception, understanding, learning, and information retention. Journal of Medical Internet Research, 2013. 15(7): e144 doi: 10.2196/jmir.2569
    [11] Biran O, Brody S, Elhadad N, et al. Putting it simply: a context-aware approach to lexical simplification. In: Proceedings of the Meeting of the Association for Computational Linguistics.//Portland, USA, 2011. 496−501
    [12] Yatskar M, Pang B, Danescu-Niculescu-Mizil C, et al. For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. In: Proceedings of the north american chapter of the association for computational linguistics. Los Angeles, California, 2010.365−368
    [13] Horn C, Manduca C, Kauchak D, et al. Learning a lexical simplifier using wikipedia. In: Proceedings of the meeting of the association for computational linguistics//Baltimore, USA, 2014.458−463
    [14] Glavas G, Stajner S. Simplifying lexical simplification: do we need simplified corpora. In: Proceedings of the international joint conference on natural language processing//Beijing, China, 2015: 63−68
    [15] Paetzold G, Specia L. Unsupervised lexical simplification for non-native speakers. In: Proceedings of the National Conference on Artificial Intelligence. Phoenix, USA, 2016: 3761−3767
    [16] Paetzold G, Specia L. Lexical simplification with neural ranking. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics//Valencia, Spain, 2017: 34−40
    [17] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [Online], available: https://arxiv.org/abs/1810.04805?context=cs, May 24, 2019
    [18] Gooding S, Kochmar E. Recursive context-aware lexical simplification. In: Proceedings of the International Joint Conference on Natural Language Processing//Hong Kong, China, 2019: 4852−4862 (in chinese)
    [19] Coster W, Kauchak D. Simple English Wikipedia: A New Text Simplification Task. In: Proceedings of the Meeting of the Association for Computational Linguistics: Human Language Technologies. DBLP, 2011
    [20] Xu W, Napoles C, Pavlick E, et al. Optimizing Statistical Machine Translation for Text Simplification. In: Proceedings of the Transactions of the Association for Computational Linguistics, 2016, 4: 401−415
    [21] Nisioi S, Tajner S, Ponzetto S P, et al. Exploring neural text simplification models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2017
    [22] Dong Y, Li Z, Rezagholizadeh M, et al. Editnts: an neural programmer-interpreter model for sentence simplification through explicit editing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
    [23] Xu W, Callison-Burch C, Napoles C. Problems in Current Text Simplification Research: New Data Can Help. In: Proceedings of the Transactions of the Association for Computational Linguistics, 2015, 3(1): 283−297
    [24] Shardlow M. A survey of automated text simplification. In: Proceedings of the International Journal of Advanced Computer Science and Applications, 2014, 4(1)
    [25] Paetzold G, Specia L. A survey on lexical simplification. In: Proceedings of the Journal of Artificial Intelligence Research, 2017: 549−593
    [26] Pavlick E, Callisonburch C. Simple ppdb: a paraphrase database for simplification. In: Proceedings of the Meeting of the Association for Computational Linguistics//Berlin, Germany 2016: 143−148
    [27] Maddela M, Xu W. A word-complexity lexicon and a neural readability ranking model for lexical simplification. In: Proceedings of the Empirical Methods in Natural Language Processing//Brussels, Belgium, 2018: 3749−3760
    [28] Leon, Hervas, Gervas, et al. Empirical identification of text simplification strategies for reading-impaired people. In: Proceedings of the Conference of the Association for the Advance of Assistive Technologies in Europe//Maastricht, Netherlands, 2011: 567−574
    [29] Lee J, Yoon W, Kim S, et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics//Singapore, Singapore, 2019, 36(4): 1234−1240.
    [30] Lample G, Conneau a. cross-lingual language model pretraining. [Online], available: https://arxiv.org/abs/1901.07291v1, Jan 22, 2019
    [31] Mikolov T, Grave E, Bojanowski P, et al. Advances in Pre-Training Distributed Word Representations. [Online], available: https://arxiv.org/abs/1712.09405., Dec 26, 2017
    [32] Brysbaert M, New B. Moving beyond kucera and francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for american english. Behavior Research Methods, Instruments & Computers, 2009, 41(4): 977-990
    [33] Ganitkevitch J, Vandurme B, Callison-Burch C. Ppdb: the paraphrase database. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics//Atlanta, USA, 2013
    [34] Little D. Common European Framework of Reference for Languages. The TESOL Encyclopedia of English Language Teaching. American Cancer Society, 2018. Gooding S,
    [35] Kochmar E. Complex word identification as a sequence labelling task. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, 1148−1153
    [36] Kajiwara T, Matsumoto H, Yamamoto K, et al. Selecting proper lexical paraphrase for children. In: Proceedings of the International Conference on Computational Linguistics//Copenhagen, Denmark, 2013: 59−7
  • 加载中
计量
  • 文章访问数:  113
  • HTML全文浏览量:  17
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-09-05
  • 录用日期:  2020-12-23
  • 网络出版日期:  2022-01-08

目录

    /

    返回文章
    返回