强继朋 钱镇宇 李云 袁运浩 朱毅

强继朋, 钱镇宇, 李云, 袁运浩, 朱毅. 基于预训练表示模型的英语词语简化方法. 自动化学报, 2022, 48(8): 2075−2087 doi: 10.16383/j.aas.c200723
Qiang Ji-Peng, Qian Zhen-Yu, Li Yun, Yuan Yun-Hao, Zhu Yi. English lexical simplification based on pretrained language representation modeling. Acta Automatica Sinica, 2022, 48(8): 2075−2087 doi: 10.16383/j.aas.c200723
基金项目: 国家自然科学基金(62076217, 61906060, 61703362)和江苏省自然科学基金(BK20170513)资助

    强继朋:扬州大学信息工程学院副教授. 2016年获合肥工业大学计算机博士学位. 主要研究方向为数据挖掘和自然语言处理. E-mail: jpqiang@yzu.edu.cn

    钱镇宇:扬州大学信息工程学院硕士研究生. 主要研究方向为主题建模和数据挖掘.E-mail: qzyjnwss@126.com

    李云:中国扬州大学信息工程学院教授. 主要研究方向为数据挖掘和云计算. 本文通信作者. E-mail: liyun@yzu.edu.cn

    袁运浩:扬州大学信息工程学院副教授. 2013年获南京理工大学模式识别与智能系统博士学位. 主要研究方向为模式识别, 数据挖掘和图像处理. E-mail: yhyuan@yzu.edu.cn

    朱毅:扬州大学信息工程学院讲师. 2018年获合肥工业大学软件工程博士学位. 主要研究方向为数据挖掘和知识图谱. E-mail: zhuyi@yzu.edu.cn

English Lexical Simplification Based on Pretrained Language Representation Modeling

Funds: Supported by National Natural Science Foundation of China (62076217, 61906060, 61703362) and Natural Science Foundation of Jiangsu Province (BK20170513)
    QIANG Ji-Peng Associate professor at the School of Information Engineering, Yangzhou University. He received his Ph.D. degree in computer science and technology from Hefei University of Technology in 2016. His research interest covers data mining and natural language processing

    QIAN Zhen-Yu Master student at the School of Information Engineering, Yangzhou University. His research interest covers topic modeling and data mining

    LI Yun Professor at the School of Information Engineering, Yangzhou University. His research interest covers data mining and cloud computing. Corresponding author of this paper

    YUAN Yun-Hao Associate professor at the School of Information Engineering, Yangzhou University. He received his Ph.D. degree in pattern recognition and intelligence system from Nanjing University of Science and Technology in 2013. His research interest covers pattern recognition, data mining, and image processing

    ZHU Yi Lecturer at the School of Information Engineering, Yangzhou University. He received his Ph.D. degree in software engineering from Hefei University of Technology in 2018. His research interest covers data mining and knowledge graph

  • 摘要: 词语简化是将给定句子中的复杂词替换成意义相等的简单替代词,从而达到简化句子的目的. 已有的词语简化方法只依靠复杂词本身而不考虑其上下文信息来生成候选替换词, 这将不可避免地产生大量的虚假候选词. 为此, 提出了一种基于预语言训练表示模型的词语简化方法, 利用预训练语言表示模进行候选替换词的生成和排序. 基于预语言训练表示模型的词语简化方法在候选词生成过程中, 不仅不需要任何语义词典和平行语料, 而且能够充分考虑复杂词本身和上下文信息产生候选替代词. 在候选替代词排序过程中, 基于预语言训练表示模型的词语简化方法采用了5个高效的特征, 除了常用的词频和词语之间相似度特征之外, 还利用了预训练语言表示模的预测排名、基于基于预语言训练表示模型的上、下文产生概率和复述数据库PPDB三个新特征. 通过3个基准数据集进行验证, 基于预语言训练表示模型的词语简化方法取得了明显的进步, 整体性能平均比最先进的方法准确率高出29.8%.
  • 图  1  三种词语简化方法产生的候选替换词进行对比[16, 18]

    Fig.  1  The substitution candidates generated by the three lexical simplification methods are compared[16, 18]

    图  2  BERT-LS使用BERT模型生成候选词, 其中输入为“the cat perched on the mat”

    Fig.  2  BERT-LS uses the BERT model to generate candidate words, and the input is “the cat perched on the mat”

    图  3  不同的掩码比例对系统的影响

    Fig.  3  The influence of different mask proportion on the system

    图  5  不同生成候选词数量的评估结果

    Fig.  5  Evaluation results of different number of candidate words generated

    表  1  候选词生成过程评估结果

    Table  1  Evaluation results of candidate word generation process

    方法LexMTurk BenchLS NNSeval
    Yamamoto0.0560.0790.065 0.0320.0870.047 0.0260.0610.037
    表  2  整个简化系统评估结果

    Table  2  Evaluation results of the whole simplified system

    方法LexMTurk BenchLS NNSeval
    Yamamoto0.0660.066 0.0440.041 0.4440.025
    表  3  不同特征对候选词排序的影响

    Table  3  The influence of different features on the ranking of candidates

    方法LexMTurk BenchLS NNSeval 平均值
    BERT-LS0.8640.792 0.6970.616 0.5260.436 0.6960.615
    仅用 BERT 预测排名0.7720.6080.6950.5020.5310.3430.6660.484
    去除 BERT 预测排名0.8340.7780.6780.6230.4730.4230.6620.608
    去除 PPDB0.8400.7740.6820.6120.5150.4310.6790.606
    表  4  使用不同的BERT模型的评估结果

    Table  4  Evaluation results using different BERT models

    数据集模型候选词生成评估 完整系统评估
    LexMTurkBase0.3170.2460.277 0.7460.700
    BenchLSBase0.2330.3170.269 0.5860.537
    NNSevalBase0.1720.2300.197 0.3930.347
    表  5  LexMTurk数据集中的简化句例

    Table  5  Simplified sentences in LexMTurk

    句子原句; 标签; 生成词; 最终
    句 1Much of the water carried by these streams is diverted; Changed, turned, moved, rerouted, separated, split, altered, veered, …; transferred, directed, discarded, converted, derived; transferred
    句 2Following the death of Schidlof from a heart attack in 1987, the Amadeus Quartet disbanded; dissolved, scattered, quit, separated, died, ended, stopped, split; formed, retired, ceased, folded, reformed, resigned, collapsed, closed, terminated; formed
    句 3…, apart from the efficacious or prevenient grace of God, is utterly unable to…; ever, present, showy, useful, effective, capable, strong, valuable, powerful, active, efficient, …; irresistible, inspired, inspiring, extraordinary, energetic, inspirational; irresistible
    句 4…, resembles the mid-19th century Crystal Palace in London; mimics, represents, matches, shows, mirrors, echos, favors, match; suggests, appears, follows, echoes, references, features, reflects, approaches; suggests
    句 5…who first demonstrated the practical application of electromagnetic waves,…; showed, shown, performed, displayed; suggested, realized, discovered, observed, proved, witnessed, sustained; suggested
    句 6…a well-defined low and strong wind gusts in squalls as the system tracked into…; followed, traveled, looked, moved, entered, steered, went, directed, trailed, traced…; rolled, ran, continued, fed, raced, stalked, slid, approached, slowed; rolled
    句 7…is one in which part of the kinetic energy is changed to some other form of energy…; active, moving, movement, motion, static, motive, innate, kinetic, real, strong, driving…; mechanical, total, dynamic, physical, the, momentum, velocity, ballistic; mechanical
    句 8None of your watched items were edited in the time period displayed; changed, refined, revise, finished, fixed, revised, revised, scanned, shortened; altered, modified, organized, incorporated, appropriate; altered
    表  6  LexMTurk数据集中的简化句例

    Table  6  Simplified sentences in LexMTurk

    句子原句; 标签; 生成词; 最终
    句 1Triangles can also be classified according to their internal angles, measured here in degrees; grouped, categorized, arranged, labeled, divided, organized, separated, defined, described …; divided, described, separated, designated; classified
    句 2…; he retained the conductorship of the Vienna Philharmonic until 1927; kept, held, had, got maintained, held, kept, remained, continued, shared; maintained
    句 3 …, and a Venetian in Paris in 1528 also reported that she was said to be beautiful; said, told, stated, wrote, declared, indicated, noted, claimed, announced, mentioned; noted, confirmed, described, claimed, recorded, said; reported
    句 4…, the king will rarely play an active role in the development of an offensive or ….; infrequently, hardly, uncommonly, barely, seldom, unlikely, sometimes, not, seldomly…; never, usually, seldom, not, barely, hardly; never
