2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

多尺度视觉语义增强的多模态命名实体识别方法

王海荣 徐玺 王彤 陈芳萍

王海荣, 徐玺, 王彤, 陈芳萍. 多尺度视觉语义增强的多模态命名实体识别方法. 自动化学报, 2024, 50(6): 1234−1245 doi: 10.16383/j.aas.c230573
引用本文: 王海荣, 徐玺, 王彤, 陈芳萍. 多尺度视觉语义增强的多模态命名实体识别方法. 自动化学报, 2024, 50(6): 1234−1245 doi: 10.16383/j.aas.c230573
Wang Hai-Rong, Xu Xi, Wang Tong, Chen Fang-Ping. Multi-scale visual semantic enhancement for multimodal named entity recognition method. Acta Automatica Sinica, 2024, 50(6): 1234−1245 doi: 10.16383/j.aas.c230573
Citation: Wang Hai-Rong, Xu Xi, Wang Tong, Chen Fang-Ping. Multi-scale visual semantic enhancement for multimodal named entity recognition method. Acta Automatica Sinica, 2024, 50(6): 1234−1245 doi: 10.16383/j.aas.c230573

多尺度视觉语义增强的多模态命名实体识别方法

doi: 10.16383/j.aas.c230573
基金项目: 宁夏自然科学基金(2023AAC03316), 宁夏回族自治区教育厅高等学校科学研究重点项目 (NYG2022051)资助
详细信息
    作者简介:

    王海荣:北方民族大学教授. 2015年获得东北大学博士学位. 主要研究方向为大数据知识工程与智能信息处理. 本文通信作者. E-mail: wanghr@nun.edu.cn

    徐玺:北方民族大学计算机科学与工程学院硕士研究生. 主要研究方向为多模态信息抽取. E-mail: 20217403@stu.nmu.edu.cn

    王彤:北方民族大学计算机科学与工程学院硕士研究生. 主要研究方向为多模态信息抽取. E-mail: is_wangtong@163.com

    陈芳萍:北方民族大学计算机科学与工程学院硕士研究生. 主要研究方向为多模态信息抽取. E-mail: 17393213357@163.com

Multi-scale Visual Semantic Enhancement for Multimodal Named Entity Recognition Method

Funds: Supported by Natural Science Foundation of Ningxia (2023AAC03316) and Key Research Project of Education Department of Ningxia Hui Autonomous Region (NYG2022051)
More Information
    Author Bio:

    WANG Hai-Rong Professor at No-rth Minzu University. She received her Ph.D. degree from Northeastern University in 2015. Her research interest covers big data knowledge engineering and intelligent information processing. Corresponding author of this paper

    XU Xi Master student at the Sch-ool of Computer Science and Engineering, North Minzu University. His main research interest is multimodal information extraction

    WANG Tong Master student at the School of Computer Science and Engineering, North Minzu Univer-sity. Her main research interest is multimodal information extraction

    CHEN Fang-Ping Master student at the School of Computer Science and Engineering, North Minzu University. Her main research interest is multimodal information extraction

  • 摘要: 为解决多模态命名实体识别(Multimodal named entity recognition, MNER)方法研究中存在的图像特征语义缺失和多模态表示语义约束较弱等问题, 提出多尺度视觉语义增强的多模态命名实体识别方法(Multi-scale visual semantic enhancement for multimodal named entity recognition method, MSVSE). 该方法提取多种视觉特征用于补全图像语义, 挖掘文本特征与多种视觉特征间的语义交互关系, 生成多尺度视觉语义特征并进行融合, 得到多尺度视觉语义增强的多模态文本表示; 使用视觉实体分类器对多尺度视觉语义特征解码, 实现视觉特征的语义一致性约束; 调用多任务标签解码器挖掘多模态文本表示和文本特征的细粒度语义, 通过联合解码解决语义偏差问题, 从而进一步提高命名实体识别准确度. 为验证该方法的有效性, 在Twitter-2015和Twitter-2017数据集上进行实验, 并与其他10种方法进行对比, 该方法的平均F1值得到提升.
  • 图  1  MSVSE模型框架

    Fig.  1  The framework of MSVSE model

    图  2  多模态特征融合模块

    Fig.  2  The multimodal feature fusion module

    图  3  多任务标签解码器

    Fig.  3  The multi-task label decoder

    图  4  在Twitter-2015上的视觉实体分类性能比较

    Fig.  4  Performance comparison of visual entity classification on Twitter-2015

    图  5  在Twitter-2017上的视觉实体分类性能比较

    Fig.  5  Performance comparison of visual entity classification on Twitter-2017

    表  1  数据集上方法性能比较(%)

    Table  1  Performance comparison of method on dataset (%)

    方法 Twitter-2015 Twitter-2017
    PER LOC ORG MISC F1 PER LOC ORG MISC F1
    MSB 86.44 77.16 52.91 36.05 73.47 84.32
    MAF 84.67 81.18 63.35 41.82 73.42 91.51 85.80 85.10 68.79 86.25
    UMGF 84.26 83.17 62.45 42.42 74.85 91.92 85.22 83.13 69.83 85.51
    M3S 86.05 81.32 62.97 41.36 75.03 92.73 84.81 82.49 69.53 86.06
    UMT 85.24 81.58 63.03 39.45 73.41 91.56 84.73 82.24 70.10 85.31
    UAMNer 84.95 81.28 61.41 38.34 73.10 90.49 81.52 82.09 64.32 84.90
    VAE 85.82 81.56 63.20 43.67 75.07 91.96 81.89 84.13 74.07 86.37
    MNER-QG 85.68 81.42 63.62 41.53 74.94 93.17 86.02 84.64 71.83 87.25
    RGCN 86.36 82.08 60.78 41.56 75.00 92.86 86.10 84.05 72.38 87.11
    HvpNet 85.74 81.78 61.92 40.81 74.33 92.28 84.81 84.37 65.20 85.80
    MSVSE 86.72 81.63 64.08 38.91 75.11 93.24 85.96 85.22 70.00 87.34
    –HvpNet 0.98 –0.15 2.16 –1.90 0.78 0.96 1.15 0.85 4.80 1.54
    下载: 导出CSV

    表  2  模型结构消融实验(%)

    Table  2  Structural ablation experiments for the model (%)

    方法Twitter-2015Twitter-2017
    PERLOCORGMISCF1PERLOCORGMISCF1
    MSVSE86.7281.6364.0838.9175.1193.2485.9685.2270.0087.34
    w/o自注意力机制86.4981.2063.2141.5674.8393.0586.5284.3767.3486.79
    w/o相似度86.3381.5963.1540.8474.9192.9486.5984.0768.2486.75
    w/o自注意力机制加相似度86.8081.3863.3239.6274.6792.9785.8784.4167.9686.67
    w/o多任务标签解码器86.4981.7862.6837.6074.6992.9884.8385.0271.6687.14
    w/o视觉实体分类器86.5281.6463.0639.8974.7993.3784.8385.8266.2486.92
    下载: 导出CSV

    表  3  联合编码器中视觉特征消融实验(%)

    Table  3  Visual feature ablation experiments in the joint encoder (%)

    文本视觉标签图像描述Twitter-2015Twitter-2017
    PERLOCORGMISCF1PERLOCORGMISCF1
    $ \checkmark$$ \checkmark$86.7281.6364.0838.9175.1193.2485.9685.2270.0087.34
    $ \checkmark$86.7681.6861.2139.4674.7392.9586.2084.6070.8287.11
    $ \checkmark$$ \checkmark$86.8781.7463.7237.8074.8793.0385.7184.4371.7187.16
    $ \checkmark$$ \checkmark$$ \checkmark$86.5181.8562.2038.3674.7293.7385.9684.6270.9787.38
    下载: 导出CSV

    表  4  多尺度视觉语义前缀中视觉特征消融实验(%)

    Table  4  Visual feature ablation experiments in multi-scale visual semantic prefixes (%)

    区域视觉特征视觉标签图像描述Twitter-2015Twitter-2017
    PERLOCORGMISCF1PERLOCORGMISCF1
    $ \checkmark$$ \checkmark$$ \checkmark$86.7281.6364.0838.9175.1193.2485.9685.2270.0087.34
    $ \checkmark$86.2581.9363.9938.2374.7693.1684.8385.4769.1087.13
    $ \checkmark$$ \checkmark$86.5681.6064.0138.5974.9393.0285.7985.9768.6787.28
    $ \checkmark$$ \checkmark$86.8781.7963.3638.6874.9892.9486.5285.1468.9487.14
    下载: 导出CSV

    表  5  单尺度视觉特征下方法性能对比(%)

    Table  5  Performance comparison of methods under single scale visual feature (%)

    方法单尺度视觉特征Twitter-2015
    F1
    Twitter-2017
    F1
    MAF区域视觉特征73.4286.25
    MSB图像标签73.4784.32
    ITA视觉标签75.1885.67
    ITA5个视觉描述75.1785.75
    ITA光学字符识别75.0185.64
    MSVSEonly区域视觉特征74.8486.75
    MSVSEonly视觉标签74.6687.17
    MSVSEonly视觉描述74.5687.23
    MSVSEw/o视觉前缀74.8987.08
    MSVSE (本文方法)75.1187.34
    下载: 导出CSV

    表  6  不同学习率的方法性能对比(%)

    Table  6  Performance comparison of methods under different learning rates (%)

    数据集 学习率($\times\; { {10}^{-5} }$)
    1 2 3 4 5 6
    Twitter-2015 73.4 75.0 75.1 74.8 74.6 74.5
    Twitter-2017 87.1 86.8 87.3 87.5 87.2 87.3
    下载: 导出CSV

    表  7  参数量及时间效率对比

    Table  7  Comparison of parameter number and time efficiency

    方法参数量(MB)训练时间(s)验证时间(s)
    MSB122.9745.803.31
    UMGF191.32314.4218.73
    MAF136.09103.396.37
    ITA122.9765.404.69
    UMT148.10156.738.59
    HvpNet143.3470.369.34
    MSVSE (本文方法)119.2775.817.03
    下载: 导出CSV

    表  8  基于预训练语言模型的MNER方法性能对比(%)

    Table  8  Performance comparison of MNER method based on pre-trained language model (%)

    方法Twitter-2015Twitter-2017
    Glove-BiLSTM-CRF69.1579.37
    BERT-CRF71.8183.44
    BERT-large-CRF73.5386.81
    XLMR-CRF77.3789.39
    Prompting ChatGPT79.3391.43
    MSVSE75.1187.34
    下载: 导出CSV
  • [1] Moon S, Neves L, Carvalho V. Multimodal named entity recognition for short social media posts. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, USA: NAACL Press, 2018. 852−860
    [2] Lu D, Neves L, Carvalho V, Zhang N, Ji H. Visual attention model for name tagging in multimodal social media. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: 2018. 1990−1999
    [3] Asgari-Chenaghlu M, Farzinvash M R, Farzinvash L, Balafar M A, Motamed C. CWI: A multimodal deep learning approach for named entity recognition from social media using character, word and image features. Neural Computing and Applications, 2022, 34(3): 1905−1922 doi: 10.1007/s00521-021-06488-4
    [4] Zhang Q, Fu J L, Liu X Y, Huang X J. Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence Conference, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans, USA: AAAI Press, 2018. 5674− 5681
    [5] Zheng C M, Wu Z W, Wang T, Cai Y, Li Q. Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Transactions on Multimedia, 2020, 23: 2520−2532
    [6] Wu Z W, Zheng C M, Cai Y, Chen J Y, Leung H F, Li Q. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020. 1038−1046
    [7] Yu J F, Jiang J, Yang L, Xia R. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Virtual Event: 2020. 3342−3352
    [8] Xu B, Huang S Z, Sha C F, Wang H Y. MAF: A general matching and alignment framework for multimodal named entity recognition. In: Proceedings of the 15th ACM International Conference on Web Search and Data Mining. New York, USA: Association for Computing Machinery, 2022. 1215−1223
    [9] Wang X W, Ye J B, Li Z X, Tian J F, Jiang Y, Yan M, et al. CAT-MNER: Multimodal named entity recognition with knowledge refined cross-modal attention. In: Proceedings of the IEEE International Conference on Multimedia and Exposition. Taipei, China: 2022. 1−6
    [10] Zhang D, Wei S Z, Li S S, Wu H Q, Zhu Q M, Zhou G D. Multimodal graph fusion for named entity recognition with targeted visual guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2021. 14347− 14355
    [11] 钟维幸, 王海荣, 王栋, 车淼. 多模态语义协同交互的图文联合命名实体识别方法. 广西科学, 2022, 29(4): 681−690

    Zhong Wei-Xing, Wang Hai-Rong, Wang Dong, Che Miao. lmage-text joint named entity recognition method based on multimodal semantic interaction. Guangxi Sciences, 2022, 29(4): 681−690
    [12] Yu T, Sun X, Yu H F, Li Y, Fu K. Hierarchical self-adaptation network for multimodal named entity recognition in social media. Neurocomputing, 2021, 439 : 12−21
    [13] Wang X Y, Gui M, Jiang Y, Jia Z X, Bach N, Wang T, et al. ITA: Image-text alignments for multimodal named entity recognition. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, USA: Association for Computational Linguistics, 2022. 3176−3189
    [14] Liu L P, Wang M L, Zhang M Z, Qing L B, He X H. UAMNer: Uncertainty aware multimodal named entity recognition in social media posts. Applied Intelligence, 2022, 52(4): 4109−4125 doi: 10.1007/s10489-021-02546-5
    [15] 李晓腾, 张盼盼, 勾智楠, 高凯. 基于多任务学习的多模态命名实体识别方法. 计算机工程, 2023, 49(4): 114−119

    Li Xiao-Teng, Zhang Pan-Pan, Gou Zhi-Nan, Gao Kai. Multimodal named entity recognition method based on multi-task learning. Computer Engineering, 2023, 49(4): 114−119
    [16] Wang J, Yang Y, Liu K Y, Zhu Z P, Liu X R. M3S: Scene graph driven multi-granularity multi-task learning for multimodal NER. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31 : 111−120
    [17] Chen X, Zhang N Y, Li L, Yao Y Z, Deng S M, Tan C Q, et al. Good visual guidance make a better extractor: Hierarchical visual prefix for multimodal entity and relation extraction. In: Proceedings of the Association for Computational Linguistics. Seattle, USA: Association for Computational Linguistics, 2022. 1607−1618
    [18] Jia M, Shen L, Shen X, Liao L J, Chen M, He X D, et al. MNER-QG: An end-to-end MRC framework for multimodal named entity recognition with query grounding. AAAI, 2022, 37(7): 8032−8040
    [19] Sun L, Wang J Q, Su Y D, Weng F S, Sun Y X, Zheng Z W, et al. RIVA: A pre-trained tweet multimodal model based on text-image relation for multimodal NER. In: Proceedings of the 28th International Conference on Computational Linguistics. Virtual Event: 2022. 1852−1862
    [20] Sun L, Wang J Q, Zhang K, Su Y D, Weng F S. RpBERT: A text-image relation propagation-based BERT model for multimodal NER. In: Proceedings of the AAAI Conference on Artificial Intelligence. Virtual Event: 2021. 13860−13868
    [21] Xu B, Huang S, Du M, Wang H Y, Song H, Sha C F, et al. Different data, different modalities reinforced data splitting for effective multimodal information extraction from social media posts. In: Proceedings of the 29th International Conference on Computational Linguistics. Virtual Event: 2022. 1855−1864
    [22] Zhao F, Li C H, Wu Z, Xing S Y, Dai X Y. Learning from different text-image pairs: A relation-enhanced graph convolutional network for multimodal NER. In: Proceedings of the 30th ACM International Conference on Multimedia, Association for Computing Machinery. New York, USA: 2022. 3983−3992
    [23] Zhou B H, Zhang Y, Song K H, Guo W Y, Zhao G Q, Wang W B, et al. A span-based multimodal variational autoencoder for semi-supervised multimodal named entity recognition. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022. 6293−6302
    [24] He K M, Gkioxari G, Dollár P, Girshick R. Mask-RCNN. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: 2017. 2980−2988
    [25] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: 2015. 3156−3164
    [26] He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: 2016. 770−778
    [27] 王海荣, 徐玺, 王彤, 荆博祥. 多模态命名实体识别方法研究进展. 多模态命名实体识别方法研究进展. 郑州大学学报(工学版), 2024, 45 (2): 60−71 doi: 10.13705/j.issn.1671-6833

    Wang Hai-Rong, Xu Xi, Wang Tong, Jing Bo-Xiang. Research progress of multimodal named entity recognition. Journal of Zhengzhou University (Engineering Science), 2024, 45 (2): 60−71 doi: 10.13705/j.issn.1671-6833
    [28] Li J Y, Li H, Pan Z, Sun D, Wang J H, Zhang W K, et al. Prompting ChatGPT in MNER: Enhanced multimodal named entity recognition with auxiliary refined knowledge. In: Proceedings of the Association for Computational Linguistics. Singapore: 2023. 2787−2802
  • 加载中
图(5) / 表(8)
计量
  • 文章访问数:  478
  • HTML全文浏览量:  211
  • PDF下载量:  140
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-09-13
  • 录用日期:  2024-02-22
  • 网络出版日期:  2024-04-01
  • 刊出日期:  2024-06-27

目录

    /

    返回文章
    返回