一种基于信息熵迁移的文本检测模型自蒸馏方法

陈建炜; 杨帆; 赖永炫

doi:10.16383/j.aas.c210598

一种基于信息熵迁移的文本检测模型自蒸馏方法

doi: 10.16383/j.aas.c210598 cstr: 32138.14.j.aas.c210598

陈建炜^1,,
杨帆^{1, 2,},
赖永炫^3,

1.
厦门大学航空航天学院厦门 361005
2.
厦门大学深圳研究院深圳 518057
3.
厦门大学信息学院厦门 361005

基金项目: 科技创新2030——“新一代人工智能”重大项目(2021ZD0112600), 国家自然科学基金委员会面上项目(62173282, 61872154), 广东省自然科学基金(2021A1515011578), 深圳市基础研究专项面上项目(JCYJ20190809161603551)资助

详细信息

作者简介:
陈建炜：厦门大学航空航天学院硕士研究生. 主要研究方向为计算机视觉, 图像处理. E-mail: jianweichen@ stu.xmu.edu.cn

杨帆：厦门大学航空航天学院副教授. 主要研究方向为机器学习, 数据挖掘和生物信息学. 本文通信作者. E-mail: yang@xmu.edu.cn

赖永炫：厦门大学信息学院教授. 主要研究方向为大数据分析和管理, 智能交通系统, 深度学习和车载网络. E-mail: laiyx@xmu.edu.cn

计量
- 文章访问数: 751
- HTML全文浏览量: 385
- PDF下载量: 175
- 被引次数: 0
出版历程
- 收稿日期: 2021-06-29
- 录用日期: 2022-02-10
- 网络出版日期: 2023-10-12
- 刊出日期: 2024-11-26

Self-distillation via Entropy Transfer for Scene Text Detection

CHEN Jian-Wei^1
,,
YANG Fan^{1, 2
,},
LAI Yong-Xuan^3
,

1.
School of Aerospace Engineering, Xiamen University, Xiamen 361005
2.
Shenzhen Research Institute, Xiamen University, Shenzhen 518057
3.
School of Informatics, Xiamen University, Xiamen 361005

Funds: Supported by National Key Research and Development Program of China (2021ZD0112600), National Natural Science Foundation of China (62173282, 61872154), Natural Science Foundation of Guangdong Province (2021A1515011578), and Shenzhen Fundamental Research Program (JCYJ20190809161603551)

More Information

Author Bio:
CHEN Jian-Wei　Master student at the School of Aerospace Engineering, Xiamen University. His research interest covers computer vision and image processing

YANG Fan　Associate professor at the School of Aerospace Engineering, Xiamen University. His research interest covers machine learning, data mining, and bio-informatics. Corresponding author of this paper

LAI Yong-Xuan　Professor at the School of Informatics, Xiamen University. His research interest covers big data analysis and management, intelligent transportation systems, deep learning, and vehicular networks

摘要

摘要: 前沿的自然场景文本检测方法大多基于全卷积语义分割网络, 利用像素级分类结果有效检测任意形状的文本, 其主要缺点是模型大、推理时间长、内存占用高, 这在实际应用中限制了其部署. 提出一种基于信息熵迁移的自蒸馏训练方法(Self-distillation via entropy transfer, SDET), 利用文本检测网络深层网络输出的分割图(Segmentation map, SM)信息熵作为待迁移知识, 通过辅助网络将信息熵反馈给浅层网络. 与依赖教师网络的知识蒸馏 (Knowledge distillation, KD)不同, SDET仅在训练阶段增加一个辅助网络, 以微小的额外训练代价实现无需教师网络的自蒸馏(Self-distillation, SD). 在多个自然场景文本检测的标准数据集上的实验结果表明, SDET在基线文本检测网络的召回率和F1得分上, 能显著优于其他蒸馏方法.
- 自然场景 /
- 文本检测 /
- 知识蒸馏 /
- 自蒸馏 /
- 信息熵
Abstract: Most of the state-of-the-art text detection methods in natural scenes are based on full convolutional network, which can effectively detect arbitrary shape text by using the pixel level classification results from the segmentation network. The main defects of these methods, i.e. large size of the networks, time-consuming forward reasoning and large memory occupation, hinder their deployment in practical applications. In this paper, we propose self-distillation via entropy transfer (SDET), which takes the information entropy of the segmentation map (SM) output by the deep layers of the text detection network as the knowledge to be transferred, and feeds it directly back into the shallow layers through an auxiliary network. Different from traditional knowledge distillation (KD) which relies on teacher network, SDET utilizes an auxiliary network in the training stage and realizes self-distillation (SD) at a small extra training cost. Experiments conducted on multiple standard datasets for natural scene text detection demonstrate that SDET significantly improves the recall rate and F1 score of the baseline text detection networks, and outperforms other distillation methods.
- Natural scene /
- text detection /
- knowledge distillation (KD) /
- self-distillation (SD) /
- information entropy

HTML全文

图 1 可微二值化文本检测网络的分割图和信息熵图可视化

Fig. 1 Segmentation map and entropy map visualization of differentiable binarization text detection network

下载: 全尺寸图片幻灯片

图 2 不同知识蒸馏方法对比

Fig. 2 Comparison of different knowledge distillation methods

下载: 全尺寸图片幻灯片

图 3 SDET训练框架

Fig. 3 SDET training framework

下载: 全尺寸图片幻灯片

图 4 辅助网络的3种结构形式

Fig. 4 The three types of auxiliary networks

下载: 全尺寸图片幻灯片

图 5 SDET与基线模型的检测结果对比((a)真实标签; (b)基线模型检测结果; (c) SDET训练后的模型检测结果)

Fig. 5 Comparison of detection results between SDET and baseline models ((a) Ground-truth; (b) Detection results of baseline models; (c) Detection results of models trained with SDET)

下载: 全尺寸图片幻灯片

表 1 不同辅助分类器对SDET的影响 (%)

Table 1 The impact of different auxiliary classifiers on SDET (%)

模型	方法	ICDAR2013			ICDAR2015
模型	方法	P	R	F	P	R	F
MV3-EAST	基线	81.7	64.4	72.0	80.9	75.4	78.0
	A型	78.8	65.9	71.8	78.8	76.3	77.5
	B型	84.4	66.5	74.4	81.3	77.0	79.1
	C型	81.4	67.4	73.7	78.9	77.7	78.3
MV3-DB	基线	83.7	66.0	73.8	87.1	71.8	78.7
	A型	84.1	68.8	75.7	86.5	73.9	79.7
	B型	81.1	67.3	73.6	87.8	71.7	78.9
	C型	84.9	67.9	75.4	87.8	73.0	79.7

下载: 导出CSV

表 2 不同特征金字塔位置对B型的影响 (%)

Table 2 The impact of different feature pyramid positions on type B (%)

方法	特征图尺寸(像素)	P	R	F
基线	—	80.9	75.4	78.0
P0	${\text{16}} \times {\text{16}}$	79.1	75.8	77.4
P1	${\text{32}} \times {\text{32}}$	79.5	76.5	78.0
P2	${\text{64}} \times {\text{64}}$	80.7	77.4	79.0
P3	${\text{128}} \times {\text{128}}$	81.3	77.0	79.1

下载: 导出CSV

表 3 MV3-DB在不同数据集上的知识蒸馏实验结果(%)

Table 3 Experimental results of knowledge distillation of MV3-DB on different datasets (%)

方法	ICDAR2013			TD500			TD-TR			ICDAR2015			Total-text			CASIA-10K
方法	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F
基线	83.7	66.0	73.8	78.7	71.4	74.9	83.6	74.4	78.7	87.1	71.8	78.7	87.2	66.9	75.7	88.1	51.9	65.3
ST	82.5	65.8	73.2	77.0	73.0	74.9	84.6	73.5	78.7	85.4	72.2	78.2	87.4	65.3	74.8	88.8	49.4	63.5
KA	82.5	66.8	73.8	79.5	71.3	75.2	86.3	72.5	78.8	85.0	73.3	78.7	85.9	66.8	75.2	87.8	51.4	64.8
FitNets	84.7	65.4	73.8	78.6	73.3	75.8	85.3	74.0	79.2	85.3	73.3	78.8	87.4	67.5	76.2	88.0	52.3	65.6
SKD	82.4	68.8	75.0	81.2	70.6	75.5	84.8	74.5	79.3	87.4	71.6	78.7	87.4	67.0	75.9	88.6	51.6	65.2
SD	83.5	67.8	74.8	79.4	72.2	75.6	85.0	74.0	79.1	85.1	73.0	78.6	87.0	67.6	76.1	87.1	52.0	65.1
SAD	82.8	66.7	73.9	78.7	72.3	75.4	87.3	72.0	78.9	86.7	72.7	79.1	86.5	67.1	75.6	88.4	50.7	64.4
本文方法	84.1	68.8	75.7	80.6	72.2	76.2	85.6	74.6	79.7	86.5	73.9	79.7	87.5	68.4	76.8	87.4	53.4	66.3

下载: 导出CSV

表 4 MV3-EAST在不同数据集上的知识蒸馏实验结果(%)

Table 4 Experimental results of knowledge distillation of MV3-EAST on different datasets (%)

方法	ICDAR2013			ICDAR2015			CASIA-10K
方法	P	R	F	P	R	F	P	R	F
基线	81.7	64.4	72.0	80.9	75.4	78.0	66.1	64.9	65.5
ST	77.8	64.9	70.8	80.9	75.1	77.9	64.7	65.1	64.9
KA	78.6	64.0	70.5	78.2	76.4	77.3	67.7	63.0	65.3
FitNets	82.4	65.8	73.2	78.0	77.8	77.9	65.4	64.2	64.8
SKD	79.5	66.3	72.3	81.9	75.6	78.6	66.6	64.7	65.6
SD	80.2	63.8	71.1	79.6	74.7	77.1	66.2	63.5	64.8
SAD	81.4	65.6	72.6	80.2	76.5	78.3	65.7	64.1	64.9
本文方法	84.4	66.5	74.4	81.3	77.0	79.1	70.8	63.0	66.7

下载: 导出CSV

表 5 SDET与DSN在不同数据集上的对比(%)

Table 5 Comparison of SDET and DSN on different datasets (%)

方法	ICDAR2013			TD500			TD-TR			ICDAR2015			Total-text			CASIA-10K
方法	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F
基线	83.7	66.0	73.8	78.7	71.4	74.9	83.6	74.4	78.7	87.1	71.8	78.7	87.2	66.9	75.7	88.1	51.9	65.3
DSN	84.4	68.0	75.3	79.7	71.5	75.4	86.4	72.2	78.7	85.8	73.4	79.1	86.1	67.9	75.9	87.9	52.3	65.6
本文方法	84.1	68.8	75.7	80.6	72.2	76.2	85.6	74.6	79.7	86.5	73.9	79.7	87.5	68.4	76.8	87.4	53.4	66.3

下载: 导出CSV

表 6 SDET在不同数据集上提升ResNet50-DB的效果(%)

Table 6 The effect of SDET on improving ResNet50-DB on different datasets (%)

方法	ICDAR2013			TD500			TD-TR			ICDAR2015			Total-text			CASIA-10K
方法	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F
基线	86.3	72.9	79.0	84.1	75.9	79.8	87.3	80.4	83.7	90.3	80.1	84.9	87.7	79.4	83.3	90.1	64.7	75.3
本文方法	82.7	77.2	79.9	79.9	81.5	80.7	87.2	83.0	85.0	90.3	82.1	86.0	87.4	81.8	84.5	86.0	68.7	76.4

下载: 导出CSV

参考文献(39)

[1]	Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 3431−3440
[2]	Yuan Y H, Chen X L, Wang J D. Object-contextual representations for semantic segmentation. arXiv preprint arXiv: 1909.11065, 2019.
[3]	Lv P Y, Liao M H, Yao C, Wu W H, Bai X. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: Proceedings of the European Conference on Computer Vision. Munich, Germany: Springer, 2018. 67−83
[4]	He K M, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 2961−2969
[5]	Ye J, Chen Z, Liu J H, Du B. TextFuseNet: Scene text detection with richer fused features. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. Yokohama, Japan: 2020. 516−522
[6]	He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 770−778
[7]	Hinton G E, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv preprint arXiv: 1503.02531, 2015.
[8]	赖轩, 曲延云, 谢源, 裴玉龙. 基于拓扑一致性对抗互学习的知识蒸馏. 自动化学报, 2023, 49(1): 102−110 doi: 10.16383/j.aas.200665 Lai Xuan, Qu Yan-Yun, Xie Yuan, Pei Yu-Long. Topology-guided adversarial deep mutual learning for knowledge distillation. Acta Automatica Sinica, 2023, 49(1): 102−110 doi: 10.16383/j.aas.200665
[9]	Romero A, Ballas N, Kahou S E, Chassang A, Gatta C, Bengio Y. FitNets: Hints for thin deep nets. arXiv preprint arXiv: 1412.6550, 2014.
[10]	Zagoruyko S, Komodakis N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv: 1612.03928, 2016.
[11]	Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, et al. ICDAR2015 competition on robust reading. In: Proceedings of the 13th International Conference on Document Analysis and Recognition. Nancy, France: IEEE, 2015. 1156−1160
[12]	Chng C K, Chan C S. Total-text: A comprehensive data-set for scene text detection and recognition. In: Proceedings of the 14th International Conference on Document Analysis and Recognition. Kyoto, Japan: IEEE, 2017. 935−942
[13]	Cho J H, Hariharan B. On the efficacy of knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019. 4794−4802
[14]	Yang P, Yang G W, Gong X, Wu P P, Han X, Wu J S, et al. Instance segmentation network with self-distillation for scene text detection. IEEE Access, 2020, 8: 45825−45836 doi: 10.1109/ACCESS.2020.2978225
[15]	Vu T H, Jain H, Bucher M, Cord M, Pérez P. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019. 2517−2526
[16]	Lee C Y, Xie S N, Gallagher P, Zhang Z Y, Tu Z W. Deeply-supervised nets. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics. San Diego, USA: PMLR, 2015. 562−570
[17]	Hou Y N, Ma Z, Liu C X, Loy C C. Learning lightweight lane detection CNNs by self attention distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019. 1013−1021
[18]	王润民, 桑农, 丁丁, 陈杰, 叶齐祥, 高常鑫, 等. 自然场景图像中的文本检测综述. 自动化学报, 2018, 44(12): 2113−2141 Wang Run-Min, Sang Nong, Ding Ding, Chen Jie, Ye Qi-Xiang, Gao Chang-Xin, et al. Text detection in natural scene image: A survey. Acta Automatica Sinica, 2018, 44(12): 2113−2141
[19]	Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv: 1506.01497, 2015.
[20]	Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y, et al. SSD: Single shot multi-box detector. In: Proceedings of the European Conference on Computer Vision. Amsterdam, Netherlands: 2016. 21−37
[21]	Liao M H, Shi B G, Bai X, Wang X G, Liu W Y. Textboxes: A fast text detector with a single deep neural network. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI, 2017. 4161−4167
[22]	Tian Z, Huang W L, He T, He P, Qiao Y. Detecting text in natural image with connectionist text proposal network. In: Proce-edings of the European Conference on Computer Vision. Amsterdam, Netherlands: Springer, 2016. 56−72
[23]	Zhou X Y, Yao C, Wen H, Wang Y Z, Zhou S C, He W R, et al. East: An efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017. 5551−5560
[24]	Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image segmentation. In: Proceedings of the Medical Image Computing and Computer Assisted Intervention. Munich, Germany: Springer, 2015. 234−241
[25]	Liao M H, Wan Z Y, Yao C, Chen K, Bai X. Real-time scene text detection with differentiable binarization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI, 2020. 11474−11481
[26]	Wang W H, Xie E Z, Li X, Hou W B, Lu T, Yu G, et al. Shape robust text detection with progressive scale expansion network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019. 9336−9345
[27]	Wang W H, Xie E Z, Song X G, Zang Y H, Wang W J, Lu T, et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019. 8440−8449
[28]	Xu Y C, Wang Y K, Zhou W, Wang Y P, Yang Z B, Bai X. Textfield: Learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing, 2019, 28(11): 5566−5579 doi: 10.1109/TIP.2019.2900589
[29]	He T, Shen C H, Tian Z, Gong D, Sun C M, Yan Y L. Knowledge adaptation for efficient semantic segmentation. In: Proce-edings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019. 578−587
[30]	Liu Y F, Chen K, Liu C, Qin Z C, Luo Z B, Wang J D. Structured knowledge distillation for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019. 2604−2613
[31]	Wang Y K, Zhou W, Jiang T, Bai X, Xu Y C. Intra-class feature variation distillation for semantic segmentation. In: Proce-edings of the European Conference on Computer Vision. Glasg-ow, UK: Springer, 2020. 346−362
[32]	Zhang L F, Song J B, Gao A, Chen J W, Bao C L, Ma K S. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Seo-ul, South Korea: IEEE, 2019. 3713−3722
[33]	Howard A, Sandler M, Chu G, Chen L C, Chen B, Tan M X, et al. Searching for MobileNetV3. In: Proceedings of the IEEE/ CVF International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019. 1314−1324
[34]	Lin T Y, Dollár P, Girshick R, He K M, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017. 2117−2125
[35]	Chen Z Y, Xu Q Q, Cong R M, Huang Q M. Global context-aware progressive aggregation network for salient object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: AAAI, 2020. 10599−10606
[36]	Karatzas D, Shafait F, Uchida S, Iwamura M I, Bigorda L G, Mestre S R, et al. ICDAR2013 robust reading competition. In: Proceedings of the 12th International Conference on Document Analysis and Recognition. Washington DC, USA: IEEE, 2013. 1484−1493
[37]	Yao C, Bai X, Liu W Y, Ma Y, Tu Z W. Detecting texts of arbitrary orientations in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE, 2012. 1083−1090
[38]	Xue C H, Lu S J, Zhan F N. Accurate scene text detection through border semantics awareness and bootstrapping. In: Proceedings of the European Conference on Computer Vision. Munich, Germany: IEEE, 2018. 355−372
[39]	He W H, Zhang X Y, Yin F, Liu C L. Multi-oriented and multi-lingual scene text detection with direct regression. IEEE Transactions on Image Processing, 2018, 27(11): 5406−5419 doi: 10.1109/TIP.2018.2855399