基于运动引导的高效无监督视频目标分割网络

赵子成; 张开华; 樊佳庆; 刘青山

doi:10.16383/j.aas.c210626

基于运动引导的高效无监督视频目标分割网络

doi: 10.16383/j.aas.c210626 cstr: 32138.14.j.aas.c210626

1.
南京信息工程大学自动化学院南京 210044

基金项目: 科技创新2030 —— “新一代人工智能”重大项目(2018AAA0100400), 国家自然科学基金(61876088, U20B2065, 61532009), 江苏省333工程人才项目(BRA2020291)资助

详细信息

作者简介:
赵子成：南京信息工程大学自动化学院硕士研究生. 主要研究方向为视频目标分割, 深度学习. E-mail: 20191222013@nuist.edu.cn

张开华：南京信息工程大学自动化学院教授. 主要研究方向为视频目标分割, 视觉追踪. 本文通信作者. E-mail: zhkhua@gmail.com

樊佳庆：南京信息工程大学自动化学院硕士研究生. 主要研究方向为视频目标分割. E-mail: jqfan@nuaa.edu.cn

刘青山：南京信息工程大学自动化学院教授. 主要研究方向为视频内容分析与理解. E-mail: qsliu@nuist.edu.cn

计量
- 文章访问数: 2853
- HTML全文浏览量: 753
- PDF下载量: 174
- 被引次数: 0
出版历程
- 收稿日期: 2021-07-06
- 网络出版日期: 2021-11-20
- 刊出日期: 2023-04-20

Learning Motion Guidance for Efficient Unsupervised Video Object Segmentation

1.
School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044

Funds: Supported by National Key Research and Development Program of China (2018AAA0100400), National Natural Science Foundation of China (61876088, U20B2065, 61532009), and 333 High-level Talents Cultivation of Jiangsu Province (BRA2020291)

More Information

Author Bio:
ZHAO Zi-Cheng　Master student at the School of Automation, Nanjing University of Information Science and Technology. His research interest covers video object segmentation and deep learning

ZHANG Kai-Hua　Professor at the School of Automation, Nanjing University of Information Science and Technology. His research interest covers video object segmentation and visual tracking. Corresponding author of this paper

FAN Jia-Qing　Master student at the School of Automation, Nanjing University of Information Science and Technology. His main research interest is video object segmentation

LIU Qing-Shan　Professor at the School of Automation, Nanjing University of Information Science and Technology. His research interest covers video content analysis and understanding

摘要

摘要: 大量基于深度学习的无监督视频目标分割(Unsupervised video object segmentation, UVOS)算法存在模型参数量与计算量较大的问题, 这显著限制了算法在实际中的应用. 提出了基于运动引导的视频目标分割网络, 在大幅降低模型参数量与计算量的同时, 提升视频目标分割性能. 整个模型由双流网络、运动引导模块、多尺度渐进融合模块三部分组成. 具体地, 首先, RGB图像与光流估计输入双流网络提取物体外观特征与运动特征; 然后, 运动引导模块通过局部注意力提取运动特征中的语义信息, 用于引导外观特征学习丰富的语义信息; 最后, 多尺度渐进融合模块获取双流网络的各个阶段输出的特征, 将深层特征渐进地融入浅层特征, 最终提升边缘分割效果. 在3个标准数据集上进行了大量评测, 实验结果表明了该方法的优越性能.
- 无监督视频目标分割 /
- 运动引导 /
- 局部注意力 /
- 互注意力
Abstract: Numerous unsupervised video object segmentation (UVOS) algorithms based on deep learning have super-fluous model parameters and expensive computational overhead, which limits the applications of the algorithms in practice. To relieve the issues, this paper proposes an unsupervised video object segmentation network based on motion guidance, which can significantly reduce the number of model parameters and calculations, and improve the performance of segmentation. The multi-scale progressive fusion module consists of three parts. Specifically, RGB image and optical flow estimation are fed into the dual flow network to extract object appearance features and motion features. Then, the motion guidance module extracts semantic information from motion features through local attention to guide semantical appearance features learning. Finally, the multi-scale progressive fusion module obtains output features of each stage of dual flow network, and gradually integrates deep features with shallow features. Extensive evaluations are conducted on three mainstream datasets, and the results show the superior performance of the proposed method.
- Unsupervised video object segmentation (UVOS) /
- motion guidance /
- local attention /
- co-attention

HTML全文

图 1 网络框架图

Fig. 1 Figure of network structure

下载: 全尺寸图片幻灯片

图 2 注意力结构

Fig. 2 Attention structure

下载: 全尺寸图片幻灯片

图 3 UNet方式的上采样与多尺度渐进融合模块

Fig. 3 Upsampling module and multi-scale progressive fusion module in UNet mode

下载: 全尺寸图片幻灯片

图 4 分割结果对比展示

Fig. 4 Comparative display of segmentation results

下载: 全尺寸图片幻灯片

图 5 分割结果展示

Fig. 5 Display of segmentation results

下载: 全尺寸图片幻灯片

表 1 不同模块每秒浮点运算数对比

Table 1 Comparison of floating-point operations per second of different modules

输入尺寸 (像素)	互注意模块 (MB)	运动引导模块 (MB)
$64 \times 64 \times 16$	10.0	2.3
$64 \times 32 \times 32$	153.1	9.0

下载: 导出CSV

表 2 不同方法在DAVIS-16 和FBMS数据集的评估结果 (%)

Table 2 Evaluation results of different methods on DAVIS-16 and FBMS datasets (%)

方法	DAVIS-16			FBMS
方法	$J\&F$	$J$	$F$	$J$
LMP^[25]	68.0	70.0	65.9	—
LVO^[16]	74.0	75.9	72.1	—
PDB^[14]	75.9	77.0	74.5	74.0
MBNM^[26]	79.5	80.4	78.5	73.9
AGS^[27]	78.6	79.7	77.4	—
COSNet^[10]	80.0	80.5	79.4	75.6
AGNN^[7]	79.9	80.7	79.1	—
AnDiff^[28]	81.1	81.7	80.5	—
MATNet^[17]	81.6	82.4	80.7	76.1
本文算法	83.6	83.7	83.4	75.9

下载: 导出CSV

表 3 不同方法在DAVIS-16、FBMS和ViSal数据集的评估结果 (%)

Table 3 Evaluation results of different methods on DAVIS-16、FBMS and ViSal datasets (%)

方法	DAVIS-16		FBMS		ViSal
方法	MAE	${F_\beta}$	MAE	${F_\beta}$	MAE	${F_\beta}$
FCNS^[29]	5.3	72.9	10.0	73.5	4.1	87.7
FGRNE^[30]	4.3	78.6	8.3	77.9	4.0	85.0
TENet^[31]	1.9	90.4	2.6	89.7	1.4	94.9
MBNM^[26]	3.1	86.2	4.7	81.6	4.7	—
PDB^[14]	3.0	84.9	6.9	81.5	2.2	91.7
AnDiff^[28]	4.4	80.8	6.4	81.2	3.0	90.4
本文算法	1.4	92.4	5.9	84.2	1.9	92.1

下载: 导出CSV

表 4 不同方法的模型参数量、计算量与推理时延

Table 4 Model parameters, computation and infer latency of different methods

算法	COSNet^[8]	MATNet^[17]	本文算法
输入尺寸 (像素)	$473 \times 473$	$473 \times 473$	$384 \times 672$
参数量 (MB)	81.2	142.7	6.4
计算量 (GB)	585.5	193.7	5.4
时延 (ms)	65	78	15

下载: 导出CSV

表 5 不同方法在GTX2080 Ti上的性能表现

Table 5 Performance of different methods on GTX2080 Ti

方法	并发量	每秒帧数	时延 (ms)
MATNet^[17]	18	16	62.40
本文算法	130	161	6.21

下载: 导出CSV

表 6 运动引导模块与多尺度渐进融合模块的消融实验(%)

Table 6 Ablation experiment on motion guidance module and multi-scale progressivefusion module (%)

指标	本文算法	$无\; {\rm{FG} }$	${\rm{FG}}$
$J$	83.7	75.8	76.1
$F$	83.4	73.5	75.6

下载: 导出CSV

表 7 不同核K大小与堆叠次数对比

Table 7 Comparison of different Kernel sizes and cascading times

K	堆叠层数	$J$ (%)	$F$ (%)
3	1	82.8	82.4
3	2	83.4	82.7
3	3	83.7	83.4
3	4	83.5	83.2
5	1	83.2	82.6
7	1	83.4	82.7
9	1	83.1	82.4

下载: 导出CSV

参考文献(31)

[1]	Papazoglou A, Ferrari V. Fast object segmentation in unconstrained video. In: Proceedings of the IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013. 1777−1784
[2]	黄宏图, 毕笃彦, 侯志强, 胡长城, 高山, 查宇飞, 库涛. 基于稀疏表示的视频目标跟踪研究综述[J]. 自动化学报, 2018, 44(10): 1747-1763 HUANG Hong-Tu, BI Du-Yan, HOU Zhi-Qiang, HU Chang-Cheng, GAO Shan, ZHA Yu-Fei, KU Tao. Research of Sparse Representation-based Visual Object Tracking: A Survey. Acta Automatica Sinica, 2018, 44(10): 1747-1763
[3]	Wang W, Shen J, Porikli F. Saliency-aware geodesic video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015. 3395−3402
[4]	钱生, 陈宗海, 林名强, 张陈斌. 基于条件随机场和图像分割的显著性检测[J]. 自动化学报, 2015, 41(4): 711-724 QIAN Sheng, CHEN Zong-Hai, LIN Ming-Qiang, ZHANG Chen-Bin. Saliency Detection Based on Conditional Random Field and Image Segmentation. Acta Automatica Sinica, 2015, 41(4): 711-724.
[5]	Ochs P, Brox T. Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions. In: Proceedings of the IEEE International Conference on Computer Vision. Barcelona, Spain: IEEE, 2011. 1583−1590
[6]	苏亮亮, 唐俊, 梁栋, 王年. 基于最大化子模和RRWM的视频协同分割[J]. 自动化学报, 2016, 42(10): 1532-1541 SU Liang-Liang, TANG Jun, LIANG Dong, WANG Nian. A Video Co-segmentation Algorithm by Means of Maximizing Submodular Function and RRWM. Acta Automatica Sinica, 2016, 42(10): 1532-1541
[7]	Ventura C, Bellver M, Girbau A, Salvador A, Marques F, Giroinieto X. RVOS: End-to-end recurrent network for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019. 5277−5286
[8]	Wang W, Lu X, Shen J, Crandall D J, Shao L. Zero-shot video object segmentation via attentive graph neural networks. In: Proceedings of the IEEE International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019. 9236−9245
[9]	Chen L C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv: 1706.05587, 2017.
[10]	Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019. 3623−3632
[11]	Faktor A, Irani M. Video segmentation by non-local consensus voting. In: Proceedings of the British Machine Vision Conference. Nottingham, UK: 2014.
[12]	Perazzi F, Pont-Tuset J, McWilliams B, Van-Gool L, Gross M, Sorkine-Hornung A. A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016. 724−732
[13]	Xu N, Yang L J, Fan Y C, Yang J C, Yue D C, Liang Y C, et al. Youtube-VOS: Sequence-to-sequence video object segmentation. In: Proceedings of the European Conference on Computer Vision.Munich, Germany: 2018. 585−601
[14]	Song H, Wang W, Zhao S, Shen J, Lam K M. Pyramid dilated deeper ConvLSTM for video salient object detection. In: Proceed-ings of the European Conference on Computer Vision. Munich, Germany: 2018. 715−731
[15]	Jampani V, Gadde R, Gehler P V. Video propagation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017. 451−461
[16]	Tokmakov P, Alahari K, Schmid C. Learning video object segmentation with visual memory. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017. 4481−4490
[17]	Zhou T, Li J, Wang S, Tao R, Shen J. Matnet: Motion-attentive transition network for zero-shot video object segmentation. IEEE Transactions on Image Processing, 2020, 29: 8326−8338
[18]	Chu X, Yang W, Ouyang W, Ma C, Yuille A L, Wang X. Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017. 1831−1840
[19]	Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, et al. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017. 5659−5667
[20]	Lu J, Yang J, Batra D, Parikh D. Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv: 1606.00061, 2016.
[21]	Wu Q, Wang P, Shen C, Reid I, Van-Den-Hengel A. Are you talking to me? Reasoned visual dialog generation through adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 6106−6115
[22]	Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L C. Mobile-Net v2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 4510−4520
[23]	Ronneberger O, Fischer P, Brox T. UNet: Convolutional networksfor biomedical image segmentation. In: Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention. Munich, Germany: 2015. 234−241
[24]	Wang W, Shen J, Shao L. Consistent video saliency using local gradient flow optimization and global refinement. IEEE Transactions on Image Processing, 2015, 24(11): 4185−4196
[25]	Tokmakov P, Alahari K, Schmid C. Learning motion patterns in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE, 2017. 3386−3394
[26]	Li S, Seybold B, Vorobyov A, Lei X, Kuo C C J. Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European Conference on Computer Vision. Munich, Germany: 2018. 207−223
[27]	Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi S C, et al. Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019. 3064−3074
[28]	Yang Z, Wang Q, Bertinetto L, Hu W, Bai S, Torr P H. Anchor diffusion for unsupervised video object segmentation. In: Procee-dings of the IEEE International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019. 931−940
[29]	Wang W, Shen J, Shao L. Video salient object detection via fully convolutional networks. IEEE Transactions on Image Processing, 2017, 27(1): 38−49
[30]	Li G, Xie Y, Wei T, Wang K, Lin L. Flow guided recurrent neural encoder for video salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018. 3243−3252
[31]	Ren S, Han C, Yang X, Han G, He S. TENet: Triple excitation network for video salient object detection. In: Proceedings of the European Conference on Computer Vision. Edinburgh, Scotland: 2020. 212−228