Scene Restoration and Semantic Classification Network Using Depth Map and Discrete Pooling Technology
-
摘要: 在机器视觉感知系统中,从不完整的被遮挡的目标对象中鲁棒重建三维场景及其语义信息至关重要.目前常用方法一般将这两个功能分开处理,本文将二者结合,提出了一种基于深度图及分离池化技术的场景复原及语义分类网络,依据深度图中的RGB-D信息,完成对三维目标场景的重建与分类.首先,构建了一种CPU端到GPU端的深度卷积神经网络模型,将从传感器采样的深度图像作为输入,深度学习摄像机投影区域内的上下文目标场景信息,网络的输出为使用改进的截断式带符号距离函数(Truncated signed distance function,TSDF)编码后的体素级语义标注.然后,使用分离池化技术改进卷积神经网络的池化层粒度结构,设计带细粒度池化的语义分类损失函数,用于回馈网络的语义分类重定位.最后,为增强卷积神经网络的深度学习能力,构建了一种带有语义标注的三维目标场景数据集,以此加强本文所提网络的深度学习鲁棒性.实验结果表明,与目前较先进的网络模型对比,本文网络的重建规模扩大了2.1%,所提深度卷积网络对缺失场景的复原效果较好,同时保证了语义分类的精准度.Abstract: In the machine vision perception system, it is very important to robustly reconstruct the 3D scene and recognize target semantics. At present, commonly used methods generally deal with these two functions separately. In this paper, we propose a scene restoration and semantic classification network using the depth map. Based on the RGB-D information in the depth map, reconstruction of a 3D target scene is completed along with classification. Firstly, a deep convolutional neural network model from the CPU end to the GPU end is constructed, which takes depth samples as input from sensor and deeply learns contextual target scene information in the camera projection area. The output of the network comes from the improved truncated signed distance function (TSDF) coding voxel-level semantic annotation. Secondly, in order to enhance the deep learning ability of the convolutional neural network, a three-dimensional target scene dataset with semantic annotation is constructed to enhance the robustness of the proposed network. Experimental results show that compared with the current advanced network model, the reconstruction scale of this network model expands by 2.1%. The proposed convolutional network has good reconstruction effect on the missing scene and the accuracy of semantic classification is also guaranteed.
-
图 6 带有二进制权值和量化激励的网络层点积分布图. (a), (b), (c), (d)分别为下采样层1、卷积层3、下采样层6、卷积层7的点积分布图(具有不同的均值和标准偏差); (e), (f), (g), (h)分别为下采样层1、卷积层3、下采样层6、卷积层7对应的点积误差分布曲线
Fig. 6 Dot product distribution of network with binary weights and quantitative activation. (a), (b), (c) and (d) are the point product distribution maps of the pooling layer 1, the convolution layer 3, the pooling layer 6 and the convolution layer 7, respectively, they share a different mean and standard deviation; (e), (f), (g) and (h) are the dot product error distribution curves corresponding to the pooling layer 1, the convolution layer 3, the pooling layer 6 and the convolution layer 7, respectively.
表 1 本文网络与L、GW网络的复原与分类性能比较(%)
Table 1 Comparison of three networks for performance of reconstruction and semantic classification (%)
L GW 本文NYU 本文LS_3DDS 本文NYU$+$LS_ 3DDS 复原 闭环率 59.6 66.8 57.0 55.6 69.3 IoU 37.8 46.4 59.1 58.2 58.6 语义场景复原 天花板 0 14.2 17.1 8.8 19.1 地面 15.7 65.5 92.7 85.8 94.6 墙壁 16.7 17.1 28.4 15.6 29.7 窗 15.6 8.7 0 7.4 18.8 椅子 9.4 4.5 15.6 18.9 19.3 床 27.3 46.6 37.1 37.4 53.6 沙发 22.9 25.7 38.0 28.0 47.9 桌子 7.2 9.3 18.0 18.7 19.9 显示器 7.6 7.0 9.8 7.1 12.9 家具 15.6 27.7 28.1 10.4 30.1 物品 2.1 8.3 15.1 6.4 11.6 平均值 18.3 26.8 32.0 27.6 37.3 表 2 本文网与F网、Z网的重建性能对比数据(%)
Table 2 Comparison of our network reconstruction performance with F and Z networks (%)
训练数据集 复原准确率 闭环率 IoU值 F复原方法 NYU 66.5 69.7 50.8 Z复原方法 NYU 60.1 46.7 34.6 本文复原 NYU 66.3 96.9 64.8 文语义复原 NYU 75.0 92.3 70.3 LS_3DDS 75.0 96.0 73.0 -
[1] Gupta S, Arbeláez P, Malik J. Perceptual organization and recognition of indoor scenes from RGB-D images. In: Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Portland, OR, USA: IEEE, 2013. 564-571 http://www.researchgate.net/publication/261227425_Perceptual_Organization_and_Recognition_of_Indoor_Scenes_from_RGB-D_Images [2] Ren X F, Bo L F, Fox D. RGB-(D) scene labeling: features and algorithms. In: Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Providence, RI, USA: IEEE, 2012. 2759-2766 [3] Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In: Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer, 2012. 746-760 doi: 10.1007/978-3-642-33715-4_54 [4] Lai K, Bo L F, Fox D. Unsupervised feature learning for 3D scene labeling. In: Proceedings of 2014 IEEE International Conference on Robotics and Automation (ICRA). Hong Kong, China: IEEE, 2014. 3050-3057 http://www.researchgate.net/publication/286679738_Unsupervised_feature_learning_for_3D_scene_labeling [5] Rock J, Gupta T, Thorsen J, Gwak J Y, Shin D, Hoiem D. Completing 3D object shape from one depth image. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015. 2484-2493. [6] Shah S A A, Bennamoun M, Boussaid F. Keypoints-based surface representation for 3D modeling and 3D object recognition. Pattern Recognition, 2017, 64:29-38 http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=0a0d4dd53a9021a3b08eb00743de46f0 [7] Ren C Y, Prisacariu V A, Kähler O, Reid I D, Murray D W. Real-time tracking of single and multiple objects from depth-colour imagery using 3D signed distance functions. International Journal of Computer Vision, 2017, 124(1):80-95 http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=3e57b7d19aee23e14c99b6a73045ae38 [8] Gupta S, Arbeláez P, Girshick R, Malik J. Aligning 3D models to RGB-D images of cluttered scenes. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, Massachusetts, USA: IEEE, 2015. 4731-4740 [9] Song S R, Xiao J X. Sliding shapes for 3D object detection in depth images. In: Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 634-651 [10] Li X, Fang M, Zhang J J, Wu J Q. Learning coupled classifiers with RGB images for RGB-D object recognition. Pattern Recognition, 2017, 61:433-446 doi: 10.1016/j.patcog.2016.08.016 [11] Nan L L, Xie K, Sharf A. A search-classify approach for cluttered indoor scene understanding. ACM Transactions on Graphics (TOG), 2012, 31(6): Article No. 137 [12] Lin D H, Fidler S, Urtasun R. Holistic scene understanding for 3D object detection with RGBD cameras. In: Proceedings of 2013 IEEE International Conference on Computer Vision (ICCV). Sydney, NSW, Australia: IEEE, 2013. 1417-1424 [13] Ohn-Bar E, Trivedi M M. Multi-scale volumes for deep object detection and localization. Pattern Recognition, 2017, 61:557-572 doi: 10.1016/j.patcog.2016.06.002 [14] Zheng B, Zhao Y B, Yu J C, Ikeuchi K, Zhu S C. Beyond point clouds: scene understanding by reasoning geometry and physics. In: Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013. 3127-3134 http://www.researchgate.net/publication/261263632_Beyond_Point_Clouds_Scene_Understanding_by_Reasoning_Geometry_and_Physics [15] Kim B S, Kohli P, Savarese S. 3D scene understanding by voxel-CRF. In: Proceedings of 2013 IEEE International Conference on Computer Vision (ICCV). Sydney, NSW, Australia: IEEE, 2013. 1425-1432 [16] Häne C, Zach C, Cohen A, Angst R. Joint 3D scene reconstruction and class segmentation. In: Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Portland, OR, USA: IEEE, 2013. 97-104 http://www.researchgate.net/publication/261448707_Joint_3D_Scene_Reconstruction_and_Class_Segmentation [17] Bláha M, Vogel C, Richard A, Wegner J D, Pock T, Schindler K. Large-scale semantic 3D reconstruction: an adaptive multi-resolution model for multi-class volumetric labeling. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 3176-3184 [18] Handa A, Patraucean V, Badrinarayanan V, Stent S, Cipolla R. Understanding real world indoor scenes with synthetic data. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 4077-4085 [19] 吕朝辉, 沈萦华, 李精华.基于Kinect的深度图像修复方法.吉林大学学报(工学版), 2016, 46(5):1697-1703 http://d.old.wanfangdata.com.cn/Periodical/jlgydxzrkxxb201605046Lv Chao-Hui, Shen Ying-Hua, Li Jing-Hua. Depth map inpainting method based on Kinect sensor. Journal of Jilin University (Engineering and Technology Edition), 2016, 46(5):1697-1703 http://d.old.wanfangdata.com.cn/Periodical/jlgydxzrkxxb201605046 [20] 胡长胜, 詹曙, 吴从中.基于深度特征学习的图像超分辨率重建.自动化学报, 2017, 43(5):814-821 http://www.aas.net.cn/CN/abstract/abstract19059.shtmlHu Chang-Sheng, Zhan Shu, Wu Cong-Zhong. Image super-resolution based on deep learning features. Acta Automatica Sinica, 2017, 43(5):814-821 http://www.aas.net.cn/CN/abstract/abstract19059.shtml [21] Wang P S, Liu Y, Guo Y X, Sun C Y, Tong X. O-CNN: octree-based convolutional neural networks for 3D shape analysis. ACM Transactions on Graphics (TOG), 2017, 36(4): Article No. 72 [22] Yücer K, Sorkine-Hornung A, Wang O, Sorkine-Hornung O. Efficient 3D object segmentation from densely sampled light fields with applications to 3D reconstruction. ACM Transactions on Graphics (TOG), 2016, 35(3): Article No. 22 http://www.researchgate.net/publication/298910150_Efficient_3D_Object_Segmentation_from_Densely_Sampled_Light_Fields_with_Applications_to_3D_Reconstruction [23] Hyvärinen A, Oja E. Independent component analysis:algorithms and applications. Neural Networks, 2000, 13(4-5):411-430 doi: 10.1016/S0893-6080(00)00026-5 [24] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015. 3431-3440 [25] Guo R Q, Zou C H, Hoiem D. Predicting complete 3D models of indoor scenes. arXiv: 1504.02437, 2015. [26] 孙旭, 李晓光, 李嘉锋, 卓力.基于深度学习的图像超分辨率复原研究进展.自动化学报, 2017, 43(5):697-709 http://www.aas.net.cn/CN/abstract/abstract19048.shtmlSun Xu, Li Xiao-Guang, Li Jia-Feng, Zhuo Li. Review on deep learning based image super-resolution restoration algorithms. Acta Automatica Sinica, 2017, 43(5):697-709 http://www.aas.net.cn/CN/abstract/abstract19048.shtml