2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

人体行为识别数据集研究进展

朱红蕾 朱昶胜 徐志刚

朱红蕾, 朱昶胜, 徐志刚. 人体行为识别数据集研究进展. 自动化学报, 2018, 44(6): 978-1004. doi: 10.16383/j.aas.2018.c170043
引用本文: 朱红蕾, 朱昶胜, 徐志刚. 人体行为识别数据集研究进展. 自动化学报, 2018, 44(6): 978-1004. doi: 10.16383/j.aas.2018.c170043
ZHU Hong-Lei, ZHU Chang-Sheng, XU Zhi-Gang. Research Advances on Human Activity Recognition Datasets. ACTA AUTOMATICA SINICA, 2018, 44(6): 978-1004. doi: 10.16383/j.aas.2018.c170043
Citation: ZHU Hong-Lei, ZHU Chang-Sheng, XU Zhi-Gang. Research Advances on Human Activity Recognition Datasets. ACTA AUTOMATICA SINICA, 2018, 44(6): 978-1004. doi: 10.16383/j.aas.2018.c170043

人体行为识别数据集研究进展

doi: 10.16383/j.aas.2018.c170043
基金项目: 

国家自然科学基金 61563030

甘肃省自然科学基金 1610RJZA027

详细信息
    作者简介:

    朱昶胜 兰州理工大学计算机与通信学院教授.2006年获得兰州理工大学博士学位.主要研究方向为高性能计算, 数据分析与理解.E-mail:zhucs2008@163.com

    徐志刚 兰州理工大学计算机与通信学院副教授.2012年获得中国科学院研究生院博士学位.主要研究方向为计算机视觉与图像处理.E-mail:xzgcn@163.com

    通讯作者:

    朱红蕾 兰州理工大学计算机与通信学院博士研究生.2004年获得兰州理工大学硕士学位.主要研究方向为计算机视觉与模式识别.本文通信作者.E-mail:zhuhllut@139.com

Research Advances on Human Activity Recognition Datasets

Funds: 

National Natural Science Foundation of China 61563030

Natural Science Foundation of Gansu Province 1610RJZA027

More Information
    Author Bio:

    Professor at the School of Computer and Conmunacation, Lanzhou University of Technology. He received his Ph. D. degree from Lanzhou University of Technology in 2006. His research interest covers high performance computing, data analysis, and understanding

    Associate professor at the School of Computer and Conmunacation, Lanzhou University of Technology. He received his Ph. D. degree from Graduate University of Chinese Academy of Sciences in 2012. His research interest covers computer vision and image processing

    Corresponding author: ZHU Hong-Lei Ph. D. candidate at the School of Computer and Conmunacation, Lanzhou University of Technology. She received her master degree from Lanzhou University of Technology in 2004. Her research interest covers computer vision and pattern recognition. Corresponding author of this paper
  • 摘要: 人体行为识别是计算机视觉领域的一个研究热点,具有重要理论价值和现实意义.近年来,为了评价人体行为识别方法的性能,大量的公开数据集被创建.本文系统综述了人体行为识别公开数据集的发展与前瞻:首先,对公开数据集的层次与内容进行归纳.根据数据集的数据特点和获取方式的不同,将人体行为识别的公开数据集分成4类.其次,对4类数据集分别描述,并对相应数据集的最新识别率及其研究方法进行对比与分析.然后,通过比较各数据集的信息和特征,引导研究者选取合适的基准数据集来验证其算法的性能,促进人体行为识别技术的发展.最后,给出公开数据集未来发展的趋势与人体行为识别技术的展望.
  • 视频人体姿态估计是指获取给定视频中人体各部位在每帧图像中的位置及方向等信息的过程[1], 是目前计算机视觉领域的研究热点, 在行为识别[2]、人机交互[3]、视频理解[4-5]等领域均有广泛的应用.

    近些年, 基于部件模型[6], 针对单帧图像的人体姿态估计展开了大量的研究并取得了卓有成效的进展[7-10], 然而这些方法对人体四肢, 尤其末端(手腕、肘、脚踝、膝盖)部位的估计结果还很不理想, 直接运用到视频的人体姿态估计问题上并不能得到满意的结果.针对视频中的人体姿态估计, 借助运动信息, 在人体部件模型基础上添加时序一致性约束, 将会得到更准确的估计结果.现有基于部件的视频人体姿态估计方法通常的做法是, 为每帧图像生成各个人体部件的状态候选; 然后, 构建时空概率图模型, 推理视频中每一时刻的人体姿态.根据实体在时域上的覆盖度, 目前这类方法采用的模型可以分为细粒度模型和粗粒度模型两类.

    第一类是细粒度模型.以人体部件(构成姿态序列的最小单位)为实体, 在单帧人体空域部件模型(图 1(a))基础上, 添加部件的帧间时域联系, 形成一个时空部件模型(图 1 (b)), 实体在时域上只覆盖单帧图像, 模型推理目的是为每帧图像中的人体各部件挑选符合时空约束的最优状态[11-16].由于人体姿态变化的多样性, 人的体型、穿着、视角等变化, 部件模型很难捕捉到所有的表观变化, 而时域上只引入了相邻帧间的一致性约束, 没有长时一致性的约束, 易出现部件状态估计的误差累积.另外由于模型存在环路, 无法获取精确解, 近似推理也在一定程度上进一步降低估计的精度.

    图 1  现有视频人体姿态估计方法采用的模型
    Fig. 1  The models used in video pose estimation

    第二类是粗粒度模型.以人体部件的轨迹为实体, 时空部件模型在时域上的依赖关系不复存在, 实体在时域上覆盖整个视频, 模型塌陷成为与单帧人体姿态估计相同的模型(图 1 (c)), 模型中结点表示部件的轨迹, 边表示部件轨迹间的约束关系, 此时模型推理的目的是为每个人体部件挑选一个最优的轨迹来组装成最终的姿态序列[17-18].粗粒度模型在时域上可以添加长时一致性的约束, 避免了误差累积的情况, 而且模型简化, 推理简单.然而, 为人体部件生成合理优质的轨迹本身属于跟踪问题, 对于人体四肢部位, 尤其末端部位(比如腕部、踝部), 极易出现表观的剧烈变化、遮挡、快速运动等情况, 而这些都是跟踪的典型难题.

    本文综合粗、细粒度模型的优点, 从中粒度出发, 以人体部件的轨迹片段为实体, 构建时空模型, 推理为每一人体部件选择最优的轨迹片段, 通过拼接各部件的轨迹片段形成最终的人体姿态序列估计.模型中实体覆盖若干帧, 方便添加长时的一致性约束, 降低对部件模型的敏感度.为解决对称部件易混淆的问题, 模型中添加对称部件间约束(如图 2(a)), 并从概念上将对称部件合并为一个结点(如图 2 (b)), 通过该处理消除空域模型中的环路, 同时保留对称部件间约束, 最终模型如图 2 (c)所示.

    图 2  中粒度时空模型
    Fig. 2  The medium granularity model

    环路的存在使得时空概率图模型的确切解不可得, 通常只能通过近似推理, 如循环置信度传播[11, 19]、采样[20]变分[12]等手段来获取近似解.另外一类思路对原始环状图模型进行拆解, 用一组树状子图来近似原始图模型[13-14, 21].还有部分研究者采用分步优化的策略[15-16], 首先不考虑空间约束, 对检测最为稳定的部件(如头部)进行序列估计, 再基于该序列估计, 对其邻接部件进行优化, 该过程一直到所有部件处理完成.本文将整个时空模型(图 4(a))拆解为一组马尔科夫随机场(图 4 (b))和隐马尔科夫模型(图 4 (c)), 分别负责空域和时域的解析, 通过迭代的时域和空域交替解析, 完成时空模型的近似推理.

    图 4  不同方法的长时运动估计对比
    Fig. 4  Long-term performances of different motion estimation approaches

    除推理算法外, 部件候选集的质量直接影响最终姿态估计的结果.直接将单帧图像的前$K$个最优姿态检测作为候选[22-23], 很难保证能够覆盖真实的姿态.为了生成更多可靠的姿态候选, 常用的一个策略是引入局部运动信息对姿态检测结果进行传播[15-16, 24-26].借助准确的运动信息, 对优质的姿态检测结果进行传播, 可以为相邻帧生成合理的姿态候选.然而当视频中存在快速运动或连续出现非常规人体姿态时, 这种策略将会失效. 1)快速运动易导致运动估计出现误差.图 3给出了一个快速运动的例子, 可以看出传统的运动估计算法(LDOF[27]、FarneBackOF[28])无法成功捕捉脚的快速运动.这使得即使在$t$帧有准确的检测, 也无法通过传播为$t+1$帧生成合理的候选. 2)当非常规姿态连续出现时, 姿态检测器会在相邻的多帧图像中连续失败, 没有好的姿态检测结果, 即使有准确的帧间运动信息, 也无法通过传播为这些帧生成好的候选.这时可借助长时运动信息将优质的检测结果传播到更远范围[29].然而, 从图 4给出的例子可以看出, 传统的运动估计几乎无法避免误差累计与漂移.针对以上问题, 本文引入全局运动信息[30-31]对姿态检测结果进行传播.全局运动信息可以给出前景长时一致的对应关系, 较好地解决了快速运动造成的障碍, 将优质的姿态检测结果稳定地传播, 为更多的帧提供有效候选.

    图 3  不同方法的短时运动估计对比
    Fig. 3  Short-term performances of different motion estimation approaches

    本文的主要贡献可以归纳如下: 1)引入全局运动信息进行姿态检测的传播, 克服局部运动信息的弊端, 为后期推理提供更合理、优质的状态候选. 2)构建中粒度模型, 有效避免细粒度模型对部件模型敏感的缺点, 同时便于添加长时的一致性约束.

    给定含有$N$帧的视频, 本文通过三个主要步骤得到最终的姿态估计结果(图 5).首先, 用姿态检测器对每帧图像进行姿态检测; 然后, 借助全局运动信息, 将每帧中的最优检测结果传播到整个视频, 从而为每个人体部件生成$N$条轨迹, 随即这些轨迹被切割成重叠的固定长度的轨迹片段, 构成每个部件的轨迹片段候选集; 最后, 通过求解中粒度时空概率图模型的优化问题, 获得符合时空一致性约束的最优轨迹片段, 拼接融合各部件的最优轨迹片段形成最终的姿态估计序列.

    图 5  基于中粒度模型的视频人体姿态估计方法示意图
    Fig. 5  Overview of the video pose estimation method based on medium granularity model

    第1.1节简要介绍单帧图像中进行姿态检测的混合部件模型[7], 第1.2节描述轨迹片段候选的生成过程, 第1.3节定义中粒度时空概率图模型.

    用于单帧图像的人体姿态检测器将人体建模成一个包含$M$个结点的树状图模型$\mathcal{G}=(\mathcal{V}, \mathcal{E})$, 如图 1(a)所示, 其中结点对应人体的$M$个部件, 边表示部件间的空间约束关系.人体的姿态可表示为所有部件的状态集合: $X=\{{x_1}, {x_2}, \cdots, {x_M}\}$, 其中部件$i$的状态$x_i$由图像中的坐标位置构成.给定图像$I$, 对某一特定人体部件状态配置$X$可用式(1)进行评分:

    $ \begin{equation} \label{equ_fmp} S(I, X)=\sum\limits_{i\in \mathcal{V}}\phi (x_i, I)+\sum\limits_{(i, j)\in \mathcal{E}}\psi({x_i, x_j}) \end{equation} $

    (1)

    这里$\phi ({x_i}, I)$为部件的观测项, 用于计算部件$i$取状态${x_i}$时的图像区域特征与部件模板的匹配程度; 评估两个相连人体部件$i$与$j$间的几何连接状况与人体模型对应结点间几何约束的匹配程度.所有部件的模板和部件间的几何约束模型均利用结构化SVM进行联合训练得到.

    姿态检测问题则形式化为最大化$S(I, X)$问题.本文采用文献[7]的算法进行单帧图像的姿态检测, 并采用文中方法对部件以及相连部件间空间约束进行建模, 为描述简洁, 公式中省略了部件类型相关的描述, 具体细节可参考文献[7].

    本文采用分层弹性运动跟踪方法[31]对视频中的人体进行弹性运动跟踪[30], 获取全局运动信息, 并基于此信息, 对每帧获得的人体姿态检测结果进行传播.全局运动信息给出的是目标在视频各帧的一个全局对应关系, 给定某帧图像中一个点, 通过该对应关系可以获取该点在其他帧的对应位置.因此, 将第$t$帧的姿态检测结果$X=\{{x_1}, {x_2}, \cdots, {x_M}\}$作为参考点集, 通过全局运动信息, 可获取该点集在视频各帧中的对应位置, 由此得到各部件的一条轨迹.对所有$N$帧中的姿态检测结果实施该传播操作, 为各部件生成$N$条轨迹候选.

    在假设全局运动信息可信的前提下, $t$帧的姿态检测结果中$x_i$越准确, 传播$x_i$所生成的轨迹质量越高, 越是邻近$x_i$的轨迹片段越可靠.拼接各部件的优质轨迹片段, 将得到高精确度的姿态估计结果.基于此直观想法, 本文将所有轨迹切割成重叠的固定长度的轨迹片段, 构成各部件的轨迹片段候选, 构建以部件轨迹片段为实体的中粒度时空模型, 推理挑选出符合时空约束的最优轨迹片段.

    本文将视频的姿态估计问题建模成一个如图 6(a)所示的时空概率图模型, 其中结点对应人体部件轨迹片段, 边表示轨迹片段间的空间几何约束以及时间上相邻片段的一致性约束, 目的是为每个人体部件挑选出最优的轨迹片段.该问题可形式化为图模型的优化问题, 由于该图模型存在环, 受文献[14]启发, 本文将时空模型分解为两个树状结构子图模型:马尔科夫随机场与隐马尔科夫模型, 分别负责空域解析(如图 6 (b))和时域解析(如图 6 (c)).为保留对称部件间的约束关系, 同时剔除空域模型中的环路, 对称部件已合并, 即原始的14个关节点的人体模型简化为8结点模型, 为描述清晰, 我们用单部件和组合部件对部件进行区分, 其中单部件指头和颈两部件, 组合部件指合并的对称部件即肩、肘、腕、胯、膝以及踝等6个部件.

    图 6  时空模型分解为空域子模型和时域子模型
    Fig. 6  Sub-models of the full graphical model
    1.3.1   马尔科夫随机场

    子图模型马尔科夫随机场(图 6 (b))用于在每个视频分段内进行空域解析, 我们用$\mathcal{G} = (\mathcal{V}_T, \mathcal{ E}_T)$来表示. $T^t=\{T_i^t|_{i=1}^M\}$表示在第$t$个视频分段$V^t$中的$M$个人体部件的轨迹片段配置, 其中$T^t\in {\mathcal{T}}$, ${\mathcal{T}}$是各部件轨迹片段候选的任意组合.对特定轨迹片段配置$T^t$的评分可由下式完成:

    $ \begin{equation} \label{equ_mn} {S}_T(T^t, V^t)=\sum\limits_{i\in \mathcal{V}_T} \Phi(T_i^t, V^t)+\sum\limits_{(i, j)\in \mathcal{E}_T} \Psi({T_i^t, T_j^t}) \end{equation} $

    (2)

    其中, 一元项$\Phi(T_i^t, V^t)$计算部件轨迹片段$T_i^t$与视频片段$V^t$的兼容性, 以及在片段内时域上部件表观的一致性.当部件为组合部件时, 在该一元项中还将添加对称部件间的评测.二元项$\Psi({T_i^t, T_j^t})$评估两部件轨迹片段间的空域兼容性.

    为了能使公式描述更清晰, 用$Q$替换$V^t$, 当部件$i$为单部件, 用$S_i$替换$T_i^t$, 当部件$i$为组合部件, 用$C_i$替换$T_i^t$. $Q=\{q^f|_{f=1}^F\}$, $S_i=\{s_i^f|_{f=1}^F\}$, $C_i=\{c_i^f|_{f=1}^F\}$, $q^f$表示长度为$F$帧的视频片段$Q$中第$f$帧图像, $s_i^f$和$c_i^f$表示部件$i$的轨迹片段在第$f$帧的状态.

    对单部件, 一元项定义为

    $ \begin{equation} \label{equ_Phis} \Phi(T_i^t, V^t)=\Phi_\mathit{s}(S_i, Q) = \sum\limits_{f=1}^F \phi_d(s_i^f, q^f) + \lambda_1\phi_g(S_i) \end{equation} $

    (3)

    其中, 综合部件$i$的表观评分$\phi_p(s_i^f, q^f)$ (式(1)中部件表观评分项)与前景覆盖度$\phi_{fg}(s_i^f, q^f)$[12], $\lambda_1$为权重因子, $\phi_g(S_i)$计算片段内部件$i$的表观时序一致性, 用部件表观特征的方差与片段内最大位移的比值来衡量, 定义为

    $ \begin{equation} \label{equ_phig} \phi_g(S_i) = -\frac{var(\Lambda(s_i^1), \Lambda(s_i^2), \cdots, \Lambda(s_i^F))}{\max\limits_{f_1, f_2}\|s_i^{f_1}-s_i^{f_2}\|_2^2} \end{equation} $

    (4)

    其中, $\Lambda(s_i^f)$为部件$i$在图像$q^f$中取状态$s_i^f$时, 所在的局部图像块归一化后的颜色直方图.

    对组合部件, 一元项定义为

    $ \begin{equation} \label{equ_Phic}\\ \begin{split} \Phi(T_i^t, &V^t)=\Phi_\mathit{c}(C_i, Q) = \Phi_\mathit{s}(C_i.l, Q) +\\& \Phi_\mathit{s}(C_i.r, Q)+ \lambda_2\sum\limits_{f=1}^F(-\psi_{\text{color}}(c_i^f\!\!.l, c_i^f\!\!.r)) +\\& \lambda_3\sum\limits_{f=1}^F\psi_{\text{dist}}(c_i^f\!\!.l, c_i^f\!\!.r) \end{split} \end{equation} $

    (5)

    其中, 前两项分别为左右部件轨迹片段的表观评分, $\Phi_\mathit{s}(\cdot)$定义同式(3), $C_i.l$与$C_i.r$分别表示组合部件$i$的左右两个部分; 第3项度量对称部件间的表观一致性, 为对称部件间颜色直方图的Chi-square距离; 第4项度量对称部件间距离, 定义; $\lambda_2$与$\lambda_3$为权重因子.评估原则为:轨迹片段的表观与部件模型越兼容, 对称部件间颜色越一致、距离越远, 得分越高.

    二元项$\Psi({T_i^t, T_j^t})$评估两部件轨迹片段间的空域兼容性.当邻接的两结点均为单部件时, 二元项定义为

    $ \begin{equation} \label{equ_PsiSS} \Psi({T_i^t, T_j^t})=\Psi(S_i, S_j) = \sum\limits_{f=1}^F \psi_p(s_i^f, s_j^f) \end{equation} $

    (6)

    当邻接的两结点分别对应单部件与组合部件时, 二元项定义为

    $ \begin{align} \label{equ_PsiSC} \Psi({T_i^t, T_j^t})= &\Psi(S_i, C_j)= \\ &\sum\limits_{f=1}^F( \psi_p(s_i^f, c_j^f\!.l)+\psi_p(s_i^f, c_j^f\!.r)) \end{align} $

    (7)

    当邻接的两结点均为组合部件时, 二元项定义为

    $ \begin{align} \label{equ_PsiCC} \Psi({T_i^t, T_j^t})= &\Psi(C_i, C_j)= \\ &\sum\limits_{f=1}^F( \psi_p(c_i^f\!.l, c_j^f\!.l)+\psi_p(c_i^f\!.r, c_j^f\!.r)) \end{align} $

    (8)

    其中, $\psi_p(\cdot)$评估两邻接部件的空域兼容性, 定义同式(1)中的二元项.

    1.3.2   隐马尔科夫模型

    隐马尔科夫模型负责在候选集中挑选出符合时域一致性约束的轨迹片段.我们将整个视频分割为重叠的$N$个片段, 用$V =\{V^t|_{t=1}^N\}$表示.任一部件$i$在整个视频上的轨迹片段配置用表示, 并建模成为一个马尔科夫链.时域上的轨迹片段配置$T_i$的评分函数可以定义为

    $ \begin{equation} \label{equ_hmm} {S}'_T(T_i, V)=\sum\limits_{t=1}^N \Phi'(T_i^t, V^t)+\sum\limits_{t=1}^{N\!-\!1}\Psi'(T_i^t, T_i^{t+1}) \end{equation} $

    (9)

    其中, 一元项$\Phi'(T_i^t, V^t)$评估轨迹片段$T_i^t$的表观评分以及与$i$结点的双亲结点$pa(i)$的空域兼容性, 具体定义为

    $ \begin{equation} \label{equ_PHI_HMM} \Phi'(T_i^t, V^t) = \Phi(T_i^t, V^t)+ \Psi({T_i^t, T_{pa(i)}^t}) \end{equation} $

    (10)

    其中, $\Phi(\cdot)$, $\Psi(\cdot)$定义同式(2).二元项$\Psi'(T_i^t, T_i^{t+1})$评估两邻接轨迹片段的时序一致性, 本文利用轨迹片段重叠部分的距离来计算, 假设两邻接轨迹片段分别为$A$和$B$, 重叠$m$帧, 则我们用计算$A$与$B$之间的距离.对单部件结点, 二元项定义为

    $ \begin{equation} \label{equ_PsiHMM} \Psi'(A, B) = -\lambda_4\|A - B\|_2^2 \end{equation} $

    (11)

    对组合部件结点, 二元项定义为

    $ \Psi '(A,B) = - {\lambda _5}{\left( {\frac{{\parallel A.l - B.l{\parallel _2} + \parallel A.r - B.r{\parallel _2}}}{2}} \right)^2} $

    (12)

    其中, $\lambda_4$与$\lambda_5$为权重因子.

    给定所有人体部件在每一视频片段的轨迹片段候选, 模型推理的目标是挑选符合时空约束的最优轨迹片段, 即获取轨迹片段的最优配置.我们通过迭代的时空解析来实现.通过空域子模型上的解析, 计算出所有轨迹片段的得分, 筛选高分轨迹片段候选, 构成时域解析的输入状态空间.依据时域解析结果再次对候选进行筛选, 作为空域解析的输入进入下一次迭代.迭代解析过程从空域开始, 原始输入状态空间由切割轨迹获取的轨迹片段构成.随着交替解析的进行, 逐步缩减状态候选数量, 一直到最后挑选出最优结果.最终的姿态序列由最优轨迹片段拼接融合得到.

    在空域解析阶段, 在第$t$个视频片段, 为部件$i$选择轨迹片段候选$a$的评分定义为

    $ \begin{equation} \mathcal{M}_{\mathcal{T}}(T_i^t, a) = \max\limits_{T^t\in\mathcal{T}:T_i^t=a}b{S}_T(T^t, V^t) \end{equation} $

    (13)

    由于空域子模型是树状结构, 所有部件轨迹片段候选的评分可以通过消息传递算法求得.从部件$i$到其邻接部件$j$的消息定义为

    $ \begin{equation} \label{equ_msg_space} m_{i\rightarrow j}( T_j^t) \propto \max\limits_ {T_i^t}(m_i(T_i^t)+ \Psi({T_i^t, T_j^t})) \end{equation} $

    (14)

    $ \begin{equation} \label{equ_belief_space} m_i(T_i^t) \propto \Phi(T_i^t, V^t) +\sum\limits_{k \in N\!b\!d(i)\backslash j} m_{k\rightarrow i}( T_i^t) \end{equation} $

    (15)

    由此, 部件$i$的轨迹片段$T_i^t$的评分可依据以下定义计算:

    $ \begin{equation} b(T_i^t) = \Phi(T_i^t, V^t) + \sum\limits_{k \in N\!b\!d(i)} m_{k\rightarrow i}( T_i^t) \end{equation} $

    (16)

    消息从叶子传递到根, 再由根传递回叶子, 一个循环即可求得所有轨迹片段的评分.

    在时域解析阶段, 由于子模型为链状结构, 所有轨迹片段的评分同样可通过消息在马尔科夫链上的一个循环传递完成.时域模型上从$t$片段向$t+1$片段传递的消息定义为

    $ \begin{equation} \label{equ_msg_time} m_{t\rightarrow {t\!+\!1}}( T_i^{t+1}) \propto \max\limits_ {T_i^t}(m_i(T_i^t)+ \Psi'(T_i^t, T_i^{t+1})) \end{equation} $

    (17)

    $ \begin{equation} m_i(T_i^t) \propto \Phi'(T_i^t, V^t) + m_{{t\!-\!1}\rightarrow t}( T_i^t) \end{equation} $

    (18)

    时序反向传递的消息定义类似, 由此, 部件$i$的轨迹片段$T_i^t$的评分可定义如下:

    $ \begin{equation} \label{equ_belief_time} b(T_i^t) = \Phi'(T_i^t, V^t\!)\!+m_{{t\!-\!1}\rightarrow t}( T_i^t)+ m_{{t\!+\!1}\rightarrow t}( T_i^t) \end{equation} $

    (19)

    其中, $\Phi'(T_i^t, V^t)$(式(10))涉及结点$i$与其双亲结点间的空域兼容性评估, 我们采用分步处理的策略来进行各部件的时域解析.从空域模型的根结点(头部)开始时域解析过程, 由于头部是检测最为稳定的部件, 首先, 对其进行独立的时域解析; 然后, 基于头部的解析结果, 计算其子结点的空域兼容项得分并对其进行时域解析, 这个过程按空域模型结构一直到所有叶子结点推理完成.

    迭代时空解析的算法如算法1所示.

    算法1.   迭代时空解析算法

    输入: $\{T_i^t|_{t=1}^N, _{i=1}^M\}$, $\{V^t|_{t=1}^N\}$

    输出: $\{\hat{T}_i^t|_{t=1}^N, _{i=1}^M\}$

    WHILE迭代次数$ <$最大迭代次数

    $//$空域解析

    FOR $t = 1$ TO $N$ DO

       FOR $i$ =叶子 TO根 DO

        依据式(14)计算消息;

      END FOR

      FOR $i$ =根 TO叶子 DO

        依据式(14)计算消息$m_{i\rightarrow j}( T_j^t)$;

      END FOR

      FOR $i$ =根 TO叶子 DO

        依据式(15)计算轨迹片段的评分$b(T_i^t)$;

        依据$b(T_i^t)$从大到小排序, 按比例$P$筛选轨迹片段候选;

        END FOR

      END FOR

      $//$时域解析

      FOR $i$ =根 TO叶子 DO

        FOR $t = 1$ TO $N-1$ DO

        依据式(17)计算消息;

      END FOR

      FOR $t = N$ TO 2 DO

        依据式(17)计算消息;

      END FOR

      FOR $t = 1$ TO $N$ DO

        依据式(19)计算轨迹片段的评分$b(T_i^t)$;

        依据$b(T_i^t)$从大到小排序, 按比例$P$筛选轨迹片段候选;

        END FOR

      END FOR

    END WHILE

    $\hat{T}_i^t = \arg\max\limits_{{T}_i^t}(b(T_i^t))$.

    本文在三个视频数据集上进行了实验.

    UnusualPose视频数据集[12]:该视频集包含4段视频, 存在大量的非常规人体姿态以及快速运动.

    FYDP视频数据集[29]:由20个舞蹈视频构成, 除个别视频外, 大部分运动比较平滑.

    Sub_Nbest视频数据集[22]:为方便与其他方法对比, 本文按照对比算法中的挑选方法, 只选用了文献[22]中给出的Walkstraight和Baseball两个视频.

    本文采用目前常用的两个评价机制对实验结果进行分析.

    PCK (Percentage of correct keypoints)[7]: PCK给出正确估计关键点(关节点部件的坐标位置)的百分比, 这里的关键点, 通常指的是人体的关节点(如头、颈、肩、肘、腕、胯、膝、踝, 当一个关键点的估计位置落在真值像素范围内时, 其估计被认为是准确的, 这里的$h$, $w$分别是人体目标边界框的高和宽, $\alpha$用于控制正确性判断的阈值.边界框由人体关节点真值的最紧外包矩形框界定, 根据姿态估计对象为整个人体或上半身人体, $\alpha$值设为0.1或0.2.

    PCP (Percentage of correct limb parts)[11]: PCP是目前应用非常广泛的姿态估计的评价机制, 它计算的是人体部件的正确评估百分比, 与关节点不同, 这里的人体部件是指两相邻关节点连接所对应的人体部位(比如上臂、前臂、大腿、小腿、躯干、头部).当一个人体部件两端对应的关节点均落在端点连线长度的50 %范围内时, 该部件的估计被认为是正确的.

    实验中, 视频分段的长度为7帧, 邻接片段重叠3帧, 模型推理通过一次迭代完成.通过表观评分挑选前20个轨迹片段构成最初输入空域解析的状态空间, 经空域推理为每个部件挑选最优的3个假设构成时域解析的输入, 并通过时域推理得到最优的估计, 最终的姿态序列由轨迹片段的拼接融合得到.

    本文提出的人体姿态估计方法, 主要包括三个关键处理策略: 1)采用全局运动信息对姿态检测结果进行传播; 2)构建中粒度模型, 以部件轨迹片段为推理实体; 3)对称部件合并, 以简化空域模型结构同时保留对称部件间约束.为验证这三个关键处理策略的有效性, 本文设置了4组对比实验, 每组实验改变其中一个处理策略, 实验的设置如下.

    实验1.   用局部运动信息对姿态检测结果进行长时传播, 构建中粒度模型, 模型中添加对称部件间约束.

    实验2.   用全局运动信息对姿态检测结果进行长时传播, 构建小粒度模型, 推理每帧中每一部件的状态, 模型中添加对称部件间约束.

    实验3.   用全局运动信息对姿态检测结果进行长时传播, 构建大粒度模型, 推理每一部件的轨迹状态, 模型中添加对称部件间约束.

    实验4.   用全局运动信息对姿态检测结果进行长时传播, 构建中粒度模型, 模型中只保留连接部件间空间约束关系, 不添加对称部件间约束.

    所有算法在UnusualPose视频数据集上进行了对比, 结果如图 7所示, 其中"局部运动信息"、"细粒度模型"、"粗粒度模型"和"无对称"分别对应实验1 $\sim$ 4.可以看出, 对本文方法的三个关键处理策略的替换, 都导致了估计精度不同程度的下降.综合来看, 本文方法的三个处理策略有效提高了视频中姿态估计的准确率.

    图 7  算法关键策略有效性测试结果
    Fig. 7  Examination of key modules

    本文与Nbest[22]、UVA[29]、SYM[15]、HPEV[18]以及PE_GM[12]共5个视频人体姿态估计方法进行了实验对比.由于SYM与HPEV方法的代码无法获取, 在UnusualPose视频数据集上, 本文只对比了Nbest、UVA和PE_GM三种方法.在FYDP视频集和Sub_Nbest视频集上, 我们直接引用文献中提供的数据结果进行对比.

    人体四肢在表达姿态中起着至关重要的作用, 也是在评估姿态估计算法性能时最为关注的地方.由表 1可以看出, 在UnusualPose视频集上, 对比其他视频姿态估计方法, 本文方法在四肢关节点上的PCK精度最高, 体现了本文方法在应对非常规人体姿态和快速运动等难题时的优势.从表 2可以看出, 在FYDP数据集上, 本文方法得到了最高的平均PCK得分.表 3显示本文方法在Sub_Nbest视频集上的PCP值与PE_GM方法综合性能相当, 均优于其他视频姿态估计方法, 需要注意的是PE_GM方法采用与本文相同的全局运动信息对检测结果进行传播, 候选质量与本文相同, 进一步证明采用全局运动信息对姿态检测结果进行传播的有效性.不同在于PE_GM方法采用细粒度模型, 通过选取姿态检测结果最优的关键帧启动其推理过程, 其最终的检测结果高度依赖其选取的启动帧, 而本文方法无需选取启动帧, 不受初始选取的限制.综合来看, 本文提出的算法具有一定的优越性.

    表 1  UnusualPose视频集上的PCK评分对比
    Table 1  PCK on UnusualPose dataset
    MethodHeadShld.ElbowWristHipKneeAnkleAvg
    Nbest99.899.476.265.087.870.871.581.5
    UVA99.493.872.756.289.366.362.477.2
    PE_GM98.798.389.973.891.076.488.988.1
    Ours98.798.1 90.175.195.9 88.489.590.8
    下载: 导出CSV 
    | 显示表格
    表 2  FYDP视频集上的PCK评分对比
    Table 2  PCK on FYDP dataset
    MethodHeadShld.ElbowWristHipKneeAnkleAvg
    Nbest95.789.775.259.183.381.479.580.6
    UVA96.291.778.460.385.483.879.282.1
    PE_GM98.489.280.960.584.4 89.383.783.8
    Ours97.993.4 84 63.188.488.984.485.7
    下载: 导出CSV 
    | 显示表格
    表 3  Sub_Nbest视频集上的PCP评分对比
    Table 3  PCP on Sub_Nbest dataset
    MethodHeadTorsoU.A.L.A.U.L.L.L.
    Nbest10061.066.041.086.084.0
    SYM10069.085.042.091.089.0
    PE_GM10097.9 97.967.094.786.2
    HPEV10010093.065.092.094.0
    Ours10098.196.658.6 95.1 94.8
    下载: 导出CSV 
    | 显示表格

    除了以上定量实验结果外, 我们还在图 8中展示了不同方法在UnusualPose视频集上的姿态估计结果.我们为每段视频选取一帧, 并用骨架结构展示姿态估计的结果, 相比较可以看出, 本文给出的姿态估计结果更符合真实的人体姿态.图 9图 10分别展示了本文方法在FYDP视频集和Sub_Nbest视频集上的部分姿态估计结果.

    图 8  UnusualPose数据集上的实验结果对比
    Fig. 8  Qualitative comparison on UnusualPose dataset
    图 9  FYDP数据集上的实验结果
    Fig. 9  Sample results on FYDP dataset
    图 10  Sub_Nbest数据集上的实验结果
    Fig. 10  Sample results on Sub_Nbest dataset

    本文提出了一种用于视频人体姿态估计的中粒度模型, 该模型以人体部件的轨迹片断为实体构建时空模型, 采用迭代的时域和空域解析进行模型推理, 目标是为各人体部件挑选最优的轨迹片断, 以拼接组成最后的人体姿态序列.为生成高质量的轨迹片段候选, 本文借助全局运动信息对姿态检测结果进行时域传播, 克服了局部运动信息的不足.为解决对称部件易混淆的问题, 模型中添加对称部件间约束, 提高了对称部件的检测准确率.算法有效性分析实验表明本文中采用中粒度模型, 通过全局运动信息进行姿态的传播以及在对称部件间添加约束等三个策略均对姿态估计的准确率提高有不同程度的贡献.与其他主流视频姿态估计方法在三个数据集上的对比实验结果显示了本文方法的优势.


  • 本文责任编委 桑农
  • 图  1  KTH数据集示例图[19]

    Fig.  1  Sample images of KTH dataset[19]

    图  2  Sample images of KTH dataset[19]

    Fig.  2  Sample images and silhouettes of Weizmann dataset[24]

    图  3  Hollywood 2数据集示例图[48]

    Fig.  3  Sample images of Hollywood 2 Dataset[48]

    图  4  UCF Sports数据集示例图[50]

    Fig.  4  Sample images of UCF Sports Dataset[50]

    图  5  UCF YouTube数据集示例图[30]

    Fig.  5  Sample images of UCF YouTube Dataset[30]

    图  6  Olympic Sports数据集示例图

    Fig.  6  Sample images of Olympic Sports Dataset

    图  7  HDMB51数据集示例图

    Fig.  7  Sample images of HDMB51 dataset

    图  8  UCF50数据集示例图[33]

    Fig.  8  Sample images of UCF50 dataset[33]

    图  9  UCF101数据集示例图

    Fig.  9  Sample images of UCF101 dataset

    图  10  IXMAS数据集同一动作的5个视角及其剪影示例图

    Fig.  10  Sample images and the corresponding silhouettes for the same action of IXMAS dataset (5 cameras)

    图  11  8个摄像机配置的顶视图[69]

    Fig.  11  The top view of the configuration of 8 cameras[69]

    图  12  MuHAVi数据集的8个视角示例图[69]

    Fig.  12  Sample images of MuHAVi dataset (8 cameras)[69]

    图  13  MuHAVi-Mas数据集的2个视角剪影示例图[69]

    Fig.  13  Sample silhouette images of MuHAVi-MAS dataset (2 cameras)[69]

    图  14  8个摄像机位置和方向的平面图[70]

    Fig.  14  Plan view showing the location and direction of the 8 cameras[70]

    图  15  PETS 2009基准数据集示例图[70]

    Fig.  15  Sample images of PETS 2009 benchmark dataset[70]

    图  16  卡车车载摄像头位置及覆盖范围[71]

    Fig.  16  The on-board camera configuration and coverage[71]

    图  17  停放车辆周围的三种不同行为[91]

    Fig.  17  Three different kinds of behavior recorded around a parked vehicle[91]

    图  18  WARD数据库示例图[93]

    Fig.  18  Sample images of WARD database[93]

    图  19  CMU Mocap数据集示例图

    Fig.  19  Sample images of CMU Mocap dataset dataset

    图  20  Microsoft Kinect相机示例图

    Fig.  20  Sample images of Microsoft Kinect camera

    图  21  MSR Action 3D数据集的深度序列图[95]

    Fig.  21  The sequences of depth maps of MSR Action 3D dataset[95]

    图  22  MSR Daily Activity 3D数据集示例图

    Fig.  22  Sample images of MSR Daily Activity 3D dataset

    图  23  UCF Kinect数据集的骨架示例图[97]

    Fig.  23  Sample skeleton images of UCF Kinect dataset[97]

    图  24  N-UCLA Multiview Action3D数据集示例图

    Fig.  24  Sample images of N-UCLA Multiview Action3D dataset

    图  25  Multiview Action3D的视角分布[104]

    Fig.  25  The view distribution of Multiview Action3D dataset[104]

    图  26  可穿戴惯性传感器及其位置示例图[105]

    Fig.  26  Sample images of the wearable inertial sensor and its placements[105]

    图  27  左臂向右滑行为的多模态数据示例图

    Fig.  27  Sample images of the multimodality data corresponding to the action left arm swipe to the right

    图  28  NTU RGB+D数据集的红外示例图

    Fig.  28  Sample infrared images of NTU RGB+D dataset

    图  29  25个骨架点示意图[106]

    Fig.  29  Configuration of 25 body joints[106]

    表  1  通用数据集的最新研究成果概览表

    Table  1  Summary of state-of-the-art research results on general datasets

    数据集名称最新识别率年份研究方法评价方案
    98.83 %[23]2016MLDFCS: Tr: 16; Te: 9
    KTH98.67 %[22]2016Semantic context feature-tree (MKL)CS: Tr: 16; Te: 9
    98.5 %[43]2015Local region tracking (HBRT/VOC)CS: Tr: 16; Te: 9
    100 %[44]20173D-TCCHOGAC+3D-HOOFGACLOOCV
    100 %[45]2016$\Re$ transform + LLE (SVM)LOOCV
    Weizmann100 %[46]2016SDEG + $\Re$ transformLOOCV
    100 %[47]20143D cuboids + mid-level feature (RF)LOSOCV
    100 %[25]2008Metric learningLOSOCV
    100 %[26]2008Mid-level motion featuresLOOCV
    *Tr: training set; Te: test set; CS: cross-subject; LOOCV: leave-one-out cross validation; LOSOCV: leave-one-subject-out cross validation
    下载: 导出CSV

    表  2  真实场景数据集的最新研究成果概览表

    Table  2  Summary of state-of-the-art research results on real scene datasets

    数据集名称最新识别率年份研究方法评价方案
    62 %[37]2012Asymmetric motions (BoW)Tr: 219 vedios; Te: 211vedios
    Hollywood59.9 %[36]2015DFW (BoW)Tr: 219 vedios; Te: 211vedios
    56.51 %[76]2016STG-MILTr: 219 vedios; Te: 211vedios
    78.6 %[41]2017EPT + DT + VideoDarwin (TCNN)Tr: 823 videos; Te: 884 videos
    Hollywood 278.5 %[40]2017HC-MTL + L/S RegTr: 823 videos; Te: 884 videos
    76.7 %[38]2016HRP + iDT (VGG-16)Tr: 823 videos; Te: 884 videos
    96.2 %[43]2015Local region tracking (HBRT/VOC)all classes
    UCF Sports96 %[44]20173D-TCCHOGAC + 3D-HOOFGACLOOCV
    95.50 %[47]20143D cuboids + mid-level feature (RF)LOOCV
    94.50 %[53]2016HboWLOOCV
    UCF YouTube94.4 %[52]2016CNRF (CNN)LOVOCV
    93.77 %[51]2014FV + SFVLOGOCV
    96.60 %[55]2016VLAD$^3$ + iDT (CNN)each class video: Tr: 40; Te: 10
    Olympic Sports96.5 %[54]2015iDT + HD (multi-layer FV)not mentioned
    93.6 %[77]2017Bag-of-SequenceletsTr: 649 videos; Te: 134 videos
    73.6 %[58]2016scene + motion (DCNN)three train/test splits
    HMDB5169.40 %[57]2016TSN (TCNN)three train/test splits
    69.2 %[56]2016spatiotemporal fusion (TCNN)three train/test splits
    99.98 %[61]2016GA (CNN)5-fold cross-validatin
    UCF5094.4 %[60]2015MIFSLOGOCV
    94.1 %[78]2013weighted SVM5-fold LOGOCV
    94.20 %[57]2016TSN (TCNN)three train/test splits
    UCF10194.08 %[62]2016RNN-FV (C3D + VGG-CCA) + iDTthree train/test splits
    93.5 %[56]2016spatiotemporal fusion (TCNN)three train/test splits
    80.8 %[55]2016VLAD$^3$ + iDT (CNN)5-fold cross-validation
    76.8 %[55]2016VLAD$^3$ (CNN)5-fold cross-validation
    THUMOS'1574.6 %[66]2015VLAD + LCD (VGG-16)5-fold cross-validation
    70.0 %[79]2015Stream Fusion + Linear SVM (VGG-19)Tr: UCF101 dataset; Te: val15
    65.5 %[80]2015iDT + LCD + VLAD (VGG-16)Tr: UCF101 dataset; Vs: val15
    Te: UCF101 dataset + val15
    75.9 %[67]2016RLSTM-g3 (GoogLeNet)not mentioned
    Sports-1M73.4 %[67]2016RLSTM-g1 (GoogLeNet)not mentioned
    (Hit$@$1)73.10 %[81]2015LSTM on Raw Frames LSTM on Optical Flow
    (GoogLeNet)
    1.1 million videos
    *LOVOCV: leave-one-video-out cross validation; LOGOCV: leave-one-group-out cross validation; Vs: validation set
    下载: 导出CSV

    表  3  多视角数据集的最新研究成果概览表

    Table  3  Summary of state-of-the-art research results on multi-view datasets

    数据集名称最新识别率年份研究方法评价方案备注
    IXMAS91.6 %[72]2015epipolar geometrynot mentioned5种行为
    (单视角)92.7 %[73]2016multi-view transition HMMLOSOCV11种行为
    IXMAS95.54 %[75]2014MMM-SVMTr: one camera's data11种行为; 5个视角
    (多视角)95.3 %[74]2016Cuboid + supervised dictionary learningLOAOCV; CV11种行为; 5个视角
    95.1 %[74]2016STIP + supervised dictionary learningLOAOCV; CV11种行为; 5个视角
    95.54 %[75]2014MMM-SVMTr: one camera's data11种行为; 4个视角
    Ts: LOSOCV
    94.7 %[40]2017HC-MTL + L/S RegLOSOCV11种行为; 4个视角
    93.7 %[92]2017eLR ConvNet(TCNN)LOSOCV12种行为; 5个视角
    85.8 %[46]2016SDEG + $\Re$ transformLOOCV13种行为; 5个视角
    MuHAVi97.48 %[83]2012Visual + Correlation (LKSSVM)LOOCV4个视角
    92.1 %[82]2014sectorial extreme points (HMM)LOSOCV4个视角
    91.6 %[84]2016CMS + multilayer descriptor (Multiclass K-NN)LOOCV8个视角
    MuHAVi-1498.53 %[86]2014Pose dictionary learning + maxpoolingLOOCV
    98.5 %[85]2013radial summary feature + Feature Subsetleave-one-sequence-out
    Selection
    95.6 %[84]2016CMS + multilayer descriptor(Multiclass K-NN)LOOCV
    94.12 %[88]2014CMS (K-NN)multi-training
    MuHAVi-8100 %[84]2016CMS + multilayer descriptor (Multiclass K-NN)LOOCV
    100 %[88]2014CMS (K-NN)multi-training
    100 %[87]2014radial silhouette-based feature (multiview learing)leave-one-sequence-out
    100 %[85]2013radial summary feature + Feature Subsetleave-one-sequence-out
    SelectionLOSOCV
    *CV: cross-view
    下载: 导出CSV

    表  4  MSR Action 3D数据集的子集

    Table  4  The subsets of MSR Action 3D dataset

    数据子集包含行为类别
    AS$_{1}$a02、a03、a05、a06、a10、a13、a18、a20
    AS$_{2}$a01、a04、a07、a08、a09、a11、a14、a12
    AS$_{3}$a06、a14、a15、a16、a17、a18、a19、a20
    下载: 导出CSV

    表  5  特殊数据集的最新研究成果概览表

    Table  5  Summary of state-of-the-art research results on special datasets

    数据集名称最新识别率年份研究方法评价方案备注
    WARD99.02 %[100]2015PCA+RLDA (SVM)CS: Tr: 15; Te: 5
    98.78 %[99]2012GDA+RVM+WLOGP3-fold cross-validation
    97.5 %[122]2017FDA (SVM)20-fold cross-validation10种行为
    近100 %[101]2016SCN (1-NN)CS5种行为
    200个样本
    CMU Mocap98.27 %[123]2010HGPLVM3-fold cross-validation5种行为
    98.13 %[124]20143D joint position features+Actionletnot mentioned5种行为
    Ensemble
    98.6 %[102]2015DisCoSet (SVM)All12种行为
    164个样本
    99.6 %[103]2014TSVQ (Pose-Histogram SVM)5-fold cross-validation30种行为
    278个样本
    MSR Action 3D100 %[108]2015DMM-LBP-FF/DMM-LBP-DFTr: 2/3; Te: 1/3
    (AS$_1$、AS$_2$和AS$_3$)98.9 %[107]2013DL-GSGCTr: 2/3; Te: 1/3
    98.9 %[107]2013DL-GSGCTr: 1/3; Te: 2/3
    98.7 %[108]2015DMM-LBP-FFTr: 1/3; Te: 2/3
    96.7 %[107]2013DL-GSGCCS
    96.1 %[125]20163D skeleton+two-level hierarchicalCS
    framework
    96.0 %[111]2017Coarse DS+Sparse coding (RDF)CS
    MSR Action 3D100 %[110]2015HDMM+3ConvNetsTr:奇数; Te:偶数
    (cross-subject)98.2 %[109]2015TriViews+ PFATr:奇数; Te:偶数
    98.2 %[126]2015Decision-Level Fusion (SUM Rule)Tr: 2/3/5/7/9;
    Te: 1/4/6/8/10
    96.7 %[107]2013DL-GSGC+TPMTr:奇数; Te:偶数
    MSR Daily Activity 3D97.5 %[111]2017Coarse DS+Sparse coding (RDF)not mentioned
    97.5 %[112]2016DSSCA+SSLMCS
    95.0 %[107]2013DL-GSGC+TPMCS
    UCF Kinect98.9 %[114]2014MvMF-HMM+$L_2$-normalization4-fold cross-validation
    98.8 %[113]2017SGS(p$_{\rm mean}$/p$_{\max}$, skeleton-view-dep.)4-fold cross-validation
    98.7 %[127]2013motion-based grouping+adaptive2-fold cross-validation
    N-UCLA92.61 %[115]2017Synthesized+Pre-trained (CNN)CV
    Multiview Action 3D90.8 %[113]2017SGS(p$_{\max}$, skel.-view-inv.+keypoint)CV
    89.57 %[115]2017Synthesized Samples (CNN)CV
    81.6 %[104]2014MST-AOGCS; LOOCV
    79.3 %[104]2014MST-AOGcross-environment
    UTD-MHAD88.4 %[117]2015DMMs+CT-HOG+LBP+EOHCS
    88.1 %[116]2017JDM+MSF (CNN)CS
    87.9 %[118]2016JTM+MSF (CNN)CS
    NTU RGB+D76.32 %[118]2016JTM+MSF (CNN)CS
    76.2 %[116]2017JDM+MSF (CNN)CS
    62.93 %[106]20162layer P-LSTMCS
    82.3 %[116]2017JDM+MSF (CNN)CV
    81.08 %[118]2016JTM+MSF (CNN)CV
    70.27 %[106]20162 layer P-LSTMCV
    下载: 导出CSV

    表  6  通用、真实场景及多视角数据集信息表

    Table  6  The information of general datasets, real scene datasets and multi-view datasets

    类型 数据集名称 年份 行为类别 行为人数 视频数/类 视频总数/样本数 场景 视角 分辨率(最高) fps
    通用 KTH[19] 2004 6 25 99 $\sim$ 100 599/2 391 4 1 160$\times$120 25
    Weizmann[2] 2005 10 9 9 $\sim$ 10 93 1 1 180$\times$144 25
    真实场景 Hollywood[27] 2008 8 N/A 30 $\sim$ 129 475 N/A N/A 544$\times$240 25
    UCF Sports[28] 2008 10 N/A 6 $\sim$ 22 150 N/A N/A 720$\times$480 9
    UT-Tower[128] 2009 9 6 12 108 2 1 360$\times$240 10
    Hollywood 2[29] (Actions) 2009 12 N/A 61 $\sim$ 278 2 517 N/A N/A 720$\times$528 25
    ADL[129] 2009 10 5 15 150 1 1 1 280$\times$720 30
    UCF YouTube[30] 2009 11 N/A 116 $\sim$ 198 1 600 N/A N/A 320$\times$240 30
    Olympic Sports[31] 2010 16 N/A 21 $\sim$ 67 783 N/A N/A - -
    UT-Interaction[130] 2010 6 N/A 20 120 2 1 720$\times$480 30
    HMDB51[32] 2011 51 N/A 102 $\sim$ 548 6 766 N/A N/A 424$\times$240 30
    CCV[131] 2011 20 N/A 224 $\sim$ 806 9 317 N/A N/A - -
    UCF50[33] 2012 50 N/A 100 $\sim$ 197 6 681 N/A N/A 320$\times$240 25
    UCF101[34] 2012 101 N/A 100 $\sim$ 167 13 320 N/A N/A 320$\times$240 25
    MPII Cooking[132] 2012 65 12 - 44/5 609 1 1 1 624$\times$1 224 29.4
    MPII Composites[133] 2012 60 22 - 212 1 1 1 624$\times$1 224 29.4
    Sports-1M[35] 2014 487 N/A 1 000 $\sim$ 3 000 1 133 158 N/A N/A 1 280$\times$720 30
    Hollywood Extended[134] 2014 16 N/A 2 $\sim$ 11 937 N/A N/A 720$\times$528 25
    MPII Cooking 2[135] 2015 67 30 - 273/14 105 1 1 1 624$\times$1 224 29.4
    ActivityNet[136] 2015 203 N/A 137(a) 27 801 N/A N/A 1 280$\times$720 30
    多视角 IXMAS[68] 2006 13 12 180 180/2 340 1 5 390$\times$291 23
    i3DPost[137] 2009 12 8 64 768 1 8 1 920$\times$1 080 25
    MuHAVi[69] 2010 17 7 56 952 1 8 720$\times$576 25
    MuHAVi-MAS[69] 2010 14 2 4 $\sim$ 16 136 1 2 720$\times$576 25
    *a: average; N/A: not applicable
    下载: 导出CSV

    表  7  特殊数据集信息表

    Table  7  The information of special human activity recognition datasets

    数据集名称 年份 行为类别 行为人数 视频数/类 视频总数/样本数 场景 视角 分辨率 fps 数据格式 骨架关节点
    CMU Mocap[94] 2007 23个亚类 N/A 1 $\sim$ 96 2 605 N/A N/A 320 $\times$ 240 30 MS 41
    WARD[93] 2009 13 20 64 $\sim$ 66 1 298 1 1 - - M N/A
    CMU-MMAC[138] 2009 5大类 45 234 $\sim$ 252 1 218 1 6 1 024$\times$768 30 RDMA N/A
    640$\times$480 60
    MSR Action 3D[95] 2010 20 10 20 $\sim$ 30 567 1 1 640$\times$480 (R)
    320$\times$240 (D)
    15 DS 20
    RGBD-HuDaAct[139] 2011 12 30 - 1 189 1 1 640$\times$480 (RD) 30 RD N/A
    UT Kinect[140] 2012 10 10 - 200 1 1 640$\times$480 (R)
    320$\times$240 (D)
    30 RDS 20
    ACT4$^2$[141] 2012 14 24 - 6 844 1 4 640 $\times$ 480 30 RD N/A
    MSR Daily Activity 3D[96] 2012 16 10 20 320 1 1 640$\times$480 30 RDS 20
    UCF Kinect[97] 2013 16 16 80 1 280 1 1 - - S 15
    Berkeley MHAD[142] 2013 11 12 54 $\sim$ 55 659 1 4 640$\times$480 30 RDMAIe N/A
    3D Action Pairs[143] 2013 12 10 30 360 1 1 640$\times$480 30 RDS 20
    Multiview RGB-D event[144] 2013 8 8 477 (a) 3 815 1 3 640$\times$480 30 RDS 20
    Online RGBD Action[145] 2014 7 24 48 336 1 1 - - RDS 20
    URFD[119] 2014 5 5 6 $\sim$ 60 100 4 2 640$\times$240 30 RD N/A
    N-UCLA[104] 2014 10 10 140 $\sim$ 173 1 475 1 3 640$\times$480 12 RDS 20
    TST Fall detection v1[120] 2014 2 4 10 20 1 1 320$\times$240 (D) 30 D N/A
    UTD-MHAD[105] 2015 27 8 31 $\sim$ 32 861 1 1 640$\times$480 30 RDSIe 25
    TST Fall detection v2[121] 2016 8 11 33 264 1 1 512$\times$424 (D) 25 DSIe 25
    NTU RGB+D[106] 2016 60 40 948 56 880 1 80 1 920$\times$720 (R)
    512$\times$424 (D)
    512$\times$424 (If)
    30 RDSIf 25
    *R: RGB; D: Depht; S: Skeleton; M: Motion; A: Audio; If: Infrared; Ie: Inertrial
    下载: 导出CSV

    表  8  人体行为数据集分类信息表

    Table  8  Human activity dataset classification according to different features

    分类特征 子类 数据集
    场景 室内 ADL、MPII Cooking、MPII Composites、MPII Cooking 2、IXMAS、i3DPost、MuHAVi、MuHAVi-MAS、CMU Mocap、WARD、CMU-MMAC、MSR Action 3D、RGBD-HuDaAct、UT Kinect、ACT4$^2$、MSR Daily Activity 3D、UCF Kinect、MHAD、3D Action Pairs、Multiview RGB-D event、Online RGBD Action、URFD、N- UCLA Multiview Action 3D、TST Fall detection dataset v1、UTD-MHAD、TST Fall detection dataset v2、NTU RGB+D
    室外 Weizmann、UT-Tower、UT-Interaction、PETS
    内容 室内/室外 KTH、Hollywood、UCF Sports、Hollywood 2、UCF YouTube、Olympic Sports、HMDB51、CCV、UCF50、UCF101、Sports-1M、Hollywood Extended、ActivityNet、THUMOS
    日常活动 KTH、Weizmann、ADL、HMDB51、CCV、ActivityNet、IXMAS、i3DPost、MuHAVi、MuHAVi-MAS、CMU Mocap、WARD、MSR Action 3D、RGBD-HuDaAct、UT Kinect、ACT4$^2$、MSR Daily Activity 3D、RGBD- HuDaAct、UCF Kinect、MHAD、3D Action Pairs、Multiview RGB-D event、Online RGBD Action、URFD、N-UCLA Multiview Action 3D、TST Fall detection dataset v1、UTD-MHAD、TST Fall detection dataset v2、NTU RGB+D
    体育运动 UCF Sports、UCF YouTube、Olympic Sports、UCF50、UCF101、Sports-1M、THUMOS
    厨房活动 MPII Cooking、MPII Composites、MPII Cooking 2、CMU-MMAC
    电影 Hollywood、Hollywood 2、Hollywood Extended
    监控 UT-Tower、UT-Interaction、PETS
    视角 单视角 KTH、Weizmann、ADL、MPII Cooking、MPII Composites、MPII Cooking 2、MSR Action 3D、UT Kinect、MSR Daily Activity 3D、RGBD-HuDaAct、UCF Kinect、3D Action Pairs、Online RGBD Action、TST Fall detection dataset v1、UTD-MHAD、TST Fall detection dataset v2
    多视角 IXMAS、i3DPost、MuHAVi、MuHAVi-MAS、ACT4$^2$、MHAD、Multiview RGB-D event、URFD、N-UCLA Multiview Action 3D、NTU RGB+D、PETS
    俯瞰 UT-Tower、UT-Interaction、PETS
    其他 Hollywood、UCF Sports、Hollywood 2、UCF YouTube、Olympic Sports、HMDB51、CCV、UCF50、UCF101、Sports-1M、Hollywood Extended、ActivityNet、CMU Mocap、WARD、CMU-MMAC、THUMOS
    相机 静止 KTH、Weizmann、UT-Tower、ADL、UT-Interaction、MPII Cooking、MPII Composites、MPII Cooking 2、IXMAS、i3DPost、MuHAVi、MuHAVi-MAS、CMU-MMAC、MSR Action 3D、RGBD-HuDaAct、UT Kinect、ACT4$^2$、MSR Daily Activity 3D、UCF Kinect、MHAD、3D Action Pairs、Multiview RGB-D event、Online RGBD Action、URFD、N-UCLA Multiview Action 3D、TST Fall detection dataset v1、UTD-MHAD、TST Fall detection dataset v2、NTU RGB+D、PETS
    移动 Hollywood、UCF Sports、Hollywood 2、UCF YouTube、Olympic Sports、HMDB51、CCV、UCF50、UCF101、Sports-1M、Hollywood Extended、ActivityNet、CMU Mocap、THUMOS
    应用 行为识别 KTH、Weizmann、Hollywood、UCF Sports、UT-Tower、Hollywood 2、ADL、UCF YouTube、Olympic Sports、UT-Interaction、HMDB51、CCV、UCF50、UCF101、MPII Cooking、MPII Composites、Sports-1M、Hollywood Extended、ActivityNet、MPII Cooking 2、IXMAS、i3DPost、MuHAVi、MuHAVi-MAS、CMU Mocap、WARD、CMU-MMAC、MSR Action 3D、RGBD-HuDaAct、UT Kinect、ACT4$^2$、MSR Daily Activity 3D、UCF Kinect、MHAD、3D Action Pairs、Multiview RGB-D event、Online RGBD Action、N-UCLA Multiview Action 3D、UTD-MHAD、TST Fall detection dataset v2、NTU RGB+D、PETS、THUMOS
    领域 检测/跟踪 KTH、Weizmann、UCF Sports、Olympic Sports、UT-Interaction、ADL、UCF YouTube、ACT4$^2$、URFD、TST Fall detection dataset v1、TST Fall detection dataset v2、PETS、UCF50、UCF101、MPII Cooking、MPII Composites、MPII Cooking 2
    其他 KTH、Weizmann、UCF YouTube、UT-Tower、UCF50、ActivityNet、MPII Cooking、MPII Composites、MPII Cooking 2、Multiview RGB-D event
    下载: 导出CSV
  • [1] Hu W M, Tan T N, Wang L, Maybank S. A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2004, 34(3):334-352 doi: 10.1109/TSMCC.2004.829274
    [2] Kim I S, Choi H S, Yi K M, Choi J Y, Kong S G. Intelligent visual surveillance-a survey. International Journal of Control, Automation and Systems, 2010, 8(5):926-939 doi: 10.1007/s12555-010-0501-4
    [3] 黄凯奇, 陈晓棠, 康运锋, 谭铁牛.智能视频监控技术综述.计算机学报, 2015, 38(6):1093-1118 doi: 10.11897/SP.J.1016.2015.01093

    Huang Kai-Qi, Chen Xiao-Tang, Kang Yun-Feng, Tan Tie-Niu. Intelligent visual surveillance:a review. Chinese Journal of Computers, 2015, 38(6):1093-1118 doi: 10.11897/SP.J.1016.2015.01093
    [4] Dix A. Human-Computer Interaction. Berlin: Springer-Verlag, 2009. 1327-1331
    [5] Myers B A. A brief history of human-computer interaction technology. Interactions, 1998, 5(2):44-54 doi: 10.1145/274430.274436
    [6] Rautaray S S, Agrawal A. Vision based hand gesture recognition for human computer interaction:a survey. Artificial Intelligence Review, 2015, 43(1):1-54 doi: 10.1007/s10462-012-9356-9
    [7] Park S H, Won S H, Lee J B, Kim S W. Smart home-digitally engineered domestic life. Personal and Ubiquitous Computing, 2003, 7(3-4):189-196 doi: 10.1007/s00779-003-0228-9
    [8] Jeong K-A, Salvendy G, Proctor R W. Smart home design and operation preferences of Americans and Koreans. Ergonomics, 2010, 53(5):636-660 doi: 10.1080/00140130903581623
    [9] Komninos N, Philippou E, Pitsillides A. Survey in smart grid and smart home security:Issues, challenges and countermeasures. IEEE Communications Surveys & Tutorials, 2014, 16(4):1933-1954 http://cn.bing.com/academic/profile?id=ba89261b5387cd451572bd2fd6012175&encoded=0&v=paper_preview&mkt=zh-cn
    [10] Suma E A, Krum D M, Lange B, Koenig S, Rizzo A, Bolas M. Adapting user interfaces for gestural interaction with the flexible action and articulated skeleton toolkit. Computers & Graphics, 2013, 37(3):193-201
    [11] Zelnik-Manor L, Irani M. Event-based analysis of video. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). Kauai, Hawaii, USA: IEEE, 2001, 2: Ⅱ-123-Ⅱ-130 doi: 10.1109/CVPR.2001.990935
    [12] Ahad M A R, Tan J, Kim H, Ishikawa S. Action dataset-a survey. In: Proceedings of the 2011 SICE Annual Conference (SICE). Tokyo, Japan: IEEE, 2011. 1650-1655 http://www.mendeley.com/catalog/action-dataset-survey/
    [13] Chaquet J M, Carmona E J, Fernández-Caballero A. A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 2013, 117(6):633-659 doi: 10.1016/j.cviu.2013.01.013
    [14] Zhang J, Li W Q, Ogunbona P O, Wang P C, Tang C. RGB-D-based action recognition datasets:a survey. Pattern Recognition, 2016, 60:86-105 doi: 10.1016/j.patcog.2016.05.019
    [15] Aggarwal J K, Ryoo M S. Human activity analysis:a review. ACM Computing Surveys, 2011, 43(3):Article No. 16 http://cn.bing.com/academic/profile?id=a25e9bf81e9f05da7e7a0358aaeb8ae3&encoded=0&v=paper_preview&mkt=zh-cn
    [16] Vishwakarma S, Agrawal A. A survey on activity recognition and behavior understanding in video surveillance. The Visual Computer, 2013, 29(10):983-1009 doi: 10.1007/s00371-012-0752-6
    [17] Chen C, Jafari R, Kehtarnavaz N. A survey of depth and inertial sensor fusion for human action recognition. Multimedia Tools and Applications, 2017, 76(3):4405-4425 doi: 10.1007/s11042-015-3177-1
    [18] 单言虎, 张彰, 黄凯奇.人的视觉行为识别研究回顾、现状及展望.计算机研究与发展, 2016, 53(1):93-112 doi: 10.7544/issn1000-1239.2016.20150403

    Shan Yan-Hu, Zhang Zhang, Huang Kai-Qi. Visual human action recognition:history, status and prospects. Journal of Computer Research and Development, 2016, 53(1):93-112 doi: 10.7544/issn1000-1239.2016.20150403
    [19] Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR). Cambridge, UK: IEEE, 2004, 3: 32-36 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=1334462
    [20] Blank M, Gorelick L, Shechtman E, Irani M, Basri R. Actions as space-time shapes. In: Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV'05). Beijing, China: IEEE, 2005, 2: 1395-1402 http://europepmc.org/abstract/MED/17934233
    [21] Gorelick L, Blank M, Shechtman E, Irani M, Basri R. Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(12):2247-2253 doi: 10.1109/TPAMI.2007.70711
    [22] Zhou T C, Li N J, Cheng X, Xu Q J, Zhou L, Wu Z Y. Learning semantic context feature-tree for action recognition via nearest neighbor fusion. Neurocomputing, 2016, 201:1-11 doi: 10.1016/j.neucom.2016.04.007
    [23] Xu W R, Miao Z J, Tian Y. A novel mid-level distinctive feature learning for action recognition via diffusion map. Neurocomputing, 2016, 218:185-196 doi: 10.1016/j.neucom.2016.08.057
    [24] Gorelick L, Blank M, Shechtman E, Irani M, Basri R. Actions as space-time shapes[Online], available: http://www.wisdom.weizmann.ac.il/~vision/SpaceTime-Actions.html, January 26, 2016.
    [25] Tran D, Sorokin A. Human activity recognition with metric learning. In: Proceedings of the 10th European Conference on Computer Vision (ECCV). Marseille, France: Springer, 2008. 548-561 http://www.springerlink.com/content/p2183333585g8845
    [26] Fathi A, Mori G. Action recognition by learning mid-level motion features. In: Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Anchorage, AK, USA: IEEE, 2008. 1-8 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4587735
    [27] Laptev I, Marszalek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Anchorage, AK, USA: IEEE, 2008. 1-8 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=4587756
    [28] Rodriguez M D, Ahmed J, Shah M. Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Anchorage, AK, USA: IEEE, 2008. 1-8 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=4587727
    [29] Marszalek M, Laptev I, Schmid C. Actions in context. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Miami, FL, USA: IEEE, 2009. 2929-2936
    [30] Liu J G, Luo J B, Shah M. Recognizing realistic actions from videos "in the wild". In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Miami, FL, USA: IEEE, 2009. 1996-2003 doi: 10.1109/CVPRW.2009.5206744
    [31] Niebles J C, Chen C W, Li F F. Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of the 11th European Conference on Computer Vision (ECCV): Part Ⅱ. Heraklion, Crete, Greece: Springer, 2010. 392-405
    [32] Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. HMDB: a large video database for human motion recognition. In: Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV). Barcelona, Spain: IEEE, 2011. 2556-2563 doi: 10.1109/ICCV.2011.6126543
    [33] Reddy K K, Shah M. Recognizing 50 human action categories of web videos. Machine Vision and Applications, 2013, 24(5):971-981 doi: 10.1007/s00138-012-0450-4
    [34] Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv: 1212. 0402, 2012. 1-7
    [35] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F F. Large-scale video classification with convolutional neural networks. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Columbus, OH, USA: IEEE, 2014. 1725-1732 http://ieeexplore.ieee.org/document/6909619/
    [36] Kulkarni K, Evangelidis G, Cech J, Horaud R. Continuous action recognition based on sequence alignment. International Journal of Computer Vision, 2015, 112(1):90-114 doi: 10.1007/s11263-014-0758-9
    [37] Shabani A H, Clausi D A, Zelek J S. Evaluation of local spatio-temporal salient feature detectors for human action recognition. In: Proceedings of the 2012 Ninth Conference on Computer and Robot Vision (CRV). Toronto, ON, Canada: IEEE, 2012. 468-475 http://dl.acm.org/citation.cfm?id=2354394
    [38] Fernando B, Anderson P, Hutter M, Gould S. Discriminative hierarchical rank pooling for activity recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 1924-1932 doi: 10.1109/CVPR.2016.212
    [39] Wang H, Schmid C. Action recognition with improved trajectories. In: Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). Sydney, Australia: IEEE, 2013. 3551-3558 doi: 10.1109/ICCV.2013.441
    [40] Liu A A, Su Y T, Nie W Z, Kankanhalli M. Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(1):102-114 doi: 10.1109/TPAMI.2016.2537337
    [41] Wang Y, Tran V, Hoai M. Evolution-preserving dense trajectory descriptors. arXiv: 1702. 04037, 2017.
    [42] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Advance in Neural Information Processing Systems. 2014, 1(4):568-576 https://www.researchgate.net/publication/262974436_Two-Stream_Convolutional_Networks_for_Action_Recognition_in_Videos
    [43] Al Harbi N, Gotoh Y. A unified spatio-temporal human body region tracking approach to action recognition. Neurocomputing, 2015, 161:56-64 doi: 10.1016/j.neucom.2014.11.072
    [44] Tong M, Wang H Y, Tian W J, Yang S L. Action recognition new framework with robust 3D-TCCHOGAC and 3D-HOOFGAC. Multimedia Tools and Applications, 2017, 76(2):3011-3030 doi: 10.1007/s11042-016-3279-4
    [45] Vishwakarma D K, Kapoor R, Dhiman A. Unified framework for human activity recognition:an approach using spatial edge distribution and R-transform. AEU-International Journal of Electronics and Communications, 2016, 70(3):341-353 doi: 10.1016/j.aeue.2015.12.016
    [46] Vishwakarma D K, Kapoor R, Dhiman A. A proposed unified framework for the recognition of human activity by exploiting the characteristics of action dynamics. Robotics and Autonomous Systems, 2016, 77:25-38 doi: 10.1016/j.robot.2015.11.013
    [47] Liu C W, Pei M T, Wu X X, Kong Y, Jia Y D. Learning a discriminative mid-level feature for action recognition. Science China Information Sciences, 2014, 57(5):1-13 http://cn.bing.com/academic/profile?id=cb77c2bcda90b6c26f8a2e19405b6342&encoded=0&v=paper_preview&mkt=zh-cn
    [48] Laptev I, Marszalek M, Schmid C, Rozenfeld B. Hollywood2: Human actions and scenes dataset[Online], available: http://www.di.ens.fr/~laptev/actions/hollywood2/, March 12, 2016.
    [49] Wang H, Kläser A, Schmid C, Liu C L. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 2013, 103(1):60-79 doi: 10.1007/s11263-012-0594-8
    [50] Soomro K, Zamir A R. Action recognition in realistic sports videos. Computer vision in sports. Cham, Switzerland: Springer, 2014. 181-208
    [51] Peng X J, Zou C Q, Qiao Y, Peng Q. Action recognition with stacked fisher vectors. In: Proceedings of the 13th European Conference on Computer Vision (ECCV). Zurich, Switzerland: Springer, 2014. 581-595 doi: 10.1007/978-3-319-10602-1_38
    [52] Liu C H, Liu J, He Z C, Zhai Y J, Hu Q H, Huang Y L. Convolutional neural random fields for action recognition. Pattern Recognition, 2016, 59:213-224 doi: 10.1016/j.patcog.2016.03.019
    [53] Sun Q R, Liu H, Ma L Q, Zhang T W. A novel hierarchical bag-of-words model for compact action representation. Neurocomputing, 2016, 174(Part B):722-732 https://www.researchgate.net/publication/283989611_A_novel_hierarchical_Bag-of-Words_model_for_compact_action_representation
    [54] Sekma M, Mejdoub M, Amar C B. Human action recognition based on multi-layer fisher vector encoding method. Pattern Recognition Letters, 2015, 65(C):37-43 https://www.researchgate.net/publication/305284646_Structured_Fisher_vector_encoding_method_for_human_action_recognition
    [55] Li Y W, Li W X, Mahadevan V, Vasconcelos N. VLAD3: encoding dynamics of deep features for action recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 1951-1960 doi: 10.1109/CVPR.2016.215
    [56] Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 1933-1941 http://arxiv.org/abs/1604.06573
    [57] Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O, Van Gool L. Temporal segment networks: Towards good practices for deep action recognition. In: Proceedings of the 14th European Conference on Computer Vision (ECCV). Amsterdam, the Netherlands: Springer, 2016. 20-36 doi: 10.1007/978-3-319-46484-8_2
    [58] Wang H S, Wang W, Wang L. How scenes imply actions in realistic videos? In: Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP). Phoenix, AZ, USA: IEEE, 2016. 1619-1623 http://ieeexplore.ieee.org/document/7532632/
    [59] Wang L M, Guo S, Huang W L, Qiao Y. Places205-VGGNet models for scene recognition. arXiv: 1508. 01667, 2015.
    [60] Lan Z Z, Lin M, Li X C, Hauptmann A G, Raj B. Beyond Gaussian pyramid: multi-skip feature stacking for action recognition. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015. 204-212 doi: 10.1109/CVPR.2015.7298616
    [61] Ijjina E P, Chalavadi K M. Human action recognition using genetic algorithms and convolutional neural networks. Pattern Recognition, 2016, 59:199-212 doi: 10.1016/j.patcog.2016.01.012
    [62] Lev G, Sadeh G, Klein B, Wolf L. RNN Fisher vectors for action recognition and image annotation. In: Proceedings of the 14th European Conference on Computer Vision (ECCV): Part Ⅷ . Amsterdam, the Netherlands: Springer, 2016. 833-850
    [63] Jiang Y G, Liu J G, Zamir A R, Laptev I, Piccardi M, Shah M, Sukthankar R. THUMOS challenge: Action recognition with a large number of classes[Online], available: http://crcv.ucf.edu/ICCV13-Action-Workshop/index.html, November 20, 2016.
    [64] Jiang Y G, Liu J G, Zamir A R, Toderici G, Laptev I, Shah M, Sukthankar R. THUMOS challenge: action recognition with a large number of classes[Online], available: http://crcv.ucf.edu/THUMOS14/home.html, November 20, 2016.
    [65] Gorban A, Idrees H, Jiang Y G, Zamir A R, Laptev I, Shah M, Sukthankar R. THUMOS challenge: action recognition with a large number of classes[Online], available: http://www.thumos.info/home.html, November 20, 2016.
    [66] Xu Z, Zhu L, Yang Y, Hauptmann A G. UTS-CMU at THUMOS 2015. In: Proceedings of the 2015 THUMOS Challenge. Boston, MA, USA: CVPR, 2015. 1-3
    [67] Mahasseni B, Todorovic S. Regularizing long short term memory with 3D human-skeleton sequences for action recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 3054-3062 doi: 10.1109/CVPR.2016.333
    [68] Weinland D, Ronfard R, Boyer E. Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 2006, 104(2-3):249-257 doi: 10.1016/j.cviu.2006.07.013
    [69] Singh S, Velastin S A, Ragheb H. MuHAVi: a multicamera human action video dataset for the evaluation of action recognition methods. In: Proceedings of the 7th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Boston, MA, USA: IEEE, 2010. 48-55 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=5597316
    [70] Ferryman J, Shahrokni A. PETS2009: dataset and challenge. In: Proceedings of the 22th IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS-Winter). Snowbird, UT, USA: IEEE, 2009. 1-6 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=5399556
    [71] Patino L, Ferryman J. PETS 2014: dataset and challenge. In: Proceedings of the 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Seoul, South Korea: IEEE, 2014. 355-360 doi: 10.1109/AVSS.2014.6918694
    [72] Ashraf N, Foroosh H. Motion retrieval using consistency of epipolar geometry. In: Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP). Quebec City, QC, Canada: IEEE, 2015. 4219-4223 http://ieeexplore.ieee.org/document/7351601/
    [73] Ji X F, Ju Z J, Wang C, Wang C H. Multi-view transition HMMs based view-invariant human action recognition method. Multimedia Tools and Applications, 2016, 75(19):11847-11864 doi: 10.1007/s11042-015-2661-y
    [74] Gao Z, Nie W Z, Liu A N, Zhang H. Evaluation of local spatial-temporal features for cross-view action recognition. Neurocomputing, 2016, 173(Part 1):110-117 http://cn.bing.com/academic/profile?id=1da561fc4b0fcb38d7c20fb3f7e53e43&encoded=0&v=paper_preview&mkt=zh-cn
    [75] Wu D, Shao L. Multi-max-margin support vector machine for multi-source human action recognition. Neurocomputing, 2014, 127(3):98-103 http://cn.bing.com/academic/profile?id=1985a105fc3d9604d66066b167adf376&encoded=0&v=paper_preview&mkt=zh-cn
    [76] Yi Y, Lin M Q. Human action recognition with graph-based multiple-instance learning. Pattern Recognition, 2016, 53(C):148-162 http://cn.bing.com/academic/profile?id=d6d8420d7e0ac3354d4a04a9cb76c2dd&encoded=0&v=paper_preview&mkt=zh-cn
    [77] Jung H J, Hong K S. Modeling temporal structure of complex actions using bag-of-sequencelets. Pattern Recognition Letters, 2017, 85:21-28 doi: 10.1016/j.patrec.2016.11.012
    [78] Ballas N, Yang Y, Lan Z Z, Delezoide B, Preteux F, Hauptmann A. Space-time robust representation for action recognition. In: Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). Sydney, NSW, Australia: IEEE, 2013. 2704-2711 doi: 10.1109/ICCV.2013.336
    [79] Qiu Z F, Li Q, Yao T, Mei T, Rui Y. MSR Asia MSM at THUMOS challenge 2015. In: Proceedings of the 2015 THUMOS Challenge. Boston, MA, USA: CVPR, 2015. 1-3 http://storage.googleapis.com/www.thumos.info/thumos15_notebooks/TH15_MSRAsia.pdf
    [80] Ning K, Wu F. ZJUDCD submission at THUMOS challenge 2015. In: Proceedings of the 2015 THUMOS Challenge. Boston, MA, USA: CVPR, 2015. 1-2
    [81] Ng J Y H, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G. Beyond short snippets: deep networks for video classification. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015. 4694-4702 doi: 10.1109/CVPR.2015.7299101
    [82] Moghaddam Z, Piccardi M. Training initialization of Hidden Markov Models in human action recognition. IEEE Transactions on Automation Science and Engineering, 2014, 11(2):394-408 doi: 10.1109/TASE.2013.2262940
    [83] Wu X X, Jia Y D. View-invariant action recognition using latent kernelized structural SVM. In: Proceedings of the 12th European Conference on Computer Vision (ECCV). Florence, Italy: Springer, 2012. 411-424 http://dl.acm.org/citation.cfm?id=2403170
    [84] Alcantara M F, Moreira T P, Pedrini H. Real-time action recognition using a multilayer descriptor with variable size. Journal of Electronic Imaging, 2016, 25(1):Article No., 013020 https://www.researchgate.net/profile/Marlon_Alcantara3/publication/293042223_Real-time_action_recognition_using_a_multilayer_descriptor_with_variable_size/links/5760567508ae2b8d20eb5f9e.pdf?origin=publication_list
    [85] Chaaraoui A A, Flórez-Revuelta F. Human action recognition optimization based on evolutionary feature subset selection. In: Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation. Amsterdam, the Netherlands: ACM, 2013. 1229-1236 Human action recognition optimization based on evolutionary feature subset selection
    [86] Cai J X, Tang X, Feng G C. Learning pose dictionary for human action recognition. In: Proceedings of the 22nd International Conference on Pattern Recognition (ICPR). Stockholm, Sweden: IEEE, 2014. 381-386 http://dl.acm.org/citation.cfm?id=2704008
    [87] Chaaraoui A A, Flórez-Revuelta F. A low-dimensional radial silhouette-based feature for fast human action recognition fusing multiple views. International Scholarly Research Notices, 2014, 2014:Article No., 547069 https://www.hindawi.com/journals/isrn/2014/547069/tab1/
    [88] Alcantara M F, Moreira T P, Pedrini H. Real-time action recognition based on cumulative motion shapes. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy: IEEE, 2014. 2917-2921 http://ieeexplore.ieee.org/document/6854134/
    [89] Li L Z, Nawaz T, Ferryman J. PETS 2015: datasets and challenge. In: Proceedings of the 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Karlsruhe, Germany: IEEE, 2015. 1-6 doi: 10.1109/AVSS.2015.7301741
    [90] Patino L, Cane T, Vallee A, Ferryman J. PETS 2016: dataset and challenge. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Las Vegas, NV, USA: IEEE, 2016. 1240-1247 http://ieeexplore.ieee.org/document/7789647/
    [91] PETS 2014[Online], available: http://www.cvg.reading.ac.uk/PETS2014/, April 16, 2016
    [92] Chen J W, Wu J, Konrad J, Ishwar P. Semi-coupled two-stream fusion ConvNets for action recognition at extremely low resolutions. In: Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). Santa Rosa, California, USA: IEEE, 2017. 139-147 http://ieeexplore.ieee.org/document/7926606/
    [93] Yang A Y, Jafari R, Sastry S S, Bajcsy R. Distributed recognition of human actions using wearable motion sensor networks. Journal of Ambient Intelligence and Smart Environments, 2009, 1(2):103-115 http://dl.acm.org/citation.cfm?id=2350317
    [94] CMU graphics lab motion capture database[Online], available: http://mocap.cs.cmu.edu, September 27, 2016.
    [95] Li W Q, Zhang Z Y, Liu Z C. Action recognition based on a bag of 3D points. In: Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). San Francisco, CA, USA: IEEE, 2010. 9-14 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=5543273&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Ficp.jsp%3Farnumber%3D5543273
    [96] Wang J, Liu Z C, Wu Y, Yuan J S. Mining actionlet ensemble for action recognition with depth cameras. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Providence, RI, USA: IEEE, 2012. 1290-1297 http://dl.acm.org/citation.cfm?id=2354966
    [97] Ellis C, Masood S Z, Tappen M F, LaViola Jr J J, Sukthankar R. Exploring the trade-off between accuracy and observational latency in action recognition. International Journal of Computer Vision, 2013, 101(3):420-436 doi: 10.1007/s11263-012-0550-7
    [98] Yang A Y, Iyengar S, Kuryloski P, Jafari R. Distributed segmentation and classification of human actions using a wearable motion sensor network. In: Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW'08). Anchorage, AK, USA: IEEE, 2008. 1-8 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=4563176
    [99] Guo Y C, He W H, Gao C. Human activity recognition by fusing multiple sensor nodes in the wearable sensor systems. Journal of Mechanics in Medicine and Biology, 2012, 12(5):Article No., 1250084 doi: 10.1142/S0219519412500844
    [100] Guo M, Wang Z L. A feature extraction method for human action recognition using body-worn inertial sensors. In: Proceedings of the 19th International Conference on Computer Supported Cooperative Work in Design (CSCWD). Calabria, Italy: IEEE, 2015. 576-581 http://ieeexplore.ieee.org/document/7231022/
    [101] Jia Q, Fan X, Luo Z X, Li H J, Huyan K, Li Z Z. Cross-view action matching using a novel projective invariant on non-coplanar space-time points. Multimedia Tools and Applications, 2016, 75(19):11661-11682 doi: 10.1007/s11042-015-2704-4
    [102] Al Aghbari Z, Junejo I N. DisCoSet:discovery of contrast sets to reduce dimensionality and improve classification. International Journal of Computational Intelligence Systems, 2015, 8(6):1178-1191 http://cn.bing.com/academic/profile?id=2cd8cc36bb9e5e545b47cfafa4362aa1&encoded=0&v=paper_preview&mkt=zh-cn
    [103] Kadu H, Kuo C C J. Automatic human Mocap data classification. IEEE Transactions on Multimedia, 2014, 16(8):2191-2202 doi: 10.1109/TMM.2014.2360793
    [104] Wang J, Nie X H, Xia Y, Wu Y, Zhu S C. Cross-view action modeling, learning, and recognition. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Columbus, OH, USA: IEEE, 2014. 2649-2656 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6909735
    [105] Chen C, Jafari R, Kehtarnavaz N. UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP). Quebec City, QC, Canada: IEEE, 2015. 168-172 http://ieeexplore.ieee.org/document/7350781
    [106] Shahroudy A, Liu J, Ng T T, Wang G. NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016. 1010-1019 http://arxiv.org/abs/1604.02808
    [107] Luo J J, Wang W, Qi H R. Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). Sydney, NSW, Australia: IEEE, 2013. 1809-1816 doi: 10.1109/ICCV.2013.227
    [108] Chen C, Jafari R, Kehtarnavaz N. Action recognition from depth sequences using depth motion maps-based local binary patterns. In: Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision (WACV). Waikoloa, HI, USA: IEEE, 2015. 1092-1099 http://dl.acm.org/citation.cfm?id=2764065.2764211
    [109] Chen W B, Guo G D. Triviews:a general framework to use 3D depth data effectively for action recognition. Journal of Visual Communication and Image Representation, 2015, 26:182-191 doi: 10.1016/j.jvcir.2014.11.008
    [110] Wang P C, Li W Q, Gao Z M, Zhang J, Tang C, Ogunbona P. Deep convolutional neural networks for action recognition using depth map sequences. arXiv: 1501. 04686, 2015. 1-8
    [111] Zhang H L, Zhong P, He J L, Xia C X. Combining depth-skeleton feature with sparse coding for action recognition. Neurocomputing, 2017, 230:417-426 doi: 10.1016/j.neucom.2016.12.041
    [112] Shahroudy A, Ng T T, Gong Y H, Wang G. Deep multimodal feature analysis for action recognition in RGB+D videos. arXiv: 160307120, 2016.
    [113] Kerola T, Inoue N, Shinoda K. Cross-view human action recognition from depth maps using spectral graph sequences. Computer Vision and Image Understanding, 2017, 154:108-126 doi: 10.1016/j.cviu.2016.10.004
    [114] Beh J, Han D K, Durasiwami R, Ko H. Hidden Markov model on a unit hypersphere space for gesture trajectory recognition. Pattern Recognition Letters, 2014, 36:144-153 doi: 10.1016/j.patrec.2013.10.007
    [115] Liu M Y, Liu H, Chen C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 2017, 68:346-362 doi: 10.1016/j.patcog.2017.02.030
    [116] Li C K, Hou Y H, Wang P C, Li W Q. Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Processing Letters, 2017, 24(5):624-628 doi: 10.1109/LSP.2017.2678539
    [117] Bulbul M F, Jiang Y S, Ma J W. DMMs-based multiple features fusion for human action recognition. International Journal of Multimedia Data Engineering & Management, 2015, 6(4):23-39 http://cn.bing.com/academic/profile?id=fd27f4caf7ad1b2f08f4f1ee6391f01b&encoded=0&v=paper_preview&mkt=zh-cn
    [118] Wang P C, Li W Q, Li C K, Hou Y H. Action recognition based on joint trajectory maps with convolutional neural networks. arXiv: 1612. 09401v1, 2016. 1-11
    [119] Kwolek B, Kepski M. Human fall detection on embedded platform using depth maps and wireless accelerometer. Computer Methods and Programs in Biomedicine, 2014, 117(3):489-501 doi: 10.1016/j.cmpb.2014.09.005
    [120] Gasparrini S, Cippitelli E, Spinsante S, Gambi E. A depth-based fall detection system using a kinect? sensor. Sensors, 2014, 14(2):2756-2775 doi: 10.3390/s140202756
    [121] Gasparrini S, Cippitelli E, Gambi E, Spinsante S, Wåhslén J, Orhan I, Lindh T. Proposal and experimental evaluation of fall detection solution based on wearable and depth data fusion. ICT innovations 2015. Cham, Switzerland: Springer, 2016. 99-108 doi: 10.1007/978-3-319-25733-4_11
    [122] 苏本跃, 蒋京, 汤庆丰, 盛敏.基于函数型数据分析方法的人体动态行为识别.自动化学报, 2017, 43(5):866-876 http://www.aas.net.cn/CN/abstract/abstract19064.shtml

    Su Ben-Yue, Jiang Jing, Tang Qing-Feng, Sheng Min. Human dynamic action recognition based on functional data analysis. Acta Automatica Sinica, 2017, 43(5):866-876 http://www.aas.net.cn/CN/abstract/abstract19064.shtml
    [123] Han L, Wu X X, Liang W, Hou G M, Jia Y D. Discriminative human action recognition in the learned hierarchical manifold space. Image and Vision Computing, 2010, 28(5):836-849 doi: 10.1016/j.imavis.2009.08.003
    [124] Wang J, Liu Z C, Wu Y, Yuan J S. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(5):914-927 doi: 10.1109/TPAMI.2013.198
    [125] Chen H Z, Wang G J, Xue J H, He L. A novel hierarchical framework for human action recognition. Pattern Recognition, 2016, 55:148-159 doi: 10.1016/j.patcog.2016.01.020
    [126] Zhu Y, Chen W B, Guo G D. Fusing multiple features for depth-based action recognition. ACM Transactions on Intelligent Systems and Technology, 2015, 6(2):Article No. 18 http://cn.bing.com/academic/profile?id=b8a609270431fed77692706f168340e8&encoded=0&v=paper_preview&mkt=zh-cn
    [127] Jiang X B, Zhong F, Peng Q S, Qin X Y. Robust action recognition based on a hierarchical model. In: Proceedings of the 2013 International Conference on Cyberworlds (CW). Yokohama, Japan: IEEE, 2013. 191-198
    [128] Chen C C, Aggarwal J K. Recognizing human action from a far field of view. In: Proceedings of the 2009 Workshop on Motion and Video Computing (WMVC'09). Snowbird, UT, USA: IEEE, 2009. 1-7 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5399231
    [129] Messing R, Pal C, Kautz H. Activity recognition using the velocity histories of tracked keypoints. In: Proceedings of the 12th International Conference on Computer Vision (ICCV). Kyoto, Japan: IEEE, 2009. 104-111 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=5459154
    [130] Ryoo M S, Aggarwal J K. UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA)[Online], available: http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html, December 10, 2016.
    [131] Jiang Y G, Ye G N, Chang S F, Ellis D, Loui A C. Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of the 1st ACM International Conference on Multimedia Retrieval (ICMR'11). Trento, Italy: ACM, 2011. Article No., 29 http://dl.acm.org/citation.cfm?id=1992025
    [132] Rohrbach M, Amin S, Andriluka M, Schiele B. A database for fine grained activity detection of cooking activities. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Providence, RI, USA: IEEE, 2012. 1194-1201 http://dl.acm.org/citation.cfm?id=2354909
    [133] Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B. Script data for attribute-based recognition of composite activities. In: Proceedings of the 12th European Conference on Computer Vision (ECCV). Florence, Italy: Springer, 2012. 144-157 http://dl.acm.org/citation.cfm?id=2402952
    [134] Bojanowski P, Lajugie R, Bach F, Laptev I, Ponce J, Schmid C, Sivic J. Weakly supervised action labeling in videos under ordering constraints. Computer Vision——ECCV 2014. Cham, Germany: IEEE, 2014, 8693: 628-643
    [135] Rohrbach M, Rohrbach A, Regneri M, Amin S, Andriluka M, Pinkal M, Schiele B. Recognizing fine-grained and composite activities using hand-centric features and script data. International Journal of Computer Vision, 2016, 119(3):346-373 doi: 10.1007/s11263-015-0851-8
    [136] Heilbron F C, Escorcia V, Ghanem B, Niebles J C. Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015. 961-970 doi: 10.1109/CVPR.2015.7298698
    [137] Gkalelis N, Kim H, Hilton A, Nikolaidis N, Pitas I. The i3DPost multi-view and 3D human action/interaction database. In: Proceedings of the 2009 Conference for Visual Media Production (CVMP). London, UK: IEEE, 2009. 159-168 http://brain.oxfordjournals.org/lookup/external-ref?access_num=20674934&link_type=MED&atom=%2Fbrain%2F135%2F3%2F723.atom
    [138] De la Torre F, Hodgins J K, Montano J, Valcarcel S. Detailed human data acquisition of kitchen activities: the CMU-multimodal activity database (CMU-MMAC). In: Proceedings of the 2009 Workshop on Developing Shared Home Behavior Datasets to Advance HCI and Ubiquitous Computing Research, in Conjuction with CHI. Boston, MA, USA: ACM, 2009. 1-5 http://www.researchgate.net/publication/242754790_Detailed_Human_Data_Acquisition_of_Kitchen_Activities_the_CMU-Multimodal_Activity_Database_CMU-MMAC
    [139] Ni B B, Wang G, Moulin P. RGBD-HuDaAct: a color-depth video database for human daily activity recognition. In: Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). Barcelona, Spain: IEEE, 2011. 1147-1153 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6130379
    [140] Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints. In: Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Providence, RI, USA: IEEE, 2012. 20-27 http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6239233
    [141] Cheng Z W, Qin L, Ye Y T, Huang Q Q, Tian Q. Human daily action analysis with multi-view and color-depth data. In: Proceedings of the Computer Vision, ECCV 2012-Workshops and Demonstrations. Florence, Italy: Springer, 2012. 52-61
    [142] Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R. Berkeley MHAD: a comprehensive multimodal human action database. In: Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV). Tampa, FL, USA: IEEE, 2013. 53-60 doi: 10.1109/WACV.2013.6474999
    [143] Oreifej O, Liu Z C. HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Portland, OR, USA: IEEE, 2013. 716-723 http://dl.acm.org/citation.cfm?id=2516099
    [144] Wei P, Zhao Y B, Zheng N N, Zhu S C. Modeling 4D human-object interactions for joint event segmentation, recognition, and object localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1165-1179 doi: 10.1109/TPAMI.2016.2574712
    [145] Yu G, Liu Z C, Yuan J S. Discriminative orderlet mining for real-time recognition of human-object interaction. In: Proceedings of the 12th Asian Conference on Computer Vision (ACCV). Singapore: Springer, 2014. 50-65
    [146] Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S. YouTube-8M: a large-scale video classification benchmark. arXiv: 1609. 08675, 2016. 1-10
  • 加载中
  • 图(29) / 表(8)
    计量
    • 文章访问数:  5662
    • HTML全文浏览量:  2675
    • PDF下载量:  1412
    • 被引次数: 0
    出版历程
    • 收稿日期:  2017-01-16
    • 录用日期:  2017-07-18
    • 刊出日期:  2018-06-20

    目录

    /

    返回文章
    返回