这是用户在 2024-7-30 11:10 为 https://ar5iv.labs.arxiv.org/html/2105.11731?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

ST-HOI: A Spatial-Temporal Baseline for
Human-Object Interaction Detection in Videos
ST-HOI:
用于视频中人物交互检测的时空基线

Meng-Jiun Chiou  邱孟骥 National University of SingaporeSingaporeSingapore
新加坡国立大学新加坡 新加坡
mengjiun.chiou@u.nus.edu
Chun-Yu Liao  廖春宇 ASUS Intelligent Cloud ServicesTaipeiTaiwan
华硕智能云服务,台北, 台湾
mist˙liao@asus.com 薄雾 ̇ liao@asus.com
Li-Wei Wang  王立伟 ASUS Intelligent Cloud ServicesTaipeiTaiwan
华硕智能云服务,台北, 台湾
popo55668@gmail.com
Roger Zimmermann  罗杰·齐默尔曼 National University of SingaporeSingaporeSingapore
新加坡国立大学新加坡 新加坡
rogerz@comp.nus.edu.sg
 and  Jiashi Feng  冯佳士 National University of SingaporeSingaporeSingapore
新加坡国立大学新加坡 新加坡
elefjia@nus.edu.sg
(2021)
Abstract. 抽象。

Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI111The dataset and source code are available at https://github.com/coldmanck/VidHOI
数据集和源代码可在以下网址获取 https://github.com/coldmanck/VidHOI

检测人物交互 (HOI) 是实现对机器的全面视觉理解的重要一步。 虽然从静态图像中检测非时间性 HOI(例如坐在椅子上)是可行的,但即使人类也不太可能从单个视频帧中猜测与时间相关的 HOI(例如打开/关闭门),其中相邻帧起着至关重要的作用。 然而,仅对静态图像进行操作的传统 HOI 方法已被用于预测与时间相关的交互,这本质上是没有时间上下文的猜测,并且可能导致次优性能。 在本文中,我们通过检测具有明确时间信息的基于视频的 HOI 来弥合这一差距。 我们首先表明,由于特征不一致问题,常见动作检测基线的朴素时间感知变体在基于视频的 HOI 上不起作用。 然后,我们提出了一种简单而有效的架构,称为时空 HOI 检测 (ST-HOI),该架构利用了时间和信息,例如人类和物体的轨迹、正确定位的视觉特征和时空掩蔽姿势特征。 我们构建了一个新的视频 HOI 基准测试,称为 VidHOI111 数据集和源代码可在 https://github.com/coldmanck/VidHOI
where our proposed approach serves as a solid baseline.
我们提出的方法可以作为坚实的基线。

Human-Object Interaction, Action Detection, Video Understanding
人物交互、动作检测、视频理解
copyright: rightsretained
版权所有:rightsretained
journalyear: 2021
JOURNAL年份: 2021
conference: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval; August 21–24, 2021; Taipei, Taiwan
会议: 2021年智能交叉数据分析与检索研讨会论文集;2021 年 8 月 21 日至 24 日;台北, 台湾
booktitle: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval (ICDAR ’21), August 21–24, 2021, Taipei, Taiwan
书名: 2021 年智能交叉数据分析和检索研讨会 (ICDAR '21) 会议记录,2021 年 8 月 21 日至 24 日,台湾台北
doi: 10.1145/3463944.3469097
doi: 10.1145/3463944.3469097
isbn: 978-1-4503-8529-9/21/08
国际标准图书编号: 978-1-4503-8529-9/21/08
ccs: Human-centered computing
船级社(CCS): 以人为本的计算
ccs: Computing methodologies Activity recognition and understanding
船级社(CCS): 计算方法 活动识别和理解

1. Introduction
1. 引言

Refer to caption
Figure 1. An illustrative comparison between conventional HOI methods and our ST-HOI when inferencing on videos. (a) Traditional HOI approaches (e.g., the baseline in (Wan et al., 2019)) take in only the target frame and predict HOIs based on ROI-pooled visual features. These models are unable to differentiate between push, pull or lean on in this example due to the lack of temporal context. (b) ST-HOI takes in not only the target frame but neighboring frames and exploits temporal context based on trajectories. ST-HOI can thus differentiate temporal-related interactions and prefers push to other interactions in this example.
图 1. 在对视频进行推理时,传统 HOI 方法与我们的 ST-HOI 之间的说明性比较。(a) 传统的 HOI 方法(例如(Wan et al.2019中的基线)仅考虑目标帧并根据 ROI 池化的视觉特征预测 HOI。由于缺乏时间上下文,在此示例中,这些模型无法区分推、拉或靠。(b) ST-HOI不仅接收目标帧,而且接收相邻帧,并利用基于轨迹的时间背景。因此,ST-HOI 可以区分与时间相关的交互,并且在此示例中更喜欢推送到其他交互。
Refer to caption
Figure 2. (a) Relative performance change (in percentage), on different video tasks by replacing 2D-CNN backbones with 3D ones (blue bars) (Tran et al., 2015; Carreira and Zisserman, 2017; Feichtenhofer et al., 2019), and on VideoHOI by adding trajectory feature (tangerine bar). VideoHOI (in triplet mAP) is to detect HOI in videos and was performed ourselves on our VidHOI benchmark. STAD (Gu et al., 2018) (in triplet mAP) means Spatial-Temporal Action Detection and was performed on AVA dataset (Gu et al., 2018). STSGG (Ji et al., 2020) (PredCls mode; in Recall@20) stands for Spatial-Temporal Scene Graph Generation and was performed on Action Genome (Ji et al., 2020). (b) An illustration of temporal-RoI pooling in 3D baselines (e.g. (Feichtenhofer et al., 2019)). Temporal pooling is usually applied to the output of the penultimate layer of a 3D-CNN (shape of d×T×H×W𝑑𝑇𝐻𝑊d\times T\times H\times W) which average-pools along the time axis into shape of d×1×H×W𝑑1𝐻𝑊d\times 1\times H\times W, followed by RoI Pooling to obtain feature maps of shape d×1×h×w𝑑1𝑤d\times 1\times h\times w. This temporal-RoI pooling, however, is equivalent to pooling the instance-of-interest feature at the same location in the keyframe throughout the video segment, which is erroneous for moving humans and objects.
图2. (a) 通过用 3D 主干网(蓝条)替换 2D-CNN 骨干网,在不同视频任务上的相对性能变化(以百分比为单位)(Tran et al.2015;Carreira 和 Zisserman,2017 年;Feichtenhofer 等人 ,2019 年),并通过添加轨迹特征(橘子条)在 VideoHOI 上。 VideoHOI(三联体 mAP)用于检测视频中的 HOI,并在我们的 VidHOI 基准测试中自行执行。 STAD(Gu et al.2018)(在三联体 mAP 中)表示时空动作检测,并在 AVA 数据集上执行(Gu et al.2018)。 STSGG (Ji et al.2020 (PredCls mode; in Recall@20) 代表 Spatial-Temporal Scene Graph Generation,在 Action Genome (Ji et al.2020 上进行。 (b) 3D 基线中时间 RoI 池化的图示(例如(Feichtenhofer 等人 ,2019 年))。 时间池化通常应用于3D-CNN的倒数第二层(形状 d×T×H×W𝑑𝑇𝐻𝑊d\times T\times H\times W )的输出,该层沿时间轴平均池化为形状, d×1×H×W𝑑1𝐻𝑊d\times 1\times H\times W 然后进行RoI池化以获得形状 d×1×h×w𝑑1𝑤d\times 1\times h\times w 的特征图。 然而,这种时间-RoI池化等同于在整个视频片段中的关键帧中将感兴趣的实例特征池化在关键帧的同一位置,这对于移动的人类和物体来说是错误的。

Thanks to the rapid development of deep learning (Goodfellow et al., 2016; He et al., 2016), machines are already surpassing or approaching human level performance in language tasks (Wu et al., 2016), acoustic tasks (Yin et al., 2019), and vision tasks (e.g., image classification (He et al., 2015) and visual place recognition (Chiou et al., 2020)). Researchers thus started to focus on how to replicate these successes to other semantically higher-level vision tasks (e.g., visual relationship detection (Lu et al., 2016; Chiou et al., 2021)) and vision-language tasks (e.g., image captioning (Vinyals et al., 2015) and visual question answering (Antol et al., 2015)) so that machines learn not just to recognize the objects but to understand their relationships and the contexts. Especially, human-object interaction (HOI) (Smith et al., 2013; Chao et al., 2018; Gao et al., 2018; Qi et al., 2018; Li et al., 2019; Wang et al., 2019; Wan et al., 2019; Gupta et al., 2019; Li et al., 2020; Wang et al., 2020; Liao et al., 2020; Ulutan et al., 2020; Zhong et al., 2020) aiming to detect actions and spatial relations among humans and salient objects in images/videos has attracted increasing attention, as we sometimes task machines to understand human behaviors, e.g., pedestrian detection (Zhang et al., 2017) and unmanned store systems (Lo and Wang, 2019).
得益于深度学习的快速发展(Goodfellow et al.2016;He et al.2016),机器在语言任务(Wu et al.2016)、声学任务(Yin et al.2019和视觉任务(例如,图像分类(He et al.2015和视觉位置识别(Chiou et al.2020方面的表现已经超过或接近人类水平).因此,研究人员开始关注如何将这些成功复制到其他语义更高层次的视觉任务(例如,视觉关系检测(Lu et al.2016;Chiou et al.2021)和视觉语言任务(例如,图像字幕(Vinyals et al.2015和视觉问答(Antol et al.2015),以便机器不仅学会识别物体,而且学会理解它们的关系和上下文。特别是人物交互(HOI)(Smith等人 ,2013;Chao et al.2018;Gao et al.2018;Qi et al.2018;Li et al.2019;Wang et al.2019;Wan 等人 2019 年;Gupta等人 2019;Li et al.2020;Wang et al.2020;Liao 等 .2020;Ulutan 等人 2020 年;Zhong et al.2020 旨在检测图像/视频中人类与显著物体之间的动作和空间关系,引起了越来越多的关注,因为我们有时会要求机器理解人类行为,例如,行人检测(Zhang et al.2017和无人商店系统(Lo and Wang, 2019)。

Although there are abundant studies that have achieved success in detecting HOI in static images, the fact that few of them (Jain et al., 2016; Qi et al., 2018; Sunkesula et al., 2020) consider temporal information (i.e., neighboring frames before/after the target frame) when performed on video data means they are actually “guessing” temporal-related HOIs with only naive co-occurrence statistics. While conventional image-based HOI methods (e.g., the baseline model in (Wan et al., 2019)) can be used for inference on videos, they treat input frames as independent and identically distributed (i.i.d.) data and make independent predictions for neighboring frames. However, video data are sequential and structured by nature and thus are not i.i.d. What is worse is that, without temporal context these methods are unable to differentiate (especially, opposite) temporal interactions, such as push versus pull a human and open versus close a door. As shown in Figure 1(a), given a video segment, traditional HOI models operate on a single frame at a time and make predictions based on 2D-CNN (e.g., (He et al., 2016)) visual features. These models by nature could not distinguish interactions between two people such as push, pull, lean on and chase, which are visually similar in static images. A possible reason causing video-based HOI underexplored is the lack of a suitable video-based benchmark and a feasible setting. To bridge this gap, we first construct a video HOI benchmark from VidOR (Shang et al., 2019), dubbed VidHOI , where we follow the common protocol in video and HOI tasks to use a keyframe-centered strategy. With VidHOI, we urge the use of video data and propose VideoHOI as – in both training and inference – performing HOI detection with videos.
尽管有大量的研究在检测静态图像中的HOI方面取得了成功,但事实上,它们很少(Jain等人 ,2016;Qi et al.2018;Sunkesula et al.2020 在对视频数据执行时,考虑时间信息(目标帧之前/之后的相邻帧)意味着它们实际上是在“猜测”与时间相关的 HOI,只有朴素的共现统计数据。虽然传统的基于图像的 HOI 方法(例如(Wan et al.2019中的基线模型)可用于对视频进行推理,但它们将输入帧视为独立且相同分布 (i.i.d.) 的数据,并对相邻帧进行独立预测。然而,视频数据本质上是顺序的和结构化的,因此不是i.i.d.。 更糟糕的是,如果没有时间背景,这些方法无法区分(尤其是相反的)时间交互,例如推与拉人以及打开与关闭门。如图1(a)所示,给定一个视频片段,传统的HOI模型一次在一帧上运行,并根据2D-CNN(例如(He et al.2016)视觉特征进行预测。从本质上讲,这些模型无法区分两个人之间的交互,例如推、拉、靠和追逐,这些在静态图像中在视觉上是相似的。导致基于视频的 HOI 未得到充分探索的一个可能原因是缺乏合适的基于视频的基准和可行的设置。 为了弥合这一差距,我们首先从 VidOR 构建了一个视频 HOI 基准(Shang et al.2019),称为 VidHOI ,其中我们遵循视频和 HOI 任务中的通用协议,使用以关键帧为中心的策略。借助 VidHOI,我们敦促使用视频数据,并建议在训练和推理中将 VideoHOI 作为对视频执行 HOI 检测。

Spatial-Temporal Action Detection (STAD) is another task bearing a resemblance to VideoHOI by requiring to localize the human and detect the actions being performed in videos. Note that STAD does not consider the objects that a human is interacting with. STAD is usually tackled by first using a 3D-CNN (Tran et al., 2015; Carreira and Zisserman, 2017) as the backbone to encode temporal information into feature maps. This is followed by RoI pooling with object proposals to obtain actor features, which are then classified by linear layers. Essentially, this approach is similar to a common HOI baseline illustrated in Figure 1(a) and differs only in the use of 3D backbones and the absence of interacting objects. Based on conventional HOI and STAD methods, a naive yet intuitive idea arises: can we enjoy the best of both worlds, by replacing 2D backbones with 3D ones and exploiting visual features of interacting objects? This idea, however, did not work straightforwardly in our preliminary experiment, where we replaced the backbone in the 2D baseline (Wan et al., 2019) with the 3D one (e.g., SlowFast (Feichtenhofer et al., 2019)) to perform VideoHOI. The relative change of performance after replacing the backbone is presented in the left most entry in Figure 2(a) with a blue bar. In VideoHOI experiment, the 3D baseline provides only a limited relative improvement (similar-to\sim2%), which is far from satisfactory considering the additional temporal context. In fact, this phenomenon has also been observed in two existing works under similar settings (Gu et al., 2018; Ji et al., 2020), where both experiments in STAD and another video task Spatial-Temporal Scene Graph Generation (STSGG) present an even worse, counter-intuitive result: replacing the backbone is actually harmful (also presented as blue bars in Figure 2(a)). We probed the underlying reason by analyzing the architecture of these 3D baselines and found that, surprisingly, temporal pooling together with RoI pooling does not work reasonably. As illustrated in Figure 2(b), temporal pooling followed by RoI pooling, which is a common practice in conventional STAD methods, is equivalent to cropping features of the same region across the whole video segment without considering the way objects move. It is not unusual for moving humans and objects in neighboring frames to be absent from its location in the target keyframe. Temporal-and-RoI-pooling features at the same location could be getting erroneous features such as other humans/objects or meaningless background. Dealing with this inconsistency, we propose to recover the missing spatial-temporal information in VideoHOI by considering human and object trajectories. The performance change of this temporal-augmented 3D baseline on VideoHOI is represented by the tangerine bar in Figure 2(a), where it achieves similar-to\sim23% improvement, in sharp contrast to similar-to\sim2% of the original 3D baseline. This experiment reveals the importance of incorporating the ”correctly-localized” temporal information.
时空动作检测 (STAD) 是另一项类似于 VideoHOI 的任务,需要定位人并检测视频中正在执行的动作。 请注意,STAD不考虑人类正在与之交互的对象。 STAD通常首先使用3D-CNN来解决(Tran等人 ,2015;Carreira 和 Zisserman,2017 年)作为将时间信息编码到特征图中的骨干。 接下来是 RoI 池化与对象建议,以获得参与者特征,然后按线性层进行分类。 从本质上讲,这种方法类似于图1(a)中所示的常见HOI基线,仅在于使用3D主干和没有交互对象而有所不同。 基于传统的HOI和STAD方法,出现了一个幼稚而直观的想法:我们能否通过用3D主干替换2D骨架并利用交互对象的视觉特征来享受两全其美的效果? 然而,这个想法在我们的初步实验中并没有直接奏效,我们在初步实验中用 3D 基线(例如,SlowFast (Feichtenhofer et al., 2019))替换了 2D 基线中的主干网来执行 VideoHOI。  更换主干网后性能的相对变化显示在图2(a)最左侧的条目中,带有蓝色条。 在 VideoHOI 实验中,3D 基线仅提供有限的相对改进 ( similar-to\sim 2%),考虑到额外的时间背景,这远远不能令人满意。 事实上,在两部现有作品中也观察到了类似设置下的现象(Gu et al.2018;Ji et al.2020),其中 STAD 中的实验和另一个视频任务时空场景图生成 (STSGG) 都呈现出更糟糕、违反直觉的结果:替换主干实际上是有害的(在图 2(a) 中也显示为蓝色条)。 我们通过分析这些 3D 基线的架构来探究根本原因,发现令人惊讶的是,时间池化和 RoI 池化并不能合理地工作。 如图2(b)所示,时间池化和RoI池化是传统STAD方法中的常见做法,相当于在整个视频段中裁剪同一区域的特征,而不考虑对象的移动方式。 在目标关键帧中,相邻帧中移动的人物和物体在其位置不存在,这并不罕见。 在同一位置的时间和 RoI 池化要素可能会获得错误的特征,例如其他人/物体或无意义的背景。 针对这种不一致问题,我们提出通过考虑人和物体的轨迹来恢复VideoHOI中缺失的时空信息。 在VideoHOI上,这种时间增强的3D基线的性能变化由图2(a)中的橘子条表示,它实现了 similar-to\sim 23%的改善,与 similar-to\sim 原始3D基线的2%形成鲜明对比。 该实验揭示了整合“正确定位”的时间信息的重要性。

Keeping the aforementioned ideas in mind, in this paper we propose Spatial-Temporal baseline for Human-Object Interaction detection in videos, or ST-HOI, which makes accurate HOI prediction with instance-wise spatial-temporal features based on trajectories. As illustrated in Figure 1(b), three kinds of such features are exploited in ST-HOI: (a) trajectory features (moving bounding boxes; shown as the red arrow), (b) correctly-localized visual features (shown as the yellow arrow), and (c) spatial-temporal actor poses (shown as the green arrow).
基于上述思路,本文提出基于轨迹的H uman-O bject I nteraction检测的Spatial-T经验基线(S patial-T emporal Base)用于视频中Human-O bject Interaction检测,即ST-HOI算法,基于轨迹的实例级时空特征,对HOI进行准确的预测。如图1(b)所示,ST-HOI利用了三种此类特征:(a)轨迹特征(移动边界框;如红色箭头所示),(b)正确定位的视觉特征(如黄色箭头所示)和(c)时空角色姿势(如绿色箭头所示)。

The contribution of our work is three-fold. First, we are among the first to identify the feature inconsistency issue existing in the naive 3D models which we address with simple yet “correct” spatial-temporal feature pooling. Second, we propose a spatial-temporal model which utilizes correctly-localized visual features, per-frame box coordinates and a novel, temporal-aware masking pose module to effectively detect video-based HOIs. Third, we establish the keyframe-based VidHOI benchmark to motivate research in detecting spatial-temporal aware interactions and hopefully inspire VideoHOI approaches utilizing the multi-modality data, i.e., video frames, texts (semantic object/relation labels) and audios.
我们的工作贡献是三重的。首先,我们是最早发现朴素3D模型中存在的特征不一致问题的人之一,我们通过简单而“正确”的时空特征池化来解决该问题。其次,我们提出了一种时空模型,该模型利用正确定位的视觉特征、每帧框坐标和新颖的、时间感知的掩蔽姿态模块来有效检测基于视频的HOI。第三,我们建立了基于关键帧的VidHOI基准,以激发在检测时空感知交互方面的研究,并希望激发利用多模态数据(视频帧、文本(语义对象/关系标签)和音频的VideoHOI方法。

2. Related Work
二、相关工作

2.1. Human-Object Interaction (HOI)
2.1. 人物交互(HOI)

HOI Detection aims to reason over interactions between humans (actors) and target objects. HOI is closely related to visual relationship detection (Lu et al., 2016; Chiou et al., 2021) and scene graph generation (Xu et al., 2017), in which the subject in (subject-predicate-object) are not restricted to a human. HOI in static images has been intensively studied recently (Smith et al., 2013; Chao et al., 2018; Gao et al., 2018; Qi et al., 2018; Li et al., 2019; Wang et al., 2019; Wan et al., 2019; Gupta et al., 2019; Li et al., 2020; Wang et al., 2020; Liao et al., 2020; Ulutan et al., 2020; Zhong et al., 2020). Most of the existing methods can be divided into two categories by the order of human-object pair proposal and interaction classification. The first group (Gao et al., 2018; Chao et al., 2018; Li et al., 2019; Wang et al., 2019; Wan et al., 2019; Gupta et al., 2019; Ulutan et al., 2020; Li et al., 2020) performs human-object pair generation followed by interaction classification, while the second group (Gkioxari et al., 2018; Qi et al., 2018; Wang et al., 2020; Liao et al., 2020) first predicts the most probable interactions performed by a person followed by associating them with the most-likely objects. Our ST-HOI belongs to the first group as we establish a temporal model based on trajectories (continuous object proposals).
HOI 检测旨在对人类(参与者)和目标物体之间的交互进行推理。HOI与视觉关系检测密切相关(Lu et al.2016;Chiou et al.2021 和场景图生成 (Xu et al.2017),其中主语(主语-谓语-宾语)不限于人类。静态图像中的HOI最近得到了深入研究(Smith等人 ,2013;Chao et al.2018;Gao et al.2018;Qi et al.2018;Li et al.2019;Wang et al.2019;Wan 等人 2019 年;Gupta等人 2019;Li et al.2020;Wang et al.2020;Liao et al.2020;Ulutan 等人 2020 年;Zhong et al.2020)。现有的大多数方法可以按照人-物对、提议和交互分类的顺序分为两类。第一组(Gao et al.2018;Chao et al.2018;Li et al.2019;Wang et al.2019;Wan 等人 2019 年;Gupta等人 2019;Ulutan 等人 2020 年;Li 等 .2020 执行人-物体对生成,然后进行交互分类,而第二组 (Gkioxari et al.2018;Qi et al.2018;Wang et al.2020;Liao et al.2020 首先预测一个人执行的最可能的交互,然后将它们与最可能的物体相关联。我们的 ST-HOI 属于第一组,因为我们建立了一个基于轨迹(连续对象建议)的时间模型。

In contrast to the popularity of image-based HOI, there are only a few of studies in VideoHOI (Koppula and Saxena, 2015; Jain et al., 2016; Qi et al., 2018; Sunkesula et al., 2020) and, to the best of our knowledge, all of which conducted experiments on CAD-120 (Koppula et al., 2013) dataset. In CAD-120, the interactions are defined by merely 10 high-level activities (e.g., making cereal or microwaving food) in 120 RGB-D videos. This setting is not favorable to real-life scenarios where machines may be asked to understand more fine-grained actions. Moreover, previous methods (Koppula and Saxena, 2015; Jain et al., 2016; Qi et al., 2018) adopted pre-computed hand-crafted features such as SIFT (Lowe, 2004) which have been outperformed by deep neural networks, and ground truth features including 3D poses and depth information from RGB-D videos which are unlikely to be available in real life scenarios. While (Sunkesula et al., 2020) adopted a ResNet (He et al., 2016) as their backbone, their method is inefficient by requiring M×N𝑀𝑁M\times N computation for extracting M𝑀M humans’ and N𝑁N objects’ features. Different from these existing methods, we evaluate on a larger and more diversified video HOI benchmark dubbed VidHOI, which includes annotations of 50 predicates on thousands of videos. We then propose a spatial-temporal HOI baseline that operates on RGB videos and does not utilize any additional information.
与基于图像的 HOI 的普及相比,VideoHOI 中的研究很少(Koppula 和 Saxena,2015 年;Jain等人 2016;Qi et al.2018;Sunkesula 等人 ,2020 年),据我们所知,所有这些都在 CAD-120 (Koppula 等人 ,2013 年)数据集上进行了实验。 在 CAD-120 中,交互仅由 120 个 RGB-D 视频中的 10 个高级活动(例如,制作麦片或微波炉加热食物)定义。此设置不利于实际情况,因为在实际生活中,可能会要求计算机理解更细粒度的操作。 此外,以前的方法(Koppula和Saxena,2015;Jain等人 2016;Qi et al.2018 采用了预先计算的手工制作特征,例如 SIFT (Lowe, 2004),这些特征已被深度神经网络所超越,以及地面实况特征,包括来自 RGB-D 视频的 3D 姿势和深度信息,这些特征在现实生活中不太可能可用。 虽然(Sunkesula et al.2020采用了ResNet(He et al.2016作为他们的骨干,但他们的方法效率低下,因为需要 M×N𝑀𝑁M\times N 计算来提取 M𝑀M 人类和 N𝑁N 物体的特征。 与这些现有方法不同,我们评估了一个更大、更多样化的视频 HOI 基准,称为 VidHOI,其中包括数千个视频的 50 个谓词的注释。 然后,我们提出了一个时空HOI基线,该基线对RGB视频进行操作,并且不使用任何额外的信息。

2.2. Spatial-Temporal Action Detection (STAD)
2.2. 时空动作检测(STAD)

STAD aims to localize actors and detect the associated actions (without considering interacting objects). One of the most popular benchmark for STAD is AVA (Gu et al., 2018), where the annotation is done at a sampling frequency of 1 Hz and the performance is measured by framewise mean AP. We followed this annotation and evaluation style when constructing VidHOI, where we converted the original labels into the same format.
STAD旨在定位参与者并检测相关动作(不考虑交互对象)。STAD最受欢迎的基准测试之一是AVA(Gu 等人,2018),其中注释是在1 Hz的采样频率下完成的,性能是通过逐帧平均AP来衡量的。在构建 VidHOI 时,我们遵循了这种注释和评估风格,我们将原始标签转换为相同的格式。

As explained in section 1, a standard approach to STAD (Tran et al., 2015; Carreira and Zisserman, 2017) is extracting spatial-temporal feature maps with a 3D-CNN followed by RoI pooling to crop human features, which are then classified by linear layers. As shown in Figure 2(a), a naive modification that incorporates RoI-pooled human/object features does not work for VideoHOI. In contrast, our ST-HOI tackles VideoHOI by incorporating multiple temporal features including trajectories, correctly-localized visual features and spatial-temporal masking pose features.
第 1 节所述,STAD 的标准方法(Tran 等人 ,2015 年;Carreira 和 Zisserman,2017 年)正在使用 3D-CNN 提取时空特征图,然后进行 RoI 池化以裁剪人类特征,然后按线性层进行分类。如图 2(a) 所示,包含 RoI 池化人/物体特征的朴素修改不适用于 VideoHOI。相比之下,我们的 ST-HOI 通过结合多种时间特征(包括轨迹、正确定位的视觉特征和时空掩蔽姿势特征)来解决 VideoHOI。

Refer to caption
Figure 3. An illustration of the two proposed spatial-temporal features. (a) In contrast to performing RoI pooling followed by temporal pooling like (Wu et al., 2019; Feichtenhofer et al., 2019), we adopt a reverse approach to first frame-wise RoI-pool instance feature maps using trajectories, which are then averaged pool along the time axis to get correctly-localized visual features. (b) With N𝑁N object trajectories (including M𝑀M human), for each frame we utilize a trained human pose prediction model (e.g., (Fang et al., 2017)) to generate 2D actor pose feature and extract a dual spatial mask for all M×(N1)𝑀𝑁1M\times(N-1) valid pair. The pose feature and the mask are concatenated and down-sampled, followed by two 3D convolution layers and spatial-temporal pooling to generate the masking pose features.
图3. 两个提议的时空特征的图示。 (a) 与执行 RoI 池化相反,然后进行时间池化,如 (Wu et al.2019;Feichtenhofer et al.2019),我们采用相反的方法,使用轨迹首先对帧 RoI 池实例特征图进行平均,然后沿时间轴对这些轨迹进行平均池化,以获得正确定位的视觉特征。 (b)对于 N𝑁N 物体轨迹(包括 M𝑀M 人类),对于每一帧,我们利用经过训练的人体姿态预测模型(例如(Fang et al.2017)来生成2D演员姿态特征并为所有 M×(N1)𝑀𝑁1M\times(N-1) 有效对提取双重空间掩码。 姿态特征和掩码被连接和下采样,然后是两个 3D 卷积层和时空池化以生成掩蔽姿态特征。

2.3. Spatial-Temporal Scene Graph Generation
2.3. 时空场景图生成

Spatial-Temporal Scene Graph Generation (STSGG) (Ji et al., 2020) aims to generate symbolic graphs representing pairwise visual relationships in video frames. A new benchmark, Action Genome, is also proposed in (Ji et al., 2020) to facilitate researches in STSGG. Ji et al. (Ji et al., 2020) dealt with STSGG by combining off-the-shelf scene graph generation models with long-term feature bank (Wu et al., 2019) on top of a 2D- or 3D-CNN, where they found that the 3D-CNN actually undermines the performance. While observing similar results in VidHOI (Figure 2(a)), we go one step further to find out the underlying reason is that RoI features across frames were erroneously pooled. We correct this by utilizing object trajectories and applying Tube-of-Interest (ToI) pooling on generated trajectories to obtain correctly-localized position information and feature maps throughout video segments.
时空场景图生成 (STSGG) (Ji et al.2020 旨在生成表示视频帧中成对视觉关系的符号图。(Ji et al.2020中还提出了一个新的基准,即Action Genome,以促进STSGG的研究。Ji et al. (Ji et al.2020 通过将现成的场景图生成模型与长期特征库 (Wu et al.2019 结合在 2D 或 3D-CNN 之上来处理 STSGG,他们发现 3D-CNN 实际上破坏了性能。在VidHOI中观察到类似的结果(图2(a))时,我们更进一步,发现根本原因是跨帧的RoI特征被错误地合并。我们通过利用对象轨迹并在生成的轨迹上应用感兴趣管 (ToI) 池化来纠正这一点,以获得在整个视频片段中正确定位的位置信息和特征图。

3. Methodology
3. 研究方法

3.1. Overview
3.1. 概述

We follow STAD approaches (Tran et al., 2015; Carreira and Zisserman, 2017; Feichtenhofer et al., 2019) to detect VideoHOI in a keyframe-centric strategy. Denote V𝑉V as a video which has T𝑇T keyframes with sampling frequency of 1 Hz as {It},t={1,,T}subscript𝐼𝑡𝑡1𝑇\{I_{t}\},t=\{1,...,T\}, and denote C𝐶C as the number of pre-defined interaction classes. Given N𝑁N instance trajectories including M𝑀M human trajectories (MN𝑀𝑁M\leq N) in a video segment centered at the target frame, for human m{1,,M}𝑚1𝑀m\in\{1,...,M\} and object n{1,,N}𝑛1𝑁n\in\{1,...,N\} in keyframe Itsubscript𝐼𝑡I_{t}, we aim to detect pairwise human-object interactions rt={0,1}Csubscript𝑟𝑡superscript01𝐶r_{t}=\{0,1\}^{C}, where each entry rt,c,c{1,,C}subscript𝑟𝑡𝑐𝑐1𝐶r_{t,c},c\in\{1,...,C\} means whether the interaction c𝑐c exists or not.

Refer to Figure 1(b) for an illustration of our ST-HOI. Our model takes in a video segment (sequence of T𝑇T frames) centered at Itsubscript𝐼𝑡I_{t} and utilizes a 3D-CNN as the backbone to extract spatial-temporal feature maps of the whole segment. To rectify the mismatch caused by temporal-RoI pooling, based on N𝑁N object (including human) trajectories {ji},i={1,..,N},jiT×4\{j_{i}\},i=\{1,..,N\},j_{i}\in\mathbb{R}^{T\times 4} we generate temporal-aware features including correctly-localized features and spatial-temporal masking pose features. These features together with trajectories are concatenated and classified by linear layers. Note that we aim at a simple but effective temporal-aware baseline to VideoHOI so that we do not utilize tricks in STAD such as non-local block (Wang et al., 2018) or long-term feature bank (Wu et al., 2019), and in image-based HOI like interactiveness (Li et al., 2019), though we note that these may be used to boost the performance.

3.2. Correctly-localized Visual Features

We have discussed in previous sections on inappropriately pooled RoI features. We propose to tackle this issue by reversing the order of temporal pooling and RoI-pooling. This approach has recently been proposed in (Hou et al., 2017) and named as tube-of-interest pooling (ToIPool). Refer to Figure 3(a) for an illustration. Denote vd×T×H×W𝑣superscript𝑑𝑇𝐻𝑊v\in\mathbb{R}^{d\times T\times H\times W} as the output of the penultimate layer of our 3D-CNN backbone, and denote vtd×H×Wsubscript𝑣𝑡superscript𝑑𝐻𝑊v_{t}\in\mathbb{R}^{d\times H\times W} as the t𝑡t-th feature map along the time axis. Recall that we have N𝑁N trajectories centered at a keyframe. Following the conventional way, we also exploit visual context when predicting an interaction, which is done by utilizing the union bounding box feature of a human and an object. For example, the sky between human and kite could help to infer the correct interaction fly. Recall that jisubscript𝑗𝑖j_{i} represents the trajectory of object i𝑖i, where we further denote ji,tsubscript𝑗𝑖𝑡j_{i,t} as the 2D bounding box at time t𝑡t. The spatial-temporal instance features {v¯i}subscript¯𝑣𝑖\{\bar{v}_{i}\} are then obtained using ToIPool with RoIAlign (He et al., 2017) by

(1) v¯i=1Tt=1TRoIAlign(vt,ji,t),subscript¯𝑣𝑖1𝑇superscriptsubscript𝑡1𝑇RoIAlignsubscript𝑣𝑡subscript𝑗𝑖𝑡\bar{v}_{i}=\frac{1}{T}\sum_{t=1}^{T}\text{RoIAlign}(v_{t},j_{i,t}),

where v¯id×h×wsubscript¯𝑣𝑖superscript𝑑𝑤\bar{v}_{i}\in\mathbb{R}^{d\times h\times w} and hh and w𝑤w means height and width of the pooled feature maps, respectively. v¯isubscript¯𝑣𝑖\bar{v}_{i} is flattened before concatenating with other features.

3.3. Spatial-Temporal Masking Pose Features

Human poses have been widely utilized in image-based HOI methods (Li et al., 2019; Gupta et al., 2019; Wan et al., 2019) to exploit characteristic actor pose to infer some special actions. In addition, some existing works (Wang et al., 2019; Wan et al., 2019) found that spatial information can be used to identify interactions. For instance, for human-ride-horse one can imagine the actor’s skeleton as legs widely open (on horse sides), and the bounding box center of human is usually on top of that of horse. However, none of the existing works consider this mechanism in temporal domain: when riding a horse the human should be moving with horse as a whole. We argue that this temporality is an important property and should be utilized as well.

The spatial-temporal masking pose module is presented at Figure 3(b). Given M𝑀M human trajectories, we first generate M𝑀M spatial-temporal pose features with a trained human pose prediction model. On frame t𝑡t, the predicted human pose hi,t17×2,i={1,..,M},t={1,..,T}h_{i,t}\in\mathbb{R}^{17\times 2},i=\{1,..,M\},t=\{1,..,T\} is defined as 17 joint points mapped to the original image. We transform hi,tsubscript𝑖𝑡h_{i,t} into a skeleton on a binary mask with fh:{hi,t}17×2{h¯i,t}1×H×W:subscript𝑓subscript𝑖𝑡superscript172subscript¯𝑖𝑡superscript1𝐻𝑊f_{h}:\{h_{i,t}\}\in\mathbb{R}^{17\times 2}\to\{\bar{h}_{i,t}\}\in\mathbb{R}^{1\times H\times W}, by connecting the joints using lines, where each line has a distinct value x[0,1]𝑥01x\in[0,1]. This helps the model to recognize and differentiate different poses.

For each of M×(N1)𝑀𝑁1M\times(N-1) valid human-object pairs on frame t𝑡t, we also generate two spatial masks si,t2×H×W,i={1,,M×(N1)}formulae-sequencesubscript𝑠𝑖𝑡superscript2𝐻𝑊𝑖1𝑀𝑁1s_{i,t}\in\mathbb{R}^{2\times H\times W},i=\{1,...,M\times(N-1)\} corresponding to human and object respectively, where the values inside of each bounding box are ones and outsides are zeroed-out. These masks enable our model to predict HOI with reference to important spatial information.

For each pair, we concatenate the skeleton mask h¯i,tsubscript¯𝑖𝑡\bar{h}_{i,t} and spatial masks si,tsubscript𝑠𝑖𝑡s_{i,t} along the first dimension to get the initial spatial masking pose feature pi,t3×H×Wsubscript𝑝𝑖𝑡superscript3𝐻𝑊p_{i,t}\in\mathbb{R}^{3\times H\times W}:

(2) pi,t=[si,t;h¯i,t].subscript𝑝𝑖𝑡subscript𝑠𝑖𝑡subscript¯𝑖𝑡p_{i,t}=[s_{i,t};\bar{h}_{i,t}].

We then down-sample {pi,t}subscript𝑝𝑖𝑡\{p_{i,t}\}, feed into two 3D convolutional layers with spatial and temporal pooling, and flatten to obtain the final spatial-temporal masking pose feature {p¯i,t}subscript¯𝑝𝑖𝑡\{\bar{p}_{i,t}\}.

Table 1. A comparison of our benchmark VidHOI with existing STAD (AVA (Gu et al., 2018)), image-based (HICO-DET (Chao et al., 2018) and V-COCO (Gupta and Malik, 2015)) and video-based (CAD-120 (Koppula et al., 2013) and Action Genome (Ji et al., 2020)) HOI datasets. VidHOI is the only dataset that provides temporal information from video clips and complete multi-person and interacting-object annotations. VidHOI also provides the most annotated keyframes and defines the most HOI categories in the existing video datasets. \daggerTwo less categories as we combine adult, child and baby into a single category, person.
Dataset Video Localized Video # Videos # Annotated # Objects # Predicate # HOI # HOI
dataset? object? hours images/frames categories categories categories Instances
HICO-DET (Chao et al., 2018) - - 47K 80 117 600 150K
V-COCO (Gupta and Malik, 2015) - - 10K 80 25 259 16K
AVA (Gu et al., 2018) 108 437 3.7M - 49 80 1.6M
CAD-120 (Koppula et al., 2013) 0.57 0.5K 61K 13 6 10 32K
Action Genome (Ji et al., 2020) \bigtriangleup 82 10K 234K 35 25 157 1.7M
VidHOI 70 7122 7.3M 78\dagger 50 557 755K

3.4. Prediction

We fuse the aforementioned features, including correctly-localized visual features v¯¯𝑣\bar{v}, spatial-temporal masking pose features p𝑝p, and instance trajectories j𝑗j by concatenating them along the last axis

(3) vso=[v¯s;v¯u;v¯o;js;jo;p¯so],subscript𝑣sosubscript¯𝑣𝑠subscript¯𝑣𝑢subscript¯𝑣𝑜subscript𝑗𝑠subscript𝑗𝑜subscript¯𝑝𝑠𝑜v_{\text{so}}=[\bar{v}_{s};\bar{v}_{u};\bar{v}_{o};j_{s};j_{o};\bar{p}_{so}],\\

where we slightly abuse the notation to denote the subscriptions s𝑠s as the subject, o𝑜o as the object and u𝑢u as their union region. vsosubscript𝑣sov_{\text{so}} is then fed into two linear layers with the final output size being the number of interaction classes in the dataset. Since VideoHOI is essentially a multi-label learning task, we train the model with per-class binary cross entropy loss.

During inference, we follow the heuristics in image-based HOI (Chao et al., 2018) to sort all the possible pairs by their softmax scores and evaluate on only top 100 predictions.

4. Experiments

4.1. Dataset and Performance Metric

While we have discussed in section 2.1 about the problem of lacking a suitable VideoHOI dataset by analyzing CAD-120 (Koppula et al., 2013), we further explain why Action Genome (Ji et al., 2020) is also not a feasible choice here. First, the authors acknowledged that the dataset is still incomplete and contains incorrect labels (Ji, 2020). Second, Action Genome is produced by annotating Charades (Sigurdsson et al., 2016), which is originally designed for activity classification where each clip contains only one ”actor” performing predefined tasks; should any other people show up, there are neither any bounding box nor interaction label about them. Finally, the videos are purposedly-generated by volunteers, which are rather unnatural. In contrast, VidHOI are based on VidOR (Shang et al., 2019) which is densely annotated with all humans and predefined objects showing up in each frame. VidOR is also more challenging as the videos are non-volunteering user-generated and thus jittery at times. A comparison of VidHOI and the existing STAD and HOI datasets is presented in Table 1.

VidOR is originally collected for video visual relationship detection where the evaluation is trajectory-based. The volumetric Interaction Over Union (vIOU) between a trajectory and a ground truth needs to be over 0.5 before considering its relationship prediction; however, how to obtain accurate trajectories with correct start- and end-timestamp remains challenging (Sun et al., 2019; Shang et al., 2017). We notice that some image-based HOI datasets (e.g., HICO-DET (Chao et al., 2018) and V-COCO (Gupta and Malik, 2015)) as well as STAD datasets (e.g., AVA (Gu et al., 2018)) are using a keyframe-centered evaluation strategy, which bypasses the aforementioned issue. We thus adopt the same and follow AVA to sample keyframes at a 1 FPS frequency, where the annotations on the keyframe at timestamp t𝑡t are assumed to be fixed for t±0.5plus-or-minus𝑡0.5t\pm 0.5s. In detail, we first filter out those keyframes without presenting at least one valid human-object pair, followed by transforming the labels from video clip-based to keyframe-based to align with common HOI metrics (i.e., frame mAP). We follow the original VidOR split in (Shang et al., 2019) to divide VidHOI into a training set comprising 193,911 keyframes in 6,366 videos and a validation set222The VidOR testing set is not available publicly. with 22,808 keyframes in 756 videos. As shown in Figure 4, there are 50 relation classes including actions (e.g., push, pull, lift, etc.) and spatial relations (e.g., next to, behind, etc.). While half (25) of the predicate classes are temporal-related, they account for merely similar-to\sim5% of the dataset.

Refer to caption
Figure 4. Predicate distribution of the VidHOI benchmark shows that most of the predicates are non-temporal-related.
Refer to caption
Figure 5. Performance comparison in predicate-wise AP (pAP). The performance boost after adding trajectory features is observed for most of the predicates. Interestingly, both spatial (e.g., next to, behind) and temporal (e.g., towards, away) predicates benefit from the temporal-aware features. Predicates sorted by the number of occurrence. Models in Oracle mode.

Following the evaluation metric in HICO-DET, we adopt mean Average Precision (mAP), where a true positive HOI needs to meet three below criteria: (a) both the predicted human and object bounding boxes have to overlap with the ground truth boxes with IOU over 0.5, (b) the predicted target category need to be matched and (c) the predicted interaction is correct. Over 50 predicates, we follow HICO-DET to define HOI categories as 557 triplets on which we compute mean AP. By defining HOI categories with triplets we can bypass the polysemy problem (Zhong et al., 2020), i.e., the same predicate word can represent very different meaning when pairing with distinct objects, e.g., person-fly-kite and person-fly-airplane. We report the mean AP over three categories: (a) Full: all 557 categories are evaluated, (b) Rare: 315 categories with less than 25 instances in the dataset, and (c) Non-rare: 242 categories with more than or equal to 25 instances in the dataset. We also examine the models in two evaluation modes: Oracle models are trained and tested with ground truth trajectories, while models in Detection mode are tested with predicted trajectories.

4.2. Implementation Details

We adopt Resnet-50 (He et al., 2016) as our 2D backbone for the preliminary experiments, and utilize Resnet-50-based SlowFast (Feichtenhofer et al., 2019) as our 3D backbone for all the other experiments. SlowFast contains the Slow and Fast pathways, which correspond to the texture details and the temporal information, respectively, by sampling video frames in different frequencies. For a 64-frame segment centered at the keyframe, T=32𝑇32T=32 frames are alternately sampled to feed into the Slow pathway; only T/α𝑇𝛼T/\alpha frames are fed into the Fast pathway, where α=8𝛼8\alpha=8 in our experiments. We use FastPose (Fang et al., 2017) to predict human poses and adopt the predicted trajectories generated by a cascaded model of video object detection, temporal NMS and tracking algorithm (Sun et al., 2019). Like object detection is to 2D HOI detection, trajectory generation is an essential module but not a main focus of this work. If a bounding box is not available in neighboring frames (i.e., the trajectory is shorter than T𝑇T or not continuous throughout the segment), we fill it with the whole-image as a box. We train all models from scratch for 20 epochs with the initial learning rate 1×1021superscript1021\times 10^{-2}, where we use step decay learning rate to reduce the learning rate by 10×10\times at the 10thsuperscript10th10^{\text{th}} and 15thsuperscript15th15^{\text{th}} epoch. We optimize our models using synchronized SGD with momentum of 0.9 and weight decay of 107superscript10710^{-7}. We train each 3D video model with eight NVIDIA Tesla V100 GPUs with batch size being 128 (i.e., 16 examples per GPU), except for the full model where we set batch size as 112 due to the memory restriction. We train the 2D model with a single V100 with batch size being 128.

During training, following the strategy in SlowFast we randomly scale the shorter side of the video to a value in [256,320]256320[256,320] pixels, followed by random horizontal flipping and random cropping into 224×224224224224\times 224 pixels. During inference, we only resize the shorter side of the video segment to 224 pixels.

4.3. Quantitative Results

Table 2. Results of the baselines and our ST-HOI on VidHOI validation set (numbers in mAP). There are two evaluation modes: Detection and Oracle, which differ only in the use of predicted or ground truth trajectories during inference. T: Trajectory features. V: Correctly-localized visual features. P: Spatial-temporal masking pose features. ”%” means the full mAP change compared to the 2D model.
Model Full Non-rare Rare %
Oracle 2D model (Wan et al., 2019) 14.1 22.9 11.3 -
3D model 14.4 23.0 12.6 2.1
Ours-T 17.3 26.9 16.8 22.7
Ours-T+V 17.3 26.9 16.3 22.7
Ours-T+P 17.4 27.1 16.4 23.4
Ours-T+V+P 17.6 27.2 17.3 24.8
Detection 2D model (Wan et al., 2019) 2.6 4.7 1.7 -
3D model 2.6 4.9 1.9 0.0
Ours-T 3.0 5.5 2.0 15.4
Ours-T+V 3.1 5.8 2.0 19.2
Ours-T+P 3.2 6.1 2.0 23.1
Ours-T+V+P 3.1 5.9 2.1 19.2

Since we aim to deal with a) the lack of temporal-aware features in 2D HOI methods, b) the feature inconsistency issue in common 3D HOI methods and c) the lack of a VideoHOI benchmark, we mainly compare with the 2D model (Wan et al., 2019) and its naive 3D variant on VidHOI to understand if our ST-HOI addresses these issues effectively.

The performance comparison between our full ST-HOI model (Ours-T+V+P) and baselines (2D model, 3D model) are presented in Table LABEL:tab:results, in which we also present ablation studies on our different features (modules) inlcuding trajectory features (T), correctly-localized visual features (V) and spatial-temporal masking pose features (P). Table LABEL:tab:results shows that 3D model only has a marginal improvement compared to 2D model (overall similar-to\sim2%) under all settings in both evaluation modes. In contrast, adding trajectory features (Ours-T) leads to a much larger 23% improvement in Oracle mode or 15% in Detection mode, showing the importance of correct spatial-temporal information. We also find that by adding additional temporal-aware features (i.e., V and P) increasingly higher mAPs are attained, and our full model (Ours-T+V+P) reports the best mAPs in Oracle mode, achieving the highest similar-to\sim25% relative improvement. We notice that the performance of Ours-T+V is close to that of Ours-T under Oracle setting, which is possibly because the ground truth trajectories (T) have provided enough “correctly-localized” information so that the correct features do not help much. We also note that the performance of Ours-T+P is slightly higher than that of Ours-T+V+P under Detection mode, which is assumably due to the same, aforementioned reason and the inferior performance resulting from the predicted trajectories. The overall performance gap between Detection and Oracle models is significant, indicating the room for improvement in trajectory generation. Another interesting observation is that Full mAPs are very close to Rare mAPs, especially under Oracle mode, showing that the long-tail effect over HOIs is strong (but common and natural).

To understand the effect of temporal features on individual predicates, we compare with predicate-wise AP (pAP) shown in Figure 5. We observe that, again, under most of circumstances naively replacing 2D backbones with 3D ones does not help video HOI detection. Both temporal predicates (e.g., towards, away, pull) and spatial (e.g., next_to, behind, beneath) predicates benefit from the additional temporal-aware features in ST-HOI. These findings verify our main idea about the essential use of trajectories and trajectory-based features. In addition, each additional features do not seem to contribute equally for different predicates. For instance, we see that while Ours-T+V+P performs the best on some predicates (e.g., behind and beneath), our sub-models achieve the highest mAP on other predicates (e.g., watch and ride). This is assumedly because predicate-wise performance is heavily subject to the number of examples, where major predicates have 10-10000 times more examples than minor ones (as shown in Figure 4).

Refer to caption
Figure 6. Results (in predicate-wise AP) of the baselines and our full model w.r.t. top frequent temporal predicates.
Table 3. Results of temporal-related and spatial (non-temporal) related triplet mAP. T%/S% means relative temporal/spatial mAP change compared to 2D model (Wan et al., 2019).
Temporal T% Spatial S%
Oracle 2D model (Wan et al., 2019) 8.3 - 18.6 -
3D model 7.7 -7.2 20.9 12.3
Ours-T 14.4 73.5 24.7 32.8
Ours-T+V 13.6 63.9 24.6 32.3
Ours-T+P 12.9 55.4 25.0 34.4
Ours-T+V+P 14.4 73.5 25.0 34.4
Detection 2D model (Wan et al., 2019) 1.5 - 2.7 -
3D model 1.6 6.7 2.9 7.4
Ours-T 1.8 20.0 3.3 23.6
Ours-T+V 1.8 20.0 3.3 23.6
Ours-T+P 1.8 20.0 3.3 23.6
Ours-T+V+P 1.9 26.7 3.3 23.6

Since the majority of HOI examples are spatial-related (similar-to\sim95%, as shown in Figure 4), the results above might not be suitable for demonstrating the temporal modeling ability of our proposed model. We thus focus on the performance on only temporal-related predicates in Figure 6, which demonstrates that Ours-T+V+P greatly outperforms the baselines regarding the top frequent temporal predicates. Table LABEL:tab:temporal_spatial_results presents the triplet mAPs of spatial- or temporal-only predicates, showing Ours-T significantly improves the 2D model on temporal-only mAP by relative +73.9%, in sharp contrast to -7.1% of the 3D model in Oracle mode. Similar to our observation with Table LABEL:tab:results, Ours-T performs on par with Ours-T+V+P for temporal-only predicates; however, it falls short of spatial-only predicates, showing that spatial/pose information is still essential for detecting spatial predicates. Overall, these results demonstrate the outstanding spatial-temporal modeling ability of our approach.

Refer to caption
Figure 7. Performance comparison (in AP) of some temporal-related HOIs in VidHOI validation set. Compared to 2D model, 3D model only shows limited improvement for the presented examples, while our ST-HOI variants provide huge performance boost. Models are in Oracle mode.
Refer to caption
Figure 8. Examples of video HOIs predicted by the 2D model (Wan et al., 2019) and our ST-HOI, both in Oracle mode. Each consists of five consecutive keyframes sampled in 1 Hz, where an entry in tables denotes whether a predicate between the subject (human; a green box) and the object (also human in both cases; a red box) is detected correctly (True Positive) or not (False Positive or False Negative). Compared to the 2D baseline, our model predicts more accurate temporal HOIs (e.g., hold_hand_of in T4subscript𝑇4T_{4} and T5subscript𝑇5T_{5} of the upper example and lift in T1subscript𝑇1T_{1} of the lower example). ST-HOI also produces less false positives in both examples.

We also compare the performance with respect to some HOI triplets in Figure 7. Similar to the results on predicate-wise mAP, we also observe the large gap between naive 2D/3D models and our models with the temporal features. ST-HOI variants are more accurate in predicting especially temporal-aware HOIs (hug/lean_on-person and push/pull-baby_walker). We also see in some examples that Ours-T+V+P does not perform the best among all the variants, e.g., lean_on-person), which is similar to the phenomenon we observed in Figure 5.

4.4. Qualitative Results

To understand the effectiveness of our proposed method, we visualize two video HOI examples of VidHOI predicted by the 2D model (Wan et al., 2019) and Ours-T+V+P (both in Oracle mode) in Figure 8. Each (upper and lower) example is a 5-second video segment (i.e., five keyframes) with a HOI prediction table where each entry means either True Positive (TP), False Positive (FP), False Negative (FN) or True Negative (TN) for both models. The upper example shows that, compared to the 2D model, Ours-T+V+P makes more accurate HOI detection by successfully predicting hold_hand_of at T4subscript𝑇4T_{4} and T5subscript𝑇5T_{5}. Moreover, Ours-T+V+P is able to predict interactions that requires temporal information, such as lift at T1subscript𝑇1T_{1} in the lower example. However, we can see that there is still room for improvement for Ours-T+V+P in the same example, where lift is not detected in the following T2subscript𝑇2T_{2} to T4subscript𝑇4T_{4} frames. Overall, our model produces less false positives throughout the dataset, which in turn contributes to its higher mAP and pAP.

5. Conclusion

In this paper, we addressed the inability of conventional HOI approaches to recognize temporal-aware interactions by re-focusing on neighboring video frames. We discussed the lack of a suitable setting and dataset for studying video-based HOI detection. We also identified a feature-inconsistency problem in a common video action detection baseline which arises from its improper order of RoI feature pooling and temporal pooling. To deal with the first issue, we established a new video HOI benchmark dubbed VidHOI and introduced a keyframe-centered detection strategy. We then proposed a spatial-temporal baseline ST-HOI which exploits trajectory-based temporal features including correctly-localized visual features, spatial-temporal masking pose features and trajectory features, solving the second problem. With quantitative and qualitative experiments on VidHOI, we showed that our model provides a huge performance boost compared to both the 2D and 3D baselines and is effective in differentiating temporal-related interactions. We expect that the proposed baseline and the dataset would serve as a solid starting point for the relatively underexplored VideoHOI task. Based on our baseline, we also hope to motivate further VideoHOI works to design advanced models with the multi-modal data including video frames, semantic object/relation labels and audios.

Acknowledgment

This research is supported by Singapore Ministry of Education Academic Research Fund Tier 1 under MOE’s official grant number T1 251RES2029.

References

  • (1)
  • Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433.
  • Carreira and Zisserman (2017) Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
  • Chao et al. (2018) Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. 2018. Learning to detect human-object interactions. In 2018 ieee winter conference on applications of computer vision (wacv). IEEE, 381–389.
  • Chiou et al. (2020) Meng-Jiun Chiou, Zhenguang Liu, Yifang Yin, An-An Liu, and Roger Zimmermann. 2020. Zero-Shot Multi-View Indoor Localization via Graph Location Networks. In Proceedings of the 28th ACM International Conference on Multimedia. 3431–3440.
  • Chiou et al. (2021) Meng-Jiun Chiou, Roger Zimmermann, and Jiashi Feng. 2021. Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations. IEEE Access 9 (2021), 50441–50451.
  • Fang et al. (2017) Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional Multi-person Pose Estimation. In ICCV.
  • Feichtenhofer et al. (2019) Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE international conference on computer vision. 6202–6211.
  • Gao et al. (2018) Chen Gao, Yuliang Zou, and Jia-Bin Huang. 2018. iCAN: Instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437 (2018).
  • Gkioxari et al. (2018) Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and recognizing human-object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8359–8367.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
  • Gu et al. (2018) Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6047–6056.
  • Gupta and Malik (2015) Saurabh Gupta and Jitendra Malik. 2015. Visual Semantic Role Labeling. arXiv preprint arXiv:1505.04474 (2015).
  • Gupta et al. (2019) Tanmay Gupta, Alexander Schwing, and Derek Hoiem. 2019. No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In Proceedings of the IEEE International Conference on Computer Vision. 9677–9685.
  • He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961–2969.
  • He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In 2015 IEEE International Conference on Computer Vision (ICCV). 1026–1034.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hou et al. (2017) Rui Hou, Chen Chen, and Mubarak Shah. 2017. Tube convolutional neural network (T-CNN) for action detection in videos. In Proceedings of the IEEE international conference on computer vision. 5822–5831.
  • Jain et al. (2016) Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the ieee conference on computer vision and pattern recognition. 5308–5317.
  • Ji (2020) Jingwei Ji. 2020. Question about the annotations · Issue #2 · JingweiJ/ActionGenome. https://github.com/JingweiJ/ActionGenome/issues/2
  • Ji et al. (2020) Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10236–10247.
  • Koppula et al. (2013) Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Saxena. 2013. Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32, 8 (2013), 951–970.
  • Koppula and Saxena (2015) Hema S Koppula and Ashutosh Saxena. 2015. Anticipating human activities using object affordances for reactive robotic response. IEEE transactions on pattern analysis and machine intelligence 38, 1 (2015), 14–29.
  • Li et al. (2020) Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, and Cewu Lu. 2020. PaStaNet: Toward Human Activity Knowledge Engine. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 382–391.
  • Li et al. (2019) Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yanfeng Wang, and Cewu Lu. 2019. Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3585–3594.
  • Liao et al. (2020) Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 482–490.
  • Lo and Wang (2019) Chi-Hung Lo and Yi-Wen Wang. 2019. Constructing an Evaluation Model for User Experience in an Unmanned Store. Sustainability 11, 18 (2019).
  • Lowe (2004) David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91–110.
  • Lu et al. (2016) Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European conference on computer vision. Springer, 852–869.
  • Qi et al. (2018) Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. 2018. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision (ECCV). 401–417.
  • Shang et al. (2019) Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 279–287.
  • Shang et al. (2017) Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In Proceedings of the 25th ACM international conference on Multimedia. 1300–1308.
  • Sigurdsson et al. (2016) Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision. Springer, 510–526.
  • Smith et al. (2013) Brian A Smith, Qi Yin, Steven K Feiner, and Shree K Nayar. 2013. Gaze locking: passive eye contact detection for human-object interaction. In Proceedings of the 26th annual ACM symposium on User interface software and technology. 271–280.
  • Sun et al. (2019) Xu Sun, Tongwei Ren, Yuan Zi, and Gangshan Wu. 2019. Video visual relation detection via multi-modal feature fusion. In Proceedings of the 27th ACM International Conference on Multimedia. 2657–2661.
  • Sunkesula et al. (2020) Sai Praneeth Reddy Sunkesula, Rishabh Dabral, and Ganesh Ramakrishnan. 2020. LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 691–699.
  • Tran et al. (2015) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
  • Ulutan et al. (2020) Oytun Ulutan, ASM Iftekhar, and Bangalore S Manjunath. 2020. VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13617–13626.
  • Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156–3164.
  • Wan et al. (2019) Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li, and Xuming He. 2019. Pose-aware multi-level feature network for human object interaction detection. In Proceedings of the IEEE International Conference on Computer Vision. 9469–9478.
  • Wang et al. (2019) Tiancai Wang, Rao Muhammad Anwer, Muhammad Haris Khan, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao, and Jorma Laaksonen. 2019. Deep contextual attention for human-object interaction detection. In Proceedings of the IEEE International Conference on Computer Vision. 5694–5702.
  • Wang et al. (2020) Tiancai Wang, Tong Yang, Martin Danelljan, Fahad Shahbaz Khan, Xiangyu Zhang, and Jian Sun. 2020. Learning Human-Object Interaction Detection using Interaction Points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4116–4125.
  • Wang et al. (2018) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794–7803.
  • Wu et al. (2019) Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 284–293.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
  • Xu et al. (2017) Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410–5419.
  • Yin et al. (2019) Yifang Yin, Meng-Jiun Chiou, Zhenguang Liu, Harsh Shrivastava, Rajiv Ratn Shah, and Roger Zimmermann. 2019. Multi-level fusion based class-aware attention model for weakly labeled audio tagging. In Proceedings of the 27th ACM International Conference on Multimedia. 1304–1312.
  • Zhang et al. (2017) Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt Schiele. 2017. Towards reaching human performance in pedestrian detection. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2017), 973–986.
  • Zhong et al. (2020) Xubin Zhong, Changxing Ding, Xian Qu, and Dacheng Tao. 2020. Polysemy deciphering network for human-object interaction detection. In Proc. Eur. Conf. Comput. Vis.
✅ 启动解除复制限制
✅ 启动解除右键限制
✅ 启动解除键盘限制