ST-HOI: A Spatial-Temporal Baseline for
Human-Object Interaction Detection in Videos
ST-HOI：
用于视频中人物交互检测的时空基线

Meng-Jiun Chiou 邱孟骥 National University of SingaporeSingaporeSingapore
新加坡国立大学新加坡新加坡 mengjiun.chiou@u.nus.edu , Chun-Yu Liao 廖春宇 ASUS Intelligent Cloud ServicesTaipeiTaiwan
华硕智能云服务，台北，台湾 mist˙liao@asus.com 薄雾 ̇ liao@asus.com , Li-Wei Wang 王立伟 ASUS Intelligent Cloud ServicesTaipeiTaiwan
华硕智能云服务，台北，台湾 popo55668@gmail.com , Roger Zimmermann 罗杰·齐默尔曼 National University of SingaporeSingaporeSingapore
新加坡国立大学新加坡新加坡 rogerz@comp.nus.edu.sg and Jiashi Feng 冯佳士 National University of SingaporeSingaporeSingapore
新加坡国立大学新加坡新加坡 elefjia@nus.edu.sg

(2021)

Abstract. 抽象。

Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI¹¹1The dataset and source code are available at https://github.com/coldmanck/VidHOI
数据集和源代码可在以下网址获取 https://github.com/coldmanck/VidHOI
检测人物交互（HOI）是实现对机器的全面视觉理解的重要一步。虽然从静态图像中检测非时间性 HOI（例如，坐在椅子上）是可行的，但即使人类也不太可能从单个视频帧中猜测与时间相关的 HOI（例如，打开/关闭门），其中相邻帧起着至关重要的作用。然而，仅对静态图像进行操作的传统 HOI 方法已被用于预测与时间相关的交互，这本质上是没有时间上下文的猜测，并且可能导致次优性能。在本文中，我们通过检测具有明确时间信息的基于视频的 HOI 来弥合这一差距。我们首先表明，由于特征不一致问题，常见动作检测基线的朴素时间感知变体在基于视频的 HOI 上不起作用。然后，我们提出了一种简单而有效的架构，称为时空 HOI 检测（ST-HOI），该架构利用了时间和信息，例如人类和物体的轨迹、正确定位的视觉特征和时空掩蔽姿势特征。我们构建了一个新的视频 HOI 基准测试，称为 VidHOI¹¹1 数据集和源代码可在 https://github.com/coldmanck/VidHOI where our proposed approach serves as a solid baseline.
我们提出的方法可以作为坚实的基线。

Human-Object Interaction, Action Detection, Video Understanding
人物交互、动作检测、视频理解

^†^†copyright: rightsretained
版权所有：rightsretained^†^†journalyear: 2021
JOURNAL年份： 2021^†^†conference: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval; August 21–24, 2021; Taipei, Taiwan
会议： 2021年智能交叉数据分析与检索研讨会论文集;2021 年 8 月 21 日至 24 日;台北，台湾^†^†booktitle: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval (ICDAR ’21), August 21–24, 2021, Taipei, Taiwan
书名： 2021 年智能交叉数据分析和检索研讨会（ICDAR '21）会议记录，2021 年 8 月 21 日至 24 日，台湾台北^†^†doi: 10.1145/3463944.3469097
doi： 10.1145/3463944.3469097^†^†isbn: 978-1-4503-8529-9/21/08
国际标准图书编号： 978-1-4503-8529-9/21/08^†^†ccs: Human-centered computing
船级社（CCS）：以人为本的计算^†^†ccs: Computing methodologies Activity recognition and understanding
船级社（CCS）：计算方法活动识别和理解

1. Introduction
1. 引言

Refer to caption — Figure 1. An illustrative comparison between conventional HOI methods and our ST-HOI when inferencing on videos. (a) Traditional HOI approaches (*e.g.*, the baseline in (Wan et al., 2019)) take in only the target frame and predict HOIs based on ROI-pooled visual features. These models are unable to differentiate between push, pull or lean on in this example due to the lack of temporal context. (b) ST-HOI takes in not only the target frame but neighboring frames and exploits temporal context based on trajectories. ST-HOI can thus differentiate temporal-related interactions and prefers push to other interactions in this example.
图 1. 在对视频进行推理时，传统 HOI 方法与我们的 ST-HOI 之间的说明性比较。（a）传统的 HOI 方法（例如，（Wan et al.， 2019）中的基线）仅考虑目标帧并根据 ROI 池化的视觉特征预测 HOI。由于缺乏时间上下文，在此示例中，这些模型无法区分推、拉或靠。（b） ST-HOI不仅接收目标帧，而且接收相邻帧，并利用基于轨迹的时间背景。因此，ST-HOI 可以区分与时间相关的交互，并且在此示例中更喜欢推送到其他交互。

Thanks to the rapid development of deep learning (Goodfellow et al., 2016; He et al., 2016), machines are already surpassing or approaching human level performance in language tasks (Wu et al., 2016), acoustic tasks (Yin et al., 2019), and vision tasks (e.g., image classification (He et al., 2015) and visual place recognition (Chiou et al., 2020)). Researchers thus started to focus on how to replicate these successes to other semantically higher-level vision tasks (e.g., visual relationship detection (Lu et al., 2016; Chiou et al., 2021)) and vision-language tasks (e.g., image captioning (Vinyals et al., 2015) and visual question answering (Antol et al., 2015)) so that machines learn not just to recognize the objects but to understand their relationships and the contexts. Especially, human-object interaction (HOI) (Smith et al., 2013; Chao et al., 2018; Gao et al., 2018; Qi et al., 2018; Li et al., 2019; Wang et al., 2019; Wan et al., 2019; Gupta et al., 2019; Li et al., 2020; Wang et al., 2020; Liao et al., 2020; Ulutan et al., 2020; Zhong et al., 2020) aiming to detect actions and spatial relations among humans and salient objects in images/videos has attracted increasing attention, as we sometimes task machines to understand human behaviors, e.g., pedestrian detection (Zhang et al., 2017) and unmanned store systems (Lo and Wang, 2019).
得益于深度学习的快速发展（Goodfellow et al.， 2016;He et al.， 2016），机器在语言任务（Wu et al.， 2016）、声学任务（Yin et al.， 2019）和视觉任务（例如，图像分类（He et al.， 2015）和视觉位置识别（Chiou et al.， 2020）方面的表现已经超过或接近人类水平).因此，研究人员开始关注如何将这些成功复制到其他语义更高层次的视觉任务（例如，视觉关系检测（Lu et al.， 2016;Chiou et al.， 2021））和视觉语言任务（例如，图像字幕（Vinyals et al.， 2015）和视觉问答（Antol et al.， 2015）），以便机器不仅学会识别物体，而且学会理解它们的关系和上下文。特别是人物交互（HOI）（Smith等人，2013;Chao et al.， 2018;Gao et al.， 2018;Qi et al.， 2018;Li et al.， 2019;Wang et al.， 2019;Wan 等人，2019 年;Gupta等人，2019;Li et al.， 2020;Wang et al.， 2020;Liao 等 .， 2020;Ulutan 等人，2020 年;Zhong et al.， 2020）旨在检测图像/视频中人类与显著物体之间的动作和空间关系，引起了越来越多的关注，因为我们有时会要求机器理解人类行为，例如，行人检测（Zhang et al.， 2017）和无人商店系统（Lo and Wang， 2019）。

Although there are abundant studies that have achieved success in detecting HOI in static images, the fact that few of them (Jain et al., 2016; Qi et al., 2018; Sunkesula et al., 2020) consider temporal information (i.e., neighboring frames before/after the target frame) when performed on video data means they are actually “guessing” temporal-related HOIs with only naive co-occurrence statistics. While conventional image-based HOI methods (e.g., the baseline model in (Wan et al., 2019)) can be used for inference on videos, they treat input frames as independent and identically distributed (i.i.d.) data and make independent predictions for neighboring frames. However, video data are sequential and structured by nature and thus are not i.i.d. What is worse is that, without temporal context these methods are unable to differentiate (especially, opposite) temporal interactions, such as push versus pull a human and open versus close a door. As shown in Figure 1(a), given a video segment, traditional HOI models operate on a single frame at a time and make predictions based on 2D-CNN (e.g., (He et al., 2016)) visual features. These models by nature could not distinguish interactions between two people such as push, pull, lean on and chase, which are visually similar in static images. A possible reason causing video-based HOI underexplored is the lack of a suitable video-based benchmark and a feasible setting. To bridge this gap, we first construct a video HOI benchmark from VidOR (Shang et al., 2019), dubbed VidHOI , where we follow the common protocol in video and HOI tasks to use a keyframe-centered strategy. With VidHOI, we urge the use of video data and propose VideoHOI as – in both training and inference – performing HOI detection with videos.
尽管有大量的研究在检测静态图像中的HOI方面取得了成功，但事实上，它们很少（Jain等人，2016;Qi et al.， 2018;Sunkesula et al.， 2020）在对视频数据执行时，考虑时间信息（即目标帧之前/之后的相邻帧）意味着它们实际上是在“猜测”与时间相关的 HOI，只有朴素的共现统计数据。虽然传统的基于图像的 HOI 方法（例如，（Wan et al.， 2019）中的基线模型）可用于对视频进行推理，但它们将输入帧视为独立且相同分布（i.i.d.）的数据，并对相邻帧进行独立预测。然而，视频数据本质上是顺序的和结构化的，因此不是i.i.d.。更糟糕的是，如果没有时间背景，这些方法无法区分（尤其是相反的）时间交互，例如推与拉人以及打开与关闭门。如图1（a）所示，给定一个视频片段，传统的HOI模型一次在一帧上运行，并根据2D-CNN（例如，（He et al.， 2016））视觉特征进行预测。从本质上讲，这些模型无法区分两个人之间的交互，例如推、拉、靠和追逐，这些在静态图像中在视觉上是相似的。导致基于视频的 HOI 未得到充分探索的一个可能原因是缺乏合适的基于视频的基准和可行的设置。为了弥合这一差距，我们首先从 VidOR 构建了一个视频 HOI 基准（Shang et al.， 2019），称为 VidHOI ，其中我们遵循视频和 HOI 任务中的通用协议，使用以关键帧为中心的策略。借助 VidHOI，我们敦促使用视频数据，并建议在训练和推理中将 VideoHOI 作为对视频执行 HOI 检测。

Spatial-Temporal Action Detection (STAD) is another task bearing a resemblance to VideoHOI by requiring to localize the human and detect the actions being performed in videos. Note that STAD does not consider the objects that a human is interacting with. STAD is usually tackled by first using a 3D-CNN (Tran et al., 2015; Carreira and Zisserman, 2017) as the backbone to encode temporal information into feature maps. This is followed by RoI pooling with object proposals to obtain actor features, which are then classified by linear layers. Essentially, this approach is similar to a common HOI baseline illustrated in Figure 1(a) and differs only in the use of 3D backbones and the absence of interacting objects. Based on conventional HOI and STAD methods, a naive yet intuitive idea arises: can we enjoy the best of both worlds, by replacing 2D backbones with 3D ones and exploiting visual features of interacting objects? This idea, however, did not work straightforwardly in our preliminary experiment, where we replaced the backbone in the 2D baseline (Wan et al., 2019) with the 3D one (e.g., SlowFast (Feichtenhofer et al., 2019)) to perform VideoHOI. The relative change of performance after replacing the backbone is presented in the left most entry in Figure 2(a) with a blue bar. In VideoHOI experiment, the 3D baseline provides only a limited relative improvement ( $\sim$ 2%), which is far from satisfactory considering the additional temporal context. In fact, this phenomenon has also been observed in two existing works under similar settings (Gu et al., 2018; Ji et al., 2020), where both experiments in STAD and another video task Spatial-Temporal Scene Graph Generation (STSGG) present an even worse, counter-intuitive result: replacing the backbone is actually harmful (also presented as blue bars in Figure 2(a)). We probed the underlying reason by analyzing the architecture of these 3D baselines and found that, surprisingly, temporal pooling together with RoI pooling does not work reasonably. As illustrated in Figure 2(b), temporal pooling followed by RoI pooling, which is a common practice in conventional STAD methods, is equivalent to cropping features of the same region across the whole video segment without considering the way objects move. It is not unusual for moving humans and objects in neighboring frames to be absent from its location in the target keyframe. Temporal-and-RoI-pooling features at the same location could be getting erroneous features such as other humans/objects or meaningless background. Dealing with this inconsistency, we propose to recover the missing spatial-temporal information in VideoHOI by considering human and object trajectories. The performance change of this temporal-augmented 3D baseline on VideoHOI is represented by the tangerine bar in Figure 2(a), where it achieves $\sim$ 23% improvement, in sharp contrast to $\sim$ 2% of the original 3D baseline. This experiment reveals the importance of incorporating the ”correctly-localized” temporal information.
时空动作检测（STAD）是另一项类似于 VideoHOI 的任务，需要定位人并检测视频中正在执行的动作。请注意，STAD不考虑人类正在与之交互的对象。 STAD通常首先使用3D-CNN来解决（Tran等人，2015;Carreira 和 Zisserman，2017 年）作为将时间信息编码到特征图中的骨干。接下来是 RoI 池化与对象建议，以获得参与者特征，然后按线性层进行分类。从本质上讲，这种方法类似于图1（a）中所示的常见HOI基线，仅在于使用3D主干和没有交互对象而有所不同。基于传统的HOI和STAD方法，出现了一个幼稚而直观的想法：我们能否通过用3D主干替换2D骨架并利用交互对象的视觉特征来享受两全其美的效果？然而，这个想法在我们的初步实验中并没有直接奏效，我们在初步实验中用 3D 基线（例如，SlowFast （Feichtenhofer et al.， 2019））替换了 2D 基线中的主干网来执行 VideoHOI。更换主干网后性能的相对变化显示在图2（a）最左侧的条目中，带有蓝色条。在 VideoHOI 实验中，3D 基线仅提供有限的相对改进（ $\sim$ 2%），考虑到额外的时间背景，这远远不能令人满意。事实上，在两部现有作品中也观察到了类似设置下的现象（Gu et al.， 2018;Ji et al.， 2020），其中 STAD 中的实验和另一个视频任务时空场景图生成（STSGG）都呈现出更糟糕、违反直觉的结果：替换主干实际上是有害的（在图 2（a）中也显示为蓝色条）。我们通过分析这些 3D 基线的架构来探究根本原因，发现令人惊讶的是，时间池化和 RoI 池化并不能合理地工作。如图2（b）所示，时间池化和RoI池化是传统STAD方法中的常见做法，相当于在整个视频段中裁剪同一区域的特征，而不考虑对象的移动方式。在目标关键帧中，相邻帧中移动的人物和物体在其位置不存在，这并不罕见。在同一位置的时间和 RoI 池化要素可能会获得错误的特征，例如其他人/物体或无意义的背景。针对这种不一致问题，我们提出通过考虑人和物体的轨迹来恢复VideoHOI中缺失的时空信息。在VideoHOI上，这种时间增强的3D基线的性能变化由图2（a）中的橘子条表示，它实现了 $\sim$ 23%的改善，与 $\sim$ 原始3D基线的2%形成鲜明对比。该实验揭示了整合“正确定位”的时间信息的重要性。

Keeping the aforementioned ideas in mind, in this paper we propose Spatial-Temporal baseline for Human-Object Interaction detection in videos, or ST-HOI, which makes accurate HOI prediction with instance-wise spatial-temporal features based on trajectories. As illustrated in Figure 1(b), three kinds of such features are exploited in ST-HOI: (a) trajectory features (moving bounding boxes; shown as the red arrow), (b) correctly-localized visual features (shown as the yellow arrow), and (c) spatial-temporal actor poses (shown as the green arrow).
基于上述思路，本文提出基于轨迹的H uman-O bject I nteraction检测的Spatial-T经验基线（S patial-T emporal Base）用于视频中Human-O bject Interaction检测，即ST-HOI算法，基于轨迹的实例级时空特征，对HOI进行准确的预测。如图1（b）所示，ST-HOI利用了三种此类特征：（a）轨迹特征（移动边界框;如红色箭头所示），（b）正确定位的视觉特征（如黄色箭头所示）和（c）时空角色姿势（如绿色箭头所示）。

The contribution of our work is three-fold. First, we are among the first to identify the feature inconsistency issue existing in the naive 3D models which we address with simple yet “correct” spatial-temporal feature pooling. Second, we propose a spatial-temporal model which utilizes correctly-localized visual features, per-frame box coordinates and a novel, temporal-aware masking pose module to effectively detect video-based HOIs. Third, we establish the keyframe-based VidHOI benchmark to motivate research in detecting spatial-temporal aware interactions and hopefully inspire VideoHOI approaches utilizing the multi-modality data, i.e., video frames, texts (semantic object/relation labels) and audios.
我们的工作贡献是三重的。首先，我们是最早发现朴素3D模型中存在的特征不一致问题的人之一，我们通过简单而“正确”的时空特征池化来解决该问题。其次，我们提出了一种时空模型，该模型利用正确定位的视觉特征、每帧框坐标和新颖的、时间感知的掩蔽姿态模块来有效检测基于视频的HOI。第三，我们建立了基于关键帧的VidHOI基准，以激发在检测时空感知交互方面的研究，并希望激发利用多模态数据（即视频帧、文本（语义对象/关系标签）和音频的VideoHOI方法。

2. Related Work
二、相关工作

2.1. Human-Object Interaction (HOI)
2.1. 人物交互（HOI）

HOI Detection aims to reason over interactions between humans (actors) and target objects. HOI is closely related to visual relationship detection (Lu et al., 2016; Chiou et al., 2021) and scene graph generation (Xu et al., 2017), in which the subject in (subject-predicate-object) are not restricted to a human. HOI in static images has been intensively studied recently (Smith et al., 2013; Chao et al., 2018; Gao et al., 2018; Qi et al., 2018; Li et al., 2019; Wang et al., 2019; Wan et al., 2019; Gupta et al., 2019; Li et al., 2020; Wang et al., 2020; Liao et al., 2020; Ulutan et al., 2020; Zhong et al., 2020). Most of the existing methods can be divided into two categories by the order of human-object pair proposal and interaction classification. The first group (Gao et al., 2018; Chao et al., 2018; Li et al., 2019; Wang et al., 2019; Wan et al., 2019; Gupta et al., 2019; Ulutan et al., 2020; Li et al., 2020) performs human-object pair generation followed by interaction classification, while the second group (Gkioxari et al., 2018; Qi et al., 2018; Wang et al., 2020; Liao et al., 2020) first predicts the most probable interactions performed by a person followed by associating them with the most-likely objects. Our ST-HOI belongs to the first group as we establish a temporal model based on trajectories (continuous object proposals).
HOI 检测旨在对人类（参与者）和目标物体之间的交互进行推理。HOI与视觉关系检测密切相关（Lu et al.， 2016;Chiou et al.， 2021）和场景图生成（Xu et al.， 2017），其中主语（主语-谓语-宾语）不限于人类。静态图像中的HOI最近得到了深入研究（Smith等人，2013;Chao et al.， 2018;Gao et al.， 2018;Qi et al.， 2018;Li et al.， 2019;Wang et al.， 2019;Wan 等人，2019 年;Gupta等人，2019;Li et al.， 2020;Wang et al.， 2020;Liao et al.， 2020;Ulutan 等人，2020 年;Zhong et al.， 2020）。现有的大多数方法可以按照人-物对、提议和交互分类的顺序分为两类。第一组（Gao et al.， 2018;Chao et al.， 2018;Li et al.， 2019;Wang et al.， 2019;Wan 等人，2019 年;Gupta等人，2019;Ulutan 等人，2020 年;Li 等 .， 2020）执行人-物体对生成，然后进行交互分类，而第二组（Gkioxari et al.， 2018;Qi et al.， 2018;Wang et al.， 2020;Liao et al.， 2020）首先预测一个人执行的最可能的交互，然后将它们与最可能的物体相关联。我们的 ST-HOI 属于第一组，因为我们建立了一个基于轨迹（连续对象建议）的时间模型。

In contrast to the popularity of image-based HOI, there are only a few of studies in VideoHOI (Koppula and Saxena, 2015; Jain et al., 2016; Qi et al., 2018; Sunkesula et al., 2020) and, to the best of our knowledge, all of which conducted experiments on CAD-120 (Koppula et al., 2013) dataset. In CAD-120, the interactions are defined by merely 10 high-level activities (e.g., making cereal or microwaving food) in 120 RGB-D videos. This setting is not favorable to real-life scenarios where machines may be asked to understand more fine-grained actions. Moreover, previous methods (Koppula and Saxena, 2015; Jain et al., 2016; Qi et al., 2018) adopted pre-computed hand-crafted features such as SIFT (Lowe, 2004) which have been outperformed by deep neural networks, and ground truth features including 3D poses and depth information from RGB-D videos which are unlikely to be available in real life scenarios. While (Sunkesula et al., 2020) adopted a ResNet (He et al., 2016) as their backbone, their method is inefficient by requiring $M\times N$ computation for extracting $M$ humans’ and $N$ objects’ features. Different from these existing methods, we evaluate on a larger and more diversified video HOI benchmark dubbed VidHOI, which includes annotations of 50 predicates on thousands of videos. We then propose a spatial-temporal HOI baseline that operates on RGB videos and does not utilize any additional information.
与基于图像的 HOI 的普及相比，VideoHOI 中的研究很少（Koppula 和 Saxena，2015 年;Jain等人，2016;Qi et al.， 2018;Sunkesula 等人，2020 年），据我们所知，所有这些都在 CAD-120 （Koppula 等人，2013 年）数据集上进行了实验。在 CAD-120 中，交互仅由 120 个 RGB-D 视频中的 10 个高级活动（例如，制作麦片或微波炉加热食物）定义。此设置不利于实际情况，因为在实际生活中，可能会要求计算机理解更细粒度的操作。此外，以前的方法（Koppula和Saxena，2015;Jain等人，2016;Qi et al.， 2018）采用了预先计算的手工制作特征，例如 SIFT （Lowe， 2004），这些特征已被深度神经网络所超越，以及地面实况特征，包括来自 RGB-D 视频的 3D 姿势和深度信息，这些特征在现实生活中不太可能可用。虽然（Sunkesula et al.， 2020）采用了ResNet（He et al.， 2016）作为他们的骨干，但他们的方法效率低下，因为需要 $M\times N$ 计算来提取 $M$ 人类和 $N$ 物体的特征。与这些现有方法不同，我们评估了一个更大、更多样化的视频 HOI 基准，称为 VidHOI，其中包括数千个视频的 50 个谓词的注释。然后，我们提出了一个时空HOI基线，该基线对RGB视频进行操作，并且不使用任何额外的信息。

2.2. Spatial-Temporal Action Detection (STAD)
2.2. 时空动作检测（STAD）

STAD aims to localize actors and detect the associated actions (without considering interacting objects). One of the most popular benchmark for STAD is AVA (Gu et al., 2018), where the annotation is done at a sampling frequency of 1 Hz and the performance is measured by framewise mean AP. We followed this annotation and evaluation style when constructing VidHOI, where we converted the original labels into the same format.
STAD旨在定位参与者并检测相关动作（不考虑交互对象）。STAD最受欢迎的基准测试之一是AVA（Gu 等人，2018），其中注释是在1 Hz的采样频率下完成的，性能是通过逐帧平均AP来衡量的。在构建 VidHOI 时，我们遵循了这种注释和评估风格，我们将原始标签转换为相同的格式。

As explained in section 1, a standard approach to STAD (Tran et al., 2015; Carreira and Zisserman, 2017) is extracting spatial-temporal feature maps with a 3D-CNN followed by RoI pooling to crop human features, which are then classified by linear layers. As shown in Figure 2(a), a naive modification that incorporates RoI-pooled human/object features does not work for VideoHOI. In contrast, our ST-HOI tackles VideoHOI by incorporating multiple temporal features including trajectories, correctly-localized visual features and spatial-temporal masking pose features.
如第 1 节所述，STAD 的标准方法（Tran 等人，2015 年;Carreira 和 Zisserman，2017 年）正在使用 3D-CNN 提取时空特征图，然后进行 RoI 池化以裁剪人类特征，然后按线性层进行分类。如图 2（a）所示，包含 RoI 池化人/物体特征的朴素修改不适用于 VideoHOI。相比之下，我们的 ST-HOI 通过结合多种时间特征（包括轨迹、正确定位的视觉特征和时空掩蔽姿势特征）来解决 VideoHOI。

2.3. Spatial-Temporal Scene Graph Generation
2.3. 时空场景图生成

Spatial-Temporal Scene Graph Generation (STSGG) (Ji et al., 2020) aims to generate symbolic graphs representing pairwise visual relationships in video frames. A new benchmark, Action Genome, is also proposed in (Ji et al., 2020) to facilitate researches in STSGG. Ji et al. (Ji et al., 2020) dealt with STSGG by combining off-the-shelf scene graph generation models with long-term feature bank (Wu et al., 2019) on top of a 2D- or 3D-CNN, where they found that the 3D-CNN actually undermines the performance. While observing similar results in VidHOI (Figure 2(a)), we go one step further to find out the underlying reason is that RoI features across frames were erroneously pooled. We correct this by utilizing object trajectories and applying Tube-of-Interest (ToI) pooling on generated trajectories to obtain correctly-localized position information and feature maps throughout video segments.
时空场景图生成（STSGG）（Ji et al.， 2020）旨在生成表示视频帧中成对视觉关系的符号图。（Ji et al.， 2020）中还提出了一个新的基准，即Action Genome，以促进STSGG的研究。Ji et al. （Ji et al.， 2020）通过将现成的场景图生成模型与长期特征库（Wu et al.， 2019）结合在 2D 或 3D-CNN 之上来处理 STSGG，他们发现 3D-CNN 实际上破坏了性能。在VidHOI中观察到类似的结果（图2（a））时，我们更进一步，发现根本原因是跨帧的RoI特征被错误地合并。我们通过利用对象轨迹并在生成的轨迹上应用感兴趣管（ToI）池化来纠正这一点，以获得在整个视频片段中正确定位的位置信息和特征图。

3. Methodology
3. 研究方法

3.1. Overview
3.1. 概述

We follow STAD approaches (Tran et al., 2015; Carreira and Zisserman, 2017; Feichtenhofer et al., 2019) to detect VideoHOI in a keyframe-centric strategy. Denote $V$ as a video which has $T$ keyframes with sampling frequency of 1 Hz as $\{I_{t}\},t=\{1,...,T\}$ , and denote $C$ as the number of pre-defined interaction classes. Given $N$ instance trajectories including $M$ human trajectories ( $M\leq N$ ) in a video segment centered at the target frame, for human $m\in\{1,...,M\}$ and object $n\in\{1,...,N\}$ in keyframe $I_{t}$ , we aim to detect pairwise human-object interactions $r_{t}=\{0,1\}^{C}$ , where each entry $r_{t,c},c\in\{1,...,C\}$ means whether the interaction $c$ exists or not.

Refer to Figure 1(b) for an illustration of our ST-HOI. Our model takes in a video segment (sequence of $T$ frames) centered at $I_{t}$ and utilizes a 3D-CNN as the backbone to extract spatial-temporal feature maps of the whole segment. To rectify the mismatch caused by temporal-RoI pooling, based on $N$ object (including human) trajectories $\{j_{i}\},i=\{1,..,N\},j_{i}\in\mathbb{R}^{T\times 4}$ we generate temporal-aware features including correctly-localized features and spatial-temporal masking pose features. These features together with trajectories are concatenated and classified by linear layers. Note that we aim at a simple but effective temporal-aware baseline to VideoHOI so that we do not utilize tricks in STAD such as non-local block (Wang et al., 2018) or long-term feature bank (Wu et al., 2019), and in image-based HOI like interactiveness (Li et al., 2019), though we note that these may be used to boost the performance.

3.2. Correctly-localized Visual Features

We have discussed in previous sections on inappropriately pooled RoI features. We propose to tackle this issue by reversing the order of temporal pooling and RoI-pooling. This approach has recently been proposed in (Hou et al., 2017) and named as tube-of-interest pooling (ToIPool). Refer to Figure 3(a) for an illustration. Denote $v\in\mathbb{R}^{d\times T\times H\times W}$ as the output of the penultimate layer of our 3D-CNN backbone, and denote $v_{t}\in\mathbb{R}^{d\times H\times W}$ as the $t$ -th feature map along the time axis. Recall that we have $N$ trajectories centered at a keyframe. Following the conventional way, we also exploit visual context when predicting an interaction, which is done by utilizing the union bounding box feature of a human and an object. For example, the sky between human and kite could help to infer the correct interaction fly. Recall that $j_{i}$ represents the trajectory of object $i$ , where we further denote $j_{i,t}$ as the 2D bounding box at time $t$ . The spatial-temporal instance features $\{\bar{v}_{i}\}$ are then obtained using ToIPool with RoIAlign (He et al., 2017) by

(1)

\bar{v}_{i}=\frac{1}{T}\sum_{t=1}^{T}\text{RoIAlign}(v_{t},j_{i,t}),

where $\bar{v}_{i}\in\mathbb{R}^{d\times h\times w}$ and $h$ and $w$ means height and width of the pooled feature maps, respectively. $\bar{v}_{i}$ is flattened before concatenating with other features.

3.3. Spatial-Temporal Masking Pose Features

Human poses have been widely utilized in image-based HOI methods (Li et al., 2019; Gupta et al., 2019; Wan et al., 2019) to exploit characteristic actor pose to infer some special actions. In addition, some existing works (Wang et al., 2019; Wan et al., 2019) found that spatial information can be used to identify interactions. For instance, for human-ride-horse one can imagine the actor’s skeleton as legs widely open (on horse sides), and the bounding box center of human is usually on top of that of horse. However, none of the existing works consider this mechanism in temporal domain: when riding a horse the human should be moving with horse as a whole. We argue that this temporality is an important property and should be utilized as well.

The spatial-temporal masking pose module is presented at Figure 3(b). Given $M$ human trajectories, we first generate $M$ spatial-temporal pose features with a trained human pose prediction model. On frame $t$ , the predicted human pose $h_{i,t}\in\mathbb{R}^{17\times 2},i=\{1,..,M\},t=\{1,..,T\}$ is defined as 17 joint points mapped to the original image. We transform $h_{i,t}$ into a skeleton on a binary mask with $f_{h}:\{h_{i,t}\}\in\mathbb{R}^{17\times 2}\to\{\bar{h}_{i,t}\}\in\mathbb{R}^{1\times H\times W}$ , by connecting the joints using lines, where each line has a distinct value $x\in[0,1]$ . This helps the model to recognize and differentiate different poses.

For each of $M\times(N-1)$ valid human-object pairs on frame $t$ , we also generate two spatial masks $s_{i,t}\in\mathbb{R}^{2\times H\times W},i=\{1,...,M\times(N-1)\}$ corresponding to human and object respectively, where the values inside of each bounding box are ones and outsides are zeroed-out. These masks enable our model to predict HOI with reference to important spatial information.

For each pair, we concatenate the skeleton mask $\bar{h}_{i,t}$ and spatial masks $s_{i,t}$ along the first dimension to get the initial spatial masking pose feature $p_{i,t}\in\mathbb{R}^{3\times H\times W}$ :

(2)

p_{i,t}=[s_{i,t};\bar{h}_{i,t}].

We then down-sample $\{p_{i,t}\}$ , feed into two 3D convolutional layers with spatial and temporal pooling, and flatten to obtain the final spatial-temporal masking pose feature $\{\bar{p}_{i,t}\}$ .

Table 1. A comparison of our benchmark VidHOI with existing STAD (AVA (Gu et al., 2018)), image-based (HICO-DET (Chao et al., 2018) and V-COCO (Gupta and Malik, 2015)) and video-based (CAD-120 (Koppula et al., 2013) and Action Genome (Ji et al., 2020)) HOI datasets. VidHOI is the only dataset that provides temporal information from video clips and complete multi-person and interacting-object annotations. VidHOI also provides the most annotated keyframes and defines the most HOI categories in the existing video datasets.

\dagger

Two less categories as we combine adult, child and baby into a single category, person.

Dataset	Video	Localized	Video	# Videos	# Annotated	# Objects	# Predicate	# HOI	# HOI
Dataset	dataset?	object?	hours		images/frames	categories	categories	categories	Instances
HICO-DET (Chao et al., 2018)	✗	✓	-	-	47K	80	117	600	150K
V-COCO (Gupta and Malik, 2015)	✗	✓	-	-	10K	80	25	259	16K
AVA (Gu et al., 2018)	✓	✗	108	437	3.7M	-	49	80	1.6M
CAD-120 (Koppula et al., 2013)	✓	✓	0.57	0.5K	61K	13	6	10	32K
Action Genome (Ji et al., 2020)	✓	$\bigtriangleup$	82	10K	234K	35	25	157	1.7M
VidHOI	✓	✓	70	7122	7.3M	78 $\dagger$	50	557	755K

3.4. Prediction

We fuse the aforementioned features, including correctly-localized visual features $\bar{v}$ , spatial-temporal masking pose features $p$ , and instance trajectories $j$ by concatenating them along the last axis

(3)

v_{\text{so}}=[\bar{v}_{s};\bar{v}_{u};\bar{v}_{o};j_{s};j_{o};\bar{p}_{so}],\\

where we slightly abuse the notation to denote the subscriptions $s$ as the subject, $o$ as the object and $u$ as their union region. $v_{\text{so}}$ is then fed into two linear layers with the final output size being the number of interaction classes in the dataset. Since VideoHOI is essentially a multi-label learning task, we train the model with per-class binary cross entropy loss.

During inference, we follow the heuristics in image-based HOI (Chao et al., 2018) to sort all the possible pairs by their softmax scores and evaluate on only top 100 predictions.

4. Experiments

4.1. Dataset and Performance Metric

While we have discussed in section 2.1 about the problem of lacking a suitable VideoHOI dataset by analyzing CAD-120 (Koppula et al., 2013), we further explain why Action Genome (Ji et al., 2020) is also not a feasible choice here. First, the authors acknowledged that the dataset is still incomplete and contains incorrect labels (Ji, 2020). Second, Action Genome is produced by annotating Charades (Sigurdsson et al., 2016), which is originally designed for activity classification where each clip contains only one ”actor” performing predefined tasks; should any other people show up, there are neither any bounding box nor interaction label about them. Finally, the videos are purposedly-generated by volunteers, which are rather unnatural. In contrast, VidHOI are based on VidOR (Shang et al., 2019) which is densely annotated with all humans and predefined objects showing up in each frame. VidOR is also more challenging as the videos are non-volunteering user-generated and thus jittery at times. A comparison of VidHOI and the existing STAD and HOI datasets is presented in Table 1.

VidOR is originally collected for video visual relationship detection where the evaluation is trajectory-based. The volumetric Interaction Over Union (vIOU) between a trajectory and a ground truth needs to be over 0.5 before considering its relationship prediction; however, how to obtain accurate trajectories with correct start- and end-timestamp remains challenging (Sun et al., 2019; Shang et al., 2017). We notice that some image-based HOI datasets (e.g., HICO-DET (Chao et al., 2018) and V-COCO (Gupta and Malik, 2015)) as well as STAD datasets (e.g., AVA (Gu et al., 2018)) are using a keyframe-centered evaluation strategy, which bypasses the aforementioned issue. We thus adopt the same and follow AVA to sample keyframes at a 1 FPS frequency, where the annotations on the keyframe at timestamp $t$ are assumed to be fixed for $t\pm 0.5$ s. In detail, we first filter out those keyframes without presenting at least one valid human-object pair, followed by transforming the labels from video clip-based to keyframe-based to align with common HOI metrics (i.e., frame mAP). We follow the original VidOR split in (Shang et al., 2019) to divide VidHOI into a training set comprising 193,911 keyframes in 6,366 videos and a validation set²²2The VidOR testing set is not available publicly. with 22,808 keyframes in 756 videos. As shown in Figure 4, there are 50 relation classes including actions (e.g., push, pull, lift, etc.) and spatial relations (e.g., next to, behind, etc.). While half (25) of the predicate classes are temporal-related, they account for merely $\sim$ 5% of the dataset.

Following the evaluation metric in HICO-DET, we adopt mean Average Precision (mAP), where a true positive HOI needs to meet three below criteria: (a) both the predicted human and object bounding boxes have to overlap with the ground truth boxes with IOU over 0.5, (b) the predicted target category need to be matched and (c) the predicted interaction is correct. Over 50 predicates, we follow HICO-DET to define HOI categories as 557 triplets on which we compute mean AP. By defining HOI categories with triplets we can bypass the polysemy problem (Zhong et al., 2020), i.e., the same predicate word can represent very different meaning when pairing with distinct objects, e.g., person-fly-kite and person-fly-airplane. We report the mean AP over three categories: (a) Full: all 557 categories are evaluated, (b) Rare: 315 categories with less than 25 instances in the dataset, and (c) Non-rare: 242 categories with more than or equal to 25 instances in the dataset. We also examine the models in two evaluation modes: Oracle models are trained and tested with ground truth trajectories, while models in Detection mode are tested with predicted trajectories.

4.2. Implementation Details

We adopt Resnet-50 (He et al., 2016) as our 2D backbone for the preliminary experiments, and utilize Resnet-50-based SlowFast (Feichtenhofer et al., 2019) as our 3D backbone for all the other experiments. SlowFast contains the Slow and Fast pathways, which correspond to the texture details and the temporal information, respectively, by sampling video frames in different frequencies. For a 64-frame segment centered at the keyframe, $T=32$ frames are alternately sampled to feed into the Slow pathway; only $T/\alpha$ frames are fed into the Fast pathway, where $\alpha=8$ in our experiments. We use FastPose (Fang et al., 2017) to predict human poses and adopt the predicted trajectories generated by a cascaded model of video object detection, temporal NMS and tracking algorithm (Sun et al., 2019). Like object detection is to 2D HOI detection, trajectory generation is an essential module but not a main focus of this work. If a bounding box is not available in neighboring frames (i.e., the trajectory is shorter than $T$ or not continuous throughout the segment), we fill it with the whole-image as a box. We train all models from scratch for 20 epochs with the initial learning rate $1\times 10^{-2}$ , where we use step decay learning rate to reduce the learning rate by $10\times$ at the $10^{\text{th}}$ and $15^{\text{th}}$ epoch. We optimize our models using synchronized SGD with momentum of 0.9 and weight decay of $10^{-7}$ . We train each 3D video model with eight NVIDIA Tesla V100 GPUs with batch size being 128 (i.e., 16 examples per GPU), except for the full model where we set batch size as 112 due to the memory restriction. We train the 2D model with a single V100 with batch size being 128.

During training, following the strategy in SlowFast we randomly scale the shorter side of the video to a value in $[256,320]$ pixels, followed by random horizontal flipping and random cropping into $224\times 224$ pixels. During inference, we only resize the shorter side of the video segment to 224 pixels.

4.3. Quantitative Results

Table 2. Results of the baselines and our ST-HOI on VidHOI validation set (numbers in mAP). There are two evaluation modes: Detection and Oracle, which differ only in the use of predicted or ground truth trajectories during inference. T: Trajectory features. V: Correctly-localized visual features. P: Spatial-temporal masking pose features. ”%” means the full mAP change compared to the 2D model.

	Model	Full	Non-rare	Rare	%
Oracle	2D model (Wan et al., 2019)	14.1	22.9	11.3	-
	3D model	14.4	23.0	12.6	2.1
	Ours-T	17.3	26.9	16.8	22.7
	Ours-T+V	17.3	26.9	16.3	22.7
	Ours-T+P	17.4	27.1	16.4	23.4
	Ours-T+V+P	17.6	27.2	17.3	24.8
Detection	2D model (Wan et al., 2019)	2.6	4.7	1.7	-
	3D model	2.6	4.9	1.9	0.0
	Ours-T	3.0	5.5	2.0	15.4
	Ours-T+V	3.1	5.8	2.0	19.2
	Ours-T+P	3.2	6.1	2.0	23.1
	Ours-T+V+P	3.1	5.9	2.1	19.2

Since we aim to deal with a) the lack of temporal-aware features in 2D HOI methods, b) the feature inconsistency issue in common 3D HOI methods and c) the lack of a VideoHOI benchmark, we mainly compare with the 2D model (Wan et al., 2019) and its naive 3D variant on VidHOI to understand if our ST-HOI addresses these issues effectively.

The performance comparison between our full ST-HOI model (Ours-T+V+P) and baselines (2D model, 3D model) are presented in Table LABEL:tab:results, in which we also present ablation studies on our different features (modules) inlcuding trajectory features (T), correctly-localized visual features (V) and spatial-temporal masking pose features (P). Table LABEL:tab:results shows that 3D model only has a marginal improvement compared to 2D model (overall $\sim$ 2%) under all settings in both evaluation modes. In contrast, adding trajectory features (Ours-T) leads to a much larger 23% improvement in Oracle mode or 15% in Detection mode, showing the importance of correct spatial-temporal information. We also find that by adding additional temporal-aware features (i.e., V and P) increasingly higher mAPs are attained, and our full model (Ours-T+V+P) reports the best mAPs in Oracle mode, achieving the highest $\sim$ 25% relative improvement. We notice that the performance of Ours-T+V is close to that of Ours-T under Oracle setting, which is possibly because the ground truth trajectories (T) have provided enough “correctly-localized” information so that the correct features do not help much. We also note that the performance of Ours-T+P is slightly higher than that of Ours-T+V+P under Detection mode, which is assumably due to the same, aforementioned reason and the inferior performance resulting from the predicted trajectories. The overall performance gap between Detection and Oracle models is significant, indicating the room for improvement in trajectory generation. Another interesting observation is that Full mAPs are very close to Rare mAPs, especially under Oracle mode, showing that the long-tail effect over HOIs is strong (but common and natural).

To understand the effect of temporal features on individual predicates, we compare with predicate-wise AP (pAP) shown in Figure 5. We observe that, again, under most of circumstances naively replacing 2D backbones with 3D ones does not help video HOI detection. Both temporal predicates (e.g., towards, away, pull) and spatial (e.g., next_to, behind, beneath) predicates benefit from the additional temporal-aware features in ST-HOI. These findings verify our main idea about the essential use of trajectories and trajectory-based features. In addition, each additional features do not seem to contribute equally for different predicates. For instance, we see that while Ours-T+V+P performs the best on some predicates (e.g., behind and beneath), our sub-models achieve the highest mAP on other predicates (e.g., watch and ride). This is assumedly because predicate-wise performance is heavily subject to the number of examples, where major predicates have 10-10000 times more examples than minor ones (as shown in Figure 4).

Table 3. Results of temporal-related and spatial (non-temporal) related triplet mAP. T%/S% means relative temporal/spatial mAP change compared to 2D model (Wan et al., 2019).

		Temporal	T%	Spatial	S%
Oracle	2D model (Wan et al., 2019)	8.3	-	18.6	-
	3D model	7.7	-7.2	20.9	12.3
	Ours-T	14.4	73.5	24.7	32.8
	Ours-T+V	13.6	63.9	24.6	32.3
	Ours-T+P	12.9	55.4	25.0	34.4
	Ours-T+V+P	14.4	73.5	25.0	34.4
Detection	2D model (Wan et al., 2019)	1.5	-	2.7	-
	3D model	1.6	6.7	2.9	7.4
	Ours-T	1.8	20.0	3.3	23.6
	Ours-T+V	1.8	20.0	3.3	23.6
	Ours-T+P	1.8	20.0	3.3	23.6
	Ours-T+V+P	1.9	26.7	3.3	23.6

Since the majority of HOI examples are spatial-related ( $\sim$ 95%, as shown in Figure 4), the results above might not be suitable for demonstrating the temporal modeling ability of our proposed model. We thus focus on the performance on only temporal-related predicates in Figure 6, which demonstrates that Ours-T+V+P greatly outperforms the baselines regarding the top frequent temporal predicates. Table LABEL:tab:temporal_spatial_results presents the triplet mAPs of spatial- or temporal-only predicates, showing Ours-T significantly improves the 2D model on temporal-only mAP by relative +73.9%, in sharp contrast to -7.1% of the 3D model in Oracle mode. Similar to our observation with Table LABEL:tab:results, Ours-T performs on par with Ours-T+V+P for temporal-only predicates; however, it falls short of spatial-only predicates, showing that spatial/pose information is still essential for detecting spatial predicates. Overall, these results demonstrate the outstanding spatial-temporal modeling ability of our approach.

We also compare the performance with respect to some HOI triplets in Figure 7. Similar to the results on predicate-wise mAP, we also observe the large gap between naive 2D/3D models and our models with the temporal features. ST-HOI variants are more accurate in predicting especially temporal-aware HOIs (hug/lean_on-person and push/pull-baby_walker). We also see in some examples that Ours-T+V+P does not perform the best among all the variants, e.g., lean_on-person), which is similar to the phenomenon we observed in Figure 5.

4.4. Qualitative Results

To understand the effectiveness of our proposed method, we visualize two video HOI examples of VidHOI predicted by the 2D model (Wan et al., 2019) and Ours-T+V+P (both in Oracle mode) in Figure 8. Each (upper and lower) example is a 5-second video segment (i.e., five keyframes) with a HOI prediction table where each entry means either True Positive (TP), False Positive (FP), False Negative (FN) or True Negative (TN) for both models. The upper example shows that, compared to the 2D model, Ours-T+V+P makes more accurate HOI detection by successfully predicting hold_hand_of at $T_{4}$ and $T_{5}$ . Moreover, Ours-T+V+P is able to predict interactions that requires temporal information, such as lift at $T_{1}$ in the lower example. However, we can see that there is still room for improvement for Ours-T+V+P in the same example, where lift is not detected in the following $T_{2}$ to $T_{4}$ frames. Overall, our model produces less false positives throughout the dataset, which in turn contributes to its higher mAP and pAP.

5. Conclusion

In this paper, we addressed the inability of conventional HOI approaches to recognize temporal-aware interactions by re-focusing on neighboring video frames. We discussed the lack of a suitable setting and dataset for studying video-based HOI detection. We also identified a feature-inconsistency problem in a common video action detection baseline which arises from its improper order of RoI feature pooling and temporal pooling. To deal with the first issue, we established a new video HOI benchmark dubbed VidHOI and introduced a keyframe-centered detection strategy. We then proposed a spatial-temporal baseline ST-HOI which exploits trajectory-based temporal features including correctly-localized visual features, spatial-temporal masking pose features and trajectory features, solving the second problem. With quantitative and qualitative experiments on VidHOI, we showed that our model provides a huge performance boost compared to both the 2D and 3D baselines and is effective in differentiating temporal-related interactions. We expect that the proposed baseline and the dataset would serve as a solid starting point for the relatively underexplored VideoHOI task. Based on our baseline, we also hope to motivate further VideoHOI works to design advanced models with the multi-modal data including video frames, semantic object/relation labels and audios.

Acknowledgment

This research is supported by Singapore Ministry of Education Academic Research Fund Tier 1 under MOE’s official grant number T1 251RES2029.

References

(1)
Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433.
Carreira and Zisserman (2017) Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
Chao et al. (2018) Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. 2018. Learning to detect human-object interactions. In 2018 ieee winter conference on applications of computer vision (wacv). IEEE, 381–389.
Chiou et al. (2020) Meng-Jiun Chiou, Zhenguang Liu, Yifang Yin, An-An Liu, and Roger Zimmermann. 2020. Zero-Shot Multi-View Indoor Localization via Graph Location Networks. In Proceedings of the 28th ACM International Conference on Multimedia. 3431–3440.
Chiou et al. (2021) Meng-Jiun Chiou, Roger Zimmermann, and Jiashi Feng. 2021. Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations. IEEE Access 9 (2021), 50441–50451.
Fang et al. (2017) Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional Multi-person Pose Estimation. In ICCV.
Feichtenhofer et al. (2019) Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE international conference on computer vision. 6202–6211.
Gao et al. (2018) Chen Gao, Yuliang Zou, and Jia-Bin Huang. 2018. iCAN: Instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437 (2018).
Gkioxari et al. (2018) Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and recognizing human-object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8359–8367.
Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
Gu et al. (2018) Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6047–6056.
Gupta and Malik (2015) Saurabh Gupta and Jitendra Malik. 2015. Visual Semantic Role Labeling. arXiv preprint arXiv:1505.04474 (2015).
Gupta et al. (2019) Tanmay Gupta, Alexander Schwing, and Derek Hoiem. 2019. No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In Proceedings of the IEEE International Conference on Computer Vision. 9677–9685.
He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961–2969.
He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In 2015 IEEE International Conference on Computer Vision (ICCV). 1026–1034.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Hou et al. (2017) Rui Hou, Chen Chen, and Mubarak Shah. 2017. Tube convolutional neural network (T-CNN) for action detection in videos. In Proceedings of the IEEE international conference on computer vision. 5822–5831.
Jain et al. (2016) Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the ieee conference on computer vision and pattern recognition. 5308–5317.
Ji (2020) Jingwei Ji. 2020. Question about the annotations · Issue #2 · JingweiJ/ActionGenome. https://github.com/JingweiJ/ActionGenome/issues/2
Ji et al. (2020) Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10236–10247.
Koppula et al. (2013) Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Saxena. 2013. Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32, 8 (2013), 951–970.
Koppula and Saxena (2015) Hema S Koppula and Ashutosh Saxena. 2015. Anticipating human activities using object affordances for reactive robotic response. IEEE transactions on pattern analysis and machine intelligence 38, 1 (2015), 14–29.
Li et al. (2020) Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, and Cewu Lu. 2020. PaStaNet: Toward Human Activity Knowledge Engine. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 382–391.
Li et al. (2019) Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yanfeng Wang, and Cewu Lu. 2019. Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3585–3594.
Liao et al. (2020) Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 482–490.
Lo and Wang (2019) Chi-Hung Lo and Yi-Wen Wang. 2019. Constructing an Evaluation Model for User Experience in an Unmanned Store. Sustainability 11, 18 (2019).
Lowe (2004) David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91–110.
Lu et al. (2016) Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European conference on computer vision. Springer, 852–869.
Qi et al. (2018) Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. 2018. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision (ECCV). 401–417.
Shang et al. (2019) Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 279–287.
Shang et al. (2017) Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In Proceedings of the 25th ACM international conference on Multimedia. 1300–1308.
Sigurdsson et al. (2016) Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision. Springer, 510–526.
Smith et al. (2013) Brian A Smith, Qi Yin, Steven K Feiner, and Shree K Nayar. 2013. Gaze locking: passive eye contact detection for human-object interaction. In Proceedings of the 26th annual ACM symposium on User interface software and technology. 271–280.
Sun et al. (2019) Xu Sun, Tongwei Ren, Yuan Zi, and Gangshan Wu. 2019. Video visual relation detection via multi-modal feature fusion. In Proceedings of the 27th ACM International Conference on Multimedia. 2657–2661.
Sunkesula et al. (2020) Sai Praneeth Reddy Sunkesula, Rishabh Dabral, and Ganesh Ramakrishnan. 2020. LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 691–699.
Tran et al. (2015) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
Ulutan et al. (2020) Oytun Ulutan, ASM Iftekhar, and Bangalore S Manjunath. 2020. VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13617–13626.
Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156–3164.
Wan et al. (2019) Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li, and Xuming He. 2019. Pose-aware multi-level feature network for human object interaction detection. In Proceedings of the IEEE International Conference on Computer Vision. 9469–9478.
Wang et al. (2019) Tiancai Wang, Rao Muhammad Anwer, Muhammad Haris Khan, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao, and Jorma Laaksonen. 2019. Deep contextual attention for human-object interaction detection. In Proceedings of the IEEE International Conference on Computer Vision. 5694–5702.
Wang et al. (2020) Tiancai Wang, Tong Yang, Martin Danelljan, Fahad Shahbaz Khan, Xiangyu Zhang, and Jian Sun. 2020. Learning Human-Object Interaction Detection using Interaction Points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4116–4125.
Wang et al. (2018) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794–7803.
Wu et al. (2019) Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 284–293.
Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
Xu et al. (2017) Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410–5419.
Yin et al. (2019) Yifang Yin, Meng-Jiun Chiou, Zhenguang Liu, Harsh Shrivastava, Rajiv Ratn Shah, and Roger Zimmermann. 2019. Multi-level fusion based class-aware attention model for weakly labeled audio tagging. In Proceedings of the 27th ACM International Conference on Multimedia. 1304–1312.
Zhang et al. (2017) Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt Schiele. 2017. Towards reaching human performance in pedestrian detection. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2017), 973–986.
Zhong et al. (2020) Xubin Zhong, Changxing Ding, Xian Qu, and Dacheng Tao. 2020. Polysemy deciphering network for human-object interaction detection. In Proc. Eur. Conf. Comput. Vis.

ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in VideosST-HOI：用于视频中人物交互检测的时空基线