这是用户在 2024-9-1 13:14 为 https://app.immersivetranslate.com/pdf-pro/22c4c708-5ae0-4b3f-bcf7-e32a005aa658 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_09_01_5999ac493f421bd57697g

Autonomous flying blimp interaction with human in an indoor space*
在室内空间与人互动的自主飞行飞艇*

Ning-shi YAO , Qiu-yang TAO , Wei-yu LIU , Zhen LIU , Ye TIAN ,
姚宁实 、陶秋阳 、刘伟宇 、刘震 、田烨
Pei-yu WANG , Timothy LI , Fumin ZHANG School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
佐治亚理工学院电气与计算机工程学院,美国佐治亚州亚特兰大市,邮编 30332
College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA
佐治亚理工学院计算机学院,美国佐治亚州亚特兰大 30332
†E-mail: nyao6@gatech.edu; qtao7@gatech.edu; wliu88@gatech.edu; fumin@gatech.edu
† 电子邮件:nyao6@gatech.edu; qtao7@gatech.edu; wliu88@gatech.edu; fumin@gatech.edu
Received Sept. 20, 2018; Revision accepted Nov. 26, 2018; Crosschecked Jan. 8, 2019
2018年9月20日收到;2018年11月26日接受修订;2019年1月8日核对

Abstract 摘要

We present the Georgia Tech Miniature Autonomous Blimp (GT-MAB), which is designed to support human-robot interaction experiments in an indoor space for up to two hours. GT-MAB is safe while flying in close proximity to humans. It is able to detect the face of a human subject, follow the human, and recognize hand gestures. GT-MAB employs a deep neural network based on the single shot multibox detector to jointly detect a human user's face and hands in a real-time video stream collected by the onboard camera. A human-robot interaction procedure is designed and tested with various human users. The learning algorithms recognize two hand waving gestures. The human user does not need to wear any additional tracking device when interacting with the flying blimp. Vision-based feedback controllers are designed to control the blimp to follow the human and fly in one of two distinguishable patterns in response to each of the two hand gestures. The blimp communicates its intentions to the human user by displaying visual symbols. The collected experimental data show that the visual feedback from the blimp in reaction to the human user significantly improves the interactive experience between blimp and human. The demonstrated success of this procedure indicates that GT-MAB could serve as a flying robot that is able to collect human data safely in an indoor environment.
我们展示了佐治亚理工学院的微型自主飞艇(GT-MAB),其设计目的是支持在室内空间进行长达两小时的人机交互实验。GT-MAB 在近距离飞行时非常安全。它能够检测人脸、跟踪人并识别手势。GT-MAB 采用基于单镜头多箱检测器的深度神经网络,在机载摄像头采集的实时视频流中联合检测人类用户的脸部和手部。设计了人机交互程序,并与不同的人类用户进行了测试。学习算法可识别两种挥手手势。人类用户与飞艇互动时无需佩戴任何额外的跟踪装置。设计了基于视觉的反馈控制器,以控制飞艇跟随人类,并根据两种手势分别以两种可区分的模式之一飞行。飞艇通过显示视觉符号向人类用户传达其意图。收集到的实验数据表明,飞艇对人类用户的视觉反馈大大改善了飞艇与人类用户之间的互动体验。这一程序的成功表明,GT-MAB 可以作为一种飞行机器人,在室内环境中安全地收集人类数据。

Key words: Robotic blimp; Human-robot interaction; Deep learning; Face detection; Gesture recognition
关键词机器人飞艇;人机交互;深度学习;人脸检测;手势识别

https://doi.org/10.1631/FITEE. 1800587
https://doi.org/10.1631/FITEE.1800587

1 Introduction 1 引言

Recent advances in robotics have enabled the rapid development of unmanned aerial vehicles (UAVs). With increasing penetration of UAVs in industry and everyday life, cooperation between humans and UAVs is quickly becoming unavoidable. It is extremely important that UAVs interact with
机器人技术的最新进展推动了无人驾驶飞行器(UAV)的快速发展。随着无人飞行器在工业和日常生活中的日益普及,人类与无人飞行器之间的合作正迅速变得不可避免。无人飞行器与人类的互动极为重要。
CLC number: TP24 CLC 编号:TP24
humans safely and naturally (Duffy, 2003; Goodrich and Schultz, 2007), and to this end, the study on human-robot interaction (HRI) has enjoyed recent research interest (Draper et al., 2003; de Crescenzio et al., 2009; Duncan and Murphy, 2013; Acharya et al., 2017; Peshkova et al., 2017). Quad-rotors are one of the most popular robotic platforms for three-dimensional (3D) HRI studies (Graether and Mueller, 2012; Arroyo et al., 2014; Szafir et al., 2015; Cauchard et al., 2016; Monajjemi et al., 2016). Humans can use speech/verbal cues (Pourmehr et al., 2014), eye gaze (Monajjemi et al., 2013; Hansen et al., 2014), and hand gestures (Naseer et al., 2013; Costante et al., 2014) to command quad-rotors to accomplish certain tasks. In addition to quad-rotors,
Duffy, 2003; Goodrich and Schultz, 2007),为此,人机交互(HRI)研究近年来备受关注(Draper等人,2003;de Crescenzio等人,2009;Duncan和Murphy,2013;Acharya等人,2017;Peshkova等人,2017)。四旋翼机器人是三维(3D)HRI 研究中最受欢迎的机器人平台之一(Graether 和 Mueller,2012 年;Arroyo 等人,2014 年;Szafir 等人,2015 年;Cauchard 等人,2016 年;Monajjemi 等人,2016 年)。人类可以使用语音/语言提示(Pourmehr 等人,2014 年)、眼睛注视(Monajjemi 等人,2013 年;Hansen 等人,2014 年)和手势(Naseer 等人,2013 年;Costante 等人,2014 年)来指挥四旋翼飞行器完成某些任务。除了四旋翼飞行器外

other types of UAVs, such as fixed-wing aircrafts (He et al., 2011) and flying displays (Schneegass et al., 2014), have been developed to interact with humans.
其他类型的无人机,如固定翼飞机(He 等人,2011 年)和飞行显示器(Schneegass 等人,2014 年),也已开发出与人类互动的功能。
In contrast to fast-flying UAVs, autonomous blimps are the preferred platform for HRI (Liew and Yairi, 2013) in certain applications where human comfort is a major concern. In Burri et al. (2013), a spherical robotic blimp was proposed to monitor activities in human crowds. St-Onge et al. (2017) demonstrated that a cubic-shaped blimp can fly close to human artists on stage. In another case, a robotic blimp (Srisamosorn et al., 2016) was used for monitoring elderly people inside a nursing home. These applications demonstrate the necessity of studying human-blimp interaction. However, there is a lack of dedicated design for autonomous blimps to support experiments in human interaction with flying robots in an indoor lab space.
与快速飞行的无人机相比,自主飞艇是某些应用中首选的 HRI 平台(Liew 和 Yairi,2013 年),因为在这些应用中,人的舒适度是一个主要问题。Burri 等人(2013 年)提出用球形机器人飞艇监测人群中的活动。St-Onge 等人(2017 年)证明,立方体形状的飞艇可以飞近舞台上的人类艺术家。在另一个案例中,机器人飞艇(Srisamosorn 等人,2016 年)被用于监测养老院内的老人。这些应用证明了研究人与飞艇互动的必要性。然而,目前还缺乏专门的自主飞艇设计,以支持在室内实验室空间进行人类与飞行机器人的互动实验。
We developed the Georgia Tech Miniature Autonomous Blimp (GT-MAB), which is designed to collect experimental data for indoor HRI (Cho et al., 2017; Yao et al., 2017; Tao et al., 2018). Being a flying robot, GT-MAB does not pose the safety threats and anxiety that a typical quad-rotor can cause to humans, and it can fly close to humans in indoor environments. In addition, GT-MAB has a relatively long flight time of up to two hours per battery charge, which supports uninterrupted HRI experiments. In this study, we introduce the hardware designs, perception algorithms, and feedback controllers on GTMAB that identify human intentions through hand gesture recognition, communicate the robot's intentions to its human subjects, and execute the blimp behavior in reaction to the hand gestures. These features are the basic building blocks for more sophisticated experiments to collect human data and study human behaviors.
我们开发了佐治亚理工学院微型自主飞艇(GT-MAB),旨在收集室内 HRI 的实验数据(Cho 等人,2017 年;Yao 等人,2017 年;Tao 等人,2018 年)。作为一个飞行机器人,GT-MAB 不会像典型的四旋翼机器人那样对人类造成安全威胁和焦虑,它可以在室内环境中贴近人类飞行。此外,GT-MAB 的飞行时间相对较长,每次充电可飞行两小时,这为不间断的 HRI 实验提供了支持。在本研究中,我们介绍了 GTMAB 上的硬件设计、感知算法和反馈控制器,它们可通过手势识别来识别人类意图,将机器人的意图传达给人类受试者,并根据手势执行飞艇行为。这些功能是收集人类数据和研究人类行为的更复杂实验的基本组成部分。
Achieving natural HRI can be more easily accomplished when the human subject does not need to wear tracking devices or use other instrumentation to interact with the robot (Duffy, 2003). With only one onboard monocular camera installed on GT-MAB, its perception algorithms can identify human intentions. We implemented a deep learning algorithm (specifically, the single-shot multibox detector (SSD) in Liu et al. (2016)), so GT-MAB could detect human faces and hands. Then we applied principal component analysis (PCA) (Wold et al., 1987) to robustly distinguish hand waving gestures in the horizontal and vertical directions. These two hand gestures trigger different reactions in the blimp. A person uses horizontal hand gestures to trigger spinning in GT-MAB and uses vertical hand gestures to trigger the blimp to fly backward (Fig. 1). We use monocular vision to measure the position of the human relative to the blimp. Vision-based feedback controllers then enable the blimp to autonomously follow a person while identifying the hand gestures. GT-MAB communicates its intentions by displaying immediate visual feedback on an onboard light-emitting diode (LED) display. The visual feedback is proven to be a key feature that improves the interactive experience.
如果人类不需要佩戴跟踪装置或使用其他仪器与机器人互动,就能更容易地实现自然的人机交互(Duffy,2003 年)。GT-MAB 只安装了一个板载单目摄像头,其感知算法就能识别人类意图。我们采用了一种深度学习算法(特别是 Liu 等人(2016)中的单镜头多箱检测器(SSD)),因此 GT-MAB 可以检测到人脸和手。然后,我们应用主成分分析(PCA)(Wold 等人,1987 年)对水平和垂直方向上的挥手手势进行稳健区分。这两种手势会在飞艇上引发不同的反应。人在 GT-MAB 中使用水平手势触发旋转,使用垂直手势触发飞艇向后飞行(图 1)。我们使用单目视觉来测量人与飞艇的相对位置。然后,基于视觉的反馈控制器可使飞艇在识别手势的同时自主跟随人。GT-MAB 通过在机载发光二极管(LED)显示屏上显示即时视觉反馈来传达自己的意图。事实证明,视觉反馈是改善互动体验的关键功能。
Fig. 1 An uninstrumented human user interacts with the Georgia Tech Miniature Autonomous Blimp (GTMAB) in close proximity and commands the GTMAB via gestures
图 1 未安装仪器的人类用户与佐治亚理工学院微型自主飞艇 (GTMAB) 近距离互动,并通过手势对 GTMAB 发出指令
We conducted HRI experiments with multiple human participants and presented a user study to evaluate the effectiveness of the proposed HRI procedure. In our experiments, GT-MAB reliably demonstrated its ability to follow humans and it consistently collected the human data while interacting with humans. In the user study, most of the participants could successfully control the robotic blimp using the two hand gestures and reported positive feedback about the interactive experience. These results clearly demonstrated the effectiveness of the basic features of GT-MAB.
我们对多名人类参与者进行了人机交互实验,并提交了一份用户研究报告,以评估所提议的人机交互程序的有效性。在我们的实验中,GT-MAB 可靠地展示了其跟随人类的能力,并在与人类互动时持续收集人类数据。在用户研究中,大多数参与者都能使用两个手势成功控制机器人飞艇,并对互动体验给予了积极反馈。这些结果清楚地证明了 GT-MAB 基本功能的有效性。

2 Literature review and novelty
2 文献综述与新颖性

2.1 Data collection in the human intimate zone
2.1 在人类亲密接触区收集数据

Hall (1966) defined space in terms of distance to humans. The intimate , personal , social , and public ( 3.6 m ) spatial zones have been widely used in both
霍尔(1966 年)根据与人类的距离来定义空间。亲密空间 、个人空间 、社交空间 和公共空间( 3.6 米)已被广泛应用于以下两个领域

human-human and HRI literature. However, because of their relatively high speed and powerful propellers, quad-rotors normally need to keep a relatively far distance from humans to ensure safe and comfortable interaction. In the previous HRI works (Monajjemi et al., 2013; Naseer et al., 2013; Costante et al., 2014; Nagi et al., 2014), researchers proposed similar HRI designs whereby a human user could control a single quad-rotor or a team of quad-rotors through face and hand gesture recognition. However, quad-rotors need to stay more than 2 m away from the humans to protect the users and avoid making the user feel threatened. Duncan and Murphy (2013) suggested that the minimum comfortable distance for humans interacting with small quad-rotors could not be less than 0.65 m . It is difficult for UAVs with strong propellers or the existing blimps (due to their size and functionality) to enter the human user's intimate space and collect human data without prompting anxiety on the user. GT-MAB can interact with humans within 0.4 m and collect videos of the human and the human's trajectories, which can be used to fit the social force model of Helbing and Molnár (1995) in the intimate zone. To the best of our knowledge, GT-MAB is perhaps the first aerial robotic platform that is able to collect HRI data naturally within the human intimate spatial zone.
人机交互和 HRI 文献。然而,由于四旋翼飞行器的速度相对较快,螺旋桨的功率较大,因此通常需要与人类保持相对较远的距离,以确保安全、舒适的交互。在以往的 HRI 作品(Monajjemi 等人,2013 年;Naseer 等人,2013 年;Costante 等人,2014 年;Nagi 等人,2014 年)中,研究人员提出了类似的 HRI 设计,即人类用户可以通过面部和手势识别来控制单个四旋翼飞行器或一组四旋翼飞行器。不过,四旋翼机器人需要与人类保持 2 米以上的距离,以保护用户,避免让用户感到威胁。Duncan 和 Murphy(2013 年)建议,人类与小型四旋翼无人机互动的最小舒适距离不能小于 0.65 米。带有强力螺旋桨的无人机或现有的飞艇(由于其尺寸和功能)很难进入人类用户的私密空间并收集人类数据而不引起用户的焦虑。GT-MAB 可在 0.4 米范围内与人类互动,并收集人类和人类轨迹的视频,这些视频可用于在亲密区域内符合 Helbing 和 Molnár(1995 年)的社会力量模型。据我们所知,GT-MAB 也许是第一个能够在人类亲密空间区域内自然收集人机交互数据的空中机器人平台。

2.2 Visual feedback from blimp to human
2.2 飞艇对人类的视觉反馈

Visual feedback in the HRI procedure can significantly improve the interactive experience. Previous research has explored the implicit expressions of robot intentions by manipulating the flying motions (Sharma et al., 2013; Szafir et al., 2014; Cauchard et al., 2016). However, such implicit expressions are limited when aerial robots interact closely with humans. Explicit expressions are preferred for proximal interactions. Szafir et al. (2015) devised a ring of LED lights under the quad-rotor and designed four signals to indicate the next flight motion of the quad-rotor. A user study was conducted, where human participants were asked to predict the robot's intentions. The user study verified that the LED signals significantly improved the viewer response time and accuracy compared to a robot without the signals. However, in that work, the human participants were separated from the robot's environment by a floor-to-ceiling glass panel, so it was not an interactive environment. In our work, we discovered that immediate visual feedback is crucial for reducing human's confusion caused by the time delays between the time when a robot perceives a human command and the time when the robot initiates an action. We conducted a user study for our proposed HRI process and verified that the LED feedback significantly improves the interactive experience and efficiency.
HRI 程序中的视觉反馈可以显著改善交互体验。以往的研究通过操纵飞行动作探索了机器人意图的隐式表达(Sharma 等人,2013 年;Szafir 等人,2014 年;Cauchard 等人,2016 年)。然而,当空中机器人与人类密切互动时,这种隐性表达是有限的。近距离交互时,显式表达是首选。Szafir 等人(2015 年)在四旋翼机器人下方设计了一圈 LED 灯,并设计了四个信号来指示四旋翼机器人的下一个飞行动作。他们进行了一项用户研究,要求人类参与者预测机器人的意图。用户研究证实,与没有信号的机器人相比,LED 信号大大提高了观察者的反应时间和准确性。不过,在这项工作中,人类参与者与机器人的环境被一块落地玻璃板隔开,因此这并不是一个互动环境。在我们的工作中,我们发现即时视觉反馈对于减少机器人感知到人类指令与机器人开始行动之间的时间延迟所造成的人类困惑至关重要。我们对拟议的人机交互流程进行了用户研究,并验证了 LED 反馈显著改善了交互体验和效率。

2.3 Monocular vision based human localization
2.3 基于单目视觉的人类定位

To localize a human, quad-rotors normally require a depth camera (Lichtenstern et al., 2012; Naseer et al., 2013). Recent works (Costante et al., 2014; Lim and Sinha, 2015; Perera et al., 2018) have also used a monocular camera on UAVs to localize humans and estimate human trajectories. Since these works used quad-rotors as the HRI platforms, one unavoidable step for monocular vision is to estimate the camera pose due to the flying mechanism of the quad-rotors, which is a challenging problem. Compared to quad-rotors, GT-MAB is self-stabilized and can fly in a horizontal plane with almost no vibration, so the pitch and roll angles of GT-MAB can be approximately viewed as staying at zero. The pose of the onboard camera is fixed. Due to this unique feature, we developed a vision-based technique to localize a human in real time from the onboard monocular camera of GT-MAB.
为了定位人类,四旋翼无人机通常需要一个深度摄像头(Lichtenstern 等人,2012 年;Naseer 等人,2013 年)。最近的研究(Costante 等人,2014 年;Lim 和 Sinha,2015 年;Perera 等人,2018 年)也使用无人机上的单目摄像头来定位人类和估计人类轨迹。由于这些研究使用四旋翼飞行器作为 HRI 平台,单目视觉不可避免的一个步骤是估计四旋翼飞行器飞行机制导致的相机姿态,这是一个具有挑战性的问题。与四旋翼飞行器相比,GT-MAB 具有自稳定功能,可以在水平面内飞行,几乎没有振动,因此 GT-MAB 的俯仰角和滚转角可以近似视为保持为零。机载摄像头的姿态是固定的。基于这一独特功能,我们开发了一种基于视觉的技术,通过 GT-MAB 的机载单目摄像头实时定位人类。

2.4 Joint face and hand detection
2.4 脸部和手部联合检测

In previous gesture-based HRIs (Monajjemi et al., 2013; Costante et al., 2014), human face detection is necessary for distinguishing a human from other objects. Once a human face is detected, a hand detector is triggered to recognize human gestures. For each frame, two feature detectors are needed to detect different human features. The computation to run two feature detectors takes a relatively long time and is hard to implement for real-time video. To speed up video processing, feature tracking algorithms are used to track the feature detected in the previous frame (Birchfield, 1996). However, the tracking algorithms cannot consistently provide an accurate and tight bounding box around the human feature. To overcome the above-mentioned problems, we use one of the state-of-the-art object detection deep learning algorithms, SSD (Liu et al., 2016), in the context of human blimp interaction.
在以往基于手势的人机交互技术中(Monajjemi 等人,2013 年;Costante 等人,2014 年),人脸检测是将人与其他物体区分开来的必要条件。一旦检测到人脸,就会触发手部检测器来识别人的手势。对于每个帧,需要两个特征检测器来检测不同的人体特征。运行两个特征检测器的计算时间相对较长,很难在实时视频中实现。为了加快视频处理速度,可以使用特征跟踪算法来跟踪前一帧中检测到的特征(Birchfield,1996 年)。然而,跟踪算法无法始终如一地提供围绕人体特征的精确而紧密的边界框。为了克服上述问题,我们将最先进的物体检测深度学习算法之一 SSD(Liu 等人,2016 年)用于人与飞艇的交互。
We also build a dataset that is efficient and adequate for training the SSD for real-time face and hand detection.
我们还建立了一个高效的数据集,足以训练用于实时人脸和手部检测的 SSD。
In our previous work (Yao et al., 2017), we achieved human following behavior on GT-MAB. GT-MAB was able to follow a human who was not wearing a tracking device and keep the human in sight of its onboard camera based on face detection, but GT-MAB could not react to the human. In this study, we propose a novel advancement to achieve natural interaction between a human and GT-MAB by enabling GT-MAB to recognize human intentions through hand gestures and react to human intentions through visual feedback and flying motions.
在我们之前的工作(Yao 等人,2017 年)中,我们在 GT-MAB 上实现了人类跟随行为。GT-MAB 能够跟踪未佩戴跟踪设备的人类,并根据人脸检测将人类保持在机载摄像头的视线范围内,但 GT-MAB 无法对人类做出反应。在这项研究中,我们提出了一种新的先进技术,使 GT-MAB 能够通过手势识别人类意图,并通过视觉反馈和飞行动作对人类意图做出反应,从而实现人类与 GT-MAB 之间的自然交互。

3 GT-MAB platform 3 GT-MAB 平台

GT-MAB consists of an envelope and a customized gondola. The envelope has a unique saucerlike shape, as shown in Fig. 1, which solves the conflict between maneuverability and stability and enhances its capability to interact with humans at a close distance. The gondola is a 3D-printed mechanical structure accommodating all onboard devices underneath the envelope. Fig. 2 depicts the structure of the gondola and indicates the main components installed on it. We use five motors for the HRI application. The vertically mounted motors are used to change the altitude, while the horizontal ones enable the blimp to fly horizontally and change the heading angle. One side-way motion motor is used to keep the blimp in the front of a human. This design enables the blimp to move in the 3D space without changing its roll and pitch angles.
GT-MAB 由一个包体和一个定制的吊厢组成。如图 1 所示,包体具有独特的飞碟形状,解决了机动性和稳定性之间的矛盾,增强了与人类近距离互动的能力。吊篮是一种 3D 打印的机械结构,可将所有机载设备容纳在包层下方。图 2 描述了吊篮的结构,并标明了安装在吊篮上的主要组件。我们在 HRI 应用中使用了五个电机。垂直安装的电机用于改变高度,水平安装的电机使飞艇能够水平飞行并改变航向角。一个侧向运动电机用于使飞艇保持在人的前方。这种设计使飞艇能够在三维空间中移动,而无需改变其滚动和俯仰角度。
Fig. 2 Georgia Tech Miniature Autonomous Blimp gondola with the installed electronic components
图 2 安装了电子元件的佐治亚理工学院微型自主飞艇吊船

The appealing characteristics of GT-MAB, especially the small size, impose challenges in the blimp's hardware design. The blimp has only 60 g of total load capacity, including the onboard camera, microprocessors, and wireless communication devices. One difficulty in vision-based HRI using a blimp is finding a wireless camera that is light enough. The camera we selected for GT-MAB is a analog camera, which is the best option we could find that can support low-latency wireless transmission. This compact device weighs 4.5 g and has a diagonal field of view of 115 degrees. The camera is directly attached to the gondola. However, since the camera is analog, the video produced from it includes some glitch noise, which makes image processing more difficult than with digital cameras. We also installed an LED matrix display on the blimp to provide the visual feedback for human users. The LED display shows the recognition results, while the controller outputs achieve spinning and backward motions for the control of the blimp. Fig. 3 shows the block diagram of the hardware setup for the system. The video stream coming from the onboard camera is obtained by the receiver connected to the ground station PC. Outputs of the perception and control algorithms running on the ground PC are packed into commands and sent to GT-MAB via an Xbee wireless module.
GT-MAB 极具吸引力的特性,尤其是小巧的体积,给飞艇的硬件设计带来了挑战。包括机载相机、微处理器和无线通信设备在内,飞艇的总载重量只有 60 克。使用飞艇进行基于视觉的人机交互的一个困难是找到足够轻的无线摄像头。我们为 GT-MAB 选用的相机是 模拟相机,这是我们能找到的支持低延迟无线传输的最佳选择。这款小巧的设备重 4.5 克,对角视场角为 115 度。摄像头直接连接到吊船。不过,由于摄像头是模拟的,其产生的视频包含一些闪烁噪声,这使得图像处理比数字摄像头更加困难。我们还在飞艇上安装了一个 LED 矩阵显示器,为人类用户提供视觉反馈。LED 显示器显示识别结果,而控制器输出实现旋转和后退的动作,以控制飞艇。图 3 显示了系统硬件设置框图。机载摄像头发出的视频流由连接到地面站 PC 的接收器获取。地面 PC 上运行的感知和控制算法的输出被打包成命令,并通过 Xbee 无线模块发送到 GT-MAB。
Fig. 3 Hardware overview 图 3 硬件概览

4 System overview 4 系统概述

We achieve a natural and smooth HRI by enabling GT-MAB to perceive human intention. Humans are required to communicate their intentions to the blimp through predefined hand gestures so that human intentions are regulated and predictable. The human uses only one hand, starts the hand gesture near the face, and moves his/her hand horizontally or vertically. Then the blimp spins or flies backward
通过让 GT-MAB 感知人类意图,我们实现了自然流畅的人机交互。人类需要通过预定义的手势向飞艇传达自己的意图,这样人类的意图就可以得到规范和预测。人类只用一只手,在脸部附近开始比划手势,并水平或垂直移动他/她的手。然后飞艇旋转或向后飞行

according to the two hand gestures.
根据这两个手势。

The overall HRI design involves five steps: (1) detecting a human face and hands jointly in a realtime video stream, (2) recognizing hand gestures, (3) communicating the blimp's intentions to the human through the onboard LED display, (4) estimating the human's location relative to the blimp, and (5) controlling the blimp to follow the human and initiate movement according to hand gestures. Fig. 4 shows a block diagram of the proposed HRI design. We first run a joint face and hand detector to detect human features in each video frame. If no face or hand is detected, the onboard LED displays the negative feedback for a human and the detector goes to the next frame. If a face is detected, a human localization algorithm and a human following controller are triggered to maintain the blimp's position relative to the human. If both face and hand are detected, a gesture recognition algorithm is triggered and the LED display shows a positive feedback to the human indicating that the gesture recognition has started. Once a valid gesture is recognized, the LED display shows another positive feedback indicating that GT-MAB has received the human's command. Meanwhile, the blimp controller switches from hu-
整个人机交互设计包括五个步骤:(1) 在实时视频流中联合检测人脸和手,(2) 识别手势,(3) 通过机载 LED 显示屏向人类传达飞艇的意图,(4) 估算人类相对于飞艇的位置,(5) 控制飞艇跟随人类并根据手势启动运动。图 4 显示了拟议的人机交互界面设计框图。我们首先运行人脸和手部联合检测器,在每个视频帧中检测人的特征。如果没有检测到人脸或手,板载 LED 就会显示 "人 "的负反馈,然后检测器进入下一帧。如果检测到人脸,则会触发人类定位算法和人类跟随控制器,以保持飞艇与人类的相对位置。如果同时检测到脸部和手部,则会触发手势识别算法,LED 显示屏会向人类显示积极反馈,表明手势识别已经开始。一旦识别到有效的手势,LED 显示屏就会显示另一个正反馈,表明 GT-MAB 接收到了人类的指令。与此同时,飞艇控制器将手势识别从 "人 "切换到 "GT-MAB"。
Fig. 4 System block diagram. The pink blocks represent the five steps. The green blocks represent the visual feedback shown on the LED display. The blue blocks represent logic questions. The blue solid arrows represent the logic flow for the process and the dashed arrows represent the data flow between each block. References to color refer to the online version of this figure man following control to blimp motion control, which controls the blimp to initiate spinning or backward motion. Once a motion is accomplished, the joint face/hand detector is activated again and the whole interactive process repeats. The details of each step will be introduced in the following sections.
图 4 系统框图。粉色块代表五个步骤。绿色图块代表 LED 显示屏上显示的视觉反馈。蓝色图块代表逻辑问题。蓝色实线箭头代表流程的逻辑流,虚线箭头代表每个模块之间的数据流。图中提及的颜色指的是本图的在线版本,图中的 "人 "指的是飞艇运动控制,它控制飞艇启动旋转或后退运动。运动完成后,脸部/手部联合检测器再次启动,整个交互过程重复进行。下文将详细介绍每个步骤。

5 Perception algorithms 5 感知算法

In this section, we present the first three steps of GT-MAB's perception capabilities.
在本节中,我们将介绍 GT-MAB 感知能力的前三个步骤。

5.1 Joint detection of face and hand
5.1 人脸和手的联合检测

We leverage the SSD (Liu et al., 2016), which is fast and can detect multiple categories of objects at the same time to jointly detect a human face and hands in real-time videos. The idea behind SSD is simple. It reframes object detection as a single regression problem where object bounding boxes are assigned with confidence scores representing how likely a bounding box is to contain a specific object. To train the SSD, the learning algorithm discretizes a training image into grid cells. Each cell has default bounding boxes of different locations and sizes. During the training process, these default boxes are compared with the ground-truth bounding boxes in training images, and a confidence score is computed for each object category. The neural network is trained to determine which default box has the highest corresponding confidence score. During detection, the trained neural network can directly generate the bounding box with the highest confidence score and determine to which category the bounded object belongs.
我们利用 SSD(Liu 等人,2016 年)快速且能同时检测多类物体的特点,来联合检测实时视频中的人脸和手。SSD 背后的理念很简单。它将物体检测重构为一个单一的回归问题,在这个问题中,物体边界框被赋予置信度分数,代表一个边界框包含特定物体的可能性有多大。为了训练 SSD,学习算法将训练图像离散化为 个网格单元。每个单元都有不同位置和大小的 默认边界框。在训练过程中,这些默认边界框会与训练图像中的地面实况边界框进行比较,并为每个对象类别计算置信度分数。训练神经网络的目的是确定哪个默认框具有最高的相应置信度分数。在检测过程中,经过训练的神经网络可以直接生成置信度分数最高的边界框,并确定边界物体属于哪个类别。
Particularly, to train the SSD for joint detection of a human face and hand, we create a new training set leveraging an image dataset from the Oxford Vision Group (Mittal et al., 2011). The images in this dataset have already been labeled with the human's hands. However, the human faces in this dataset are not labeled. To modify the dataset for both hand and face detection, we first assign the originally labeled hand regions as category 1. Then we use the Haar face detector (Viola and Jones, 2004) to detect a human face and label the face bounding box as category 2. We divide the modified dataset into a training set, which contains 4069 images, and a test set, which contains 821 images. The joint face and hand detector is then trained offline using the new
特别是,为了训练联合检测人脸和手的 SSD,我们利用牛津视觉小组(Mittal 等人,2011 年)的图像数据集创建了一个新的训练集。该数据集中的图像已经标注了人的手。然而,该数据集中的人脸却没有标记。为了修改数据集以同时进行手部和脸部检测,我们首先将原始标注的手部区域归为类别 1。然后,我们使用 Haar 人脸检测器(Viola 和 Jones,2004 年)来检测人脸,并将人脸边界框标记为类别 2。我们将修改后的数据集分为包含 4069 张图像的训练集和包含 821 张图像的测试集。然后使用新的

training set. We fine-tune the neural network using a stochastic gradient descent with 0.9 momentum, 0.0005 weight decay, and a 128 batch size. As for the learning rate, we use for the first iterations, and then continue training for iterations with a learning rate and another iterations with a learning rate.
训练集。我们采用随机梯度下降法对神经网络进行微调,动量为 0.9,权重衰减为 0.0005,批量大小为 128。至于学习率,我们在最初的 次迭代中使用 ,然后在 次迭代中使用 学习率继续训练,在 次迭代中使用 学习率继续训练。
The trained joint face and hand detector is evaluated on the test set using the mean average precision (or mAP), a common metric used in feature and object detection. Specifically, for each bounding box generated by the trained detector, we discard the box if it has less than percent intersection over the union with the ground-truth bounding box. Given a specific threshold , we compute the average precision (or AP) for each test image. Then we compute the mAP by taking the mean of all APs for all the test images. The test results are that with , the detector can achieve 0.862 mAP , with , the detector can achieve 0.844 mAP , and with , the detector can achieve 0.684 mAP . The performance is almost the same as that in Liu et al. (2016).
在测试集上使用平均精度(或 mAP)对训练有素的面部和手部联合检测器进行评估,这是特征和物体检测中常用的指标。具体来说,对于训练有素的检测器生成的每个边界框,如果它与地面实况边界框的交集小于 %,我们就放弃该边界框。给定一个特定的阈值 ,我们计算每个测试图像的平均精度(或 AP)。然后,我们取所有测试图像的所有 AP 的平均值来计算 mAP。测试结果表明,使用 时,检测器可以达到 0.862 mAP;使用 时,检测器可以达到 0.844 mAP;使用 时,检测器可以达到 0.684 mAP。其性能与 Liu 等人(2016 年)的研究结果基本相同。
After testing the joint face and hand detector, the detector is applied to detect a human face and hand in the real-time video stream from the blimp camera. The results are shown in Fig. 5. The detected face is bounded by the yellow box with a label "Face" and the detected hand is bounded by the box labeled as "Hand." Fig. 5a shows the case where only a face is detected. Fig. 5b shows the case where both a face and a hand are detected but the hand is outside the initial gesture region, i.e., the two yellow boxes near the face bounding box. We define the initial gesture regions to filter out incorrect human gestures or random hand movements, and to ensure that the gesture recognition is more robust. Fig. 5c shows the case where both a face and a hand are detected with the hand in the initial gesture region. Only this case initializes the gesture recognition step.
在对人脸和手联合检测器进行测试后,检测器被用于检测飞艇摄像头实时视频流中的人脸和手。结果如图 5 所示。检测到的人脸以标有 "人脸 "的黄色方框为界,检测到的手以标有 "手 "的方框为界。图 5a 显示的是只检测到人脸的情况。图 5b 显示的是同时检测到脸部和手部,但手部不在初始手势区域内的情况,即脸部边界框附近的两个黄色方框。我们定义初始手势区域是为了过滤掉不正确的人类手势或随机的手部动作,并确保手势识别更加稳健。图 5c 显示了同时检测到脸部和手部的情况,其中手部位于初始手势区域。只有这种情况才会初始化手势识别步骤。
Based on the bounding boxes, we define the position of the human face to be the center of the face bounding box, denoted as , and the face length in the image frame, where , and are in pixels. The hand position is the center of the hand bounding box, denoted as . We use the face position and the length of the human face to estimate the human position relative to the blimp, which will be introduced in Section 6.1.
根据边界框,我们将人脸的位置定义为图像帧中人脸边界框的中心(表示为 )和人脸长度 ,其中 的单位为像素。手部位置是手部边界框的中心,表示为 。我们使用人脸的位置和长度来估计人脸相对于飞艇的位置,这将在第 6.1 节中介绍。

5.2 Hand gesture recognition
5.2 手势识别

Once the gesture recognition algorithm is initialized, the algorithm identifies two types of hand movements: horizontal linear hand movements and vertical linear hand movements.
手势识别算法初始化后,会识别两种类型的手部动作:水平线性手部动作和垂直线性手部动作。
The detection algorithm tracks the human hand from frame to frame. Once gesture recognition is triggered, the hand position is not restricted by the initial gesture region. The human hand can move out of the initial region and still be recognized. We collect the hand position data in 50 successive video frames once gesture recognition is triggered. The hand trajectory is modeled as a set of twodimensional (2D) points in the image coordinates, where is a 2 D vector of the hand position. If the human performs a defined gesture for the blimp, the distribution of hand trajectory data should be close to a line. We use PCA (Wold et al., 1987) to analyze the linearity
检测算法可在帧与帧之间跟踪人手。一旦触发手势识别,手的位置就不受初始手势区域的限制。人手可以离开初始区域,但仍能被识别。一旦触发手势识别,我们会在 50 个连续视频帧中收集手部位置数据。手部轨迹被建模为 图像坐标中的一组二维 (2D) 点 ,其中 是手部位置的 2D 向量。如果人类对飞艇做出一个确定的手势,那么手部轨迹数据 的分布应该接近一条直线。我们使用 PCA(Wold 等人,1987 年)分析线性度

(a)
(b)
(c)
Fig. 5 Face and hand detection: (a) only face is detected; (b) face and hand are detected with the hand outside the initial region; (c) face and hand are detected with the hand in the initial region. The images are from the onboard camera of Georgia Tech Miniature Autonomous Blimp. References to color refer to the online version of this figure
图 5 人脸和手的检测:(a) 仅检测到人脸;(b) 检测到人脸和手,但手不在初始区域内;(c) 检测到人脸和手,但手在初始区域内。图像来自佐治亚理工学院微型自主飞艇的机载摄像头。颜色参考本图的在线版本

of the data points in and determine whether a hand trajectory is valid as defined. PCA is an orthogonal linear transformation which transforms the dataset into a new coordinate system such that the greatest variance in the data lies on the first coordinate, and the second greatest variance lies on the second coordinate (Fig. 6). In our setup, the direction of the first coordinate from PCA is exactly the hand movement direction.
中的数据点,并确定手部轨迹是否符合定义。PCA 是一种正交线性变换,它将数据集 转换为一个新的坐标系,使数据中的最大方差位于第一个坐标上,而第二个最大方差位于第二个坐标上(图 6)。在我们的设置中,PCA 第一个坐标的方向正是手的运动方向。
Fig. 6 PCA illustration. The red crosses represent the data points represents the first coordinate, and represents the second coordinate. References to color refer to the online version of this figure
图 6 PCA 图示。红色十字代表数据点, 代表第一个坐标, 代表第二个坐标。颜色参考本图的在线版本
To apply PCA, we first need to compute the mean-subtracted dataset , since the hand positions are in pixels which are positive integers and do not have a zero mean. Each element equals , where is the mean of . Then the principal component can be obtained using singular value decomposition (SVD):
要应用 PCA,我们首先需要计算经过均值抽取的数据集 ,因为手部位置的像素是正整数,且均值不为零。每个元素 等于 ,其中 的均值。然后就可以利用奇异值分解(SVD)得到主成分:
where is a orthonormal matrix, is a orthonormal matrix, and is a rectangular diagonal matrix with . After applying SVD, we obtain the two bases of the new coordinates of PCA, and , which are the two column vectors of matrix .
其中, 正交矩阵, 正交矩阵, 矩形对角矩阵, 为矩形对角矩阵。应用 SVD 后,我们得到 PCA 新坐标的两个基 ,它们是矩阵 的两个列向量。
The ratio is computed to determine whether a hand trajectory is linear. A large ratio represents a high linearity. However, since humans cannot move their hands in a perfectly straight line, we need to add in some tolerance. To achieve high accuracy and robustness in gesture recognition, we run multiple trials using the blimp camera to collect both valid and invalid hand trajectories and finally select the threshold as five. Additionally, to avoid false detection of human hand gestures, we require the maximum first principal component among all the hand position data be greater than or equal to
计算比率 可以确定手的轨迹是否线性。比率越大,表示线性度越高。但是,由于人类的手不可能完全直线移动,因此我们需要增加一些容差。为了实现手势识别的高准确性和鲁棒性,我们使用飞艇相机进行多次试验,收集有效和无效的手部轨迹,最后将阈值选为 5。此外,为了避免误检人类手势,我们要求所有手部位置数据的最大第一主成分大于或等于

250 (in pixels) so that the hand movement is noticeable enough that a human can recognize it. That is to say, if and , the hand trajectory is detected as a valid linear hand gesture.
250(以像素为单位),使手部运动足够明显,以便人类能够识别。也就是说,如果 ,手部轨迹将被检测为有效的线性手势。
For a valid linear hand gesture, the slope of the first coordinate is used to determine the direction of the hand gesture, where and are the first and second elements of vector , respectively. If , the gesture is a horizontal gesture. If , the gesture is a vertical gesture. Otherwise, the hand gesture is invalid.
对于有效的线性手势,第一个坐标 的斜率 用于确定手势的方向,其中 分别是矢量 的第一和第二元素。如果 ,手势为水平手势。如果是 ,则手势为垂直手势。否则,手势无效。

5.3 Visual display 5.3 可视化显示

However, using hand gesture recognition to activate the blimp reactive behaviors may not always work for human users. This is because there is a time delay between the time instant when the blimp detects a human and the time instant when the blimp initiates the corresponding movement. Although the time delay is only a few seconds, a human user may find the delay confusing because the person perceives no immediate reaction from the blimp. The human user may redo the hand gesture, approach the blimp to see whether the blimp is broken, or feel disappointed and walk away, even if the blimp actually recognizes the hand gesture and executes the correct action later.
然而,使用手势识别来激活飞艇的反应行为对于人类用户来说可能并不总是有效。这是因为在飞艇检测到人类的瞬间与飞艇启动相应动作的瞬间之间存在时间延迟。虽然时间延迟只有几秒钟,但人类用户可能会感到困惑,因为他们感觉不到飞艇会立即做出反应。人类用户可能会重做手势,走近飞艇查看飞艇是否坏了,或者感到失望而走开,即使飞艇实际上能识别手势并在稍后执行正确的动作。
Through these unsuccessful interactions, we discover that it is important for the blimp to communicate its intentions to humans. To achieve bidirectional communication between the human user and the blimp, we install an LED matrix screen on GTMAB and it displays what the blimp is "thinking." The LED screen gives the human instantaneous feedback during the interactive process and shows the human the status of the blimp: whether it detects the user and understands his/her hand gesture. The spatially close interaction with the blimp enables the human to see the visual feedback from the LED display, and the visual feedback helps the human user take the correct action for the next step and increase the efficiency and satisfaction of the interaction.
通过这些不成功的互动,我们发现,让飞艇向人类传达自己的意图非常重要。为了实现人类用户与飞艇之间的双向交流,我们在 GTMAB 上安装了一个 LED 矩阵屏幕,显示飞艇的 "想法"。在互动过程中,LED 屏幕会向人类提供即时反馈,并向人类显示飞艇的状态:是否检测到用户,是否理解用户的手势。与飞艇在空间上的近距离互动使人类能够看到 LED 显示屏的视觉反馈,视觉反馈有助于人类用户采取正确的下一步行动,提高互动的效率和满意度。
We design four visual patterns on the LED display to represent the four intentions of the blimp (Fig. 7). The first pattern, which is the letter " " in Fig. 7a, indicates that the user's face has been detected, and GT-MAB is ready to detect the human's
我们在 LED 显示屏上设计了四个视觉图案,分别代表飞艇的四个意图(图 7)。第一个图案是图 7a 中的字母" ",表示已检测到用户的脸部,GT-MAB 已准备好检测人类的脸部。

(a)
(b)
(c)

(d)
Fig. 7 LED feedback display: (a) face is detected; (b) hand is detected; (c) a hand is not detected or a valid gesture is not recognized; (d) a valid gesture is detected, ready to fly
图 7 LED 反馈显示:(a) 检测到脸;(b) 检测到手;(c) 未检测到手或未识别到有效手势;(d) 检测到有效手势,准备飞行

hand. This is a positive feedback. When the human user sees this pattern, the human should place his/her hand near the face and start a vertical or horizontal hand movement. The second pattern, which is the "check" mark in Fig. 7b, represents that the blimp has successfully detected a human face and a hand in the initial gesture region, and it is in the process of recognizing the human's gesture. This is also a positive feedback. When the human user sees this pattern, the human should continue moving his/her hand. The third pattern, which is the "cross" mark in Fig. 7c, means that no hand has been detected in the initial gesture region or that the blimp cannot recognize a valid hand gesture. This is a negative feedback from the blimp that tells the human there was a mistake during the interaction. When seeing this pattern, the human user should place his/her hand in the initial gesture region and redo the gesture. The last pattern, shown in Fig. 7d, indicates that GT-MAB recognizes a valid hand gesture and it is going to make the corresponding motion. When seeing this pattern, the human user can see if the blimp successfully recognizes the gesture by checking whether the blimp is making the correct motion. Once the blimp completes the motion and returns to the initial position, the joint face and hand detector is triggered to detect the human face. If a face is detected, the pattern " " is displayed again and the human can perform the next hand gesture. The whole interaction procedure repeats.
手。这是一种积极的反馈。当人类用户看到这个图案时,应将手放在人脸附近,并开始垂直或水平的手部运动。第二个图案是图 7b 中的 "复选 "标记,表示飞艇在初始手势区域成功检测到人脸和手,并正在识别人的手势。这也是一个积极的反馈。当人类用户看到这个图案时,应该继续移动他/她的手。第三种模式,即图 7c 中的 "十字 "标记,表示在初始手势区域没有检测到手部,或者飞艇无法识别有效的手势。这是飞艇发出的负面反馈,告诉人类在交互过程中出现了错误。当看到这种模式时,人类用户应该将手放在初始手势区域,然后重新做手势。最后一种模式如图 7d 所示,表示 GT-MAB 识别出一个有效的手势,并将做出相应的动作。人类用户在看到这一模式时,可以通过检查飞艇是否做出正确的动作来判断飞艇是否成功识别了手势。一旦飞艇完成运动并返回初始位置,就会触发面部和手部联合检测器来检测人脸。如果检测到人脸,就会再次显示" "图案,然后人就可以做出下一个手势。整个交互过程重复进行。

6 Localization and control algorithms
6 定位和控制算法

In this section, we present the last two steps in the HRI design for GT-MAB: vision-based human localization and blimp motion control.
在本节中,我们将介绍 GT-MAB 人机交互设计的最后两个步骤:基于视觉的人类定位和飞艇运动控制。

6.1 Relative position estimation
6.1 相对位置估算

GT-MAB localizes a human using its onboard monocular camera only. This is different from most
GT-MAB 仅使用板载单目摄像头对人类进行定位。这与大多数

other blimps which use an external system to localize humans, such as indoor localization or fixed external cameras (Srisamosorn et al., 2016).
而其他飞艇则使用外部系统对人类进行定位,如室内定位或固定外部摄像头(Srisamosorn 等人,2016 年)。
We assume that the camera satisfies the pinhole camera model (Corke, 2011), which defines the relationship between a 3 D point in the camera coordinates and a 2 D point in the camera image frame
我们假设摄像机符合针孔摄像机模型(Corke,2011 年),该模型定义了摄像机坐标 中的 3 D 点 和摄像机图像帧 中的 2 D 点 之间的关系。
where and are the focal lengths in the and directions respectively, and is the optical center of the image. Here, we assume that and are both equal to the focal length and is the center of the image.
其中, 分别是 方向的焦距, 是图像的光学中心。在此,我们假设 都等于焦距 ,而 是图像的中心。
The illustration of human position estimation is shown in Fig. 8. Because the pitch and roll angles of the blimp are very small, we can assume that the camera projection plane is always perpendicular to the ground; i.e., is perpendicular to the ground. This assumption does not hold for quadrotors because they need to change the pitch angle to fly forward or backward. GT-MAB provides a certain convenience for support of vision-based HRI algorithms because the pitch and roll angles of the onboard camera can be controlled to be zero. Line represents the center line of the human face and we assume that it is parallel to the image plane; i.e., the plane of the human face is also perpendicular
人体位置估计的示意图如图 8 所示。由于飞艇的俯仰角和滚转角非常小,我们可以假设相机投影面始终垂直于地面,即 垂直于地面。这一假设对于四旋翼飞行器来说并不成立,因为它们需要改变俯仰角才能向前或向后飞行。GT-MAB 为支持基于视觉的 HRI 算法提供了一定的便利,因为机载摄像头的俯仰角和滚动角可以控制为零。 线表示人脸的中心线,我们假设它与图像平面平行;也就是说,人脸的平面也与图像平面垂直。
Fig. 8 Illustration of relative distance estimation
图 8 相对距离估算示意图

to the ground. Point is the center point of line . Points , and are the corresponding projection points. We denote the actual length of the human face as and denote the length of the human face in the camera projection plane as .
到地面。点 是直线 的中心点。点 是相应的投影点。我们用 表示人脸的实际长度,用 表示人脸在摄像机投影平面上的长度。
In the calibration phase, we use the detection algorithm introduced in Section 5.1 to compute a human user's face length, denoted as