2024_09_01_5999ac493f421bd57697g

Autonomous flying blimp interaction with human in an indoor space*
在室内空间与人互动的自主飞行飞艇*

Ning-shi YAO , Qiu-yang TAO , Wei-yu LIU , Zhen LIU , Ye TIAN ,
姚宁实、陶秋阳、刘伟宇、刘震、田烨、Pei-yu WANG , Timothy LI , Fumin ZHANG School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
佐治亚理工学院电气与计算机工程学院，美国佐治亚州亚特兰大市，邮编 30332 College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA
佐治亚理工学院计算机学院，美国佐治亚州亚特兰大 30332†E-mail: nyao6@gatech.edu; qtao7@gatech.edu; wliu88@gatech.edu; fumin@gatech.edu
† 电子邮件：nyao6@gatech.edu; qtao7@gatech.edu; wliu88@gatech.edu; fumin@gatech.eduReceived Sept. 20, 2018; Revision accepted Nov. 26, 2018; Crosschecked Jan. 8, 2019
2018年9月20日收到；2018年11月26日接受修订；2019年1月8日核对

Abstract 摘要

We present the Georgia Tech Miniature Autonomous Blimp (GT-MAB), which is designed to support human-robot interaction experiments in an indoor space for up to two hours. GT-MAB is safe while flying in close proximity to humans. It is able to detect the face of a human subject, follow the human, and recognize hand gestures. GT-MAB employs a deep neural network based on the single shot multibox detector to jointly detect a human user's face and hands in a real-time video stream collected by the onboard camera. A human-robot interaction procedure is designed and tested with various human users. The learning algorithms recognize two hand waving gestures. The human user does not need to wear any additional tracking device when interacting with the flying blimp. Vision-based feedback controllers are designed to control the blimp to follow the human and fly in one of two distinguishable patterns in response to each of the two hand gestures. The blimp communicates its intentions to the human user by displaying visual symbols. The collected experimental data show that the visual feedback from the blimp in reaction to the human user significantly improves the interactive experience between blimp and human. The demonstrated success of this procedure indicates that GT-MAB could serve as a flying robot that is able to collect human data safely in an indoor environment.
我们展示了佐治亚理工学院的微型自主飞艇（GT-MAB），其设计目的是支持在室内空间进行长达两小时的人机交互实验。GT-MAB 在近距离飞行时非常安全。它能够检测人脸、跟踪人并识别手势。GT-MAB 采用基于单镜头多箱检测器的深度神经网络，在机载摄像头采集的实时视频流中联合检测人类用户的脸部和手部。设计了人机交互程序，并与不同的人类用户进行了测试。学习算法可识别两种挥手手势。人类用户与飞艇互动时无需佩戴任何额外的跟踪装置。设计了基于视觉的反馈控制器，以控制飞艇跟随人类，并根据两种手势分别以两种可区分的模式之一飞行。飞艇通过显示视觉符号向人类用户传达其意图。收集到的实验数据表明，飞艇对人类用户的视觉反馈大大改善了飞艇与人类用户之间的互动体验。这一程序的成功表明，GT-MAB 可以作为一种飞行机器人，在室内环境中安全地收集人类数据。

Key words: Robotic blimp; Human-robot interaction; Deep learning; Face detection; Gesture recognition
关键词机器人飞艇；人机交互；深度学习；人脸检测；手势识别
https://doi.org/10.1631/FITEE. 1800587
https://doi.org/10.1631/FITEE.1800587

1 Introduction 1 引言

Recent advances in robotics have enabled the rapid development of unmanned aerial vehicles (UAVs). With increasing penetration of UAVs in industry and everyday life, cooperation between humans and UAVs is quickly becoming unavoidable. It is extremely important that UAVs interact with
机器人技术的最新进展推动了无人驾驶飞行器（UAV）的快速发展。随着无人飞行器在工业和日常生活中的日益普及，人类与无人飞行器之间的合作正迅速变得不可避免。无人飞行器与人类的互动极为重要。

CLC number: TP24 CLC 编号：TP24
humans safely and naturally (Duffy, 2003; Goodrich and Schultz, 2007), and to this end, the study on human-robot interaction (HRI) has enjoyed recent research interest (Draper et al., 2003; de Crescenzio et al., 2009; Duncan and Murphy, 2013; Acharya et al., 2017; Peshkova et al., 2017). Quad-rotors are one of the most popular robotic platforms for three-dimensional (3D) HRI studies (Graether and Mueller, 2012; Arroyo et al., 2014; Szafir et al., 2015; Cauchard et al., 2016; Monajjemi et al., 2016). Humans can use speech/verbal cues (Pourmehr et al., 2014), eye gaze (Monajjemi et al., 2013; Hansen et al., 2014), and hand gestures (Naseer et al., 2013; Costante et al., 2014) to command quad-rotors to accomplish certain tasks. In addition to quad-rotors,
Duffy, 2003; Goodrich and Schultz, 2007），为此，人机交互（HRI）研究近年来备受关注（Draper等人，2003；de Crescenzio等人，2009；Duncan和Murphy，2013；Acharya等人，2017；Peshkova等人，2017）。四旋翼机器人是三维（3D）HRI 研究中最受欢迎的机器人平台之一（Graether 和 Mueller，2012 年；Arroyo 等人，2014 年；Szafir 等人，2015 年；Cauchard 等人，2016 年；Monajjemi 等人，2016 年）。人类可以使用语音/语言提示（Pourmehr 等人，2014 年）、眼睛注视（Monajjemi 等人，2013 年；Hansen 等人，2014 年）和手势（Naseer 等人，2013 年；Costante 等人，2014 年）来指挥四旋翼飞行器完成某些任务。除了四旋翼飞行器外
other types of UAVs, such as fixed-wing aircrafts (He et al., 2011) and flying displays (Schneegass et al., 2014), have been developed to interact with humans.
其他类型的无人机，如固定翼飞机（He 等人，2011 年）和飞行显示器（Schneegass 等人，2014 年），也已开发出与人类互动的功能。

In contrast to fast-flying UAVs, autonomous blimps are the preferred platform for HRI (Liew and Yairi, 2013) in certain applications where human comfort is a major concern. In Burri et al. (2013), a spherical robotic blimp was proposed to monitor activities in human crowds. St-Onge et al. (2017) demonstrated that a cubic-shaped blimp can fly close to human artists on stage. In another case, a robotic blimp (Srisamosorn et al., 2016) was used for monitoring elderly people inside a nursing home. These applications demonstrate the necessity of studying human-blimp interaction. However, there is a lack of dedicated design for autonomous blimps to support experiments in human interaction with flying robots in an indoor lab space.
与快速飞行的无人机相比，自主飞艇是某些应用中首选的 HRI 平台（Liew 和 Yairi，2013 年），因为在这些应用中，人的舒适度是一个主要问题。Burri 等人（2013 年）提出用球形机器人飞艇监测人群中的活动。St-Onge 等人（2017 年）证明，立方体形状的飞艇可以飞近舞台上的人类艺术家。在另一个案例中，机器人飞艇（Srisamosorn 等人，2016 年）被用于监测养老院内的老人。这些应用证明了研究人与飞艇互动的必要性。然而，目前还缺乏专门的自主飞艇设计，以支持在室内实验室空间进行人类与飞行机器人的互动实验。

We developed the Georgia Tech Miniature Autonomous Blimp (GT-MAB), which is designed to collect experimental data for indoor HRI (Cho et al., 2017; Yao et al., 2017; Tao et al., 2018). Being a flying robot, GT-MAB does not pose the safety threats and anxiety that a typical quad-rotor can cause to humans, and it can fly close to humans in indoor environments. In addition, GT-MAB has a relatively long flight time of up to two hours per battery charge, which supports uninterrupted HRI experiments. In this study, we introduce the hardware designs, perception algorithms, and feedback controllers on GTMAB that identify human intentions through hand gesture recognition, communicate the robot's intentions to its human subjects, and execute the blimp behavior in reaction to the hand gestures. These features are the basic building blocks for more sophisticated experiments to collect human data and study human behaviors.
我们开发了佐治亚理工学院微型自主飞艇（GT-MAB），旨在收集室内 HRI 的实验数据（Cho 等人，2017 年；Yao 等人，2017 年；Tao 等人，2018 年）。作为一个飞行机器人，GT-MAB 不会像典型的四旋翼机器人那样对人类造成安全威胁和焦虑，它可以在室内环境中贴近人类飞行。此外，GT-MAB 的飞行时间相对较长，每次充电可飞行两小时，这为不间断的 HRI 实验提供了支持。在本研究中，我们介绍了 GTMAB 上的硬件设计、感知算法和反馈控制器，它们可通过手势识别来识别人类意图，将机器人的意图传达给人类受试者，并根据手势执行飞艇行为。这些功能是收集人类数据和研究人类行为的更复杂实验的基本组成部分。

Achieving natural HRI can be more easily accomplished when the human subject does not need to wear tracking devices or use other instrumentation to interact with the robot (Duffy, 2003). With only one onboard monocular camera installed on GT-MAB, its perception algorithms can identify human intentions. We implemented a deep learning algorithm (specifically, the single-shot multibox detector (SSD) in Liu et al. (2016)), so GT-MAB could detect human faces and hands. Then we applied principal component analysis (PCA) (Wold et al., 1987) to robustly distinguish hand waving gestures in the horizontal and vertical directions. These two hand gestures trigger different reactions in the blimp. A person uses horizontal hand gestures to trigger spinning in GT-MAB and uses vertical hand gestures to trigger the blimp to fly backward (Fig. 1). We use monocular vision to measure the position of the human relative to the blimp. Vision-based feedback controllers then enable the blimp to autonomously follow a person while identifying the hand gestures. GT-MAB communicates its intentions by displaying immediate visual feedback on an onboard light-emitting diode (LED) display. The visual feedback is proven to be a key feature that improves the interactive experience.
如果人类不需要佩戴跟踪装置或使用其他仪器与机器人互动，就能更容易地实现自然的人机交互（Duffy，2003 年）。GT-MAB 只安装了一个板载单目摄像头，其感知算法就能识别人类意图。我们采用了一种深度学习算法（特别是 Liu 等人（2016）中的单镜头多箱检测器（SSD）），因此 GT-MAB 可以检测到人脸和手。然后，我们应用主成分分析（PCA）（Wold 等人，1987 年）对水平和垂直方向上的挥手手势进行稳健区分。这两种手势会在飞艇上引发不同的反应。人在 GT-MAB 中使用水平手势触发旋转，使用垂直手势触发飞艇向后飞行（图 1）。我们使用单目视觉来测量人与飞艇的相对位置。然后，基于视觉的反馈控制器可使飞艇在识别手势的同时自主跟随人。GT-MAB 通过在机载发光二极管（LED）显示屏上显示即时视觉反馈来传达自己的意图。事实证明，视觉反馈是改善互动体验的关键功能。

Fig. 1 An uninstrumented human user interacts with the Georgia Tech Miniature Autonomous Blimp (GTMAB) in close proximity and commands the GTMAB via gestures
图 1 未安装仪器的人类用户与佐治亚理工学院微型自主飞艇 (GTMAB) 近距离互动，并通过手势对 GTMAB 发出指令

We conducted HRI experiments with multiple human participants and presented a user study to evaluate the effectiveness of the proposed HRI procedure. In our experiments, GT-MAB reliably demonstrated its ability to follow humans and it consistently collected the human data while interacting with humans. In the user study, most of the participants could successfully control the robotic blimp using the two hand gestures and reported positive feedback about the interactive experience. These results clearly demonstrated the effectiveness of the basic features of GT-MAB.
我们对多名人类参与者进行了人机交互实验，并提交了一份用户研究报告，以评估所提议的人机交互程序的有效性。在我们的实验中，GT-MAB 可靠地展示了其跟随人类的能力，并在与人类互动时持续收集人类数据。在用户研究中，大多数参与者都能使用两个手势成功控制机器人飞艇，并对互动体验给予了积极反馈。这些结果清楚地证明了 GT-MAB 基本功能的有效性。

2 Literature review and novelty
2 文献综述与新颖性

2.1 Data collection in the human intimate zone
2.1 在人类亲密接触区收集数据

Hall (1966) defined space in terms of distance to humans. The intimate

, personal

, social

, and public (

3.6 m ) spatial zones have been widely used in both
霍尔（1966 年）根据与人类的距离来定义空间。亲密空间

、个人空间

、社交空间

和公共空间（

3.6 米）已被广泛应用于以下两个领域
human-human and HRI literature. However, because of their relatively high speed and powerful propellers, quad-rotors normally need to keep a relatively far distance from humans to ensure safe and comfortable interaction. In the previous HRI works (Monajjemi et al., 2013; Naseer et al., 2013; Costante et al., 2014; Nagi et al., 2014), researchers proposed similar HRI designs whereby a human user could control a single quad-rotor or a team of quad-rotors through face and hand gesture recognition. However, quad-rotors need to stay more than 2 m away from the humans to protect the users and avoid making the user feel threatened. Duncan and Murphy (2013) suggested that the minimum comfortable distance for humans interacting with small quad-rotors could not be less than 0.65 m . It is difficult for UAVs with strong propellers or the existing blimps (due to their size and functionality) to enter the human user's intimate space and collect human data without prompting anxiety on the user. GT-MAB can interact with humans within 0.4 m and collect videos of the human and the human's trajectories, which can be used to fit the social force model of Helbing and Molnár (1995) in the intimate zone. To the best of our knowledge, GT-MAB is perhaps the first aerial robotic platform that is able to collect HRI data naturally within the human intimate spatial zone.
人机交互和 HRI 文献。然而，由于四旋翼飞行器的速度相对较快，螺旋桨的功率较大，因此通常需要与人类保持相对较远的距离，以确保安全、舒适的交互。在以往的 HRI 作品（Monajjemi 等人，2013 年；Naseer 等人，2013 年；Costante 等人，2014 年；Nagi 等人，2014 年）中，研究人员提出了类似的 HRI 设计，即人类用户可以通过面部和手势识别来控制单个四旋翼飞行器或一组四旋翼飞行器。不过，四旋翼机器人需要与人类保持 2 米以上的距离，以保护用户，避免让用户感到威胁。Duncan 和 Murphy（2013 年）建议，人类与小型四旋翼无人机互动的最小舒适距离不能小于 0.65 米。带有强力螺旋桨的无人机或现有的飞艇（由于其尺寸和功能）很难进入人类用户的私密空间并收集人类数据而不引起用户的焦虑。GT-MAB 可在 0.4 米范围内与人类互动，并收集人类和人类轨迹的视频，这些视频可用于在亲密区域内符合 Helbing 和 Molnár（1995 年）的社会力量模型。据我们所知，GT-MAB 也许是第一个能够在人类亲密空间区域内自然收集人机交互数据的空中机器人平台。

2.2 Visual feedback from blimp to human
2.2 飞艇对人类的视觉反馈

Visual feedback in the HRI procedure can significantly improve the interactive experience. Previous research has explored the implicit expressions of robot intentions by manipulating the flying motions (Sharma et al., 2013; Szafir et al., 2014; Cauchard et al., 2016). However, such implicit expressions are limited when aerial robots interact closely with humans. Explicit expressions are preferred for proximal interactions. Szafir et al. (2015) devised a ring of LED lights under the quad-rotor and designed four signals to indicate the next flight motion of the quad-rotor. A user study was conducted, where human participants were asked to predict the robot's intentions. The user study verified that the LED signals significantly improved the viewer response time and accuracy compared to a robot without the signals. However, in that work, the human participants were separated from the robot's environment by a floor-to-ceiling glass panel, so it was not an interactive environment. In our work, we discovered that immediate visual feedback is crucial for reducing human's confusion caused by the time delays between the time when a robot perceives a human command and the time when the robot initiates an action. We conducted a user study for our proposed HRI process and verified that the LED feedback significantly improves the interactive experience and efficiency.
HRI 程序中的视觉反馈可以显著改善交互体验。以往的研究通过操纵飞行动作探索了机器人意图的隐式表达（Sharma 等人，2013 年；Szafir 等人，2014 年；Cauchard 等人，2016 年）。然而，当空中机器人与人类密切互动时，这种隐性表达是有限的。近距离交互时，显式表达是首选。Szafir 等人（2015 年）在四旋翼机器人下方设计了一圈 LED 灯，并设计了四个信号来指示四旋翼机器人的下一个飞行动作。他们进行了一项用户研究，要求人类参与者预测机器人的意图。用户研究证实，与没有信号的机器人相比，LED 信号大大提高了观察者的反应时间和准确性。不过，在这项工作中，人类参与者与机器人的环境被一块落地玻璃板隔开，因此这并不是一个互动环境。在我们的工作中，我们发现即时视觉反馈对于减少机器人感知到人类指令与机器人开始行动之间的时间延迟所造成的人类困惑至关重要。我们对拟议的人机交互流程进行了用户研究，并验证了 LED 反馈显著改善了交互体验和效率。

2.3 Monocular vision based human localization
2.3 基于单目视觉的人类定位

To localize a human, quad-rotors normally require a depth camera (Lichtenstern et al., 2012; Naseer et al., 2013). Recent works (Costante et al., 2014; Lim and Sinha, 2015; Perera et al., 2018) have also used a monocular camera on UAVs to localize humans and estimate human trajectories. Since these works used quad-rotors as the HRI platforms, one unavoidable step for monocular vision is to estimate the camera pose due to the flying mechanism of the quad-rotors, which is a challenging problem. Compared to quad-rotors, GT-MAB is self-stabilized and can fly in a horizontal plane with almost no vibration, so the pitch and roll angles of GT-MAB can be approximately viewed as staying at zero. The pose of the onboard camera is fixed. Due to this unique feature, we developed a vision-based technique to localize a human in real time from the onboard monocular camera of GT-MAB.
为了定位人类，四旋翼无人机通常需要一个深度摄像头（Lichtenstern 等人，2012 年；Naseer 等人，2013 年）。最近的研究（Costante 等人，2014 年；Lim 和 Sinha，2015 年；Perera 等人，2018 年）也使用无人机上的单目摄像头来定位人类和估计人类轨迹。由于这些研究使用四旋翼飞行器作为 HRI 平台，单目视觉不可避免的一个步骤是估计四旋翼飞行器飞行机制导致的相机姿态，这是一个具有挑战性的问题。与四旋翼飞行器相比，GT-MAB 具有自稳定功能，可以在水平面内飞行，几乎没有振动，因此 GT-MAB 的俯仰角和滚转角可以近似视为保持为零。机载摄像头的姿态是固定的。基于这一独特功能，我们开发了一种基于视觉的技术，通过 GT-MAB 的机载单目摄像头实时定位人类。

2.4 Joint face and hand detection
2.4 脸部和手部联合检测

In previous gesture-based HRIs (Monajjemi et al., 2013; Costante et al., 2014), human face detection is necessary for distinguishing a human from other objects. Once a human face is detected, a hand detector is triggered to recognize human gestures. For each frame, two feature detectors are needed to detect different human features. The computation to run two feature detectors takes a relatively long time and is hard to implement for real-time video. To speed up video processing, feature tracking algorithms are used to track the feature detected in the previous frame (Birchfield, 1996). However, the tracking algorithms cannot consistently provide an accurate and tight bounding box around the human feature. To overcome the above-mentioned problems, we use one of the state-of-the-art object detection deep learning algorithms, SSD (Liu et al., 2016), in the context of human blimp interaction.
在以往基于手势的人机交互技术中（Monajjemi 等人，2013 年；Costante 等人，2014 年），人脸检测是将人与其他物体区分开来的必要条件。一旦检测到人脸，就会触发手部检测器来识别人的手势。对于每个帧，需要两个特征检测器来检测不同的人体特征。运行两个特征检测器的计算时间相对较长，很难在实时视频中实现。为了加快视频处理速度，可以使用特征跟踪算法来跟踪前一帧中检测到的特征（Birchfield，1996 年）。然而，跟踪算法无法始终如一地提供围绕人体特征的精确而紧密的边界框。为了克服上述问题，我们将最先进的物体检测深度学习算法之一 SSD（Liu 等人，2016 年）用于人与飞艇的交互。

We also build a dataset that is efficient and adequate for training the SSD for real-time face and hand detection.
我们还建立了一个高效的数据集，足以训练用于实时人脸和手部检测的 SSD。

In our previous work (Yao et al., 2017), we achieved human following behavior on GT-MAB. GT-MAB was able to follow a human who was not wearing a tracking device and keep the human in sight of its onboard camera based on face detection, but GT-MAB could not react to the human. In this study, we propose a novel advancement to achieve natural interaction between a human and GT-MAB by enabling GT-MAB to recognize human intentions through hand gestures and react to human intentions through visual feedback and flying motions.
在我们之前的工作（Yao 等人，2017 年）中，我们在 GT-MAB 上实现了人类跟随行为。GT-MAB 能够跟踪未佩戴跟踪设备的人类，并根据人脸检测将人类保持在机载摄像头的视线范围内，但 GT-MAB 无法对人类做出反应。在这项研究中，我们提出了一种新的先进技术，使 GT-MAB 能够通过手势识别人类意图，并通过视觉反馈和飞行动作对人类意图做出反应，从而实现人类与 GT-MAB 之间的自然交互。

3 GT-MAB platform 3 GT-MAB 平台

GT-MAB consists of an envelope and a customized gondola. The envelope has a unique saucerlike shape, as shown in Fig. 1, which solves the conflict between maneuverability and stability and enhances its capability to interact with humans at a close distance. The gondola is a 3D-printed mechanical structure accommodating all onboard devices underneath the envelope. Fig. 2 depicts the structure of the gondola and indicates the main components installed on it. We use five motors for the HRI application. The vertically mounted motors are used to change the altitude, while the horizontal ones enable the blimp to fly horizontally and change the heading angle. One side-way motion motor is used to keep the blimp in the front of a human. This design enables the blimp to move in the 3D space without changing its roll and pitch angles.
GT-MAB 由一个包体和一个定制的吊厢组成。如图 1 所示，包体具有独特的飞碟形状，解决了机动性和稳定性之间的矛盾，增强了与人类近距离互动的能力。吊篮是一种 3D 打印的机械结构，可将所有机载设备容纳在包层下方。图 2 描述了吊篮的结构，并标明了安装在吊篮上的主要组件。我们在 HRI 应用中使用了五个电机。垂直安装的电机用于改变高度，水平安装的电机使飞艇能够水平飞行并改变航向角。一个侧向运动电机用于使飞艇保持在人的前方。这种设计使飞艇能够在三维空间中移动，而无需改变其滚动和俯仰角度。

Fig. 2 Georgia Tech Miniature Autonomous Blimp gondola with the installed electronic components
图 2 安装了电子元件的佐治亚理工学院微型自主飞艇吊船
The appealing characteristics of GT-MAB, especially the small size, impose challenges in the blimp's hardware design. The blimp has only 60 g of total load capacity, including the onboard camera, microprocessors, and wireless communication devices. One difficulty in vision-based HRI using a blimp is finding a wireless camera that is light enough. The camera we selected for GT-MAB is a

analog camera, which is the best option we could find that can support low-latency wireless transmission. This compact device weighs 4.5 g and has a diagonal field of view of 115 degrees. The camera is directly attached to the gondola. However, since the camera is analog, the video produced from it includes some glitch noise, which makes image processing more difficult than with digital cameras. We also installed an

LED matrix display on the blimp to provide the visual feedback for human users. The LED display shows the recognition results, while the controller outputs achieve spinning and backward motions for the control of the blimp. Fig. 3 shows the block diagram of the hardware setup for the system. The video stream coming from the onboard camera is obtained by the receiver connected to the ground station PC. Outputs of the perception and control algorithms running on the ground PC are packed into commands and sent to GT-MAB via an Xbee wireless module.
GT-MAB 极具吸引力的特性，尤其是小巧的体积，给飞艇的硬件设计带来了挑战。包括机载相机、微处理器和无线通信设备在内，飞艇的总载重量只有 60 克。使用飞艇进行基于视觉的人机交互的一个困难是找到足够轻的无线摄像头。我们为 GT-MAB 选用的相机是

模拟相机，这是我们能找到的支持低延迟无线传输的最佳选择。这款小巧的设备重 4.5 克，对角视场角为 115 度。摄像头直接连接到吊船。不过，由于摄像头是模拟的，其产生的视频包含一些闪烁噪声，这使得图像处理比数字摄像头更加困难。我们还在飞艇上安装了一个

LED 矩阵显示器，为人类用户提供视觉反馈。LED 显示器显示识别结果，而控制器输出实现旋转和后退的动作，以控制飞艇。图 3 显示了系统硬件设置框图。机载摄像头发出的视频流由连接到地面站 PC 的接收器获取。地面 PC 上运行的感知和控制算法的输出被打包成命令，并通过 Xbee 无线模块发送到 GT-MAB。

Fig. 3 Hardware overview 图 3 硬件概览

4 System overview 4 系统概述

We achieve a natural and smooth HRI by enabling GT-MAB to perceive human intention. Humans are required to communicate their intentions to the blimp through predefined hand gestures so that human intentions are regulated and predictable. The human uses only one hand, starts the hand gesture near the face, and moves his/her hand horizontally or vertically. Then the blimp spins or flies backward
通过让 GT-MAB 感知人类意图，我们实现了自然流畅的人机交互。人类需要通过预定义的手势向飞艇传达自己的意图，这样人类的意图就可以得到规范和预测。人类只用一只手，在脸部附近开始比划手势，并水平或垂直移动他/她的手。然后飞艇旋转或向后飞行
according to the two hand gestures.
根据这两个手势。
The overall HRI design involves five steps: (1) detecting a human face and hands jointly in a realtime video stream, (2) recognizing hand gestures, (3) communicating the blimp's intentions to the human through the onboard LED display, (4) estimating the human's location relative to the blimp, and (5) controlling the blimp to follow the human and initiate movement according to hand gestures. Fig. 4 shows a block diagram of the proposed HRI design. We first run a joint face and hand detector to detect human features in each video frame. If no face or hand is detected, the onboard LED displays the negative feedback for a human and the detector goes to the next frame. If a face is detected, a human localization algorithm and a human following controller are triggered to maintain the blimp's position relative to the human. If both face and hand are detected, a gesture recognition algorithm is triggered and the LED display shows a positive feedback to the human indicating that the gesture recognition has started. Once a valid gesture is recognized, the LED display shows another positive feedback indicating that GT-MAB has received the human's command. Meanwhile, the blimp controller switches from hu-
整个人机交互设计包括五个步骤：(1) 在实时视频流中联合检测人脸和手，(2) 识别手势，(3) 通过机载 LED 显示屏向人类传达飞艇的意图，(4) 估算人类相对于飞艇的位置，(5) 控制飞艇跟随人类并根据手势启动运动。图 4 显示了拟议的人机交互界面设计框图。我们首先运行人脸和手部联合检测器，在每个视频帧中检测人的特征。如果没有检测到人脸或手，板载 LED 就会显示 "人 "的负反馈，然后检测器进入下一帧。如果检测到人脸，则会触发人类定位算法和人类跟随控制器，以保持飞艇与人类的相对位置。如果同时检测到脸部和手部，则会触发手势识别算法，LED 显示屏会向人类显示积极反馈，表明手势识别已经开始。一旦识别到有效的手势，LED 显示屏就会显示另一个正反馈，表明 GT-MAB 接收到了人类的指令。与此同时，飞艇控制器将手势识别从 "人 "切换到 "GT-MAB"。

Fig. 4 System block diagram. The pink blocks represent the five steps. The green blocks represent the visual feedback shown on the LED display. The blue blocks represent logic questions. The blue solid arrows represent the logic flow for the process and the dashed arrows represent the data flow between each block. References to color refer to the online version of this figure man following control to blimp motion control, which controls the blimp to initiate spinning or backward motion. Once a motion is accomplished, the joint face/hand detector is activated again and the whole interactive process repeats. The details of each step will be introduced in the following sections.
图 4 系统框图。粉色块代表五个步骤。绿色图块代表 LED 显示屏上显示的视觉反馈。蓝色图块代表逻辑问题。蓝色实线箭头代表流程的逻辑流，虚线箭头代表每个模块之间的数据流。图中提及的颜色指的是本图的在线版本，图中的 "人 "指的是飞艇运动控制，它控制飞艇启动旋转或后退运动。运动完成后，脸部/手部联合检测器再次启动，整个交互过程重复进行。下文将详细介绍每个步骤。

5 Perception algorithms 5 感知算法

In this section, we present the first three steps of GT-MAB's perception capabilities.
在本节中，我们将介绍 GT-MAB 感知能力的前三个步骤。

5.1 Joint detection of face and hand
5.1 人脸和手的联合检测

We leverage the SSD (Liu et al., 2016), which is fast and can detect multiple categories of objects at the same time to jointly detect a human face and hands in real-time videos. The idea behind SSD is simple. It reframes object detection as a single regression problem where object bounding boxes are assigned with confidence scores representing how likely a bounding box is to contain a specific object. To train the SSD, the learning algorithm discretizes a training image into

grid cells. Each cell has

default bounding boxes of different locations and sizes. During the training process, these default boxes are compared with the ground-truth bounding boxes in training images, and a confidence score is computed for each object category. The neural network is trained to determine which default box has the highest corresponding confidence score. During detection, the trained neural network can directly generate the bounding box with the highest confidence score and determine to which category the bounded object belongs.
我们利用 SSD（Liu 等人，2016 年）快速且能同时检测多类物体的特点，来联合检测实时视频中的人脸和手。SSD 背后的理念很简单。它将物体检测重构为一个单一的回归问题，在这个问题中，物体边界框被赋予置信度分数，代表一个边界框包含特定物体的可能性有多大。为了训练 SSD，学习算法将训练图像离散化为

个网格单元。每个单元都有不同位置和大小的

默认边界框。在训练过程中，这些默认边界框会与训练图像中的地面实况边界框进行比较，并为每个对象类别计算置信度分数。训练神经网络的目的是确定哪个默认框具有最高的相应置信度分数。在检测过程中，经过训练的神经网络可以直接生成置信度分数最高的边界框，并确定边界物体属于哪个类别。

Particularly, to train the SSD for joint detection of a human face and hand, we create a new training set leveraging an image dataset from the Oxford Vision Group (Mittal et al., 2011). The images in this dataset have already been labeled with the human's hands. However, the human faces in this dataset are not labeled. To modify the dataset for both hand and face detection, we first assign the originally labeled hand regions as category 1. Then we use the Haar face detector (Viola and Jones, 2004) to detect a human face and label the face bounding box as category 2. We divide the modified dataset into a training set, which contains 4069 images, and a test set, which contains 821 images. The joint face and hand detector is then trained offline using the new
特别是，为了训练联合检测人脸和手的 SSD，我们利用牛津视觉小组（Mittal 等人，2011 年）的图像数据集创建了一个新的训练集。该数据集中的图像已经标注了人的手。然而，该数据集中的人脸却没有标记。为了修改数据集以同时进行手部和脸部检测，我们首先将原始标注的手部区域归为类别 1。然后，我们使用 Haar 人脸检测器（Viola 和 Jones，2004 年）来检测人脸，并将人脸边界框标记为类别 2。我们将修改后的数据集分为包含 4069 张图像的训练集和包含 821 张图像的测试集。然后使用新的
training set. We fine-tune the neural network using a stochastic gradient descent with 0.9 momentum, 0.0005 weight decay, and a 128 batch size. As for the learning rate, we use

for the first

iterations, and then continue training for

iterations with a

learning rate and another

iterations with a

learning rate.
训练集。我们采用随机梯度下降法对神经网络进行微调，动量为 0.9，权重衰减为 0.0005，批量大小为 128。至于学习率，我们在最初的

次迭代中使用

，然后在

次迭代中使用

学习率继续训练，在

次迭代中使用

学习率继续训练。

The trained joint face and hand detector is evaluated on the test set using the mean average precision (or mAP), a common metric used in feature and object detection. Specifically, for each bounding box generated by the trained detector, we discard the box if it has less than

percent intersection over the union with the ground-truth bounding box. Given a specific threshold

, we compute the average precision (or AP) for each test image. Then we compute the mAP by taking the mean of all APs for all the test images. The test results are that with

, the detector can achieve 0.862 mAP , with

, the detector can achieve 0.844 mAP , and with

, the detector can achieve 0.684 mAP . The performance is almost the same as that in Liu et al. (2016).
在测试集上使用平均精度（或 mAP）对训练有素的面部和手部联合检测器进行评估，这是特征和物体检测中常用的指标。具体来说，对于训练有素的检测器生成的每个边界框，如果它与地面实况边界框的交集小于

%，我们就放弃该边界框。给定一个特定的阈值

，我们计算每个测试图像的平均精度（或 AP）。然后，我们取所有测试图像的所有 AP 的平均值来计算 mAP。测试结果表明，使用

时，检测器可以达到 0.862 mAP；使用

时，检测器可以达到 0.844 mAP；使用

时，检测器可以达到 0.684 mAP。其性能与 Liu 等人（2016 年）的研究结果基本相同。

After testing the joint face and hand detector, the detector is applied to detect a human face and hand in the real-time video stream from the blimp camera. The results are shown in Fig. 5. The detected face is bounded by the yellow box with a label "Face" and the detected hand is bounded by the box labeled as "Hand." Fig. 5a shows the case where only a face is detected. Fig. 5b shows the case where both a face and a hand are detected but the hand is outside the initial gesture region, i.e., the two yellow boxes near the face bounding box. We define the initial gesture regions to filter out incorrect human gestures or random hand movements, and to ensure that the gesture recognition is more robust. Fig. 5c shows the case where both a face and a hand are detected with the hand in the initial gesture region. Only this case initializes the gesture recognition step.
在对人脸和手联合检测器进行测试后，检测器被用于检测飞艇摄像头实时视频流中的人脸和手。结果如图 5 所示。检测到的人脸以标有 "人脸 "的黄色方框为界，检测到的手以标有 "手 "的方框为界。图 5a 显示的是只检测到人脸的情况。图 5b 显示的是同时检测到脸部和手部，但手部不在初始手势区域内的情况，即脸部边界框附近的两个黄色方框。我们定义初始手势区域是为了过滤掉不正确的人类手势或随机的手部动作，并确保手势识别更加稳健。图 5c 显示了同时检测到脸部和手部的情况，其中手部位于初始手势区域。只有这种情况才会初始化手势识别步骤。

Based on the bounding boxes, we define the position of the human face to be the center of the face bounding box, denoted as

, and the face length

in the image frame, where

, and

are in pixels. The hand position is the center of the hand bounding box, denoted as

. We use the face position and the length of the human face to estimate the human position relative to the blimp, which will be introduced in Section 6.1.
根据边界框，我们将人脸的位置定义为图像帧中人脸边界框的中心（表示为

）和人脸长度

，其中

和

的单位为像素。手部位置是手部边界框的中心，表示为

。我们使用人脸的位置和长度来估计人脸相对于飞艇的位置，这将在第 6.1 节中介绍。

5.2 Hand gesture recognition
5.2 手势识别

Once the gesture recognition algorithm is initialized, the algorithm identifies two types of hand movements: horizontal linear hand movements and vertical linear hand movements.
手势识别算法初始化后，会识别两种类型的手部动作：水平线性手部动作和垂直线性手部动作。

The detection algorithm tracks the human hand from frame to frame. Once gesture recognition is triggered, the hand position is not restricted by the initial gesture region. The human hand can move out of the initial region and still be recognized. We collect the hand position data in 50 successive video frames once gesture recognition is triggered. The hand trajectory is modeled as a set of twodimensional (2D) points

in the

image coordinates, where

is a 2 D vector of the hand position. If the human performs a defined gesture for the blimp, the distribution of hand trajectory data

should be close to a line. We use PCA (Wold et al., 1987) to analyze the linearity
检测算法可在帧与帧之间跟踪人手。一旦触发手势识别，手的位置就不受初始手势区域的限制。人手可以离开初始区域，但仍能被识别。一旦触发手势识别，我们会在 50 个连续视频帧中收集手部位置数据。手部轨迹被建模为

图像坐标中的一组二维 (2D) 点

，其中

是手部位置的 2D 向量。如果人类对飞艇做出一个确定的手势，那么手部轨迹数据

的分布应该接近一条直线。我们使用 PCA（Wold 等人，1987 年）分析线性度

(a)
(b)
(c)

Fig. 5 Face and hand detection: (a) only face is detected; (b) face and hand are detected with the hand outside the initial region; (c) face and hand are detected with the hand in the initial region. The images are from the onboard camera of Georgia Tech Miniature Autonomous Blimp. References to color refer to the online version of this figure
图 5 人脸和手的检测：(a) 仅检测到人脸；(b) 检测到人脸和手，但手不在初始区域内；(c) 检测到人脸和手，但手在初始区域内。图像来自佐治亚理工学院微型自主飞艇的机载摄像头。颜色参考本图的在线版本
of the data points in

and determine whether a hand trajectory is valid as defined. PCA is an orthogonal linear transformation which transforms the dataset

into a new coordinate system such that the greatest variance in the data lies on the first coordinate, and the second greatest variance lies on the second coordinate (Fig. 6). In our setup, the direction of the first coordinate from PCA is exactly the hand movement direction.

中的数据点，并确定手部轨迹是否符合定义。PCA 是一种正交线性变换，它将数据集

转换为一个新的坐标系，使数据中的最大方差位于第一个坐标上，而第二个最大方差位于第二个坐标上（图 6）。在我们的设置中，PCA 第一个坐标的方向正是手的运动方向。

Fig. 6 PCA illustration. The red crosses represent the data points

represents the first coordinate, and

represents the second coordinate. References to color refer to the online version of this figure
图 6 PCA 图示。红色十字代表数据点，

代表第一个坐标，

代表第二个坐标。颜色参考本图的在线版本

To apply PCA, we first need to compute the mean-subtracted dataset

, since the hand positions are in pixels which are positive integers and do not have a zero mean. Each element

equals

, where

is the mean of

. Then the principal component can be obtained using singular value decomposition (SVD):
要应用 PCA，我们首先需要计算经过均值抽取的数据集

，因为手部位置的像素是正整数，且均值不为零。每个元素

等于

，其中

是

的均值。然后就可以利用奇异值分解（SVD）得到主成分：

where

is a

orthonormal matrix,

is a

orthonormal matrix, and

is a

rectangular diagonal matrix with

. After applying SVD, we obtain the two bases of the new coordinates of PCA,

and

, which are the two column vectors of matrix

.
其中，

为

正交矩阵，

为

正交矩阵，

为

矩形对角矩阵，

为矩形对角矩阵。应用 SVD 后，我们得到 PCA 新坐标的两个基

和

，它们是矩阵

的两个列向量。

The ratio

is computed to determine whether a hand trajectory is linear. A large ratio represents a high linearity. However, since humans cannot move their hands in a perfectly straight line, we need to add in some tolerance. To achieve high accuracy and robustness in gesture recognition, we run multiple trials using the blimp camera to collect both valid and invalid hand trajectories and finally select the threshold as five. Additionally, to avoid false detection of human hand gestures, we require the maximum first principal component among all the hand position data be greater than or equal to
计算比率

可以确定手的轨迹是否线性。比率越大，表示线性度越高。但是，由于人类的手不可能完全直线移动，因此我们需要增加一些容差。为了实现手势识别的高准确性和鲁棒性，我们使用飞艇相机进行多次试验，收集有效和无效的手部轨迹，最后将阈值选为 5。此外，为了避免误检人类手势，我们要求所有手部位置数据的最大第一主成分大于或等于
250 (in pixels) so that the hand movement is noticeable enough that a human can recognize it. That is to say, if

and

, the hand trajectory is detected as a valid linear hand gesture.
250（以像素为单位），使手部运动足够明显，以便人类能够识别。也就是说，如果

和

，手部轨迹将被检测为有效的线性手势。

For a valid linear hand gesture, the slope

of the first coordinate

is used to determine the direction of the hand gesture, where

and

are the first and second elements of vector

, respectively. If

, the gesture is a horizontal gesture. If

, the gesture is a vertical gesture. Otherwise, the hand gesture is invalid.
对于有效的线性手势，第一个坐标

的斜率

用于确定手势的方向，其中

和

分别是矢量

的第一和第二元素。如果

，手势为水平手势。如果是

，则手势为垂直手势。否则，手势无效。

5.3 Visual display 5.3 可视化显示

However, using hand gesture recognition to activate the blimp reactive behaviors may not always work for human users. This is because there is a time delay between the time instant when the blimp detects a human and the time instant when the blimp initiates the corresponding movement. Although the time delay is only a few seconds, a human user may find the delay confusing because the person perceives no immediate reaction from the blimp. The human user may redo the hand gesture, approach the blimp to see whether the blimp is broken, or feel disappointed and walk away, even if the blimp actually recognizes the hand gesture and executes the correct action later.
然而，使用手势识别来激活飞艇的反应行为对于人类用户来说可能并不总是有效。这是因为在飞艇检测到人类的瞬间与飞艇启动相应动作的瞬间之间存在时间延迟。虽然时间延迟只有几秒钟，但人类用户可能会感到困惑，因为他们感觉不到飞艇会立即做出反应。人类用户可能会重做手势，走近飞艇查看飞艇是否坏了，或者感到失望而走开，即使飞艇实际上能识别手势并在稍后执行正确的动作。

Through these unsuccessful interactions, we discover that it is important for the blimp to communicate its intentions to humans. To achieve bidirectional communication between the human user and the blimp, we install an LED matrix screen on GTMAB and it displays what the blimp is "thinking." The LED screen gives the human instantaneous feedback during the interactive process and shows the human the status of the blimp: whether it detects the user and understands his/her hand gesture. The spatially close interaction with the blimp enables the human to see the visual feedback from the LED display, and the visual feedback helps the human user take the correct action for the next step and increase the efficiency and satisfaction of the interaction.
通过这些不成功的互动，我们发现，让飞艇向人类传达自己的意图非常重要。为了实现人类用户与飞艇之间的双向交流，我们在 GTMAB 上安装了一个 LED 矩阵屏幕，显示飞艇的 "想法"。在互动过程中，LED 屏幕会向人类提供即时反馈，并向人类显示飞艇的状态：是否检测到用户，是否理解用户的手势。与飞艇在空间上的近距离互动使人类能够看到 LED 显示屏的视觉反馈，视觉反馈有助于人类用户采取正确的下一步行动，提高互动的效率和满意度。

We design four visual patterns on the LED display to represent the four intentions of the blimp (Fig. 7). The first pattern, which is the letter "

" in Fig. 7a, indicates that the user's face has been detected, and GT-MAB is ready to detect the human's
我们在 LED 显示屏上设计了四个视觉图案，分别代表飞艇的四个意图（图 7）。第一个图案是图 7a 中的字母"

"，表示已检测到用户的脸部，GT-MAB 已准备好检测人类的脸部。

(a)
(b)
(c)

(d)

Fig. 7 LED feedback display: (a) face is detected; (b) hand is detected; (c) a hand is not detected or a valid gesture is not recognized; (d) a valid gesture is detected, ready to fly
图 7 LED 反馈显示：(a) 检测到脸；(b) 检测到手；(c) 未检测到手或未识别到有效手势；(d) 检测到有效手势，准备飞行
hand. This is a positive feedback. When the human user sees this pattern, the human should place his/her hand near the face and start a vertical or horizontal hand movement. The second pattern, which is the "check" mark in Fig. 7b, represents that the blimp has successfully detected a human face and a hand in the initial gesture region, and it is in the process of recognizing the human's gesture. This is also a positive feedback. When the human user sees this pattern, the human should continue moving his/her hand. The third pattern, which is the "cross" mark in Fig. 7c, means that no hand has been detected in the initial gesture region or that the blimp cannot recognize a valid hand gesture. This is a negative feedback from the blimp that tells the human there was a mistake during the interaction. When seeing this pattern, the human user should place his/her hand in the initial gesture region and redo the gesture. The last pattern, shown in Fig. 7d, indicates that GT-MAB recognizes a valid hand gesture and it is going to make the corresponding motion. When seeing this pattern, the human user can see if the blimp successfully recognizes the gesture by checking whether the blimp is making the correct motion. Once the blimp completes the motion and returns to the initial position, the joint face and hand detector is triggered to detect the human face. If a face is detected, the pattern "

" is displayed again and the human can perform the next hand gesture. The whole interaction procedure repeats.
手。这是一种积极的反馈。当人类用户看到这个图案时，应将手放在人脸附近，并开始垂直或水平的手部运动。第二个图案是图 7b 中的 "复选 "标记，表示飞艇在初始手势区域成功检测到人脸和手，并正在识别人的手势。这也是一个积极的反馈。当人类用户看到这个图案时，应该继续移动他/她的手。第三种模式，即图 7c 中的 "十字 "标记，表示在初始手势区域没有检测到手部，或者飞艇无法识别有效的手势。这是飞艇发出的负面反馈，告诉人类在交互过程中出现了错误。当看到这种模式时，人类用户应该将手放在初始手势区域，然后重新做手势。最后一种模式如图 7d 所示，表示 GT-MAB 识别出一个有效的手势，并将做出相应的动作。人类用户在看到这一模式时，可以通过检查飞艇是否做出正确的动作来判断飞艇是否成功识别了手势。一旦飞艇完成运动并返回初始位置，就会触发面部和手部联合检测器来检测人脸。如果检测到人脸，就会再次显示"

"图案，然后人就可以做出下一个手势。整个交互过程重复进行。

6 Localization and control algorithms
6 定位和控制算法

In this section, we present the last two steps in the HRI design for GT-MAB: vision-based human localization and blimp motion control.
在本节中，我们将介绍 GT-MAB 人机交互设计的最后两个步骤：基于视觉的人类定位和飞艇运动控制。

6.1 Relative position estimation
6.1 相对位置估算

GT-MAB localizes a human using its onboard monocular camera only. This is different from most
GT-MAB 仅使用板载单目摄像头对人类进行定位。这与大多数
other blimps which use an external system to localize humans, such as indoor localization or fixed external cameras (Srisamosorn et al., 2016).
而其他飞艇则使用外部系统对人类进行定位，如室内定位或固定外部摄像头（Srisamosorn 等人，2016 年）。

We assume that the camera satisfies the pinhole camera model (Corke, 2011), which defines the relationship between a 3 D point

in the camera coordinates

and a 2 D point

in the camera image frame

我们假设摄像机符合针孔摄像机模型（Corke，2011 年），该模型定义了摄像机坐标

中的 3 D 点

和摄像机图像帧

中的 2 D 点

之间的关系。

where

and

are the focal lengths in the

and

directions respectively, and

is the optical center of the image. Here, we assume that

and

are both equal to the focal length

and

is the center of the image.
其中，

和

分别是

和

方向的焦距，

是图像的光学中心。在此，我们假设

和

都等于焦距

，而

是图像的中心。

The illustration of human position estimation is shown in Fig. 8. Because the pitch and roll angles of the blimp are very small, we can assume that the camera projection plane is always perpendicular to the ground; i.e.,

is perpendicular to the ground. This assumption does not hold for quadrotors because they need to change the pitch angle to fly forward or backward. GT-MAB provides a certain convenience for support of vision-based HRI algorithms because the pitch and roll angles of the onboard camera can be controlled to be zero. Line

represents the center line of the human face and we assume that it is parallel to the image plane; i.e., the plane of the human face is also perpendicular
人体位置估计的示意图如图 8 所示。由于飞艇的俯仰角和滚转角非常小，我们可以假设相机投影面始终垂直于地面，即

垂直于地面。这一假设对于四旋翼飞行器来说并不成立，因为它们需要改变俯仰角才能向前或向后飞行。GT-MAB 为支持基于视觉的 HRI 算法提供了一定的便利，因为机载摄像头的俯仰角和滚动角可以控制为零。

线表示人脸的中心线，我们假设它与图像平面平行；也就是说，人脸的平面也与图像平面垂直。

Fig. 8 Illustration of relative distance estimation
图 8 相对距离估算示意图
to the ground. Point

is the center point of line

. Points

, and

are the corresponding projection points. We denote the actual length of the human face as

and denote the length of the human face in the camera projection plane as

.
到地面。点

是直线

的中心点。点

和

是相应的投影点。我们用

表示人脸的实际长度，用

表示人脸在摄像机投影平面上的长度。

In the calibration phase, we use the detection algorithm introduced in Section 5.1 to compute a human user's face length, denoted as

in unit of meters. The human stands away from the camera at a fixed distance

, and the position of the blimp is adjusted such that the center of the human face is at the center of the image frame. Then we run the joint face and hand detector to detect the human face and obtain the face length

in the image. Given

, and

, the true human face length

can be computed using
在校准阶段，我们使用第 5.1 节中介绍的检测算法来计算人类用户的脸部长度，以米为单位表示为

。人站在离摄像机固定距离

的地方，调整飞艇的位置，使人的面部中心位于图像帧的中心。然后，我们运行人脸和手部联合检测器来检测人脸，并获得图像中的人脸长度

。给定

、

和

，真实的人脸长度

可以通过以下方法计算得出

During the interaction experiments, the face length

from each image frame should satisfy the following equation:
在交互实验中，每个图像帧的面长

应满足以下公式：

Note that this equation holds only if line

is parallel to the projection plane. The estimated localization of the human face

in the camera coordinate frame can be computed as
请注意，只有当

线与投影平面平行时，该等式才成立。在摄像机坐标框架中，人脸

的估计定位计算公式为

where the true face length

is known from Eq. (3) and the camera focal length

can be obtained through standard camera calibration.
其中，真实面长

由公式 (3) 可知，相机焦距

可通过标准相机校准获得。

6.2 Blimp control 6.2 飞艇控制

Due to the modeling (Tao et al., 2018) and the autopilot controller design (Cho et al., 2017), GTMAB can be easily controlled to maintain its position or fly in certain patterned motions. In this subsection, we introduce three types of blimp controllers that we design for HRI application.
由于建模（Tao 等人，2018 年）和自动驾驶控制器设计（Cho 等人，2017 年），GTMAB 可以很容易地被控制以保持其位置或以某些模式运动飞行。在本小节中，我们将介绍为 HRI 应用而设计的三种飞艇控制器。

6.2.1 Human following controller
6.2.1 人类跟随控制器

To follow the human user and accurately track the human's hand trajectory, the goal for the human following controller is to control the blimp to maintain a fixed distance

away from the human and to keep the human face at the center of the camera frame. The general blimp model has six degrees of freedom and is highly nonlinear and coupled. Due to the self-stabilized physical design of GT-MAB, we can use the simplified motion primitives presented in Cho et al. (2017) to design three independent PID controllers for stable human following behavior. A distance PID controller is designed to control the relative distance

to coverage to the desired value

. A height PID controller is designed to control the height difference between the human and blimp

to be 0 . A heading PID controller is designed to control the difference

between the blimp's heading angle and the human's heading angle to be

. The measurements of

, and

can be calculated based on the estimated human position

, and

. The PID parameters are shown in Table 1.
为了跟随人类用户并准确跟踪人类的手部轨迹，人类跟随控制器的目标是控制飞艇与人类保持固定距离

，并将人类面部保持在摄像机帧的中心。一般的飞艇模型有六个自由度，是高度非线性和耦合的。由于 GT-MAB 的自稳定物理设计，我们可以使用 Cho 等人（2017）中提出的简化运动基元来设计三个独立的 PID 控制器，以实现稳定的人体跟随行为。距离 PID 控制器旨在控制相对距离

覆盖到期望值

。高度 PID 控制器用于控制人与飞艇之间的高度差

为 0。航向 PID 控制器用于控制飞艇航向角与人类航向角之间的差值

为

。

和

的测量值可根据估计的人类位置

和

计算得出。PID 参数如表 1 所示。

Table 1 PID controller gains
表 1 PID 控制器增益

Controller 控制器	P	I	D
Distance 距离	0.0125	0	0.0658
Height 高度	1.3120	0.0174	1.4704
Yaw 亚乌	0.3910	0	0.3840

6.2.2 Blimp motion controllers
6.2.2 飞艇运动控制器

If a valid hand gesture is recognized, the blimp should not only follow the human but also make the corresponding motion controlled by the blimp motion controllers.
如果识别到有效的手势，飞艇不仅要跟随人类，还要在飞艇运动控制器的控制下做出相应的动作。

1. Backward motion controller
1.后向运动控制器

Once a vertical gesture is recognized, the backward motion controller is triggered, which also consists of three independent controllers for distance, height, and heading angle. The height and heading controllers are the same as the human following controller. The distance controller switches to an openloop backward motion controller, which linearly increases the thrust of the two horizontal thrusters on GT-MAB until the thrust reaches its maximum limits. Under this controller, GT-MAB flies backward
一旦识别到垂直手势，就会触发后向运动控制器，该控制器也由距离、高度和航向角三个独立控制器组成。高度和航向控制器与人类跟随控制器相同。距离控制器切换到开环后向运动控制器，该控制器线性增加 GT-MAB 上两个水平推进器的推力，直到推力达到最大极限。在该控制器下，GT-MAB 向后飞行
(away from the human). The open-loop backward motion controller can achieve a faster and more obvious motion compared to the feedback PID controller. Once the relative distance between the human and GT-MAB reaches

, the backward motion is completed. The backward motion controller switches to the human following controller and then GT-MAB flies towards the human until it reaches the initial interaction distance

.
(远离人体）。与反馈 PID 控制器相比，开环后向运动控制器可以实现更快、更明显的运动。一旦人类和 GT-MAB 之间的相对距离达到

，后退运动就完成了。后向运动控制器切换为人类跟随控制器，然后 GT-MAB 向人类飞去，直到达到初始交互距离

。
2. Spinning motion controller
2.旋转运动控制器

Once a horizontal gesture is recognized, a spinning motion controller is activated. To achieve a spinning motion, all three PID feedback controllers for human following behavior are disabled. The spinning controller directly sets two opposite thrusts for the two horizontal thrusters so that GT-MAB can start to spin. The two opposite thrusts last 2.5 s . After 2.5 s , the horizontal thrusters stop but the spinning motion continues because of inertia. Once GT-MAB returns to its initial heading direction, the human face appearing in the video stream can be detected again and the spinning controller switches back to the human following controller.
一旦识别到水平手势，就会启动旋转运动控制器。为了实现旋转运动，所有三个用于人类跟随行为的 PID 反馈控制器都被禁用。旋转控制器直接为两个水平推进器设置两个相反的推力，使 GT-MAB 开始旋转。两个相反的推力持续 2.5 秒。 2.5 秒后，水平推进器停止，但由于惯性，旋转运动仍在继续。一旦 GT-MAB 返回到初始航向，视频流中出现的人脸就会再次被检测到，旋转控制器就会切换回人体跟随控制器。

7 Experiments and results
7 实验和结果

We conducted two HRI experiments on GTMAB which validated the capabilities of GT-MAB in support of HRI. We first tested the ability of GTMAB to follow a human and collect human data. Then we invited multiple human participants to interact with GT-MAB and collected the users' feedback to examine how participants felt about the proposed interactive procedure and how the LED visual feedback on GT-MAB affected the interactive experience.
我们在 GTMAB 上进行了两次 HRI 实验，验证了 GT-MAB 在支持 HRI 方面的能力。我们首先测试了 GTMAB 跟踪人类和收集人类数据的能力。然后，我们邀请多名人类参与者与 GT-MAB 进行互动，并收集用户的反馈，以研究参与者对拟议互动程序的感受，以及 GT-MAB 上的 LED 视觉反馈对互动体验的影响。

7.1 Human following experiment
7.1 人类跟踪实验

In this experiment, a human user served as the leader and the blimp served as the follower. As the human moved up, down, right, left, forward, and backward, the blimp followed the human to the extent that the human was not moving abruptly. We set the desired relative distance

between the human and the blimp to 0.4 m , which regulates the blimp to follow the human in the intimate zone. We used an external real-time localization system, OptiTrack, to measure the 3D position of GT-MAB. Meanwhile, we used the localization method intro- duced in Section 6.1 to estimate the human trajectory online, given the position of GT-MAB from OptiTrack. To test the human following performance, we also used OptiTrack to obtain the accurate 3D position of the human, but the OptiTrack data for the human user were used only as ground truth to analyze the performance of our method. The data used for implementing human following functionality and blimp control were from the onboard camera only.
在这项实验中，一名人类用户担任领导者，飞艇担任跟随者。当人类向上、向下、向右、向左、向前和向后移动时，飞艇跟随人类移动，以人类不突然移动为限。我们将人类与飞艇之间的理想相对距离

设定为 0.4 米，从而控制飞艇在亲密区域内跟随人类。我们使用外部实时定位系统 OptiTrack 来测量 GT-MAB 的三维位置。同时，我们使用第 6.1 节中介绍的定位方法，根据 OptiTrack 提供的 GT-MAB 位置，在线估计人类轨迹。为了测试人类跟踪性能，我们还使用 OptiTrack 获取了人类的精确 3D 位置，但人类用户的 OptiTrack 数据仅用作分析我们方法性能的基本事实。用于实现人体跟随功能和飞艇控制的数据仅来自机载摄像头。

Fig. 9 shows a 3 D view of the estimated human trajectory, true human trajectory, and blimp trajectory. The red solid line represents the online estimated human trajectory computed by our method. The blue dashed line represents the true trajectory of the human measured by OptiTrack. The black dotted line represents the trajectory of the blimp from OptiTrack. The coordinates in this figure are the OptiTrack coordinates in meters. Fig. 10 shows a top view of the blimp and the human trajectories. Fig. 11 shows the height of the estimated human, true human, and the blimp trajectories in the

axis of the OptiTrack system. The human kneeled down once to test the height control and the blimp can change its height corresponding to the human height. From the three figures, we can see that the estimated human trajectory and the true human trajectory matched well to a certain degree. There were some errors between the estimated trajectory and true human trajectory, even though the errors were not significant (the maximum error was 0.38 m in
图 9 显示了估计的人类轨迹、真实的人类轨迹和飞艇轨迹的三维视图。红色实线代表我们的方法计算出的在线估计人体轨迹。蓝色虚线代表 OptiTrack 测得的人类真实轨迹。黑色虚线代表 OptiTrack 测得的飞艇轨迹。图中的坐标是 OptiTrack 以米为单位的坐标。图 10 显示了飞艇和人类轨迹的俯视图。图 11 显示了在 OptiTrack 系统的

轴上估算的人类、真实人类和飞艇轨迹的高度。人类跪下一次来测试高度控制，飞艇可以根据人类的高度改变高度。从这三幅图中我们可以看出，估计的人体轨迹与真实的人体轨迹在一定程度上吻合得很好。估计的轨迹与真实的人体轨迹之间存在一些误差，尽管误差并不明显（最大误差为 0.38 米（图 1））。

Fig. 9 Three-dimensional view of the estimated human, true human, and blimp trajectories. The starting positions are represented by the circles, and the ending positions are represented by the stars. References to color refer to the online version of this figure
图 9 估计人类、真实人类和飞艇轨迹的三维视图。圆圈代表起始位置，星星代表终点位置。颜色指的是本图的在线版本

Fig. 10 Top view of the estimated human (red solid), true human (blue dashed), and blimp (black dotted) trajectories. The numbers in the figure represent the time in the unit of seconds, showing when the trajectories visited the points represented by the triangles. References to color refer to the online version of this figure
图 10 估计人类（红色实线）、真实人类（蓝色虚线）和飞艇（黑色虚线）轨迹的俯视图。图中的数字代表以秒为单位的时间，表示轨迹何时到达三角形所代表的点。颜色指的是本图的在线版本

Fig.

positions of the estimated human, true human, and blimp trajectories
图：

估计人类、真实人类和飞艇轨迹的位置
the

plane and 0.11 m in the

axis). The errors were mainly due to the blimp vibrations caused by the wind from an air conditioner in our lab. Although the vibration was small and can be stabilized after a few seconds, the assumption that the camera projection plane is parallel to the human face does not hold during transitions. The errors can be reduced by conducting the experiment in a no-wind environment, or be compensated for by a low-level controller to better stabilize the vibration.

平面上的误差为 0.11 米，

轴上的误差为 0.11 米）。产生误差的主要原因是我们实验室的空调风引起的飞艇振动。虽然振动很小，几秒钟后就能稳定下来，但在过渡期间，相机投影面平行于人脸的假设并不成立。在无风环境中进行实验可以减少误差，或者通过低级控制器补偿误差，从而更好地稳定振动。

7.2 Evaluation of human-blimp interaction
7.2 评估人与黑猩猩的互动

We conducted a user study with the goal of comparing the human experience during interaction with the blimp with and without an LED display. Our goal was to verify whether the blimp's immediate visual feedback from the blimp to the human could improve the interaction.
我们进行了一项用户研究，目的是比较有 LED 显示屏和没有 LED 显示屏时人类与飞艇互动的体验。我们的目标是验证从飞艇到人的即时视觉反馈是否能改善互动。

The main hypothesis in this experiment is that human users will experience different levels of comfort with or without the LED display feedback from the blimp. This was assessed using the time du- ration of each interaction and a survey after the interactions.
本实验的主要假设是，在有无飞艇 LED 显示屏反馈的情况下，人类用户会体验到不同程度的舒适感。这一点通过每次互动的时间间隔和互动后的调查进行评估。

We recruited a total of 14 participants to test the human-blimp interactive procedure designed in this study. The participants included 7 males and 7 females. The average participant age was 26.14 years old with a range of 21 to 44 years old. Six participants reported a prior familiarity with UAVs and eight participants reported a low familiarity with UAVs.
我们共招募了 14 名参与者来测试本研究中设计的人鼠互动程序。参与者包括 7 名男性和 7 名女性。参与者的平均年龄为 26.14 岁，年龄范围在 21 至 44 岁之间。6 名参与者表示先前熟悉无人机，8 名参与者表示不太熟悉无人机。

7.2.1 Experimental procedure
7.2.1 实验程序

Each participant was directed to perform the procedure in a lab setting independently, i.e., without the attendance of other participants but under the guidance and supervision of the experiment assistants. We randomly separated the 14 users into two groups, groups A and B. Each group had seven participants. Participants from group A controlled the blimp without LED feedback first and with LED feedback later. Participants from group B performed the test with LED feedback first and tested without LED later. The study took approximately half an hour and consisted of three parts: pre-interaction, interaction, and post-interaction.
我们指导每位参与者在实验室环境中独立完成操作，即在没有其他参与者在场的情况下，在实验助理的指导和监督下进行操作。我们随机将 14 名参与者分成两组，即 A 组和 B 组。A 组的参与者先控制无 LED 反馈的飞艇，随后再控制有 LED 反馈的飞艇。B 组的参与者先进行有 LED 反馈的测试，然后再进行无 LED 的测试。研究耗时约半小时，包括三个部分：互动前、互动和互动后。

1. Pre-interaction 1.互动前

The pre-interaction began when a participant was greeted and provided consent forms with information about the study objective and his/her rights as a participant. After signing the consent forms, a participant was taken to the experiment room and guided by the experiment assistants through a few preparatory steps to learn how to interact with GTMAB: (1) An experiment assistant first played a video to the participant, demonstrating a valid horizontal hand gesture and a valid vertical hand gesture, and the corresponding blimp motions for each gesture. All participants watched the same video. (2) The assistant showed the participant pictures of the LED display patterns and informed the participant of the meaning of each pattern and what they should do after seeing each pattern. (3) The assistant demonstrated the experiment process to the participant. After the preparation, the participant was asked to practice commanding the blimp to spin and fly backward using the two valid hand gestures with and without LED display. The practice stopped when the participant felt confident that he/she could control the blimp using both hand
互动开始前，实验助理会向参与者问好，并向其提供同意书，其中包含有关研究目的和参与者权利的信息。在签署同意书后，参与者被带到实验室，在实验助理的引导下进行一些准备步骤，学习如何与 GTMAB 互动：（1）实验助理首先向参与者播放一段视频，演示一个有效的水平手势和一个有效的垂直手势，以及每个手势对应的飞艇动作。所有被试都观看了相同的视频。(2）实验助理向被试展示 LED 显示屏图案的图片，并告诉被试每个图案的含义以及他们看到每个图案后应该做什么。(3) 助手向被试演示实验过程。准备工作完成后，要求被试练习在有 LED 显示屏和没有 LED 显示屏的情况下使用两种有效手势指挥飞艇旋转和向后飞行。当受试者确信自己能够使用两种手势控制飞艇时，练习停止。
gestures. The practice time was less than 10 min for all participants.
手势。所有参与者的练习时间均不超过 10 分钟。

2. Interaction with GT-MAB
2.与 GT-MAB 的相互作用

The participants were asked to use the two gestures, horizontal and vertical hand movements. The order of the gestures could be determined by the participant. The first trial conducted by the participant was labeled as trial 1 (without LED feedback for group A and with LED feedback for group B) and the second trial conducted by the participant was labeled as trial 2. At the beginning of each trial, the participant was asked to stand at a fixed location in the experiment room and the location was unknown to the blimp. The blimp was released in front of the participant at around 1.2 m away. After the blimp was released, it automatically approached the participant and the interactive distance was set to 0.5 m , i.e.,

for the distance PID controller. When the blimp arrived at the desired interaction position and the human face was detected, a timer started. Meanwhile, the participant was informed by the assistant that he/she could start to perform the gesture. When the blimp recognized a valid gesture from the participant, the timer stopped. After trial 1 , the participant was required to repeat the gestures to control the blimp for trial 2. The order of gestures was required to be the same as that for trial 1.
参与者被要求使用水平和垂直两种手势。手势的顺序可由被试自行决定。受试者进行的第一次试验称为试验 1（A 组不带 LED 反馈，B 组带 LED 反馈），受试者进行的第二次试验称为试验 2。每次试验开始时，受试者被要求站在实验室内的一个固定位置，而飞艇并不知道该位置。飞艇在被试面前约 1.2 米处释放。飞艇释放后，自动靠近被试，交互距离设定为 0.5 米，即距离 PID 控制器的

。当飞艇到达所需的互动位置并检测到人脸时，计时器开始计时。同时，助手通知参与者可以开始做手势。当飞艇识别到参与者的有效手势时，计时器停止。试验 1 结束后，参与者需要重复手势来控制飞艇进行试验 2。手势的顺序要求与试验 1 相同。
3. Post-interaction 3.互动后

After completing both trials, the participants were taken out of the experiment room and asked to fill out a survey form. The survey collected information including whether the participant thought the blimp took the correct action based on each gesture the participant performed, which trial brought a better interactive experience, as well as notes about the experiment in general and the interactions with GT-MAB.
完成两次试验后，参与者被带出实验室，并被要求填写一份调查表。调查收集的信息包括：根据参与者做出的每个手势，参与者是否认为飞艇做出了正确的动作；哪一次试验带来了更好的互动体验；以及关于实验的总体情况和与 GT-MAB 互动的注意事项。

7.2.2 Results and analysis
7.2.2 结果与分析

Experimental results showed that most of the users can interact with the autonomous blimp. There were only two gesture recognition errors among all 56 blimp control tests. The horizontal gestures from participants 7 and 9 without LED feedback were recognized as vertical gestures by the blimp. All the gestures with LED feedback were correctly recognized. This confirms that the human-blimp interaction procedure has a high success rate when used by participants who go through a short training pe- riod. We measured the amount of time that the blimp took to recognize a gesture from each trial of each participant. The time duration without LED feedback is denoted as

, and the time duration with LED feedback is denoted as

. The time duration is in seconds. We also recorded videos of the participant's gestures and the blimp's corresponding motions to compare with the participant's answers collected from the survey form.
实验结果表明，大多数用户都能与自主飞艇进行互动。在所有 56 次飞艇控制测试中，只有两次手势识别错误。参与者 7 和 9 在没有 LED 反馈的情况下做出的水平手势被飞艇识别为垂直手势。所有带 LED 反馈的手势都被正确识别。这证明，经过短期培训的参与者使用人-飞艇互动程序的成功率很高。我们测量了飞艇识别每位参与者每次试验的手势所需的时间。没有 LED 反馈的持续时间用

表示，有 LED 反馈的持续时间用

表示。持续时间以秒为单位。我们还录制了参与者手势和飞艇相应动作的视频，以便与调查表中收集的参与者答案进行比较。

The time durations for horizontal gesture commands are shown in Fig. 12a. The blue circles represent the experimental results of participants

, and 14 from group A, who completed the gesture without LED feedback first and with LED feedback later. The red circles represent the experimental results of participants

, and 13 from group

, who completed the gesture with LED feedback first and without LED feedback later. Nine participants (data points below the dashed line) took less time to finish the horizontal
水平手势指令的持续时间如图 12a 所示。蓝色圆圈代表 A 组的

和 14 名参与者的实验结果，他们先在没有 LED 反馈的情况下完成手势，随后在有 LED 反馈的情况下完成手势。红色圆圈代表参与者

和来自

组的 13 名参与者的实验结果，他们先完成有 LED 反馈的手势，后完成无 LED 反馈的手势。九名参与者（数据点位于虚线下方）用较少的时间完成了水平手势。

Fig. 12 Time duration for gesture recognition: (a) horizontal gesture; (b) vertical gesture. The red circles represent the data from group

and blue circles represent the data from group

. The dashed line represents the line where

. The number near each circle represents the index of each user. References to color refer to the online version of this figure
图 12 手势识别的持续时间：(a) 水平手势；(b) 垂直手势。红色圆圈代表

组的数据，蓝色圆圈代表

组的数据。虚线代表

所在的线。每个圆圈附近的数字代表每个用户的索引。颜色参考本图的在线版本
gesture with LED feedback. Participants 10 and 13 took almost the same time to finish the horizontal gesture with and without LED feedback. Participant 2 took slightly more time (about 0.3 s ) to finish the horizontal gesture with LED feedback. Participants 5 and 8 took about 1.5 s more to finish the horizontal gesture with LED feedback.
有 LED 反馈的手势。10 号和 13 号学员完成有 LED 反馈和无 LED 反馈水平手势所需的时间几乎相同。参加者 2 用了稍多的时间（约 0.3 秒）完成有 LED 反馈的水平手势。5 号和 8 号学员在有 LED 反馈的情况下完成水平手势所需的时间多出约 1.5 秒。

The time durations for vertical gesture commands are shown in Fig. 12b. The blue circles represent the experimental results of group A and red circles represent the results of group B. Eight participants (data points below the dashed line) took less time to finish the vertical gesture with LED feedback. Participants 2, 3, 10, and 12 took almost the same time to finish the vertical gesture with and without LED feedback. Participants 7 and 9 took more time to finish the vertical gesture with LED feedback. For both gestures, most of the participants took less time to command the blimp with LED feedback.
垂直手势指令的持续时间如图 12b 所示。蓝色圆圈代表 A 组的实验结果，红色圆圈代表 B 组的实验结果。8 名参与者（数据点位于虚线下方）在有 LED 反馈的情况下完成垂直手势所需的时间较短。参与者 2、3、10 和 12 在有 LED 反馈和无 LED 反馈的情况下完成垂直手势所需的时间几乎相同。7 号和 9 号学员在有 LED 反馈的情况下完成垂直手势所用的时间较长。对于这两种手势，大多数参与者在有 LED 反馈的情况下指挥飞艇所用的时间较短。

The average time to complete a gesture that could be successfully recognized by the blimp across the 14 participants is shown in Table 2. For horizontal gestures, the average recognition time was reduced by

with LED feedback compared to the average time without LED feedback. For vertical gestures, the average recognition time was reduced by

with LED feedback compared to the average time without LED feedback. These results confirmed that the simple visual feedback improves the interactive efficiency between the human and the blimp.
表 2 显示了 14 名参与者完成一个手势并被飞艇成功识别的平均时间。与没有 LED 反馈的平均时间相比，有 LED 反馈的水平手势的平均识别时间缩短了

。对于垂直手势，与无 LED 反馈的平均识别时间相比，有 LED 反馈的平均识别时间减少了

。这些结果证实，简单的视觉反馈提高了人与飞艇之间的互动效率。

Table 2 Average time with or without LED feedback
表 2 有无 LED 反馈的平均时间

Gesture 手势	Time (s) 时间（秒）
Gesture 手势	Without LED 无 LED	With LED 带 LED
Horizontal 横向	11.1567	8.4022
Vertical 垂直	13.6597	11.4616

The participants were asked to choose which trial (with or without LED) provided them a better interactive experience. The preferences among all participants are shown in Fig. 13. Eleven participants out of 14 reported that the interaction with LED feedback is better. Participant 14 chose the interaction without LED feedback because participant 14 mentioned in the survey that he/she felt nervous when seeing the negative feedback from the blimp, so participant 14 preferred the interaction without LED even if the blimp might misunderstand his/her com- mands. Participants 2 and 10 replied that the two interactive trials provided the same HRI experience for them. From the data we collected, participant 2 took a very short time (less than 3.5 s ) to complete every gesture with and without LED, so participant 2 was very effective at controlling the blimp using hand gestures. Therefore, the visual feedback was not crucial for this participant. It was a similar case for participant 10, who took almost the same time to complete each gesture. All of the other 11 participants replied in the survey that the LED visual feedback provided a better interactive experience because they knew what the blimp was "thinking."
参与者被要求选择哪种试验（带 LED 或不带 LED）能给他们带来更好的互动体验。所有参与者的偏好如图 13 所示。14 位参与者中有 11 位表示有 LED 反馈的互动体验更好。第 14 位参与者选择了没有 LED 反馈的互动，因为第 14 位参与者在调查中提到，当他/她看到飞艇的负面反馈时会感到紧张，所以即使飞艇可能会误解他/她的口令，第 14 位参与者也更喜欢没有 LED 的互动。参与者 2 和 10 回答说，两次互动试验为他们提供了相同的人机交互体验。从我们收集到的数据来看，参与者 2 在有 LED 和无 LED 的情况下完成每一个手势的时间都很短（不到 3.5 秒），因此参与者 2 在使用手势控制飞艇方面非常有效。因此，视觉反馈对该学员来说并不重要。第 10 位参与者的情况与此类似，他完成每个手势所需的时间几乎相同。其他 11 位参与者都在调查中回答说，LED 视觉反馈提供了更好的互动体验，因为他们知道飞艇在 "想什么"。

Fig. 13 Users' preference of a better human-robot interaction experience
图 13 用户对更好的人机交互体验的偏好

8 Conclusions 8 结论

We have presented a novel robotic platform, an autonomous robotic blimp equipped with only one monocular camera, which enables an uninstrumented human to use hand gestures to interact with the robot. The deep neural network design can effectively recognize human face and hands. The proposed learning algorithm can distinguish horizontal hand movements and vertical hand movements. The blimp reacted to humans via immediate feedback and patterned motions. We collected experimental data to show that GT-MAB has reliable human detection and human following capabilities. A user study was conducted to verify that the proposed HRI procedure can enable natural interaction between a human and a robotic blimp. We also discovered that simple visual feedback improves the interactive experience. Future work will improve the perception and learning algorithms so that more gestural commands can be interpreted by the blimp. We also acknowledged that participant groups that are more broadly representative of the potential users should be recruited to test the design of GT-MAB.
我们提出了一种新颖的机器人平台--只配备一个单目摄像头的自主机器人飞艇，它能让未安装仪器的人类使用手势与机器人互动。深度神经网络设计可有效识别人脸和手。所提出的学习算法可以区分水平手部动作和垂直手部动作。飞艇通过即时反馈和模式化动作对人类做出反应。我们收集的实验数据表明，GT-MAB 具有可靠的人类检测和跟随能力。我们还进行了一项用户研究，以验证所建议的人机交互程序能够实现人与机器人飞艇之间的自然交互。我们还发现，简单的视觉反馈可以改善交互体验。未来的工作将改进感知和学习算法，使飞艇能够解读更多的手势指令。我们还认识到，应招募更广泛代表潜在用户的参与群体，以测试 GT-MAB 的设计。

Corresponding author
通讯作者
- Project supported by the Office of Naval Research (Nos. N0001414-1-0635 and N00014-16-1-2667), the National Science Foundation, U.S. (No. OCE-1559475), the Naval Research Laboratory (No. N0017317-1-G001), and the National Oceanic and Atmospheric Administration (No. NA16NOS0120028)
  由海军研究办公室（编号：N0001414-1-0635 和 N00014-16-1-2667）、美国国家科学基金会（编号：OCE-1559475）、海军研究实验室（编号：N0017317-1-G001）和美国国家海洋与大气管理局（编号：NA16NOS0120028）支持的项目
  (1) ORCID: Fumin ZHANG, http://orcid.org/0000-0003-0053-4224 (c) Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2019
  (1) ORCID: Fumin ZHANG, http://orcid.org/0000-0003-0053-4224 (c) 浙江大学和德国施普林格出版社，施普林格自然出版社2019年的一部分

Autonomous flying blimp interaction with human in an indoor space* 在室内空间与人互动的自主飞行飞艇*

Abstract 摘要

1 Introduction 1 引言

2 Literature review and novelty2 文献综述与新颖性

2.1 Data collection in the human intimate zone2.1 在人类亲密接触区收集数据

2.2 Visual feedback from blimp to human2.2 飞艇对人类的视觉反馈

2.3 Monocular vision based human localization2.3 基于单目视觉的人类定位

2.4 Joint face and hand detection2.4 脸部和手部联合检测

3 GT-MAB platform 3 GT-MAB 平台

4 System overview 4 系统概述

5 Perception algorithms 5 感知算法

5.1 Joint detection of face and hand5.1 人脸和手的联合检测

5.2 Hand gesture recognition5.2 手势识别

5.3 Visual display 5.3 可视化显示

6 Localization and control algorithms6 定位和控制算法

6.1 Relative position estimation6.1 相对位置估算

6.2 Blimp control 6.2 飞艇控制

6.2.1 Human following controller6.2.1 人类跟随控制器

6.2.2 Blimp motion controllers6.2.2 飞艇运动控制器

1. Backward motion controller1.后向运动控制器

7 Experiments and results7 实验和结果

7.1 Human following experiment7.1 人类跟踪实验

7.2 Evaluation of human-blimp interaction7.2 评估人与黑猩猩的互动

7.2.1 Experimental procedure7.2.1 实验程序

1. Pre-interaction 1.互动前

2. Interaction with GT-MAB2.与 GT-MAB 的相互作用

7.2.2 Results and analysis7.2.2 结果与分析

8 Conclusions 8 结论

Autonomous flying blimp interaction with human in an indoor space*
在室内空间与人互动的自主飞行飞艇*

2 Literature review and novelty
2 文献综述与新颖性

2.1 Data collection in the human intimate zone
2.1 在人类亲密接触区收集数据

2.2 Visual feedback from blimp to human
2.2 飞艇对人类的视觉反馈

2.3 Monocular vision based human localization
2.3 基于单目视觉的人类定位

2.4 Joint face and hand detection
2.4 脸部和手部联合检测

5.1 Joint detection of face and hand
5.1 人脸和手的联合检测

5.2 Hand gesture recognition
5.2 手势识别

6 Localization and control algorithms
6 定位和控制算法

6.1 Relative position estimation
6.1 相对位置估算

6.2.1 Human following controller
6.2.1 人类跟随控制器

6.2.2 Blimp motion controllers
6.2.2 飞艇运动控制器

1. Backward motion controller
1.后向运动控制器

7 Experiments and results
7 实验和结果

7.1 Human following experiment
7.1 人类跟踪实验

7.2 Evaluation of human-blimp interaction
7.2 评估人与黑猩猩的互动

7.2.1 Experimental procedure
7.2.1 实验程序

2. Interaction with GT-MAB
2.与 GT-MAB 的相互作用

7.2.2 Results and analysis
7.2.2 结果与分析