\addbibresource

references.bib 参考文献.bib

FoAR: Force-Aware Reactive Policy for
Contact-Rich Robotic Manipulation
FoAR：面向接触丰富机器人操作的力感知反应策略

Zihao He^∗, Hongjie Fang^∗, Jingjing Chen, Hao-Shu Fang^† and Cewu Lu^†
Shanghai Jiao Tong University ^∗ Equal Contribution.^† Hao-Shu Fang and Cewu Lu are the corresponding authors.Emails: {he0610, galaxies, jjchen20}@sjtu.edu.cn, fhaoshu@gmail.com, lucewu@sjtu.edu.cn

Abstract 摘要

Contact-rich tasks present significant challenges for robotic manipulation policies due to the complex dynamics of contact and the need for precise control. Vision-based policies often struggle with the skill required for such tasks, as they typically lack critical contact feedback modalities like force/torque information. To address this issue, we propose FoAR, a force-aware reactive policy that combines high-frequency force/torque sensing with visual inputs to enhance the performance in contact-rich manipulation. Built upon the RISE policy, FoAR incorporates a multimodal feature fusion mechanism guided by a future contact predictor, enabling dynamic adjustment of force/torque data usage between non-contact and contact phases. Its reactive control strategy also allows FoAR to accomplish contact-rich tasks accurately through simple position control. Experimental results demonstrate that FoAR significantly outperforms all baselines across various challenging contact-rich tasks while maintaining robust performance under unexpected dynamic disturbances. Project website: https://tonyfang.net/FoAR/.
接触丰富的任务对机器人操作策略提出了重大挑战，这源于接触的复杂动力学特性以及对精确控制的需求。基于视觉的策略通常难以应对此类任务所需的技能，因为它们往往缺乏关键的接触反馈模式，如力/扭矩信息。为解决这一问题，我们提出了 FoAR，一种力感知反应策略，它将高频力/扭矩传感与视觉输入相结合，以增强在接触丰富操作中的表现。FoAR 建立在 RISE 策略之上，融合了由未来接触预测器指导的多模态特征融合机制，能够在非接触和接触阶段之间动态调整力/扭矩数据的使用。其反应控制策略还使 FoAR 能够通过简单的位置控制准确完成接触丰富的任务。实验结果表明，在各种具有挑战性的接触丰富任务中，FoAR 显著优于所有基线，同时在意外动态干扰下保持稳健性能。项目网站：https://tonyfang.net/FoAR/。

I Introduction 引言

Contact-rich manipulation is an essential field in robotics, involving tasks that require sustained, intricate contact with objects or environments [suomalainen2022survey]. Such tasks, including assembly [furniturebench, mimictouch], wiping [acp, maniwav], and peeling [chen2024vegetable, liu2024force], are inherently challenging due to the complex dynamics of force and precise control required. Unlike simple pick-and-place operations [transporter], contact-rich manipulation demands nuanced interaction and real-time adaptation to variations in object properties. As a result, developing effective algorithms and learning models for contact-rich manipulation is crucial for advancing robotic dexterity, enabling more versatile, autonomous, and interactive robot systems.
接触丰富的操作是机器人学中的一个关键领域，涉及需要与物体或环境持续、复杂接触的任务[suomalainen2022survey]。这些任务，包括装配[furniturebench, mimictouch]、擦拭[acp, maniwav]和剥离[chen2024vegetable, liu2024force]，由于需要复杂的力动态和精确控制，本质上具有挑战性。与简单的拾取和放置操作[transporter]不同，接触丰富的操作要求细腻的交互和实时适应物体属性的变化。因此，开发有效的算法和学习模型对于提升机器人灵巧性至关重要，能够使机器人系统更加多功能、自主和互动。

In recent years, significant progress has been made in vision-based robotic manipulation policies [rt1, diffusionpolicy, oxe, openvla, octo, rise, cage, act, rt2]. However, these policies often fall short of achieving the dexterity required for contact-rich manipulations, as they typically lack crucial contact feedback, such as force/torque and tactile information. This limitation hinders the robot’s ability to perceive contact states and understand physical interactions, thus constraining its manipulation capabilities, as illustrated in Fig. 1 (left).
近年来，基于视觉的机器人操作策略取得了显著进展[rt1, diffusionpolicy, oxe, openvla, octo, rise, cage, act, rt2]。然而，这些策略往往难以达到接触密集型操作所需的灵巧性，因为它们通常缺乏关键的接触反馈，如力/扭矩和触觉信息。这一限制阻碍了机器人感知接触状态和理解物理交互的能力，从而限制了其操作能力，如图 1（左）所示。

Refer to caption — Figure 1: Overview of the FoAR Policy for Contact-Rich Robotic Manipulations. Vision alone struggles to distinguish contact from non-contact states in contact-rich tasks, underscoring the need for integrating force/torque information. Our FoAR policy combines vision and force/torque inputs to predict robot actions along with a future contact probability $\phi$ . Reactive control then refines actions dynamically based on current and predicted future contact states, enabling precise, force-aware manipulations for contact-rich tasks.
图 1：面向接触丰富机器人操作的 FoAR 策略概览。在接触密集的任务中，仅依赖视觉难以区分接触与非接触状态，这凸显了整合力/力矩信息的必要性。我们的 FoAR 策略融合视觉与力/力矩输入，以预测机器人动作及未来接触概率 $\phi$ 。随后，反应控制根据当前及预测的未来接触状态动态调整动作，实现对接触丰富任务的精确、力感知操作。

To address the limitations of pure vision-based policies, recent approaches have incorporated additional modalities such as audio [playtothescore, seehearfeel, maniwav, hearingtouch], tactile [playtothescore, huang20243d, seehearfeel, eyesight_hand], and force/torque [vtt, makingsensevisiontouch, tacdiffusion, zhou2024admittance] into the policy framework. These multi-modal policies offer promising avenues for advancing robotic manipulation by providing richer feedback about interactions, enabling robots to handle contact-rich tasks with greater precision and adaptability.
为了克服纯视觉策略的局限性，近期研究已将音频[playtothescore, seehearfeel, maniwav, hearingtouch]、触觉[playtothescore, huang20243d, seehearfeel, eyesight_hand]以及力/力矩[vtt, makingsensevisiontouch, tacdiffusion, zhou2024admittance]等多种感知模态融入策略框架中。这些多模态策略通过提供更丰富的交互反馈，为机器人操作技术的进步开辟了有前景的途径，使机器人能够以更高的精度和适应性处理接触密集型的任务。

Nevertheless, audio and similar indirect sensing modalities are typically vulnerable to background noise [maniwav], complicating signal processing and reducing reliability in real-world applications. Additionally, they often fail to deliver detailed information about contact dynamics between robots and objects, constraining their effectiveness in tasks that require precise manipulation. Although tactile sensing provides direct contact information, it faces unique challenges due to the wide variety of available sensor types. For example, camera-based tactile sensors [digit, gelsight] are highly heterogeneous, making it difficult to standardize the tactile perception results [t3], while magnetic-based tactile sensors [reskin, tito2018a] often encounter inconsistency issues during replacements [anyskin], adding further complexity to their use.
然而，音频及类似的间接感知方式通常易受背景噪声影响[maniwav]，这使信号处理复杂化，并降低了实际应用中的可靠性。此外，它们往往无法提供关于机器人与物体间接触动态的详细信息，限制了在需要精确操控任务中的有效性。尽管触觉传感能直接提供接触信息，但由于可用传感器类型的多样性，它面临独特挑战。例如，基于摄像头的触觉传感器[digit, gelsight]高度异质化，使得触觉感知结果的标准化变得困难[t3]，而基于磁性的触觉传感器[reskin, tito2018a]在更换时经常遇到一致性问题[anyskin]，进一步增加了其使用的复杂性。

In contact-rich manipulation, integrating force/torque sensing offers an intuitive and versatile approach by directly capturing the physical interactions between the robot and its environment. Since contact inherently produces forces and torques, leveraging this information allows policies to sense and adapt to contact interactions in real time, thereby enhancing the precision and control of manipulation tasks. While prior studies [acp, liu2024force, tacdiffusion] have improved contact-rich task performance by incorporating the force/torque modality, they often combine force/torque data with vision data through the whole manipulation process, ignoring the fact that force/torque are sparsely activated. In practice, tasks like wiping involve multiple phases, such as picking up an eraser, performing the wiping, and placing the eraser down. Among these phases, only the wiping phase requires significant contact interactions. During non-contact phases of the task, the inherent noise in force/torque data from real-world sensors might degrade policy performance.
在接触丰富的操作中，集成力/扭矩传感提供了一种直观且多功能的方法，通过直接捕捉机器人与其环境之间的物理交互。由于接触本质上会产生力和扭矩，利用这些信息使得策略能够实时感知并适应接触交互，从而增强操作任务的精确性和控制性。尽管先前的研究[acp, liu2024force, tacdiffusion]通过引入力/扭矩模态提升了接触丰富任务的性能，但它们通常在整个操作过程中将力/扭矩数据与视觉数据结合，忽视了力/扭矩是稀疏激活的事实。实际上，像擦拭这样的任务涉及多个阶段，如拿起橡皮、执行擦拭动作以及放下橡皮。在这些阶段中，只有擦拭阶段需要显著的接触交互。在任务的非接触阶段，来自现实世界传感器的力/扭矩数据中的固有噪声可能会降低策略性能。

This paper introduces FoAR, a force-aware reactive policy designed for contact-rich robotic manipulation tasks. Building on the state-of-the-art real-world robot imitation policy RISE [rise], FoAR effectively integrates high-frequency force/torque sensing with visual inputs by dynamically balancing the usage of force/torque data. This enables precise handling of complex contact dynamics while maintaining strong performance in non-contact phases. The co-design of the FoAR policy and its reactive control strategy further enhances its contact-rich task performance through simple position control. With only 50 demonstrations per task, FoAR significantly outperforms baselines across various challenging contact-rich manipulation tasks. Additionally, FoAR demonstrates exceptional robustness, maintaining stable performance in three evaluation scenarios with unexpected dynamic disturbances, highlighting its adaptability and resilience in real-world applications.
本文介绍了 FoAR，一种专为接触丰富的机器人操作任务设计的力感知反应策略。基于最先进的现实世界机器人模仿策略 RISE[rise]，FoAR 通过动态平衡力/扭矩数据的使用，有效地将高频力/扭矩传感与视觉输入相结合。这使得在保持非接触阶段强劲性能的同时，能够精确处理复杂的接触动力学。FoAR 策略与其反应控制策略的协同设计，通过简单的位置控制进一步增强了其在接触丰富任务中的表现。每个任务仅需 50 次演示，FoAR 在各种具有挑战性的接触丰富操作任务中显著优于基线方法。此外，FoAR 展现出卓越的鲁棒性，在三种包含意外动态干扰的评估场景中保持稳定性能，凸显了其在实际应用中的适应性和韧性。

II Related Works II 相关工作

II-A Integrating Force/Torque Perception in Manipulation
II-A 力/力矩感知在操作中的集成

Force/torque perception is critical for enabling robots to interact effectively with the environment, particularly in manipulations that demand precise control and accurate feedback. By measuring the forces and torques applied to the robot, sensors offer valuable data on contact states, helping the robot perform delicate, contact-rich manipulations [ftsensor_review].
力/力矩感知对于使机器人能够有效地与环境交互至关重要，尤其是在需要精确控制和准确反馈的操作中。通过测量施加在机器人上的力和力矩，传感器提供了关于接触状态的有价值数据，帮助机器人执行精细且接触丰富的操作[ftsensor_review]。

Early research leveraged force/torque feedback for low-level control strategies [Beltran2020learning, hou2019robust, magrini2015control], enabling precise control in contact-rich tasks but often overlooking its potential for high-level decision-making. More recently, advancements have broadened the application of force/torque perception in robot learning. For example, methods such as [aburub2024learning, acp, kamijo2024learning] enhance vision-based policies [diffusionpolicy, act] by incorporating force/torque inputs and predefined stiffness outputs for compliance control. Others have extended the diffusion policy [diffusionpolicy] into the force domain to predict contact wrenches for hybrid force/position control [liu2024force], feedforward forces for impedance control [tacdiffusion], and desired forces for admittance control [zhou2024admittance].
早期研究利用力/扭矩反馈进行低级控制策略[Beltran2020learning, hou2019robust, magrini2015control]，在接触丰富的任务中实现了精确控制，但往往忽视了其在高层次决策中的潜力。最近，技术进步拓宽了力/扭矩感知在机器人学习中的应用。例如，[aburub2024learning, acp, kamijo2024learning]等方法通过结合力/扭矩输入和预定义刚度输出来增强基于视觉的策略[diffusionpolicy, act]，以实现顺应控制。其他研究将扩散策略[diffusionpolicy]扩展到力域，以预测混合力/位置控制的接触力矩[liu2024force]、阻抗控制的前馈力[tacdiffusion]以及导纳控制的期望力[zhou2024admittance]。

However, these approaches often only emphasize contact phases by assuming the object is already grasped [aburub2024learning, kamijo2024learning, tacdiffusion, zhou2024admittance] or fixed to the robot [acp, liu2024force], bypassing the impact of noisy force/torque readings during non-contact phases. Other works [buamanee2024bi, kobayashi2024alpha] employ torque data for bilateral control but rely on leader-follower teleoperation frameworks [act], limiting their adaptability to different data collection setups.
然而，这些方法通常仅通过假设物体已被抓取[aburub2024learning, kamijo2024learning, tacdiffusion, zhou2024admittance]或固定于机器人[acp, liu2024force]来强调接触阶段，从而绕过了非接触阶段中噪声力/扭矩读数的影响。其他研究[buamanee2024bi, kobayashi2024alpha]利用扭矩数据进行双边控制，但依赖于领导者-跟随者远程操作框架[act]，限制了它们对不同数据收集设置的适应性。

II-B Contact-Rich Robotic Manipulation
II-B 接触丰富的机器人操作

Contact-rich manipulation has been extensively studied due to its relevance in both manufacturing and daily life. It involves enabling robots to perform complex tasks that require precise control during physical interactions with the environment [suomalainen2022survey]. In the past, researchers developed classical force control methods [hogan1985impedance, mason1981compliance, raibert1981hybrid, whitney1985] for assembly tasks, laying the foundation for contact-rich manipulation control techniques. However, these approaches are often limited by their reliance on precise models and predefined strategies.
接触丰富的操作因其在制造业和日常生活中的重要性而得到了广泛研究。它涉及使机器人能够在与环境进行物理交互时执行需要精确控制的复杂任务[suomalainen2022survey]。过去，研究人员为装配任务开发了经典的力控制方法[hogan1985impedance, mason1981compliance, raibert1981hybrid, whitney1985]，为接触丰富的操作控制技术奠定了基础。然而，这些方法往往受限于对精确模型和预定义策略的依赖。

Recent advances in learning-based methods have greatly expanded robots’ capabilities in contact-rich manipulation. Reinforcement learning-based approaches [kalakrishnan2011learning, makingsensevisiontouch, levine2015learning, noseworthy2024forge, mimictouch] enable robots to learn complex tasks through interaction, but they often struggle with sim-to-real transfer due to discrepancies in visual observations and force/torque feedback, limiting their performance in real-world tasks. Several imitation learning studies seek to improve the abilities of the robot in contact-rich manipulation abilities by incorporating auxiliary modalities like audio [playtothescore, seehearfeel, maniwav, hearingtouch], tactile [denseboxpacking, playtothescore, huang20243d, seehearfeel, eyesight_hand], and force/torque [buamanee2024bi, acp, kamijo2024learning, liu2024force, tacdiffusion, zhou2024admittance]. A key challenge in multimodal policies lies in effectively processing and integrating different modalities within the policy framework, ensuring that the information from each modality is applied to the relevant task phases.
基于学习的方法的最新进展极大地扩展了机器人在接触丰富操作中的能力。基于强化学习的方法[kalakrishnan2011learning, makingsensevisiontouch, levine2015learning, noseworthy2024forge, mimictouch]使机器人能够通过交互学习复杂任务，但由于视觉观察和力/扭矩反馈的差异，它们在模拟到现实的转移中常常遇到困难，限制了其在现实世界任务中的表现。一些模仿学习研究通过整合音频[playtothescore, seehearfeel, maniwav, hearingtouch]、触觉[denseboxpacking, playtothescore, huang20243d, seehearfeel, eyesight_hand]和力/扭矩[buamanee2024bi, acp, kamijo2024learning, liu2024force, tacdiffusion, zhou2024admittance]等辅助模态，寻求提高机器人在接触丰富操作中的能力。多模态策略中的一个关键挑战在于如何在策略框架内有效处理和整合不同的模态，确保来自每个模态的信息被应用于相关的任务阶段。

By leveraging the force/torque modality, we propose learning a future contact probability to guide the multimodal feature fusion. This approach allows the force/torque information to enhance the contact phases of the task while preventing noisy data from interfering with other phases, leading to improved performance in various contact-rich manipulation tasks compared to previous fusion methods.
通过利用力/扭矩模态，我们提出学习未来接触概率以指导多模态特征融合。该方法使力/扭矩信息能够增强任务中的接触阶段，同时防止噪声数据干扰其他阶段，从而在多种接触丰富的操作任务中相比以往的融合方法实现了性能提升。

III Method III 方法

III-A Preliminary III-A 初步

Given an observation $p_{t}\in\mathbb{R}^{N_{t}\times 6}$ at current timestep $t$ , RISE [rise] $\pi(p_{t})=a_{t:t+T_{a}}$ learns a direct mapping from the current observation to future robot actions over a horizon of $T_{a}$ . Building upon RISE, our proposed force-aware policy, FoAR, incorporates high-frequency force/torque observations $f_{t-T_{o}:t}\in\mathbb{R}^{T_{o}\times 6}$ over a historical horizon of $T_{o}$ as additional inputs. Notice that $T_{o}$ represents the history horizon for high-frequency force/torque data, typically sampled at about 100Hz, while $T_{a}$ denotes the action horizon for future predictions, which operates at a lower frequency like 10Hz.
给定当前时间步 $t$ 的观测 $p_{t}\in\mathbb{R}^{N_{t}\times 6}$ ，RISE [rise] $\pi(p_{t})=a_{t:t+T_{a}}$ 学习从当前观测到未来 $T_{a}$ 时间范围内机器人动作的直接映射。在 RISE 的基础上，我们提出的力感知策略 FoAR，将历史 $T_{o}$ 时间范围内的高频力/力矩观测 $f_{t-T_{o}:t}\in\mathbb{R}^{T_{o}\times 6}$ 作为额外输入。注意， $T_{o}$ 代表高频力/力矩数据的历史时间范围，通常以约 100Hz 采样，而 $T_{a}$ 表示未来预测的动作时间范围，其操作频率较低，如 10Hz。

While vision-based policies [diffusionpolicy, octo, rise, cage] have demonstrated success in simple contact-rich tasks, we argue that force/torque information is vital for more complex scenarios. Taking the determination of contact states as an example, Fig. 1 (left) shows that visual differences in either RGB images or point clouds before and after contact are minimal, making it difficult to determine contact states. In contrast, force/torque data provides clear and reliable indicators of contact, highlighting its critical role in such tasks. As a result, incorporating high-frequency force/torque data complements point cloud observations, enabling more accurate and robust decision-making in contact-rich manipulations.
尽管基于视觉的策略[如 diffusionpolicy、octo、rise、cage]在简单的接触密集型任务中已展现出成效，但我们认为，在更复杂的场景中，力/扭矩信息至关重要。以接触状态的判定为例，图 1（左）展示了接触前后 RGB 图像或点云中的视觉差异微乎其微，这使得确定接触状态变得困难。相比之下，力/扭矩数据提供了清晰可靠的接触指示，凸显了其在此类任务中的关键作用。因此，整合高频力/扭矩数据与点云观测相辅相成，能够在接触密集型操作中实现更精确、更稳健的决策制定。

III-B Force-Aware Policy Design
III-B 力感知策略设计

Point Cloud Encoder. Following RISE [rise], we employ sparse 3D encoder [minkowski] with a shallow ResNet architecture [resnet] to process the point cloud $p_{t}\in\mathbb{R}^{N_{t}\times 6}$ into sparse point tokens $P_{t}\in\mathbb{R}^{N_{p}\times 512}$ . A Transformer [transformer] with sparse point encodings [rise] is then applied to these point tokens to generate a scene feature $h^{s}_{t}\in\mathbb{R}^{512}$ .
点云编码器。遵循 RISE[rise]的方法，我们采用稀疏 3D 编码器[minkowski]结合浅层 ResNet 架构[resnet]来处理点云 $p_{t}\in\mathbb{R}^{N_{t}\times 6}$ ，将其转化为稀疏点标记 $P_{t}\in\mathbb{R}^{N_{p}\times 512}$ 。随后，应用带有稀疏点编码[rise]的 Transformer[transformer]对这些点标记进行处理，以生成场景特征 $h^{s}_{t}\in\mathbb{R}^{512}$ 。

Force/Torque Encoder. The force/torque observation $f_{t}\in\mathbb{R}^{6}$ is first processed through a 3-layer MLP to generate the corresponding force token $F_{t}\in\mathbb{R}^{512}$ . These tokens over the past horizon $F_{t-T_{o}:t}\in\mathbb{R}^{T_{o}\times 512}$ , being inherently time-series data in nature, are then encoded using a Transformer [transformer] with sinusoidal positional encodings applied along the temporal axis, resulting in a force feature $h^{f}_{t}\in\mathbb{R}^{512}$ .
力/扭矩编码器。首先，力/扭矩观测值 $f_{t}\in\mathbb{R}^{6}$ 通过一个三层多层感知器（MLP）处理，生成相应的力标记 $F_{t}\in\mathbb{R}^{512}$ 。这些过去时间段 $F_{t-T_{o}:t}\in\mathbb{R}^{T_{o}\times 512}$ 内的标记，本质上是时间序列数据，随后使用 Transformer[transformer]进行编码，并沿时间轴应用正弦位置编码，最终得到力特征 $h^{f}_{t}\in\mathbb{R}^{512}$ 。

Feature Fusion. Previous studies on multimodal feature fusion in robotic manipulation have explored approaches such as direct concatenation of features [makingsensevisiontouch, liu2024force] and processing multimodal tokens through Transformers [vtt, seehearfeel, maniwav, octo]. However, such simple fusion methods often lead to the noisy force modality interfering with the non-contact phases of the task. Instead, we introduce a future contact predictor $\phi(t)\in[0,1]$ to guide the feature fusion process. Specifically, the fused feature $h_{t}$ is calculated as follows:
特征融合。以往关于机器人操作中多模态特征融合的研究探索了诸如特征直接连接[makingsensevisiontouch, liu2024force]以及通过 Transformer 处理多模态令牌[vtt, seehearfeel, maniwav, octo]等方法。然而，这些简单的融合方法往往导致噪声力模态干扰任务中的非接触阶段。为此，我们引入了一个未来接触预测器 $\phi(t)\in[0,1]$ 来指导特征融合过程。具体而言，融合特征 $h_{t}$ 的计算方式如下：

h_{t}=\left[h^{s}_{t};\phi(t)\cdot h^{f}_{t}+(1-\phi(t))\cdot h^{*}\right],

where $h^{*}$ is a learnable embedding, and $[\cdot;\cdot]$ is the concatenation symbol. In other words, the future contact predictor dynamically adjusts the weight of the force feature $h_{t}^{f}$ in the fusion process, ensuring that the force data is strongly emphasized during contact phases while minimizing its impact during non-contact phases by blending it with a neutral embedding $h^{*}$ . This approach allows the policy to more effectively utilize multimodal information without introducing interference from irrelevant data.
其中 $h^{*}$ 是可学习的嵌入， $[\cdot;\cdot]$ 是连接符号。换言之，未来接触预测器在融合过程中动态调整力特征 $h_{t}^{f}$ 的权重，确保在接触阶段强烈强调力数据，同时通过将其与中性嵌入 $h^{*}$ 混合，在非接触阶段最小化其影响。这种方法使策略能够更有效地利用多模态信息，而不会引入无关数据的干扰。

Future Contact Predictor. The future contact predictor takes the current observations as inputs and outputs the probability that contact will occur in the future steps. This probability is used to modulate the fusion of the visual and force modalities, allowing the model to emphasize force data when contact is likely to occur and reduce its influence during non-contact phases. As discussed in §III-A, we use current RGB image $I_{t}$ and force/torque data $f_{t-T_{o}:t}$ as observation inputs to the predictor, since (1) using RGB images can make the predictor more lightweight given that it performs similarly with point clouds in contact state determination; (2) while force/torque data does not directly predict future contact, it helps correct the predictor when unexpected contact occurs.
未来接触预测器。该预测器以当前观测数据为输入，输出未来步骤中发生接触的概率。此概率用于调节视觉与力觉模态的融合，使模型在接触可能发生时强调力数据，在非接触阶段减少其影响。如§III-A 所述，我们采用当前 RGB 图像 $I_{t}$ 和力/扭矩数据 $f_{t-T_{o}:t}$ 作为预测器的观测输入，因为（1）使用 RGB 图像能使预测器更为轻量化，鉴于其在接触状态判定上与点云表现相当；（2）尽管力/扭矩数据不直接预测未来接触，但在意外接触发生时有助于校正预测器。

Action Head. The fused feature $h_{t}$ is then used as the conditioning input for the action denoising process [diffusionpolicy, ddpm, ddim] to generate robot end-effector actions by progressively refining noisy action trajectories.
动作头部。融合特征 $h_{t}$ 随后被用作动作去噪过程[扩散策略、DDPM、DDIM]的条件输入，通过逐步优化含噪声的动作轨迹来生成机器人末端执行器的动作。

Supervision. The generated action is supervised by ground-truth action in demonstration data via L2 loss $\mathcal{L}_{\text{action}}$ in the diffusion process. The ground-truth future contact state is automatically extracted from the demonstrations based on whether the force/torque data exceeds a threshold $\delta_{\text{demo}}$ within a surrounding time window around the current timestep, which supervises the future contact predictor through binary cross-entropy loss $\mathcal{L}_{\text{predictor}}$ . The overall loss $\mathcal{L}$ is a linear combination of both terms:
监督。生成的动作通过扩散过程中的 L2 损失 $\mathcal{L}_{\text{action}}$ 由演示数据中的真实动作进行监督。真实的未来接触状态是基于力/扭矩数据在当前时间步周围的时间窗口内是否超过阈值 $\delta_{\text{demo}}$ 自动从演示中提取的，这通过二元交叉熵损失 $\mathcal{L}_{\text{predictor}}$ 监督未来接触预测器。总体损失 $\mathcal{L}$ 是这两项的线性组合：

\mathcal{L}=\mathcal{L}_{\text{action}}+\alpha\mathcal{L}_{\text{predictor}},

where $\alpha$ is the weighting factor.
其中 $\alpha$ 为权重因子。

III-C Reactive Control in Deployment
III-C 部署中的反应控制

Prior literature has explored various control strategies for contact-rich manipulation, such as admittance control [zhou2024admittance], compliance control [acp, kamijo2024learning, mason1981compliance], and hybrid force/position control [liu2024force, raibert1981hybrid]. These approaches often require additional parameters, such as stiffness and contact force direction. In contrast, we demonstrate that our proposed future contact predictor enables the robot to perform accurate, force-feedback-driven manipulation in contact-rich tasks even using simple end-effector position control, eliminating the need for complex parameter tuning and prediction.
先前的研究已探讨了多种针对接触丰富操作的控制策略，如导纳控制[zhou2024admittance]、顺应性控制[acp, kamijo2024learning, mason1981compliance]以及混合力/位置控制[liu2024force, raibert1981hybrid]。这些方法通常需要额外参数，如刚度和接触力方向。相比之下，我们证明了所提出的未来接触预测器使机器人能够在接触丰富的任务中，仅使用简单的末端执行器位置控制即可实现精确的力反馈驱动操作，从而无需复杂的参数调整和预测。

We introduce reactive control during deployment, as outlined in Alg. 1. Specifically, we threshold the predicted future contact probability $\phi$ from the contact predictor to determine whether the robot will make contact with the object and whether the predicted end-effector action needs to be adjusted using force/torque feedback. If $\phi$ exceeds the threshold $\delta_{\phi}$ , indicating that the robot is in contact or will soon make contact with the object, the controller will check the current force/torque readings $f_{t}$ , and correct the predicted robot actions if insufficient force/torque is detected. For action correction (Line 12-14 in Alg. 1), we estimate the future action direction based on the predicted action chunk and the current end-effector pose $q_{t}$ , then adjust the predicted robot actions by a small step $\epsilon$ towards that direction. Different temporal ensemble buffers [act] are used for contact and non-contact phases to avoid mutual interference while ensuring smooth trajectory execution.
在部署过程中，我们引入了反应控制，如算法 1 所述。具体而言，我们通过接触预测器对预测的未来接触概率 $\phi$ 进行阈值判断，以确定机器人是否将与物体接触，以及是否需要利用力/扭矩反馈调整预测的末端执行器动作。若 $\phi$ 超过阈值 $\delta_{\phi}$ ，表明机器人已接触或即将接触物体，控制器将检查当前的力/扭矩读数 $f_{t}$ ，并在检测到力/扭矩不足时修正预测的机器人动作。对于动作修正（算法 1 中的第 12-14 行），我们基于预测的动作片段和当前末端执行器姿态 $q_{t}$ 估计未来动作方向，随后沿该方向以一小步 $\epsilon$ 调整预测的机器人动作。接触与非接触阶段采用不同的时间集成缓冲器[act]，以避免相互干扰，同时确保轨迹执行的平滑性。

By incorporating reactive control during deployment, our FoAR policy can effectively handle uncertainties and dynamic changes in the environment, allowing the robot to adapt to real-world variations and achieve more reliable contact-rich manipulation performance.
通过在部署过程中融入反应控制，我们的 FoAR 策略能够有效应对环境中的不确定性和动态变化，使机器人能够适应现实世界的变化，实现更为可靠的接触丰富操作性能。

IV Experiments 第四部分实验

During the experiments, we intend to answer the following research questions:
在实验过程中，我们旨在解答以下研究问题：

•

Q1: Does integrating force/torque information improve policy performance and manipulation accuracy in contact-rich tasks, particularly in real-world scenarios where such tasks involve multiple phases with varying demands on precision and contact interactions?

• 问题一：在接触丰富的任务中，尤其是在现实世界场景下，这些任务涉及多个阶段，对精度和接触互动的需求各不相同，整合力/扭矩信息是否能提升策略表现和操作准确性？
•

Q2: Is the feature fusion module of the FoAR policy more effective than other variants in terms of integrating force/torque information?

• 问题二：在整合力/扭矩信息方面，FoAR 策略的特征融合模块是否比其他变体更为有效？
•

Q3: Does reactive control during deployment enhance the policy’s ability to perform contact-rich actions?

• 问题三：部署过程中的反应控制是否增强了策略执行接触丰富动作的能力？
•

Q4: Can FoAR maintain consistent task performance under unexpected environmental disturbances?

• 问题四：FoAR 能否在意外环境干扰下保持任务性能的一致性？

IV-A Setup IV-A 设置

Platform. The experimental platform consists of a Flexiv Rizon robotic arm with a Dahuan AG-95 gripper, and an OptoForce force/torque sensor mounted between the flange and the gripper. The robot operates within a 45cm $\times$ 60cm $\times$ 40cm workspace. An Intel RealSense D435 RGBD camera located in front of the robot workspace is used for scene perception. All devices are linked to a workstation with an Intel Core i9-10900K CPU and an NVIDIA RTX 3090 GPU for both data collection and evaluation.
实验平台由一台配备大寰 AG-95 夹爪的 Flexiv Rizon 机械臂组成，夹爪与法兰之间安装有 OptoForce 力/扭矩传感器。该机器人在 45 厘米 $\times$ 60 厘米 $\times$ 40 厘米的工作空间内运行。位于机器人工作空间前方的 Intel RealSense D435 RGBD 相机用于场景感知。所有设备均连接至一台配备 Intel Core i9-10900K CPU 和 NVIDIA RTX 3090 GPU 的工作站，用于数据收集与评估。

Tasks. As shown in Fig. 3, we design three challenging contact-rich tasks across two categories: surface force control (Wiping and Peeling) and instantaneous force impact (Chopping). These tasks require different capabilities in terms of the direction, intensity, and precision of applied contact forces. Moreover, these tasks are designed to have both non-contact phases and contact phases for thorough evaluations. For the Wiping task, we design two variants: one with a fixed orientation of the whiteboard, and another that allows arbitrary orientations, denoted as Wiping (General).
任务。如图 3 所示，我们设计了两大类共三项富有挑战性的接触密集型任务：表面力控制（擦拭与剥离）和瞬时力冲击（劈砍）。这些任务在施加接触力的方向、强度及精度方面要求各异。此外，这些任务特别设计为包含非接触阶段和接触阶段，以便进行全面评估。针对擦拭任务，我们设计了两种变体：一种是白板固定方向，另一种允许任意方向，分别标记为擦拭（常规）和擦拭（通用）。

Baselines. We evaluate our proposed approach against five baseline methods, including the vision-based policy RISE [rise] and three ablation variants: (1) RISE (force-token): incorporates encoded force/torque information as additional tokens within the RISE transformer, akin to [vtt, seehearfeel, maniwav, octo]; (2) RISE (force-concat): directly concatenates the force feature with the vision feature for action generation; (3) FoAR (3D-cls): uses scene features $h_{t}^{s}$ directly in the future contact predictor, instead of a separate image encoder.
基线方法。我们评估了所提出的方法，与五种基线方法进行比较，其中包括基于视觉的策略 RISE [rise] 及其三个消融变体：（1）RISE（力-令牌）：将编码的力/扭矩信息作为额外令牌整合到 RISE 变换器中，类似于[vtt, seehearfeel, maniwav, octo]；（2）RISE（力-拼接）：直接将力特征与视觉特征拼接以生成动作；（3）FoAR（3D 分类）：在未来的接触预测器中直接使用场景特征 $h_{t}^{s}$ ，而非单独的图像编码器。

Metrics. For all tasks, we calculate the action success rate (referred to as ASR) to assess the policy’s ability to meet basic action requirements, regardless of action quality. For the Wiping task, the score is assigned to 1 for a fully wiped whiteboard, 0.5 for partial wiping, and 0 for no erasure. For the Peeling task, the score is calculated based on the proportion of peeled cucumber skin to the total cucumber length, normalized by the average proportion in the demonstration data (0.778). For the Chopping task, we aim to let the robot use the knife to divide the pepper into several uniform small segments. Therefore, we focus on the number of segments, as well as the mean and standard deviation of the normalized lengths (defined as the proportion of each segment’s length to the total length of the pepper), providing a comprehensive assessment of chopping precision and consistency, as shown in Fig. 5.
指标。对于所有任务，我们计算动作成功率（简称 ASR）以评估策略满足基本动作要求的能力，而不考虑动作质量。对于擦拭任务，完全擦净的白板得分为 1，部分擦拭得 0.5 分，未擦拭则为 0 分。对于削皮任务，得分基于已削黄瓜皮长度与黄瓜总长度的比例，并通过演示数据中的平均比例（0.778）进行归一化计算。对于切菜任务，我们的目标是让机器人使用刀具将辣椒分割成若干均匀的小段。因此，我们关注分段数量以及归一化长度（定义为每段长度与辣椒总长度的比例）的平均值和标准差，从而全面评估切割的精确性和一致性，如图 5 所示。

Protocols. For policy training, we collect 50 expert demonstrations for the Wiping and Peeling tasks, and 40 for the Chopping task through haptic teleoperation [rh20t]. During evaluation, we run 20 trials per method for the Wiping and Peeling tasks, and 10 trials each only for FoAR and RISE [rise] on the Chopping task to conserve resources. Objects are randomly placed in the workspace, while ensuring similar positions across methods for fair comparisons.
协议。在策略训练方面，我们通过触觉遥操作[rh20t]收集了 50 次专家演示用于擦拭和剥离任务，以及 40 次用于切菜任务。评估期间，我们对擦拭和剥离任务每种方法进行了 20 次试验，而对于切菜任务，仅对 FoAR 和 RISE[rise]各进行了 10 次试验，以节约资源。物体被随机放置在工作区域内，同时确保各方法间位置相似，以保证公平比较。

Implementation. FoAR uses $T_{o}=200$ to encode high-frequency (100Hz) force/torque data, corresponding to approximately 2 seconds of data. The dimensions of force tokens, scene feature $h_{t}^{s}$ , force feature $h_{t}^{f}$ , and learnable embedding $h^{*}$ are all set to 512. For the future contact predictor, we utilize a ResNet18 [resnet] vision encoder and an MLP-based force encoder, followed by feature concatenation and a linear layer to output the probability $\phi$ . We combine the action loss and the predictor loss using $\alpha=0.1$ during training. Other hyperparameters remain the same as RISE. For reactive control in deployment, we set the future contact probability threshold $\delta_{\phi}=0.9$ , force threshold $\delta_{f}=8\text{N}$ , torque threshold $\delta_{t}=5\text{N}\cdot\text{m}$ , and small step $\epsilon=0.006\text{m}$ .
实现。FoAR 采用 $T_{o}=200$ 编码高频（100Hz）力/扭矩数据，对应约 2 秒的数据。力标记、场景特征 $h_{t}^{s}$ 、力特征 $h_{t}^{f}$ 及可学习嵌入 $h^{*}$ 的维度均设为 512。对于未来接触预测器，我们使用 ResNet18[resnet]视觉编码器和基于 MLP 的力编码器，随后进行特征拼接并通过线性层输出概率 $\phi$ 。训练时，我们结合动作损失与预测器损失，采用 $\alpha=0.1$ 。其他超参数与 RISE 保持一致。部署中的反应控制，我们设定了未来接触概率阈值 $\delta_{\phi}=0.9$ 、力阈值 $\delta_{f}=8\text{N}$ 、扭矩阈值 $\delta_{t}=5\text{N}\cdot\text{m}$ 及小步长 $\epsilon=0.006\text{m}$ 。

IV-B Surface Force Control Tasks: Wiping and Peeling
IV-B 表面力控制任务：擦拭与剥离

In surface force control tasks (Wiping and Peeling), the robot utilizes force/torque data to maintain consistent surface contact. A key challenge arises from the variability in tool grasp positions (e.g., top, bottom, center, or off-center), requiring adaptive control to adjust for changes in the grasp. As shown in Fig. 3, the Wiping task assesses the ability of the policy to maintain continuous and sustained contact, while the Peeling task emphasizes precision and sensitivity in manipulation.
在表面力控制任务（擦拭与剥离）中，机器人利用力/扭矩数据来维持稳定的表面接触。一个关键挑战源自工具抓握位置的变化（如顶部、底部、中心或偏心），这需要自适应控制来调整抓握变化。如图 3 所示，擦拭任务评估策略维持连续且持久接触的能力，而剥离任务则强调操作的精确性和敏感性。

FoAR significantly outperforms baselines in surface force control tasks by integrating force/torque information to enhance manipulation accuracy and contact consistency in surface force control tasks (Q1). We report the evaluation results for the Wiping, Wiping (General), and Peeling tasks in Table I. Our proposed method, FoAR, achieves the highest scores of 0.875, 0.850, and 0.756 for the Wiping, Wiping (General), and Peeling tasks, respectively, significantly outperforming all baseline and variant methods. FoAR attains 100% success rates in both grasping the tool and performing the contact-rich operations (wiping and peeling) in all tasks, demonstrating its ability to maintain continuous and precise contact regardless of grasp position of the tool (eraser and peeler). In contrast, the pure vision-based policy RISE struggles with these contact-rich operations due to the lack of force/torque feedback, leading to inaccurate position control, which reflects the difficulty in maintaining consistent contact stemming from absence of force/torque perceptions. The qualitative results of the Peeling task in Fig. 4 further support these findings, showcasing that FoAR achieves more consistent and effective performance compared to the baselines, which often result in partial peelings or failures.
FoAR 在表面力控制任务中显著优于基线方法，通过整合力/扭矩信息来增强操作精度和接触一致性（Q1）。我们在表 I 中报告了擦拭、通用擦拭和剥离任务的评估结果。我们提出的方法 FoAR 在擦拭、通用擦拭和剥离任务中分别获得了 0.875、0.850 和 0.756 的最高分，显著优于所有基线和变体方法。FoAR 在所有任务中均实现了 100%的工具抓取成功率以及执行富含接触的操作（擦拭和剥离）的成功率，展示了其无论工具（橡皮擦和削皮器）抓取位置如何都能保持连续且精确接触的能力。相比之下，纯视觉策略 RISE 由于缺乏力/扭矩反馈，在这些富含接触的操作中表现不佳，导致位置控制不准确，这反映了在没有力/扭矩感知的情况下保持一致接触的困难。图 X 中剥离任务的定性结果进一步证实了这一点。进一步支持这些发现，展示出与基线相比，FoAR 实现了更加一致且有效的性能，而基线方法常常导致部分剥离或失败。

The feature fusion module of FoAR enhances capabilities in contact-rich operations while maintaining strong performance during non-contact phases, surpassing several variants in integrating force/torque information (Q2). We report the evaluation results for these variants in Table I and Fig. 4. RISE (force-token) and RISE (force-concat) exhibit similar or slightly better performance compared to the pure vision-based policy RISE, suggesting that incorporating force/torque data as an additional input does provide some benefits. However, the key factor lies in how these inputs are effectively leveraged in the policy. Simply integrating force/torque tokens into the policy transformer or concatenating force features with vision features not only fails to fully leverage force/torque information but also risks introducing noisy force/torque data during non-contact phases, which can interfere with the policy’s decision-making process and thus negatively impacting overall performance, e.g., leading to lower grasp action success rates in both tasks for RISE (force-token). On the contrary, FoAR demonstrates strong performance in both contact and non-contact phases, highlighting the effectiveness of our feature fusion module over these variants in utilizing force/torque data.
FoAR 的特征融合模块在增强接触丰富操作能力的同时，在非接触阶段保持了强劲性能，在整合力/扭矩信息方面超越了多个变体（Q2）。我们在表 I 和图 4 中报告了这些变体的评估结果。RISE（力-令牌）和 RISE（力-连接）与纯视觉策略 RISE 相比，表现出相似或略优的性能，这表明将力/扭矩数据作为额外输入确实带来了一些益处。然而，关键在于如何在策略中有效利用这些输入。简单地将力/扭矩令牌整合到策略变换器中或将力特征与视觉特征连接，不仅未能充分利用力/扭矩信息，还可能在非接触阶段引入噪声力/扭矩数据，干扰策略的决策过程，从而对整体性能产生负面影响，例如导致 RISE（力-令牌）在两个任务中的抓取动作成功率降低。相反，FoAR 在接触和非接触阶段均表现出色，凸显了我们的特征融合模块在利用力/扭矩数据方面相较于这些变体的有效性。

Method 方法	# Segments $\uparrow$ # 片段 $\uparrow$	Norm. Length 标准长度		ASR (%) $\uparrow$
Method 方法	# Segments $\uparrow$ # 片段 $\uparrow$	Avg. $\downarrow$ 平均 $\downarrow$	Std. $\downarrow$ 标准 $\downarrow$	Grasp 掌握	Place 地点
RISE [rise] 崛起	1.8 $\pm$ 0.6	0.727	0.411	100	30
FoAR (ours) FoAR（我们的）	3.9 $\pm$ 0.9	0.353	0.094	100	70
Oracle (demonstration) 甲骨文（演示）	5.0 $\pm$ 0.0	0.200	0.056	100	100

TABLE II: Evaluation Results of the Chopping Task. We also calculate the metrics of the demonstrations as an oracle for reference.
表 II：切菜任务的评估结果。我们还计算了演示的指标作为参考的基准。

Separating the future contact predictor from the policy backbone is crucial to avoid disruption (Q2). As shown in Table I, the FoAR (3D-cls) variant even significantly underperforms the RISE baseline. This variant employs a shared sparse 3D encoder for both the future contact predictor and the policy backbone. We suspect that the visual features required by each component differ substantially. For example, in the Wiping task, the future contact predictor focuses on whether the eraser is positioned above the whiteboard, whereas the policy requires detailed information like the precise locations of objects and the end-effector position. Consequently, sharing a single vision encoder may cause conflicting attention and potential interference, disrupting both components and reducing their effectiveness.
将未来接触预测器与策略主干分离对于避免干扰至关重要（Q2）。如表 I 所示，FoAR（3D-cls）变体甚至显著低于 RISE 基线。该变体采用共享的稀疏 3D 编码器，同时服务于未来接触预测器和策略主干。我们怀疑每个组件所需的视觉特征存在显著差异。例如，在擦拭任务中，未来接触预测器关注的是橡皮擦是否位于白板上方，而策略则需要诸如物体精确位置和末端执行器位置等详细信息。因此，共享单一视觉编码器可能导致注意力冲突和潜在干扰，破坏两个组件的功能并降低其效能。

IV-C Instantaneous Force Impact Task: Chopping
IV-C 瞬时力冲击任务：劈砍

The Chopping task evaluates the robot’s ability to handle instantaneous force impacts, requiring precise force and torque control that vision data alone cannot provide [cage]. The main challenge lies in accurately assessing the chopping as the knife’s contact with the pepper and the chopping depth constantly change.
切菜任务评估机器人处理瞬时力冲击的能力，这需要精确的力和扭矩控制，仅凭视觉数据无法提供[cage]。主要挑战在于准确评估切菜过程，因为刀具与辣椒的接触及切菜深度不断变化。

Vision alone is insufficient for ensuring smooth chops, highlighting the necessity of force/torque feedback for precise control and improved policy performance (Q1). The results in Tab. II demonstrate that FoAR outperforms the baseline policy RISE, providing more reliable and controlled performance in the Chopping task. It achieves over double the number of segments (3.9 v.s. 1.8) and a lower averaged normalized segment length (0.353 v.s. 0.727) with reduced segment variability (standard deviation of 0.094 v.s. 0.411), indicating better performance in chopping the pepper into smaller, more uniform segments.
仅依赖视觉无法确保切割过程的顺畅，这凸显了力/扭矩反馈对于精确控制和提升策略性能的必要性（Q1）。表 II 中的结果显示，FoAR 策略在切割任务中超越了基准策略 RISE，提供了更可靠且受控的表现。它实现了超过两倍的切割段数（3.9 对 1.8）和更低的平均归一化切割长度（0.353 对 0.727），同时减少了切割长度的变异性（标准差为 0.094 对 0.411），表明在将辣椒切割成更小、更均匀的段方面表现更佳。

IV-D Ablations on Reactive Control
IV-D 关于反应控制的消融研究

Reactive control in deployment enables the policy to perform more precise contact-rich actions (Q3). To illustrate the importance of reactive control during policy deployment, we use the Wiping (General) task as an example. The results in Table III demonstrate that reactive control is essential for achieving optimal policy performance. Notably, our proposed reactive control relies on the predicted future contact probability from the policy, highlighting the effective co-design of the FoAR policy and the reactive control mechanism. Consequently, in all experiments, reactive control is applied to FoAR-based methods to ensure high performance.
部署中的反应控制使策略能够执行更精确的接触丰富动作（Q3）。为了说明策略部署期间反应控制的重要性，我们以擦拭（通用）任务为例。表 III 中的结果表明，反应控制对于实现最佳策略性能至关重要。值得注意的是，我们提出的反应控制依赖于策略预测的未来接触概率，突出了 FoAR 策略与反应控制机制的有效协同设计。因此，在所有实验中，反应控制都应用于基于 FoAR 的方法，以确保高性能。

Method 方法	Score $\uparrow$ 得分 $\uparrow$	ASR(%) $\uparrow$
Method 方法	Score $\uparrow$ 得分 $\uparrow$	Grasp 掌握	Wipe 擦拭
FoAR wo. Reactive Control FoAR 第几期. 反应控制	0.650	100	80
FoAR w. Reactive Control FoAR w. 反应控制	\cellcolor[HTML]CAD4E70.850	100	100

TABLE III: Ablation Results of the Wiping (General) Task on Reactive Control.
表 III：反应控制下擦拭（通用）任务的消融实验结果。

IV-E Robustness to Dynamic Disturbances
IV-E 动态干扰的鲁棒性

[Uncaptioned image] — TABLE IV: Robustness Evaluation Results of the Wiping (General) Task. The figure on the left illustrates the dynamic disturbances in the robustness evaluation. “Original” refers to vanilla evaluation with no disturbances.
表 IV：擦拭（通用）任务的鲁棒性评估结果。左图展示了鲁棒性评估中的动态干扰情况。“原始”指未施加任何干扰的常规评估。

Method 方法	Original 原文翻译文本：	Rewrite 改写	Move 移动	Rewrite + Move 重写 + 移动
RISE [rise] 崛起	0.500	90	80	0.500	80	70	0.600	100	100	0.500	100	70
RISE (force-token) 崛起（强制令牌）	0.600	90	80	0.450	90	90	0.500	90	80	0.600	100	100
FoAR (ours FoAR（我们的)	\cellcolor[HTML]CAD4E70.850	100	100	\cellcolor[HTML]CAD4E70.800	100	100	\cellcolor[HTML]CAD4E70.850	100	100	\cellcolor[HTML]CAD4E70.800	100	100

To further assess the adaptability of our model FoAR under more challenging and varied conditions, we develop three robustness evaluations for the Wiping (General) task: (1) Rewrite: write new random figures on the wiped area after robot wiping; (2) Move: move the whiteboard to a different position after robot wiping; (3) Rewrite + Move: combine the previous two dynamic disturbances, i.e., move the whiteboard to a different position, and write new random figures on the wiped area after robot wiping. These evaluations introduce disturbances during task execution to assess how well the methods can adjust to new conditions.
为了进一步评估我们的模型 FoAR 在更具挑战性和多样化条件下的适应性，我们为“擦拭（通用）”任务开发了三种鲁棒性评估方法：（1）重写：在机器人擦拭后，在已擦拭区域书写新的随机图形；（2）移动：在机器人擦拭后，将白板移至不同位置；（3）重写+移动：结合前两种动态干扰，即在机器人擦拭后，将白板移至不同位置，并在已擦拭区域书写新的随机图形。这些评估在任务执行过程中引入干扰，以评估方法对新条件的适应能力。

FoAR maintains consistent task performance under unexpected and dynamic environmental disturbances, demonstrating superior robustness and adaptability (Q4). As shown in Tab. IV, FoAR achieves scores of 0.800, 0.850, and 0.800 for the Rewrite, Move, and Rewrite + Move robustness evaluations, respectively, while maintaining 100% action success rates. It can adapt to newly written random figures in erased areas, adjust its strategy to changes in whiteboard position, and handle both challenges simultaneously. As a strong vision-based baseline, RISE also exhibits decent generalization ability and handles these variations without performance degradation [rise], but its overall performance is constrained by the absence of force/torque feedback. Built upon RISE, FoAR successfully inherits the strong generalization ability, consistently detecting and responding to dynamic environmental changes while maintaining high performances. Its effective integration of force/torque information elevates performance to a higher level compared to RISE. Conversely, RISE (force-token) struggles to handle such complex scenarios and experiences a performance drop. We hypothesize that the unexpected disturbances force the policy to return to non-contact phases, requiring action sequence re-generations. The inherent noise in force/torque data during these phases further exacerbates errors in the re-generated sequences, hindering its effectiveness.
FoAR 在意外和动态环境干扰下保持了一致的任务表现，展现了卓越的鲁棒性和适应性（Q4）。如表 IV 所示，FoAR 在重写、移动及重写+移动的鲁棒性评估中分别获得了 0.800、0.850 和 0.800 的分数，同时保持了 100%的动作成功率。它能够适应擦除区域新写入的随机图形，根据白板位置的变化调整策略，并能同时应对这两项挑战。作为基于视觉的强大基线，RISE 也展示了良好的泛化能力，并在处理这些变化时未出现性能下降[rise]，但其整体表现因缺乏力/扭矩反馈而受限。基于 RISE 构建的 FoAR 成功继承了强大的泛化能力，持续检测并响应动态环境变化，同时保持高性能。其有效整合的力/扭矩信息将性能提升至比 RISE 更高的水平。相反，RISE（力-令牌）难以处理此类复杂场景，并遭遇性能下降。我们假设，意外的干扰迫使策略返回到非接触阶段，需要重新生成动作序列。在这些阶段中，力/扭矩数据固有的噪声进一步加剧了重新生成序列中的错误，从而阻碍了其有效性。

V Conclusion 五结论

In this paper, we propose FoAR, a force-aware reactive policy tailored for contact-rich robotic manipulation. By introducing a future contact predictor, the policy enables effective contact-guided feature fusion between force/torque and visual information, dynamically balancing the contribution of each modality based on future contact probability. This design not only enhances precision during contact phases but also maintains strong performance in non-contact phases. Additionally, the future contact probability further guides the reactive control strategy, improving policy performance even with simple position control. Extensive experiments demonstrate the superior performance of FoAR in contact-rich tasks that require sustained and precise contact, such as wiping, peeling, and chopping. In the future, we plan to integrate advanced control strategies, such as compliance control and hybrid force/position control, into the FoAR policy to further enhance its performance. We also aim to extend this approach to dual-arm robots or humanoid robots for more complex contact-rich manipulation tasks.
本文提出了一种名为 FoAR 的力感知反应策略，专为接触丰富的机器人操作设计。通过引入未来接触预测器，该策略实现了力/扭矩与视觉信息之间的有效接触引导特征融合，根据未来接触概率动态平衡各模态的贡献。这一设计不仅提升了接触阶段的精确性，还在非接触阶段保持了强劲性能。此外，未来接触概率进一步指导反应控制策略，即使采用简单的位置控制也能提升策略表现。大量实验证明，FoAR 在需要持续且精确接触的任务中，如擦拭、剥离和切割，展现出卓越性能。未来，我们计划将先进控制策略，如顺应控制和混合力/位置控制，融入 FoAR 策略，以进一步提升其性能。同时，我们旨在将此方法扩展至双臂机器人或人形机器人，以应对更为复杂的接触丰富操作任务。

Acknowledgement 致谢

We would like to thank Chenxi Wang for helpful discussions, Yiming Wang and Shangning Xia for their help during the data collection process.
我们感谢陈曦王的有益讨论，以及王益明和夏尚宁在数据收集过程中的帮助。

\printbibliography

Appendix 附录

V-A Implementation Details
V-A 实现细节

Data Processing. Following RISE [rise], we create the point cloud from a single-view RGB-D image captured by a global camera. Then both the input point clouds and the output actions are aligned in the same camera coordinate. The point cloud is cropped based on the pre-defined robot workspace (notice that the tabletop points remain after cropping). The coordinates are normalized to $[-1,1]$ based on the robot workspace. The gripper width is also normalized to $[-1,1]$ according to the gripper width range.
数据处理。遵循 RISE[rise]方法，我们从全局相机捕捉的单视角 RGB-D 图像中生成点云。随后，输入的点云和输出的动作均在同一相机坐标系下对齐。根据预设的机器人工作空间对点云进行裁剪（注意裁剪后桌面点云仍保留）。坐标基于机器人工作空间归一化至 $[-1,1]$ 。同时，夹爪宽度也根据其宽度范围归一化至 $[-1,1]$ 。

Point Cloud Encoder. We implement sparse 3D encoder using MinkowskiEngine [minkowski] with a voxel size of $5\text{mm}$ . The sparse 3D encoder outputs a set of $512$ -dimensional point feature vectors. The transformer [transformer] contains $4$ encoding blocks and $1$ decoding block, with $d_{\text{model}}=512$ and $d_{\text{ff}}=2048$ . The readout token has a dimension of $512$ .
点云编码器。我们采用 MinkowskiEngine [minkowski]实现稀疏 3D 编码器，体素大小为 $5\text{mm}$ 。该稀疏 3D 编码器输出一组 $512$ 维的点特征向量。变换器[transformer]包含 $4$ 个编码块和 $1$ 个解码块，分别具有 $d_{\text{model}}=512$ 和 $d_{\text{ff}}=2048$ 。读出令牌的维度为 $512$ 。

Force/Torque Encoder. The high-frequency force/torque observation of the last $T_{o}=200$ steps (approximately $2$ seconds given the frequency of $100\text{Hz}$ ) is encoded via a 3-layer MLP of dimension $(64,128,512)$ . We use the same transformer architecture as the point cloud encoder to process these force/torque tokens with sinusoidal positioning encoding along the temporal axis. The readout token has a dimension of $512$ .
力/扭矩编码器。通过一个维度为 $(64,128,512)$ 的三层 MLP 对最近 $T_{o}=200$ 步（以 $100\text{Hz}$ 的频率计算，大约 $2$ 秒）的高频力/扭矩观测进行编码。我们采用与点云编码器相同的变压器架构，结合时间轴上的正弦位置编码来处理这些力/扭矩标记。读取标记的维度为 $512$ 。

Future Contact Predictor. We utilize a ResNet18 [resnet] vision encoder and a two-layer MLP of dimension $(128,512)$ to process image and force/torque inputs respectively. The outputs are concatenated and passed through a linear layer to compute the future contact probability $\phi$ . The ground-truth future contact state $t$ is generated from the force/torque data within the time window $[t-2\text{s},t+2\text{s}]$ and checks whether force/torque value exceeds the predefined force and torque thresholds. The thresholds may differ across tasks and can be easily determined manually from the collected demonstrations. The ground-truth future contact states are then used to supervise the future contact predictor.
未来接触预测器。我们采用 ResNet18[resnet]视觉编码器和一个维度为 $(128,512)$ 的两层 MLP 分别处理图像和力/扭矩输入。输出结果被拼接并通过一个线性层来计算未来接触概率 $\phi$ 。真实的未来接触状态 $t$ 是从时间窗口 $[t-2\text{s},t+2\text{s}]$ 内的力/扭矩数据生成的，并检查力/扭矩值是否超过预定义的力和扭矩阈值。这些阈值可能因任务而异，并且可以轻松地从收集的演示中手动确定。随后，真实的未来接触状态被用来监督未来接触预测器。

Action Head. A CNN-based diffusion head [diffusionpolicy] is employed with $100$ denoising iterations for training and $20$ iterations for inference using DDIM [ddim] scheduler. FoAR predicts the future action of $T_{a}=20$ steps.
动作头部。采用基于 CNN 的扩散头[扩散策略]，在训练时使用 $100$ 次去噪迭代，在推理时使用 DDIM[ddim]调度器进行 $20$ 次迭代。FoAR 预测未来 $T_{a}=20$ 步的动作。

Training. FoAR is trained on 2 NVIIDA A100 GPUs with a batch size of $240$ , an initial learning rate of $3\times 10^{-4}$ , and a warmup step of $2000$ . The learning rate is decayed by a cosine scheduler. During training, we apply the same point cloud augmentations as RISE [rise], and we also leverage color jittering for improved robustness. The weighting factor $\alpha$ between the future contact predictor loss and the action loss is set as $0.1$ .
训练。FoAR 在 2 块 NVIDIA A100 GPU 上进行训练，批量大小为 $240$ ，初始学习率为 $3\times 10^{-4}$ ，预热步数为 $2000$ 。学习率通过余弦调度器衰减。训练期间，我们采用与 RISE[rise]相同的点云增强方法，并利用颜色抖动以提高鲁棒性。未来接触预测器损失与动作损失之间的权重因子 $\alpha$ 设置为 $0.1$ 。

Deployment. We apply the reactive control during deployment, combining with simple end-effector position control (i.e., Line 20 in Alg. 1 is sent to the end-effector position controller of the robot). No advanced control strategies like compliance control, admittance control, and hybrid force/torque control are used in this paper. The future contact probability threshold $\delta_{\phi}$ is set to $0.9$ , the force threshold $\delta_{f}$ is set to $8\text{N}$ , the torque threshold $\delta_{t}$ is set to $5\text{N}\cdot\text{m}$ . The motion direction is calculated based on the predicted future $T_{f}=5$ action steps, and we set the small step $\epsilon=0.006\text{m}$ .
部署。我们在部署过程中应用反应控制，结合简单的末端执行器位置控制（即算法 1 中的第 20 行发送至机器人末端执行器位置控制器）。本文未采用如顺应控制、导纳控制及混合力/力矩控制等高级控制策略。未来接触概率阈值 $\delta_{\phi}$ 设为 $0.9$ ，力阈值 $\delta_{f}$ 设为 $8\text{N}$ ，力矩阈值 $\delta_{t}$ 设为 $5\text{N}\cdot\text{m}$ 。运动方向基于预测的未来 $T_{f}=5$ 动作步骤计算，并设定小步长 $\epsilon=0.006\text{m}$ 。

V-B Additional Ablation V-B 附加消融

We conduct an additional ablation experiment by replacing the transformer in the force/torque encoder with a simple MLP, as in the future contact predictor.
我们进行了一项额外的消融实验，将力/扭矩编码器中的变压器替换为简单的多层感知机（MLP），如同未来接触预测器中所采用的那样。

Method 方法	Score $\uparrow$ 得分 $\uparrow$	ASR(%) $\uparrow$
Method 方法	Score $\uparrow$ 得分 $\uparrow$	Grasp 掌握	Peel 剥离
RISE [rise] 崛起	0.293	100	50
FoAR (MLP) FoAR（MLP）	0.426	100	75
FoAR 《建筑学研究前沿》	\cellcolor[HTML]CAD4E70.588	100	100

TABLE V: Ablation Results of the Peeling Task on Different Force/Torque Encoders.
表 V：不同力/扭矩编码器在剥离任务上的消融结果。

The results shown in Tab. V illustrate that the simple MLP encoder cannot effectively capture the temporal features of the force/torque information, resulting in inferior performance compared to the transformer-based encoder. However, it still outperforms RISE by incorporating force/torque information for contact-rich manipulations.
表 V 所示结果表明，简单的 MLP 编码器无法有效捕捉力/扭矩信息的时间特征，导致其性能逊色于基于 Transformer 的编码器。然而，通过整合力/扭矩信息用于接触丰富的操作，它仍优于 RISE 方法。

Method 方法	Wiping 擦拭			Wiping (General) 擦拭（通用）			Peeling 剥离
	Score $\uparrow$ 得分 $\uparrow$	ASR (%) $\uparrow$		Score $\uparrow$ 得分 $\uparrow$	ASR (%) $\uparrow$		Score $\uparrow$ 得分 $\uparrow$	ASR (%) $\uparrow$
	Score $\uparrow$ 得分 $\uparrow$	Grasp 掌握	Wipe 擦拭	Score $\uparrow$ 得分 $\uparrow$	Grasp 掌握	Wipe 擦拭	Score $\uparrow$ 得分 $\uparrow$	Grasp 掌握	Peel 剥离
RISE [rise] 崛起	0.500	100	75	0.500	90	80	0.377	100	50
RISE (force-token) 崛起（强制令牌）	0.575	85	80	0.600	90	80	0.487	95	75
RISE (force-concat) 崛起（力-连接）	0.475	100	65	-	-	-	0.524	100	80
FoAR (3D-cls) FoAR（三维分类）	0.175	40	35	-	-	-	0.270	95	40
FoAR (ours FoAR（我们的)	\cellcolor[HTML]CAD4E70.875	100	100	\cellcolor[HTML]CAD4E70.850	100	100	\cellcolor[HTML]CAD4E70.756	100	100

FoAR: Force-Aware Reactive Policy for Contact-Rich Robotic Manipulation FoAR：面向接触丰富机器人操作的力感知反应策略

Abstract 摘要

I Introduction 引言

II Related Works II 相关工作

II-A Integrating Force/Torque Perception in ManipulationII-A 力/力矩感知在操作中的集成

II-B Contact-Rich Robotic ManipulationII-B 接触丰富的机器人操作

III Method III 方法

III-A Preliminary III-A 初步

III-B Force-Aware Policy DesignIII-B 力感知策略设计

III-C Reactive Control in DeploymentIII-C 部署中的反应控制

IV Experiments 第四部分 实验