InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
InterMimic：迈向基于物理的人体-物体交互的通用全身控制

Sirui Xu¹ Hung Yu Ling² Yu-Xiong Wang^1† Liang-Yan Gui^1†
¹ University of Illinois Urbana-Champaign ² Electronic Arts
^† Equal Advising
https://sirui-xu.github.io/InterMimic

Abstract 摘要

Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy – perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.
实现人类与各类物体互动的逼真模拟，长期以来一直是一项基础性目标。将基于物理的运动模仿技术扩展到复杂的人-物交互（HOI）领域面临多重挑战：人体与物体间错综复杂的耦合关系、物体几何形态的多样性，以及运动捕捉数据中的伪影问题（如不精确的接触点和有限的手部细节）。我们提出 InterMimic 框架，该框架使单一策略能够从长达数小时、涵盖与动态多变物体进行全身交互的不完美动作捕捉数据中稳健学习。我们的核心思路是采用"先求精再扩展"的课程策略：首先生成针对特定对象的教师策略来模仿、重定向并优化动作捕捉数据；随后将这些教师策略蒸馏至学生策略中——教师策略既作为在线专家提供直接监督，又充当高质量参考基准。值得注意的是，我们通过对学生策略进行强化学习微调，使其超越单纯的动作复现，获得更高质量的解决方案。实验表明，InterMimic 能在多个人-物交互数据集上生成逼真且多样化的交互动作。习得的策略以零样本方式泛化，并与运动学生成器无缝集成，将框架从单纯的模仿提升至复杂人机交互的生成式建模。

{strip} 去除 [Uncaptioned image]

Figure 1: InterMimic enables physically simulated humans to perform interactions with dynamic and diverse objects. It supports highly-dynamic, multi-object interactions and scalable skill learning (Top), making it adaptable for versatile downstream applications (Bottom): it can translate whole-body loco-manipulation skills to a humanoid robot [130, 47], perfect interaction MoCap data, and bridge kinematic generation, e.g., predicting future interactions from past (InterDiff [160]) or generating interactions given text prompts (InterDreamer [162]).
图 1：InterMimic 技术使物理仿真人类能够与动态多样的物体进行交互。该系统支持高度动态、多物体交互及可扩展的技能学习（上图），使其能适应多种下游应用场景（下图）：可将全身运动操控技能迁移至人形机器人[130,47]、完善动作捕捉数据，并衔接运动生成——例如根据过往交互预测未来动作（InterDiff[160]），或基于文本提示生成交互行为（InterDreamer[162]）。

\etocdepthtag

.tocmtchapter

1 Introduction 1 引言

Animating human-object interactions is a challenging and time-consuming task even for skilled animators. It requires a deep understanding of physics and meticulous attention to detail to create natural and convincing interactions. While Motion Capture (MoCap) data provides references, animators often need to correct contact errors caused by sensor limitations and occlusions between humans and objects. However, this process remains unscalable, as refining a single motion demands a delicate balance between preserving the captured data and ensuring its physical plausibility.
动画化人与物体的互动对于熟练的动画师而言也是一项极具挑战且耗时的工作。它需要深入理解物理原理并细致关注细节，才能创造出自然且令人信服的交互效果。尽管动作捕捉（MoCap）数据提供了参考，动画师仍需频繁修正因传感器局限及人物与物体间遮挡导致的接触误差。然而，这一过程仍难以规模化，因为优化单个动作需要在保留捕捉数据与确保其物理合理性之间达成精妙的平衡。

Physics-based human motion imitation [67, 104] offers an alternative approach to improving motion fidelity, by training control policies to mimic reference MoCap data within a physics simulator. However, scaling up human-object interaction imitation presents significant challenges: (i) MoCap Imperfection: Contact artifacts are common, causing expected contacts to fluctuate instead of maintaining consistent zero distance, often due to MoCap limitations or missing hand capture [5, 70]. Accurately imitating MoCap kinematics can result in unrealistic dynamics in simulation. Moreover, HOI datasets often include diverse human shapes, requiring motion retargeting to adapt movements across different human models while preserving interaction dynamics. This retargeting process is imperfect and can introduce new contact artifacts or exacerbate existing ones. (ii) Scaling-up: Although large-scale motion imitation has been explored in previous works [173, 89, 125, 147], it remains largely underexplored for whole-body interactions involving dynamic and diverse objects.
基于物理的人体运动模仿[67, 104]通过训练控制策略在物理模拟器中模仿参考动作捕捉数据，为提高运动保真度提供了另一种途径。然而，扩展人-物交互模仿面临着显著挑战：(i) 动作捕捉缺陷：接触伪影普遍存在，由于动作捕捉技术限制或手部数据缺失[5, 70]，预期接触会波动而非保持稳定的零距离。精确模仿动作捕捉运动学可能导致模拟中出现不真实的动力学。此外，人-物交互数据集常包含多样化人体形态，需通过运动重定向技术在不同人体模型间调整动作，同时保持交互动力学特性。这一重定向过程并不完美，可能引入新的接触伪影或加剧现有问题。(ii) 规模化扩展：尽管先前研究[173, 89, 125, 147]已探索大规模运动模仿，但对于涉及动态多样化物体的全身交互仍缺乏深入探索。

In this paper, we aim to utilize rich yet imperfect motion capture interaction datasets to train a control policy capable of learning diverse motor skills while enhancing the plausibility of these actions by correcting errors, such as inaccurate hand motions and faulty contacts. Our approach is grounded on the key insight of tackling the challenges of skill perfection and skill integration progressively. We implement a curriculum-based teacher-student distillation framework, where multiple teacher policies focus on imitating and refining small subsets of interactions, and a student policy integrates these skills from the teachers.
本文旨在利用丰富但存在缺陷的动作捕捉交互数据集，训练一个能够学习多样化运动技能的控制策略，同时通过修正错误（如不准确的手部动作和错误的接触）来提升动作的合理性。我们的方法基于逐步解决技能完善与技能整合这一核心思路，构建了基于课程教学的师生蒸馏框架：由多个教师策略分别专注于模仿与精炼小规模交互子集，再由学生策略整合来自教师们的技能。

Instead of relying on curated data that covers a limited range of actions [91, 6], we employ multiple teacher policies trained on a diverse set of imperfect interaction data and address two key challenges: retargeting and recovering. First, we unify all training policies to a canonical human model, by embedding HOI retargeting directly into the imitation. This is achieved by reframing the policy learning to optimize both imitation and retargeting objectives. Second, our teacher policies refine interaction motion through learning from it, as accurate contact dynamics enforced by a physics simulator inherently correct inaccuracies in the reference kinematics. To support this, we introduce tailored contact-guided reward and optimize trajectory collection, enabling effective skill imitation despite MoCap errors.
我们摒弃了依赖覆盖有限动作范围的精选数据[91, 6]的传统方法，转而采用基于多样化不完美交互数据训练的多教师策略，并着力解决两大核心挑战：动作重定向与运动恢复。首先，通过将人机交互重定向直接嵌入模仿过程，我们将所有训练策略统一到标准人体模型上。这是通过重构策略学习框架，使其同时优化模仿与重定向目标来实现的。其次，我们的教师策略通过从交互动作中学习来精炼运动表现——物理模拟器强制的精确接触动力学会从根本上修正参考运动学中的误差。为此，我们设计了专门的接触引导奖励机制并优化轨迹采集，使得即使在运动捕捉存在误差的情况下，仍能实现高效的技能模仿。

Introducing teacher policies offers several key benefits. By leveraging teacher rollouts, we effectively distill raw MoCap data into refined HOI references with a unified embodiment and enhanced physical fidelity. These refined references guide the subsequent student policy training, reducing the negative impact of errors in the original MoCap data. A major hurdle in scaling motion imitation is the sample inefficiency of Reinforcement Learning (RL), which can lead to prohibitively long training times. Our teacher-student approach mitigates this through a space-time trade-off: multiple teacher policies are trained in parallel on smaller, more manageable data subsets, and their expertise is then distilled into a single student policy. We begin with demonstration-based distillation to bootstrap PPO [114] updates, reducing reliance on pure trial and error and enabling more effective scaling. As training progresses, the student gradually shifts from heavy demonstration guidance to increased RL updates, ultimately surpassing simple demonstration memorization. This mirrors alignment strategies in Large Language Models (LLMs), where demonstration-based pretraining is refined through RL fine-tuning [98, 129]
引入教师策略具有多项关键优势。通过运用教师策略部署，我们能够高效地将原始动作捕捉数据提炼为具有统一体现形式和增强物理保真度的人机交互参考数据。这些优化后的参考数据指导后续学生策略训练，降低了原始动作捕捉数据中误差的负面影响。扩大运动模仿规模的主要障碍在于强化学习（RL）的样本效率低下，这会导致训练时间过长。我们的师生方法通过时空权衡缓解了这一问题：多个教师策略在更小、更易管理的数据子集上并行训练，随后将其专业知识提炼至单一学生策略中。我们首先采用基于示范的提炼方法来引导 PPO[114]更新，减少对纯试错训练的依赖，从而实现更有效的规模化扩展。随着训练推进，学生策略逐渐从重度依赖示范指导转向更多强化学习更新，最终超越简单的示范记忆阶段。这反映了大型语言模型（LLMs）中的对齐策略，其中基于演示的预训练通过强化学习微调得到优化[98, 129].

To summarize, our contributions are as follows: (i) We introduce InterMimic, which, to the best of our knowledge, is the first framework designed to train physically simulated humans to develop a wide range of whole-body motor skills for interacting with diverse and dynamic objects, extending beyond traditional grasping tasks. (ii) We develop a teacher-student training strategy, where teacher policies provide a unified solution to address the challenges of retargeting and refining in HOI imitation. The student distillation introduces a scalable solution by leveraging a space-time trade-off. (iii) We demonstrate that our unified framework, InterMimic, as illustrated in Figure 1, effectively handles versatile physics-based interaction animation, recovering motions with realistic and physically plausible details. Notably, by combining kinematic generators with InterMimic, we enable a physics-based agent to achieve tasks such as interaction prediction and text-to-interaction generation.
综上所述，我们的贡献如下：(i) 我们提出了 InterMimic 框架，据我们所知，这是首个旨在训练物理模拟人类掌握多样化全身运动技能以与动态物体交互的框架，其应用范围超越了传统的抓取任务。(ii) 我们开发了一种师生训练策略，其中教师策略为解决 HOI 模仿中的重定向与精炼挑战提供了统一方案。通过时空权衡，学生蒸馏引入了一种可扩展的解决方案。(iii) 我们证明，如图 1 所示，我们的统一框架 InterMimic 能有效处理基于物理的多样化交互动画，还原出具有真实感且物理合理的运动细节。值得注意的是，通过将运动学生成器与 InterMimic 结合，我们使基于物理的智能体能够完成交互预测和文本到交互生成等任务。

2 Related Work 2 相关工作

Significant progress has been made in human interaction animation and control, with advancements in areas such as human-human interactions [161, 31, 81, 144, 158, 84, 44, 55, 75], hand-object interactions [121, 145, 169, 178, 170, 195, 197, 7, 183, 94, 128, 14, 101, 128, 1, 168, 139, 159], single-frame interactions [153, 184, 141, 107, 43, 60, 163, 164, 20, 73, 166, 56, 167, 165, 181, 190, 192, 149], human interactions with static scenes [36, 136, 137, 9, 99, 152, 125, 65, 140, 69, 188, 85, 57, 16, 185, 8, 95, 193], and real-world humanoid control for object manipulation [33, 115, 38, 27, 28, 11, 175, 12, 135, 52, 86, 177, 39, 15, 4, 25, 72, 54, 48, 23, 76]. In this section, we focus on recent studies on whole-body interaction animation, particularly involving dynamic objects.
在人机交互动画与控制领域已取得显著进展，相关研究涵盖以下方向：人人交互[161, 31, 81, 144, 158, 84, 44, 55, 75]、手物交互[121, 145, 169, 178, 170, 195, 197, 7183, 94, 128, 14, 101, 128, 1, 168, 139, 159]、单帧交互[153, 184, 141, 107, 43, 60, 163, 164, 20, 73, 166, 56, 167, 165181, 190, 192, 149]、静态场景中的人体交互[36, 136, 137, 9, 99, 152, 125, 65, 140, 69, 188, 85, 57, 16, 185, 8, 95193]，以及面向物体操控的真实世界仿人机器人控制[33, 115, 38, 27, 28, 11, 175, 12, 135, 52, 86, 177, 39, 15, 4, 25, 72, 54, 48, 23, 76]。本节我们将重点综述涉及动态物体的全身交互动画最新研究成果。

2.1 Kinematic Interaction Animation
2.1 运动交互动画

Generating human interactions has been a long-standing topic in animation and computer graphics [66, 32]. Significant advances in character animation [10, 196, 21, 155, 68, 77, 194, 3] have emerged with the advent of deep learning, e.g., including phase-function-based methods [41] that enable object interactions like carrying a box [119] or playing basketball [120]. This is extended to more diverse but static objects approaching [186, 150, 123, 64]. Subsequent efforts [30, 109, 71, 151, 74, 51, 50, 87] integrate object motion into interactions but remain constrained by assuming that interactions occur primarily through the hands. To address this, recent developments [17, 134, 160, 103, 24, 148, 22, 118, 58, 162, 40] introduce interactions in a fashion of whole-body loco-manipulation that engages multiple body parts in contact. However, these methods often suffer from physical inaccuracies, such as floating contacts and penetrations, while they generate only body motion without considering hand motion [88] or dexterity. In this work, we address physical inaccuracies by refining imperfect kinematic generation through physics simulation, with InterDiff [160] and HOI-Diff [103] serving as motion planning for loco-manipulation that bridges high-level decision-making (e.g., text instruction) with low-level execution.
生成人类交互行为一直是动画与计算机图形学领域长期研究的课题[66, 32]。随着深度学习的兴起，角色动画技术取得了显著进展[10, 196, 21, 155, 68, 77, 194, 3]，例如基于相位函数的方法[41]实现了搬箱子[119]、打篮球[120]等物体交互。该技术进一步扩展到处理更多样化但静态的接近物体[186, 150, 123, 64]。后续研究[30, 109, 71, 151, 74, 51, 50, 87]将物体运动融入交互过程，但仍受限于"交互主要通过手部完成"的假设。为此，最新研究[17, 134, 160, 103, 24, 148, 22, 118, 58, 162, 40]引入了全身运动操控的交互方式，通过多肢体接触实现交互。然而这些方法常存在物理不精确问题，如接触悬浮和穿透现象，且仅生成身体运动而忽略了手部动作[88]或灵巧性。在本工作中，我们通过物理模拟改进不完美的运动学生成来解决物理不准确性问题，其中 InterDiff [160]和 HOI-Diff [103]作为连接高层决策（如文本指令）与底层执行的局部操作运动规划工具。

2.2 Physics-based Interaction Animation
2.2 基于物理的交互动画

Physics-based methods generate motion through motor control policies within a physics simulator, e.g., achieved via deep reinforcement learning to track reference motions [104]. These policies are directly applicable for executing simple interactions, such as punching or striking an object [106, 124, 18, 127]. To achieve more complex interactions, early studies focus on specific scenarios, including notable sports-related [92] examples such as basketball [142, 79, 143], skating [78], soccer [156, 42, 80, 35, 18], tennis [179], table tennis [138], as well as tasks proposed in [2]. Research also demonstrates flexibility in more general but simpler box carrying tasks [189]. These advancements are achieved through the integration of multiple control policies [97], the use of adversarial motion priors [37, 105, 29], and imitating diverse kinematic generations [157, 151]. However, these methods train their policies in a non-scalable manner, with each policy handling only specific object types or actions. In pursuit of a single, scalable policy to enable multiple interaction skills, existing methods either rely on fixed interaction patterns, such as approaching and grasping objects by following predefined trajectories [6], or remain confined to single-hand grasping actions [91]. Additionally, they depend on highly curated data from the GRAB dataset [122], which, despite its high quality, primarily features low-dynamic full-body motion and only small-sized objects. More recently proposed datasets [5, 49, 46, 180, 26, 70, 191, 59, 93, 176, 154, 182, 187, 82, 83] offer richer full-body interactions with objects in diverse shapes but contain significant artifacts that challenge existing motion imitation approaches. We process data from OMOMO [70], BEHAVE [5], HODome [180], IMHD [191], and HIMO [93] into the simulator, demonstrating InterMimic’s scalability for diverse interactions and robustness to MoCap artifacts.
基于物理的方法通过在物理模拟器中的运动控制策略生成动作，例如通过深度强化学习跟踪参考运动[104]。这些策略可直接应用于执行简单交互，如击打或敲击物体[106, 124, 18, 127]。为实现更复杂交互，早期研究聚焦特定场景，包括篮球[142, 79, 143]、滑冰[78]、足球[156, 42, 80, 35, 18]、网球[179]、乒乓球[138]等显著运动相关案例[92]，以及文献[2]提出的任务。研究还展示了在更通用但更简单的搬箱任务中的灵活性[189]。这些进展通过整合多重控制策略[97]、采用对抗性运动先验[37, 105, 29]以及模仿多样化运动生成[157, 151]得以实现。然而，这些方法以非可扩展的方式训练策略，每个策略仅能处理特定物体类型或动作。为寻求一种单一且可扩展的策略以实现多种交互技能，现有方法要么依赖固定的交互模式（如通过预定义轨迹完成接近和抓取物体的动作[6]），要么局限于单手抓取操作[91]。此外，这些方法高度依赖来自 GRAB 数据集[122]的精心筛选数据，尽管该数据集质量较高，但主要包含低动态全身运动及小型物体交互。最新提出的数据集[5,49,46,180,26,70,191,59,93,176,154,182,187,82,83]虽能提供更丰富的全身与各类形状物体的交互，但存在显著的运动捕捉伪影，对现有动作模仿方法构成挑战。我们将 OMOMO[70]、BEHAVE[5]、HODome[180]、IMHD[191]和 HIMO[93]的数据处理后导入仿真器，证明了 InterMimic 在多样化交互中的可扩展性及对运动捕捉伪影的鲁棒性。

Refer to caption — Figure 2: Our two-stage pipeline: (i) training each teacher policy (MLP) on a small data subset with initialization corrected via Physical State Initialization (PSI), and (ii) freezing the teacher policies to provide refined references for training a student policy (Transformer). The student leverages teacher supervision for effective scaling and is fine-tuned through RL.
图 2：我们的两阶段流程：(i) 通过物理状态初始化（PSI）修正初始参数后，在小型数据子集上训练每个教师策略（MLP）；(ii) 冻结教师策略，为训练学生策略（Transformer）提供优化参考。学生策略利用教师监督实现高效扩展，并通过强化学习进行微调。

3 Methodology 3 方法论

Task Formulation. The goal of human-object interaction (HOI) imitation is to learn a policy $\pi$ that produces simulated human-object motion $\{\boldsymbol{q}_{t}\}_{t=1}^{T}$ closely matching a ground-truth reference $\{\hat{\boldsymbol{q}}_{t}\}_{t=1}^{T}$ derived from large-scale MoCap data. Given the geometries of the human and objects, the policy should also compensate for missing or inaccurate details in the dataset. Each pose $\boldsymbol{q}_{t}$ has two components: the human pose $\boldsymbol{q}^{h}_{t}$ and the object pose $\boldsymbol{q}^{o}_{t}$ . The human pose is defined as $\boldsymbol{q}^{h}_{t}=\{\boldsymbol{\theta}^{h}_{t},\boldsymbol{p}^{h}_{t}\}$ , where $\boldsymbol{\theta}^{h}_{t}\in\mathbb{R}^{51\times 3}$ represents the joint rotations, and $\boldsymbol{p}^{h}_{t}\in\mathbb{R}^{51\times 3}$ specifies the joint positions. Specifically, our human model includes 30 hand joints and 21 joints for the rest of the body. The object pose $\boldsymbol{q}^{o}_{t}$ is represented as $\{\boldsymbol{\theta}^{o}_{t},\boldsymbol{p}^{o}_{t}\}$ , where $\boldsymbol{\theta}^{o}_{t}\in\mathbb{R}^{3}$ denotes the object’s orientation and $\boldsymbol{p}^{o}_{t}\in\mathbb{R}^{3}$ the position. All simulation states have corresponding ground-truth values, denoted by the hat symbol. For instance, the reference object rotation is $\{\hat{\boldsymbol{\theta}}^{o}_{t}\}_{t=1}^{T}$ . The environmental setup for the simulation is detailed in Sec. B of supplementary.
任务定义。人-物交互（HOI）模仿的目标是学习一个策略 $\pi$ ，该策略生成的模拟人-物运动 $\{\boldsymbol{q}_{t}\}_{t=1}^{T}$ 需与基于大规模运动捕捉数据得出的真实参考 $\{\hat{\boldsymbol{q}}_{t}\}_{t=1}^{T}$ 高度匹配。在给定人类和物体几何形状的前提下，该策略还应补偿数据集中缺失或不准确的细节。每个姿态 $\boldsymbol{q}_{t}$ 包含两个部分：人体姿态 $\boldsymbol{q}^{h}_{t}$ 和物体姿态 $\boldsymbol{q}^{o}_{t}$ 。人体姿态定义为 $\boldsymbol{q}^{h}_{t}=\{\boldsymbol{\theta}^{h}_{t},\boldsymbol{p}^{h}_{t}\}$ ，其中 $\boldsymbol{\theta}^{h}_{t}\in\mathbb{R}^{51\times 3}$ 表示关节旋转， $\boldsymbol{p}^{h}_{t}\in\mathbb{R}^{51\times 3}$ 指定关节位置。具体而言，我们的人体模型包含 30 个手部关节和 21 个身体其他部位的关节。物体姿态 $\boldsymbol{q}^{o}_{t}$ 表示为 $\{\boldsymbol{\theta}^{o}_{t},\boldsymbol{p}^{o}_{t}\}$ ，其中 $\boldsymbol{\theta}^{o}_{t}\in\mathbb{R}^{3}$ 表示物体朝向， $\boldsymbol{p}^{o}_{t}\in\mathbb{R}^{3}$ 表示位置。所有模拟状态均有对应的真实值，用尖帽符号标注。例如，参考物体旋转为 $\{\hat{\boldsymbol{\theta}}^{o}_{t}\}_{t=1}^{T}$ 。模拟的环境设置详见补充材料 B 节。

Overview. We formulate interaction imitation as a Markov Decision Process (MDP), defined by states, actions, simulator-provided transition dynamics, and a reward function. Figure 2 illustrates our two-stage framework: (i) training teacher policies $\pi^{(T)}$ on small skill subsets, and (ii) distilling these teachers into a scalable student policy $\pi^{(S)}$ for large-scale skill learning. In Sec. 3.1, we define the states $\boldsymbol{s}_{t}$ and actions $\boldsymbol{a}_{t}$ , applicable to both teacher $\pi^{(T)}$ and student $\pi^{(S)}$ policies. In Sec. 3.2, we describe how teacher policies are trained via RL, focusing on reward designs that facilitate retargeting, as well as techniques that mitigate the impact of imperfections in the reference data. Sec. 3.3 details the subsequent distillation of teachers into a scalable student policy, leveraging both RL and learning from demonstration.
概述。我们将交互模仿建模为一个马尔可夫决策过程（MDP），其定义包括状态、动作、模拟器提供的转移动态及奖励函数。图 2 展示了我们的两阶段框架：（i）在小规模技能子集上训练教师策略 $\pi^{(T)}$ ，（ii）将这些教师策略蒸馏为可扩展的学生策略 $\pi^{(S)}$ 以实现大规模技能学习。在 3.1 节中，我们定义了适用于教师 $\pi^{(T)}$ 和学生 $\pi^{(S)}$ 策略的状态 $\boldsymbol{s}_{t}$ 与动作 $\boldsymbol{a}_{t}$ 。3.2 节阐述了如何通过强化学习训练教师策略，重点关注促进重定向的奖励设计，以及减轻参考数据缺陷影响的技术。3.3 节详述了后续将教师策略蒸馏为可扩展学生策略的过程，结合了强化学习和示教学习的方法。

3.1 Policy Representation
3.1 策略表示

State. The state $\boldsymbol{s}_{t}$ , which serves as input to the policy, comprises two components $\boldsymbol{s}_{t}=\{\boldsymbol{s}_{t}^{s},\boldsymbol{s}_{t}^{g}\}$ . The first part, $\boldsymbol{s}_{t}^{s}$ , contains human proprioception and object observations, expressed as, $\{\{\boldsymbol{\theta}_{t}^{h},\boldsymbol{p}_{t}^{h},\boldsymbol{\omega}_{t}% ^{h},\boldsymbol{v}_{t}^{h}\},\{\boldsymbol{\theta}_{t}^{o},\boldsymbol{p}_{t}% ^{o},\boldsymbol{\omega}_{t}^{o},\boldsymbol{v}_{t}^{o}\},\{\boldsymbol{d}_{t}% ,\boldsymbol{c}_{t}\}\},$ where $\{\boldsymbol{\theta}_{t}^{h},\boldsymbol{p}_{t}^{h},\boldsymbol{\omega}_{t}^{% h},\boldsymbol{v}_{t}^{h}\}$ represent the rotation, position, angular velocity, and velocity of all joints, respectively, while $\{{\boldsymbol{\theta}}_{t}^{o},\boldsymbol{p}_{t}^{o},{\boldsymbol{\omega}}_{% t}^{o},\boldsymbol{v}_{t}^{o}\}$ represent the orientation, location, velocity, and angular velocity of the object, respectively. Motivated by [13], we include object geometry and whole-body haptic sensing from two elements: (i) $\boldsymbol{d}_{t}$ , vectors from human joints to their nearest points on each object surface; and (ii) $\boldsymbol{c}_{t}$ , contact markers indicating whether the human’s rigid body parts experience applied forces; this serves as simplified tactile or force sensing – an important multi-modal input in robot manipulation tasks [19, 171, 116, 45]. The goal state $\boldsymbol{s}_{t}^{g}=\{\boldsymbol{s}_{t,t+k}^{g}\}_{k\in K}$ integrates reference poses from the ground truth motion, where $\boldsymbol{s}_{t,t+k}^{g}$ is defined as,
状态。作为策略输入的状态 $\boldsymbol{s}_{t}$ 由两个组成部分 $\boldsymbol{s}_{t}=\{\boldsymbol{s}_{t}^{s},\boldsymbol{s}_{t}^{g}\}$ 构成。第一部分 $\boldsymbol{s}_{t}^{s}$ 包含人体本体感觉和物体观测数据，其表达式为 $\{\{\boldsymbol{\theta}_{t}^{h},\boldsymbol{p}_{t}^{h},\boldsymbol{\omega}_{t}% ^{h},\boldsymbol{v}_{t}^{h}\},\{\boldsymbol{\theta}_{t}^{o},\boldsymbol{p}_{t}% ^{o},\boldsymbol{\omega}_{t}^{o},\boldsymbol{v}_{t}^{o}\},\{\boldsymbol{d}_{t}% ,\boldsymbol{c}_{t}\}\},$ ，其中 $\{\boldsymbol{\theta}_{t}^{h},\boldsymbol{p}_{t}^{h},\boldsymbol{\omega}_{t}^{% h},\boldsymbol{v}_{t}^{h}\}$ 分别表示所有关节的旋转角度、位置、角速度和线速度，而 $\{{\boldsymbol{\theta}}_{t}^{o},\boldsymbol{p}_{t}^{o},{\boldsymbol{\omega}}_{% t}^{o},\boldsymbol{v}_{t}^{o}\}$ 则分别表示物体的朝向、位置、速度和角速度。受文献[13]启发，我们通过两个要素引入物体几何特征和全身触觉感知：(i) $\boldsymbol{d}_{t}$ ，即人体各关节到物体表面最近点的向量；(ii) $\boldsymbol{c}_{t}$ ，接触标记用于指示人体刚性部位是否受到外力作用——这相当于简化的触觉或力觉传感，是机器人操作任务中重要的多模态输入[19, 171, 116, 45]。目标状态 $\boldsymbol{s}_{t}^{g}=\{\boldsymbol{s}_{t,t+k}^{g}\}_{k\in K}$ 整合了真实运动中的参考姿态，其中 $\boldsymbol{s}_{t,t+k}^{g}$ 定义为：

	$\displaystyle\{\{\hat{\boldsymbol{\theta}}_{t+k}^{h}\ominus\boldsymbol{\theta}% _{t}^{h},\hat{\boldsymbol{p}}_{t+k}^{h}-\boldsymbol{p}_{t}^{h}\},\{\hat{{% \boldsymbol{\theta}}}_{t+k}^{o}\ominus{\boldsymbol{\theta}}_{t}^{o},\hat{% \boldsymbol{p}}_{t+k}^{o}-\boldsymbol{p}_{t}^{o}\},$		(1)
	$\displaystyle\{\hat{\boldsymbol{d}}_{t+k}-\boldsymbol{d}_{t},\hat{\boldsymbol{% c}}_{t+k}-\boldsymbol{c}_{t}\},\{\hat{\boldsymbol{\theta}}_{t+k}^{h},\hat{% \boldsymbol{p}}_{t+k}^{h},\hat{{\boldsymbol{\theta}}}_{t+k}^{o},\hat{% \boldsymbol{p}}_{t+k}^{o}\}\},$		(1)

where $\hat{\boldsymbol{\theta}}_{t+k}^{h},\hat{\boldsymbol{p}}_{t+k}^{h},\hat{% \boldsymbol{d}}_{t+k},\hat{\boldsymbol{c}}_{t+k}$ represent the reference information at time step $t+k$ , $\ominus$ denotes the calculation of rotation difference. All continuous elements of $\boldsymbol{s}_{t}$ are normalized relative to the current direction of view of the human and the position of the root [104]
其中 $\hat{\boldsymbol{\theta}}_{t+k}^{h},\hat{\boldsymbol{p}}_{t+k}^{h},\hat{% \boldsymbol{d}}_{t+k},\hat{\boldsymbol{c}}_{t+k}$ 代表时间步 $t+k$ 的参考信息， $\ominus$ 表示旋转差异的计算。 $\boldsymbol{s}_{t}$ 的所有连续元素均相对于人体当前视线方向和根位置[104]进行了归一化处理。.

We extract reference contact markers $\hat{\boldsymbol{c}}_{t+k}$ by inferring dynamic information, in addition to inaccurate contact distances, specifically by analyzing the object’s acceleration to detect human-induced forces. To accommodate the variability in contact distances observed in reference motion, we discretize reference contact markers using varying distance thresholds, as illustrated in Fig. 3(i). The neutral areas serve as buffer zones, avoiding the penalization or enforcement of strict contact. See Sec. C of supplementary for details.
我们通过推断动态信息提取参考接触标记 $\hat{\boldsymbol{c}}_{t+k}$ ，除不精确的接触距离外，特别通过分析物体加速度来检测人为施加的力。为适应参考运动中观察到的接触距离变化，我们采用不同距离阈值对参考接触标记进行离散化处理，如图 3(i)所示。中性区域作为缓冲带，避免对严格接触进行惩罚或强制要求。详见补充材料 C 节。

Action. Our human model has 51 actuated joints, defining an action space of $\boldsymbol{a}_{t}\in\mathbb{R}^{51\times 3}$ . These actions are specified as joint PD targets using the exponential map and are converted into torques applied to each of the human joints.
动作。我们的人体模型拥有 51 个可驱动关节，定义了一个 $\boldsymbol{a}_{t}\in\mathbb{R}^{51\times 3}$ 维的动作空间。这些动作以指数映射形式指定为关节 PD 目标，并转化为施加于各人体关节的扭矩。

3.2 Imitation as Perfecting
3.2 模仿作为完善

The teacher policy $\pi^{(\text{T})}$ is trained via RL to maximize the expected discounted reward by comparing simulated states against potentially erroneous reference states. The training involves: (i) trajectory collection, where we explain how trajectories are initialized and terminated. (ii) policy updating, where collected trajectories and their associated rewards are used to refine the policy. In this section, we elaborate on our reward design and how we optimize trajectory collection to mitigate the impact of reference inaccuracies.
教师策略 $\pi^{(\text{T})}$ 通过强化学习训练，以最大化预期折扣奖励，其方法是将模拟状态与可能存在错误的参考状态进行对比。训练过程包括：(i) 轨迹收集，其中我们阐述了轨迹如何初始化和终止；(ii) 策略更新，利用收集的轨迹及其相关奖励来优化策略。本节详细说明了我们的奖励设计以及如何优化轨迹收集以减轻参考不准确性的影响。

Imitation as Retargeting. We tailor teacher policies to each human subject, while all policies share the same base human model. This serves the retargeting purpose by converting HOIs from different human shapes into a unified base shape. Although motion imitation does not necessarily require a unified human model [146, 89], our approach offers two benefits: (i) It enhances integration with kinematic generation methods, which generally perform better on a single, unified shape [34, 126]. (ii) It demonstrates possible integration with real-world humanoid deployment, which requires retargeting to a consistent physical embodiment. In Figure 1, our method translates MoCap data into motor skills on a Unitree G1 [130] with two Inspire hands [47], all without external retargeting in complex contact-rich scenarios. See Sec. F of the supplementary for additional details.
模仿即重定向。我们为每位人类受试者定制教师策略，同时所有策略共享同一基础人体模型。通过将不同体型的人类对象交互（HOIs）转换为统一的基础体型，实现了重定向目标。尽管动作模仿并不必然需要统一的人体模型[146, 89]，但我们的方法具有双重优势：(i) 提升与运动学生成方法的兼容性——这类方法通常在单一统一体型上表现更优[34, 126]；(ii) 展现与现实世界仿人机器人部署的潜在兼容性，这类场景需将动作重定向至一致的物理实体。如图 1 所示，我们的方法在无需外部重定向的情况下，将动作捕捉数据转化为 Unitree G1[130]搭载双 Inspire 机械手[47]的运动技能，全程适应复杂密集接触场景。更多细节参见补充材料 F 章节。

Human [132] or HOI [61] retargeting can be formulated as an optimization problem. Inverse Kinematics (IK) methods, such as those based on quadratic programming [63], demonstrate effectiveness in simplified scenarios but remain underexplored for motions featuring intricate object interactions. RL, by contrast, solves the optimization by maximizing an expected cumulative reward, prompting us to investigate whether RL-driven HOI imitation can be used for HOI retargeting. This extends existing physics-based retargeting approaches, which either omit object interactions [110] or are non-scalable with a single reference [189]
人类[132]或人机交互（HOI）[61]重定向可被表述为一个优化问题。基于二次规划[63]等逆向运动学（IK）方法在简化场景中展现出有效性，但对于涉及复杂物体交互的运动仍缺乏深入探索。相比之下，强化学习（RL）通过最大化期望累积奖励来解决优化问题，这促使我们探究 RL 驱动的 HOI 模仿是否可用于 HOI 重定向。该研究拓展了现有基于物理的重定向方法——这些方法要么忽略物体交互[110]，要么因单一参考而难以扩展[189]。.

While the kinematics should differ due to the embodiment gap, we argue that the dynamics between human and object should remain invariant. Thus, we define rewards to include an embodiment-aware component that loosely aligns the simulated kinematics with the reference interaction, and an embodiment-agnostic reward component that encourages dynamics to be close to the reference.
尽管运动学特性会因体现形式差异而有所不同，但我们主张人与物体间的动力学关系应保持不变。因此，我们设计的奖励函数包含两部分：一个体现感知组件，用于松散地对齐模拟运动学与参考交互；另一个体现无关组件，旨在促使动力学特性接近参考状态。

Embodiment-Aware Reward. When the human and object are far apart, retargeting should prioritize capturing rotational motion, whereas when they are close, accurate position tracking becomes crucial for achieving contact. To reflect this, we define the weights $\boldsymbol{w}_{d}$ that are inversely proportional to the distances between joints and the object [189]. The reward thus includes cost functions for joint position $E_{p}^{h}=\langle\boldsymbol{\Delta}^{h}_{p},\boldsymbol{w}_{d}\rangle$ , rotation $E_{\theta}^{h}=\langle\boldsymbol{\Delta}^{h}_{\theta},\boldsymbol{1}-% \boldsymbol{w}_{d}\rangle$ , and interaction tracking $E_{d}=\langle\boldsymbol{\Delta}_{d},\boldsymbol{w}_{d}\rangle$ , where $\langle\cdot,\cdot\rangle$ is the inner product, $\boldsymbol{\Delta}^{h}_{p}[i]=\|\hat{\boldsymbol{p}}^{h}[i]-\boldsymbol{p}^{h% }[i]\|$ , $\boldsymbol{\Delta}^{h}_{\theta}[i]=\|\hat{\boldsymbol{\theta}}^{h}[i]\ominus% \boldsymbol{\theta}^{h}[i]\|$ , and $\boldsymbol{\Delta}_{d}[i]=\|\hat{\boldsymbol{d}}[i]-\boldsymbol{d}[i]\|$ represent the displacement for the variables defined in Sec. 3.1 with timestep $t$ omitted. The formulation of $\boldsymbol{w}_{d}$ is provided in Sec. D of supplementary. The reward to be maximized can be formulated as $\exp(-\lambda E)$ for each cost function $E$ with a specific hyperparameter $\lambda$ . Details can be found in Sec. D.
具身感知奖励。当人与物体相距较远时，动作重定向应优先捕捉旋转运动；而当两者接近时，精确的位置追踪对于实现接触至关重要。为此，我们定义了与关节和物体间距离成反比的权重 $\boldsymbol{w}_{d}$ [189]。该奖励函数包含关节位置 $E_{p}^{h}=\langle\boldsymbol{\Delta}^{h}_{p},\boldsymbol{w}_{d}\rangle$ 、旋转 $E_{\theta}^{h}=\langle\boldsymbol{\Delta}^{h}_{\theta},\boldsymbol{1}-% \boldsymbol{w}_{d}\rangle$ 和交互追踪 $E_{d}=\langle\boldsymbol{\Delta}_{d},\boldsymbol{w}_{d}\rangle$ 的代价函数，其中 $\langle\cdot,\cdot\rangle$ 表示内积， $\boldsymbol{\Delta}^{h}_{p}[i]=\|\hat{\boldsymbol{p}}^{h}[i]-\boldsymbol{p}^{h% }[i]\|$ 、 $\boldsymbol{\Delta}^{h}_{\theta}[i]=\|\hat{\boldsymbol{\theta}}^{h}[i]\ominus% \boldsymbol{\theta}^{h}[i]\|$ 和 $\boldsymbol{\Delta}_{d}[i]=\|\hat{\boldsymbol{d}}[i]-\boldsymbol{d}[i]\|$ 代表第 3.1 节定义的变量在省略时间步 $t$ 时的位移量。 $\boldsymbol{w}_{d}$ 的公式详见补充材料 D 节。待最大化的奖励可表示为针对每个代价函数 $E$ （含特定超参数 $\lambda$ ）的 $\exp(-\lambda E)$ 。具体细节参见 D 节。

Embodiment-Agnostic Reward. The reward includes components for object tracking and contact tracking. The object tracking cost is defined for position $E^{o}_{p}=\|\hat{\boldsymbol{p}}^{o}-\boldsymbol{p}^{o}\|$ and rotation $E^{o}_{\theta}=\|\hat{\boldsymbol{\theta}}^{o}-{\boldsymbol{\theta}}^{o}\|$ , with all values normalized to the human’s current position and direction.
与具体实现无关的奖励。该奖励包含物体追踪和接触追踪两个部分。物体追踪成本针对位置 $E^{o}_{p}=\|\hat{\boldsymbol{p}}^{o}-\boldsymbol{p}^{o}\|$ 和旋转 $E^{o}_{\theta}=\|\hat{\boldsymbol{\theta}}^{o}-{\boldsymbol{\theta}}^{o}\|$ 进行定义，所有数值均以人类当前位置和方向为基准进行归一化处理。

The contact tracking reward comprises two cost functions: body contact promotion $E^{c}_{b}$ and penalty $E^{c}_{p}$ , both aligning the simulated contact $\boldsymbol{c}$ with reference markers $\hat{\boldsymbol{c}}$ , as shown in Figure 3. We define three contact levels – promotion, penalty, and neutral – to accommodate potential inaccuracies in reference contact distances. The detailed formulation can be found in Sec. D of the supplementary. Since the physics engine does not differentiate between object, ground, and self-contact, we adopt two strategies: (i) we model foot-ground contact promotion and penalty. This ensures proper foot lifting during cyclic walking and mitigates foot hobbling. (ii) We allow self-collision to avoid self-contact promotion but to promote object interaction. This poses minimal risk as the policy is guided by MoCap reference, which, although lacking perfect contact accuracy, rarely shows self-penetration. For humanoid robots with embodiments that differ from the MoCap reference and require real-world applicability, we disable self-collision, as discussed in Sec. F.
接触追踪奖励由两个成本函数组成：身体接触促进 $E^{c}_{b}$ 和惩罚 $E^{c}_{p}$ ，两者均将模拟接触 $\boldsymbol{c}$ 与参考标记 $\hat{\boldsymbol{c}}$ 对齐，如图 3 所示。我们定义了三个接触级别——促进、惩罚和中性——以适应参考接触距离中可能存在的误差。详细公式可在补充材料的 D 节中找到。由于物理引擎不区分物体、地面和自接触，我们采用两种策略：(i) 对足地接触进行促进和惩罚建模。这确保了周期性行走中足部的正确抬升，并减少了足部拖曳现象。(ii) 我们允许自碰撞以避免自接触促进，但促进物体交互。由于策略由动作捕捉参考引导，这带来的风险极小——尽管参考接触精度并非完美，但极少出现自穿透现象。对于形态与动作捕捉参考不同且需实际应用的人形机器人，如 F 节所述，我们会禁用自碰撞功能。

We introduce energy consumption rewards [172] to penalize large human or object jitters, with a proposed contact energy penalizing abrupt contact to promote compliant interactions. See Sec. D of supplementary for more details.
我们引入能量消耗奖励[172]来惩罚较大的人体或物体抖动，并提出接触能量惩罚机制以抑制突然接触，从而促进柔顺交互。更多细节请参阅补充材料 D 节。

Hand Interaction Discovery. We use data with average or flattened hand poses [5, 70], which makes accurate object manipulation imitation challenging. To address this, we activate a reference contact marker for any hand part when a fingertip or palm is near an object. Given tasks that do not demand high dexterity, employing a contact-promoting reward with this marker enables policies to develop effective hand interaction strategies, leveraging the exploratory nature of RL. Additionally, we constrain the range of motion (RoM) of the hands to ensure natural movement. See Sec. D and Sec. B of the supplementary for further details.
手部交互发现。我们采用平均或扁平化手部姿态的数据[5,70]，这使得精确的物体操作模仿具有挑战性。为解决这一问题，当指尖或手掌靠近物体时，我们会激活相应手部部位的参考接触标记。对于不需要高度灵巧性的任务，利用这种标记结合促进接触的奖励机制，能够使策略开发出有效的手部交互策略，充分发挥强化学习的探索特性。此外，我们限制手部的活动范围（RoM）以确保动作的自然性。更多细节请参阅补充材料 D 节和 B 节。

Policy Learning. Following [104], the control policy $\pi$ is trained using PPO [114] with the policy gradient $L(\psi)=\mathbb{E}_{t}[\min(r_{t}(\psi){A}_{t},\text{clip}(r_{t}(\psi),1-% \epsilon,1+\epsilon){A}_{t}).$ $\psi$ are the parameters of $\pi$ and $r_{t}(\psi)$ quantifies the difference in action likelihoods between updated and old policies. $\epsilon$ is a small constant, and ${A}_{t}$ is the advantage estimate given by the generalized advantage estimator GAE( $\lambda$ ) [113]
策略学习。遵循[104]的研究方法，控制策略 $\pi$ 采用 PPO[114]算法进行训练，其中策略梯度 $L(\psi)=\mathbb{E}_{t}[\min(r_{t}(\psi){A}_{t},\text{clip}(r_{t}(\psi),1-% \epsilon,1+\epsilon){A}_{t}).$ $\psi$ 分别代表 $\pi$ 和 $r_{t}(\psi)$ 的参数， $r_{t}(\psi)$ 用于量化新旧策略间动作概率的差异。 $\epsilon$ 为一较小常数， ${A}_{t}$ 则是广义优势估计器 GAE( $\lambda$ )[113]给出的优势估计值。.

Physical State Initialization. Learning later-phase motion can be essential for policies to achieve high rewards during earlier phases, compared to incrementally learning from the starting phase. Thus, Reference State Initialization (RSI) [104] sets the current pose ${\boldsymbol{q}}_{t}$ to a reference pose $\hat{\boldsymbol{q}}_{t}$ at a random timestep $t$ , for initializing the rollout. However, initializing with the imperfect reference can introduce critical artifacts, such as contact floating or incorrect hand motion, leading to unrecoverable failures, e.g., object falling, as depicted in Figure 3(ii). These issues render many initializations ineffective, limiting training on certain interaction phases since successful rollouts may not reach them before the maximum length. The problem is exacerbated by the use of prioritized sampling [90, 156, 146, 125], which favors high-failure-rate initializations.
物理状态初始化。与从初始阶段逐步学习相比，策略学习后期运动对于在早期阶段获得高奖励至关重要。因此，参考状态初始化（RSI）[104]将当前姿态 ${\boldsymbol{q}}_{t}$ 设置为随机时间步 $t$ 的参考姿态 $\hat{\boldsymbol{q}}_{t}$ ，用于初始化 rollout。然而，使用不完美的参考进行初始化可能会引入关键伪影，例如接触漂浮或错误的手部运动，导致无法恢复的失败（如物体掉落，如图 3(ii)所示）。这些问题使得许多初始化无效，限制了特定交互阶段的训练，因为成功的 rollout 可能在达到最大长度前无法触及这些阶段。优先采样[90, 156, 146, 125]的使用加剧了这一问题，该方法偏向高失败率的初始化。

To address the need for higher-quality reference initialization, we propose Physical State Initialization (PSI). As illustrated in Figure 2, PSI begins by creating an initialization buffer that stores reference states from MoCap and simulation states from prior rollouts. For each new rollout, an initial state is randomly selected from this buffer, which increases the likelihood of starting from advantageous positions. Once a rollout is completed, trajectories are evaluated based on their expected discounted rewards; those above a certain threshold are added to the buffer using a first-in-first-out (FIFO) strategy, while older or lower-quality trajectories are discarded. This selective reintroduction of high-value states for initialization helps maintain stable policy updates. We apply PSI in a sparse manner to ensure training efficiency. As shown in Figure 3(ii), PSI can collect trajectories for policy update that RSI does not effectively utilize. Further details are provided in Sec. E of the supplementary.
为满足高质量参考初始化的需求，我们提出了物理状态初始化（PSI）方法。如图 2 所示，PSI 首先创建一个初始化缓冲区，用于存储来自动作捕捉（MoCap）的参考状态和先前模拟运行中的仿真状态。每次新模拟开始时，系统会从该缓冲区随机选取一个初始状态，从而增加从有利位置启动的概率。当一次模拟完成后，系统会根据轨迹的预期折现奖励对其进行评估：达到特定阈值的轨迹会按先进先出（FIFO）策略加入缓冲区，而较早或质量较低的轨迹则被淘汰。这种选择性重引入高价值状态进行初始化的机制，有助于保持策略更新的稳定性。我们采用稀疏方式实施 PSI 以确保训练效率。如图 3(ii)所示，PSI 能收集到 RSI 方法未能有效利用的策略更新轨迹，更多细节详见补充材料 E 节。

Interaction Early Termination. Early Termination (ET) [104] is commonly used in motion imitation, ending an episode when a body part makes unplanned ground contact or when the character deviates significantly from the reference [89], thus stopping the policy from overvaluing invalid transitions. However, additional conditions should be considered for human-object interactions. We propose Interaction Early Termination (IET), which supplements ET with three extra checks: (i) Object points deviate from their references by more than 0.5 m on average. (ii) Weighted average distances between the character’s joints and the object surface exceed 0.5 m from the reference. (iii) Any required body-object contact is lost for over 10 consecutive frames. Full conditions are detailed in Sec. E of the supplementary.
交互早期终止。早期终止（ET）[104] 在动作模仿中常用，当身体部位意外接触地面或角色显著偏离参考动作时终止当前回合[89]，从而防止策略过度评估无效状态转移。然而，对于人-物交互场景需引入额外条件。我们提出交互早期终止（IET），在 ET 基础上补充三项检测：(i) 物体点云与参考位置的平均偏离超过 0.5 米；(ii) 角色关节与物体表面的加权平均距离较参考值超出 0.5 米；(iii) 任何必需的身体-物体接触持续丢失超过 10 帧。完整条件详见补充材料 E 章节。

3.3 Imitation with Distillation
3.3 蒸馏式模仿

As shown in Figure 2, after training the teacher policies on data from each subject (Sec. 3.2), we aggregate them to train a student policy $\pi^{(\text{S})}$ to master all skills. As outlined in Algorithm 1, the combined teacher policies, denoted by $\pi^{(\text{T})}$ for brevity, serves dual roles by providing state-action trajectories $(\boldsymbol{s}^{(\text{T})},\boldsymbol{a}^{(\text{T})})$ : (i) the state $\boldsymbol{s}^{(\text{T})}$ for reference distillation, and (ii) the action $\boldsymbol{a}^{(\text{T})}$ for policy distillation.
如图 2 所示，在针对各受试者数据训练教师策略后（见第 3.2 节），我们将其汇总以训练学生策略 $\pi^{(\text{S})}$ 掌握所有技能。如算法 1 所述，为简洁起见，用 $\pi^{(\text{T})}$ 表示的组合教师策略通过提供状态-动作轨迹 $(\boldsymbol{s}^{(\text{T})},\boldsymbol{a}^{(\text{T})})$ 发挥双重作用：(i) 状态 $\boldsymbol{s}^{(\text{T})}$ 用于参考蒸馏，(ii) 动作 $\boldsymbol{a}^{(\text{T})}$ 用于策略蒸馏。

Reference Distillation. Noisy MoCap data can hinder policy learning, especially at larger scales. In contrast, teacher policies trained on smaller-scale data effectively address these issues by correcting contact artifacts, refining hand placements, and recovering missing details (see Figures 1 and 5). To fully leverage teacher policies, we use their rollouts as references for defining the student policy’s goal state and reward functions, distinguishing our approach from distillation based on only action output.
参考蒸馏。噪声运动捕捉数据会阻碍策略学习，尤其在更大规模时。相比之下，基于小规模数据训练的教师策略能有效解决这些问题，通过修正接触伪影、优化手部定位及恢复缺失细节（见图 1 和图 5）。为充分利用教师策略，我们将其推演结果作为参考来定义学生策略的目标状态和奖励函数，这一方法区别于仅基于动作输出的蒸馏技术。

Policy Distillation. We also apply distillation on action outputs, which we view as crucial for scaling policies to large datasets. In essence, we trade space for time: teacher policies are trained in parallel on smaller data subsets, allowing the student policy to scale through distillation. Following Algorithm 1, we begin with Behavior Cloning (BC) [53, 133] and then use RL fine-tuning to go beyond demonstration memorization, an approach common in LLM alignment [98, 129]. Inspired by [108, 117], we integrate BC into online policy updates and adopt a staged schedule: we start with DAgger [112] and gradually transition to PPO. Throughout, the critic is continuously trained with the reward from Sec. 3.2. This RL fine-tuning phase is crucial because teacher policies may behave differently when performing similar skills, and simple BC can lead to suboptimal “averaging” behavior, where RL fine-tuning helps the student policy converge on optimal solutions.
策略蒸馏。我们同样对动作输出应用了蒸馏技术，这被视为将策略扩展至大规模数据集的关键所在。本质上，我们以空间换取时间：教师策略在较小的数据子集上并行训练，使学生策略能通过蒸馏实现规模化扩展。遵循算法 1 的流程，我们首先采用行为克隆（BC）[53, 133]，随后利用强化学习（RL）微调超越单纯模仿示范的局限——这一方法在 LLM 对齐领域[98, 129]颇为常见。受[108, 117]启发，我们将 BC 整合到在线策略更新中，并采用分阶段调度方案：从 DAgger[112]起步，逐步过渡至 PPO。整个过程中，评论家网络持续使用第 3.2 节的奖励信号进行训练。RL 微调阶段至关重要，因为教师策略在执行相似技能时可能表现迥异，而单纯的 BC 会导致次优的"平均化"行为，此时 RL 微调能帮助学生策略收敛到最优解。

3.4 Architecture 3.4 架构

We set the keyframe indices $K$ (Sec. 3.1, Eq. 1) to $\{1,16\}$ for the teacher policies and $\{1,2,4,16\}$ for the student policy. The broader observation window for the student policy helps it better distinguish different skills with larger-scale data. Teacher policies employ MLPs, common in physics-based animation [104], while the student policy handles higher-dimensional observations, for which MLPs are less effective. Thus, we use a transformer [131] architecture for sequential modeling [125], as shown in Figure 2.
我们将教师策略的关键帧索引 $K$ （第 3.1 节，公式 1）设为 $\{1,16\}$ ，学生策略设为 $\{1,2,4,16\}$ 。学生策略更广的观察窗口有助于其通过更大规模的数据更好地区分不同技能。教师策略采用基于物理动画[104]中常见的 MLPs，而学生策略处理更高维度的观测数据，MLPs 在此效果欠佳。因此，如图 2 所示，我们采用 transformer[131]架构进行序列建模[125]。

Algorithm 1 Distillation with RL Fine-tuning
算法 1 基于强化学习微调的蒸馏算法

1:Input: A composite policy

\pi^{(\text{T})}

integrated from individual teacher policies, student policy parameters

\boldsymbol{\psi}

, student value function parameters

\boldsymbol{\phi}

, schedule hyperparameter

\beta

for DAgger, horizon length

H

for PPO
1:输入：一个复合策略

\pi^{(\text{T})}

，由个体教师策略、学生策略参数

\boldsymbol{\psi}

、学生价值函数参数

\boldsymbol{\phi}

、DAgger 的调度超参数

\beta

以及 PPO 的水平长度

H

整合而成

2:for

t=0,1,2,\ldots

3: for

h

from

1

H

do
3: 对于

h

从

1

到

H

执行

4: Sample a variable

u\sim\text{Uniform}(0,1)

4: 对变量

u\sim\text{Uniform}(0,1)

进行采样

5: Collect

\boldsymbol{s}^{(\text{T})},\boldsymbol{a}^{(\text{T})}

from teacher

\pi^{(\text{T})}

5：从老师

\pi^{(\text{T})}

处收集

\boldsymbol{s}^{(\text{T})},\boldsymbol{a}^{(\text{T})}

6: Obtain the refined reference from

\boldsymbol{s}^{(\text{T})}

to define

\boldsymbol{s}^{(\text{S})}

and

r(\cdot)

, obtain

\boldsymbol{a}^{(\text{S})}

from

\pi^{(\text{S})}_{\boldsymbol{\phi}}(\boldsymbol{a}^{(\text{S})}|\boldsymbol{s% }^{(\text{S})})

6：从

\boldsymbol{s}^{(\text{T})}

获取精炼参考以定义

\boldsymbol{s}^{(\text{S})}

和

r(\cdot)

，从

\pi^{(\text{S})}_{\boldsymbol{\phi}}(\boldsymbol{a}^{(\text{S})}|\boldsymbol{s% }^{(\text{S})})

获取

\boldsymbol{a}^{(\text{S})}

7: if

u\leq\frac{t}{\beta}

then

\triangleright

Use the teacher

\triangleright

使用教师

8: Given

\boldsymbol{s}^{(\text{S})}

, execute

\boldsymbol{a}^{(\text{S})}

, observe

\boldsymbol{s}^{\prime(\text{S})},r

8: 给定

\boldsymbol{s}^{(\text{S})}

，执行

\boldsymbol{a}^{(\text{S})}

，观察

\boldsymbol{s}^{\prime(\text{S})},r

9: else 9: 否则

\triangleright

Use the student

\triangleright

使用该学生

10: Given

\boldsymbol{s}^{(\text{S})}

, execute

\boldsymbol{a}^{(\text{T})}

, observe

\boldsymbol{s}^{\prime(\text{S})},r

10：给定

\boldsymbol{s}^{(\text{S})}

，执行

\boldsymbol{a}^{(\text{T})}

，观察

\boldsymbol{s}^{\prime(\text{S})},r

11: end if 11: 结束如果

12: Store the transition

(\boldsymbol{s}^{(\text{S})},\boldsymbol{s}^{\prime(\text{S})},\boldsymbol{a}^% {(\text{S})},\boldsymbol{a}^{(\text{T})},r)

12：存储转换

(\boldsymbol{s}^{(\text{S})},\boldsymbol{s}^{\prime(\text{S})},\boldsymbol{a}^% {(\text{S})},\boldsymbol{a}^{(\text{T})},r)

13: end for 13: 结束循环

14: Update

\boldsymbol{\phi}

with TD(

\lambda

)
14: 使用 TD(

\lambda

)更新

\boldsymbol{\phi}

15: Compute PPO objective:

L(\psi)

15：计算 PPO 目标：

L(\psi)

16: Compute

J(\psi)=\|\boldsymbol{a}^{(\text{S})}-\boldsymbol{a}^{(\text{T})}\|

16：计算

J(\psi)=\|\boldsymbol{a}^{(\text{S})}-\boldsymbol{a}^{(\text{T})}\|

17: Compute the weight:

w=\min(\frac{t}{\beta},1)

17：计算权重：

w=\min(\frac{t}{\beta},1)

18: Update

\psi

by gradient:

\nabla_{\psi}(wL(\psi)+(1-w)J(\psi))

18: 根据梯度更新

\psi

：

\nabla_{\psi}(wL(\psi)+(1-w)J(\psi))

19:end for 19:结束循环

4 Experiments 4 实验

We evaluate teacher policies on their ability to imitate imperfect HOI references, and assess the entire teacher-student framework for scalability to large-scale data and zero-shot generalization across various scenarios. Additional experiments are provided in Sec. G of supplementary.
我们评估教师策略模仿不完美 HOI 参考的能力，并考察整个师生框架在大规模数据上的可扩展性及跨场景零样本泛化性能。补充材料 G 节提供了额外实验。

Datasets. We use the following datasets: OMOMO [70], BEHAVE [5], HODome [180], IMHD [191], and HIMO [93]. OMOMO, containing 15 objects and approximately 10 hours of data, is our primary dataset for evaluating the full teacher-student distillation framework for its scale. We train 17 teacher policies, one per subject, with subject 14 reserved as the test set and the remaining data used for training the student policy. A small portion of data is discarded after teacher imitation due to severe MoCap errors that could not be corrected (see Sec. F and Sec. H of the supplementary). Additional datasets are used to evaluate teacher policies in various MoCap scenarios with different error levels and interaction types. We focus on highly dynamic motions (Figure 1) and interactions involving multiple body parts (Figure 4), while excluding scenarios such as carrying a bag with a strap, since the simulator [96] used lacks full support for soft bodies.
数据集。我们使用以下数据集：OMOMO [70]、BEHAVE [5]、HODome [180]、IMHD [191] 和 HIMO [93]。OMOMO 包含 15 个对象和约 10 小时的数据，因其规模成为我们评估完整师生蒸馏框架的主要数据集。我们训练了 17 个教师策略（每个受试者对应一个），其中受试者 14 的数据留作测试集，其余数据用于训练学生策略。由于存在无法修正的严重动作捕捉错误（见补充材料 F 节和 H 节），在教师模仿后会丢弃一小部分数据。其他数据集用于评估不同动作捕捉场景下（具有不同错误级别和交互类型）的教师策略。我们重点关注高度动态的运动（图 1）及涉及多身体部位的交互（图 4），而排除诸如肩带背包等场景，因为所使用的模拟器[96]缺乏对软体的完全支持。

Metrics. We use the following metrics: (i) Success Rate is defined as the proportion of references that the policy successfully imitates, averaged over all references, while (ii) Duration is the time (in seconds) that the imitation is maintained without triggering the interaction early termination conditions introduced in Sec. 3.2. (iii) Human Tracking Error ( $E_{h}$ ), which measures the per-joint position error (cm) between the simulated and reference human (excluding hand joints for BEHAVE [5] and OMOMO [70] due to inaccuracy), and (iv) the Object Tracking Error ( $E_{o}$ ), which measures the per-point position error (cm) between the simulated and reference object. Both errors are averaged over the duration of the imitation.
指标。我们采用以下指标：(i) 成功率定义为策略成功模仿参考动作的比例，对所有参考动作取平均值；(ii) 持续时间指模仿行为维持的时长（以秒计），期间未触发第 3.2 节所述的交互提前终止条件；(iii) 人体跟踪误差（ $E_{h}$ ），用于测量模拟人体与参考人体各关节位置间的误差（厘米，对于 BEHAVE[5]和 OMOMO[70]数据集因精度问题排除手部关节）；(iv) 物体跟踪误差（ $E_{o}$ ），用于测量模拟物体与参考物体各点位置间的误差（厘米）。两项误差均在模仿持续时间内取平均值。

Baselines. To facilitate fair comparisons, we downgrade our method for teacher policy evaluation to imitate either a single MoCap clip (Figure 4) or multiple clips with a single object (Table 1), enabling direct comparison with PhysHOI [142] and SkillMimic [143] (Sec. 4.1 and 4.2). Due to the lack of established baselines for large-scale HOI imitation, we adapt the following variants for comparison with our student policy (Sec. 4.3): (i) PPO [114] trains an imitation policy from scratch, following [104]. We experiment with both versions, with and without reference distillation; (ii) DAgger [112] distills the student without RL fine-tuning, a process we refer to as policy distillation.
基线方法。为确保公平比较，我们将教师策略评估方法简化为模仿单个动作捕捉片段（图 4）或包含单一对象的多个片段（表 1），以便与 PhysHOI[142]和 SkillMimic[143]直接对比（见第 4.1 和 4.2 节）。由于大规模人机交互模仿领域缺乏既定基线，我们采用以下变体与学生策略进行对比（第 4.3 节）：(i) PPO[114]遵循文献[104]从头开始训练模仿策略，我们测试了包含参考蒸馏与不含参考蒸馏的两种版本；(ii) DAgger[112]在不进行强化学习微调的情况下蒸馏学生策略，这一过程我们称为策略蒸馏。

Implementation Details. The control policies operate at 30 Hz and are trained using the Isaac Gym simulator [96]. Teacher policies are implemented as MLPs with hidden layers of sizes 1024, 1024, and 512. The student policy is implemented as a three-layer Transformer encoder with 4 heads, a hidden size of 256, and a feed-forward layer of 512. The critics are also modeled as MLPs with the same architecture as the teacher policies. To integrate the student policy with kinematic generators, including text-to-HOI [103] and future interaction prediction [160], we train these models using reference data distilled by the teacher policies from the OMOMO [70] dataset, following the same train-test split as the student policy training. For the text-to-HOI model, we train it to generate 10 seconds of motion and use 24 generated samples for evaluation, while for future interaction prediction, the model generates 25 future frames given 10 past frames and we use 60 generated samples for evaluation. See Sec. F of the supplementary.
实现细节。控制策略以 30Hz 频率运行，并在 Isaac Gym 模拟器[96]中进行训练。教师策略采用隐藏层大小为 1024、1024 和 512 的多层感知机（MLP）实现。学生策略则使用具有 4 个头、隐藏层大小为 256、前馈层为 512 的三层 Transformer 编码器构建。评论家网络同样采用与教师策略相同架构的 MLP 建模。为将学生策略与运动生成器（包括文本到人机交互[103]和未来交互预测[160]）集成，我们使用教师策略从 OMOMO 数据集[70]蒸馏的参考数据训练这些模型，并遵循与学生策略训练相同的训练-测试划分。对于文本到人机交互模型，我们训练其生成 10 秒运动序列并采用 24 个生成样本进行评估；而未来交互预测模型则在给定 10 帧历史帧的条件下生成 25 帧未来帧，并使用 60 个生成样本进行评估。详见补充材料 F 节。

Method 方法	Time^↑ 时间 ^↑	$E_{h}$ ^↓	$E_{o}$ ^↓
SkillMimic [143]	12.2	7.2	13.4
InterMimic (Ours) w/o IET InterMimic（我们的方法）不含 IET	40.3	6.7	9.9
InterMimic (Ours) w/o PSI InterMimic（我们的）无 PSI	36.1	6.6	10.2
InterMimic (Ours) InterMimic（我们的方法）	42.6	6.4	9.2

Table 1: Quantitative comparison between the teacher policy from InterMimic and SkillMimic [143] to imitate data extracted from the BEHAVE [5] dataset involving a single subject interacting with yogamat. We ablate our proposed approach by removing interaction early termination and physical state initialization.
表 1：InterMimic 与 SkillMimic[143]中教师策略在模仿 BEHAVE[5]数据集（涉及单个受试者与瑜伽垫互动）提取数据时的定量对比。我们通过移除交互早期终止和物理状态初始化来消融所提出的方法。

4.1 Quantitative Evaluation
4.1 定量评估

PPO	Reference Distillation	Policy Distillation	Architecture 架构	OMOMO-Train [70]				OMOMO [70]-Test OMOMO [70]-测试				OMOMO [70]-Test (w $\times 10$ ) OMOMO [70]-测试 (w $\times 10$ )				HOI-Diff [103]				InterDiff [160]
PPO	Reference Distillation	Policy Distillation	Architecture 架构	Succ.^↑ 成功 ^↑	Time^↑ 时间 ^↑	$E_{h}$ ^↓	$E_{o}$ ^↓	Succ.^↑ 成功 ^↑	Time^↑ 时间 ^↑	$E_{h}$ ^↓	$E_{o}$ ^↓	Succ.^↑ 成功 ^↑	Time^↑ 时间 ^↑	$E_{h}$ ^↓	$E_{o}$ ^↓	Succ.^↑ 成功 ^↑	Time^↑ 时间 ^↑	$E_{h}$ ^↓	$E_{o}$ ^↓	Succ.^↑	Time^↑	$E_{h}$ ^↓	$E_{o}$ ^↓
$\checkmark$	$\times$	$\times$	MLP	23.9	101.6	7.2	15.6	9.6	85.3	7.5	16.2	3.9	71.2	7.5	17.9	0.0	0.0	-	-	6.7	11.7	6.2	16.4
$\times$	$\checkmark$	$\checkmark$		54.5	139.9	7.1	11.0	54.3	140.2	7.1	11.2	15.5	91.7	9.3	13.7	4.2	84.8	10.1	9.7	65.0	27.4	7.5	13.4
$\checkmark$	$\checkmark$	$\times$		71.7	152.8	8.9	12.7	91.6	173.7	8.5	13.2	45.8	127.6	9.1	14.9	8.3	130.9	10.1	13.8	73.3	28.9	6.9	14.4
$\checkmark$	$\checkmark$	$\checkmark$		90.7	168.0	5.5	9.7	95.5	173.9	5.4	11.9	62.6	140.9	6.6	14.5	12.5	121.4	8.6	12.1	75.0	29.1	6.2	13.5
$\checkmark$	$\checkmark$	$\checkmark$	Transformer	88.8	167.0	6.0	10.2	98.1	176.5	5.9	11.3	56.8	134.7	6.6	13.2	12.5	119.0	8.5	12.6	76.7	29.3	6.4	13.3

Table 2: Quantitative evaluation of large-scale interaction imitation using OMOMO [70], kinematic generations from HOI-Diff [103], and InterDiff [160]. Additionally, we evaluate on test set when objects with weights ten times greater than those used during training.
表 2：使用 OMOMO [70]、HOI-Diff [103]的动力学生成以及 InterDiff [160]进行大规模交互模仿的定量评估。此外，我们还评估了测试集上物体重量为训练时十倍的情况。

Table 1 shows that the baseline struggles with MoCap imperfections, e.g., incorrect hand positioning, and thus results in clearly shorter tracking durations. In contrast, our method maintains reference tracking for longer durations and produces interactions that closely match the reference. Table 2 shows that our method consistently outperforms baselines in both training data imitation and out-of-distribution generalization, including interactions from the test set and from kinematic generations. We discuss the effectiveness of specific design choices in Sec. 4.3.
表 1 显示，基线方法难以应对动作捕捉的不完美性（如错误的手部定位），因而导致跟踪时长明显缩短。相比之下，我们的方法能维持更长时间的参考跟踪，并生成与参考高度匹配的交互动作。表 2 表明，无论是在训练数据模仿还是分布外泛化方面（包括测试集交互和运动学生成），我们的方法始终优于基线方法。具体设计选择的有效性将在 4.3 节讨论。

4.2 Qualitative Evaluation
4.2 定性评估

Figure 4 shows a representative sequence from the experiment in Table 1, illustrating how our teacher policy corrects interactions that PhysHOI [142] fails to track robustly – our method effectively withstands and corrects incorrect hand positioning and floating contacts in the reference. Beyond obvious errors, our method also rectifies the rotation of symmetric objects that MoCap inaccurately depicts as sliding along the ground, shown in Figure 5. Figure 6 presents additional examples that complement Figure 1, demonstrating how our approach integrates with kinematic generators for future interaction prediction and text-to-interaction synthesis. This zero-shot generalization extends to novel objects unseen during training (Figure 7), highlighting the effectiveness of our object geometry and contact-encoded representation, as well as the large-scale training.
图 4 展示了表 1 实验中的一个代表性序列，说明我们的教师策略如何修正 PhysHOI[142]未能稳健跟踪的交互——我们的方法有效抵御并纠正了参考数据中错误的手部定位和悬浮接触。除明显错误外，该方法还能修正动作捕捉系统误判为沿地面滑动的对称物体旋转（如图 5 所示）。图 6 提供了补充图 1 的更多示例，展示我们的方法如何与运动学生成器结合，实现未来交互预测及文本到交互的合成。这种零样本泛化能力可延伸至训练中未见过的新物体（图 7），凸显了我们物体几何与接触编码表征的有效性，以及大规模训练的优势。

4.3 Ablation Study 4.3 消融研究

Effectiveness of PSI and IET. We conduct an ablation study, as demonstrated in Table 1, comparing the full approach to “Ours w/o PSI”. The results validate that Physical State Initialization (PSI) is effective by mitigating inaccuracies in the motion capture data. We also observe reduced effectiveness without our interaction early termination, as training often spends rollouts on irrelevant periods.
PSI 与 IET 的有效性验证。我们开展了消融实验（如表 1 所示），将完整方案与“去除 PSI 的版本”进行对比。结果表明：物理状态初始化（PSI）通过修正运动捕捉数据中的误差确实有效。同时发现，若取消交互早期终止机制（IET），训练过程常会耗费计算资源在无关时段，导致效果下降。

Effectiveness of Reference Distillation. Compared to directly scaling imitation from MoCap with potential imperfections (line 1 in Table 2), using references refined by the teacher policy (line 3) achieves consistently better performance on all metrics. The improvement is even more pronounced on the test set, where, without reference distillation, the policy struggles with unseen shapes, while retargeting by reference distillation eliminates the difficulty.
参考蒸馏的有效性。与直接从可能存在缺陷的运动捕捉（MoCap）数据进行模仿扩展（表 2 第 1 行）相比，使用教师策略精炼后的参考数据（第 3 行）在所有指标上均取得了更优且稳定的性能表现。这一提升在测试集上尤为显著：未采用参考蒸馏时，策略难以应对未见过的形态，而通过参考蒸馏进行重定向则完全消除了这一困难。

Effectiveness of Joint PPO and DAgger Updates. As shown in Table 2, training a policy from scratch (line 3) or relying solely on policy distillation (DAgger, line 2) fails to achieve optimal performance. While supervised skill learning lays the groundwork, additional PPO fine-tuning is crucial for resolving conflicts among teacher policies. This is important because our subject-based clustering may not effectively distinguish between different interaction patterns, and ambiguity arises when multiple teachers produce different actions for similar motions.
联合 PPO 与 DAgger 更新的有效性。如表 2 所示，从头开始训练策略（第 3 行）或仅依赖策略蒸馏（DAgger，第 2 行）均无法实现最优性能。尽管监督式技能学习奠定了基础，但额外的 PPO 微调对于解决教师策略间的冲突至关重要。这一点尤为重要，因为基于主体的聚类可能无法有效区分不同交互模式，当多位教师对相似动作产生不同行为时就会产生模糊性。

Effectiveness of Transformer for Policy Learning. From Table 2, we see that using a Transformer policy (line 5) outperforms MLP-based approaches, particularly on the test set and out-of-distribution cases generated by the kinematic model. We attribute this to the Transformer’s inductive bias in sequential modeling and its capacity to incorporate longer-term observations, enabling it to handle complex spatio-temporal dependencies more effectively.
Transformer 在策略学习中的有效性。从表 2 可以看出，使用 Transformer 策略（第 5 行）的表现优于基于 MLP 的方法，尤其是在测试集和由运动学模型生成的分布外案例上。我们将此归因于 Transformer 在序列建模中的归纳偏置及其整合长期观察的能力，使其能更有效地处理复杂的时空依赖关系。

5 Conclusion 5 结论

In this work, we introduce a framework for synthesizing realistic human-object interactions that are both physically grounded and generalizable. Unlike previous methods, our approach leverages a rich repository of imperfect MoCap data to facilitate the learning of various interaction skills across a wide variety of objects. To address inaccuracies in the MoCap data, we propose contact-guided rewards and optimize trajectory collection, enabling teacher policies to recover missing physical details in the original data. These teacher policies are used to train student policies within a distillation framework that combines policy distillation and reference distillation, thus enabling efficient skill scaling. Our approach shows zero-shot generalizability, which effectively bridges the gap between imitation and generative capabilities by integrating with kinematic generation. We believe that this framework can be adapted for whole-body loco-manipulation for real-world robots, enabling them to handle objects with human-like dexterity and nuance.
在本研究中，我们提出了一种合成真实人-物交互的框架，这些交互既基于物理规律又具备泛化能力。与先前方法不同，我们的方案利用丰富的非完美动作捕捉数据资源库，促进跨多种物体的交互技能学习。针对动作捕捉数据中的误差，我们提出接触引导奖励机制并优化轨迹采集，使教师策略能够修复原始数据中缺失的物理细节。这些教师策略被用于在结合策略蒸馏与参考蒸馏的框架内训练学生策略，从而实现高效的技能扩展。我们的方法展现出零样本泛化能力，通过与运动学生成相结合，有效弥合了模仿能力与生成能力之间的鸿沟。我们相信该框架可适配现实世界机器人的全身移动操作，使其能以类人的精细度和灵活性处理物体。

References 参考文献

Akkerman et al. [2024] Akkerman 等人[2024] Rick Akkerman, Haiwen Feng, Michael J Black, Dimitrios Tzionas, and Victoria Fernández Abrevaya. InterDyn: Controllable interactive dynamics with video diffusion models. arXiv preprint arXiv:2412.11785, 2024.
Rick Akkerman、Haiwen Feng、Michael J Black、Dimitrios Tzionas 和 Victoria Fernández Abrevaya。InterDyn：基于视频扩散模型的可控交互动力学。arXiv 预印本 arXiv:2412.11785，2024 年。
Bae et al. [2023] Bae 等人[2023] Jinseok Bae, Jungdam Won, Donggeun Lim, Cheol-Hui Min, and Young Min Kim. Pmp: Learning to physically interact with environments using part-wise motion priors. In SIGGRAPH, 2023.
Jinseok Bae、Jungdam Won、Donggeun Lim、Cheol-Hui Min 和 Young Min Kim。《PMP：利用部件运动先验学习与环境的物理交互》。发表于 SIGGRAPH，2023 年。
Barquero et al. [2024] Barquero 等人[2024] German Barquero, Sergio Escalera, and Cristina Palmero. Seamless human motion composition with blended positional encodings. In CVPR, 2024.
German Barquero、Sergio Escalera 和 Cristina Palmero。《基于混合位置编码的无缝人体运动合成》。载于 CVPR，2024 年。
Ben et al. [2025] Ben 等人[2025] Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. HOMIE: Humanoid loco-manipulation with isomorphic exoskeleton cockpit. arXiv preprint arXiv:2502.13013, 2025.
Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. HOMIE: 人形机器人的同构外骨骼驾驶舱式移动操作. arXiv 预印本 arXiv:2502.13013, 2025 年.
Bhatnagar et al. [2022] Bhatnagar 等人[2022] Bharat Lal Bhatnagar, Xianghui Xie, Ilya Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. BEHAVE: Dataset and method for tracking human object interactions. In CVPR, 2022.
Bharat Lal Bhatnagar、Xianghui Xie、Ilya Petrov、Cristian Sminchisescu、Christian Theobalt 和 Gerard Pons-Moll。BEHAVE：用于追踪人物与物体交互的数据集及方法。发表于 CVPR，2022 年。
Braun et al. [2024] Braun 等人[2024] Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, and Otmar Hilliges. Physically plausible full-body hand-object interaction synthesis. In 3DV, 2024.
乔纳·布劳恩、萨米·克里斯滕、穆罕默德·科贾巴斯、埃姆雷·阿克桑与奥特马尔·希利格斯。《物理合理的全身手部-物体交互合成》。收录于《三维视觉国际会议》（3DV），2024 年。
Cao et al. [2024] Cao 等人[2024] Jinkun Cao, Jingyuan Liu, Kris Kitani, and Yi Zhou. Multi-modal diffusion for hand-object grasp generation. arXiv preprint arXiv:2409.04560, 2024.
曹金坤，刘静远，Kris Kitani，周毅。面向手-物抓取生成的多模态扩散方法。arXiv 预印本 arXiv:2409.04560，2024 年。
Cen et al. [2024] Cen 等[2024] Zhi Cen, Huaijin Pi, Sida Peng, Zehong Shen, Minghui Yang, Shuai Zhu, Hujun Bao, and Xiaowei Zhou. Generating human motion in 3d scenes from text descriptions. In CVPR, 2024.
Zhi Cen、Huaijin Pi、Sida Peng、Zehong Shen、Minghui Yang、Shuai Zhu、Hujun Bao 和 Xiaowei Zhou。《基于文本描述生成 3D 场景中的人体运动》。发表于 CVPR，2024 年。
Chao et al. [2021] Chao 等[2021] Yu-Wei Chao, Jimei Yang, Weifeng Chen, and Jia Deng. Learning to sit: Synthesizing human-chair interactions via hierarchical control. In AAAI, 2021.
Yu-Wei Chao、Jimei Yang、Weifeng Chen 和 Jia Deng。《学会坐姿：通过分层控制合成人椅交互》。收录于 AAAI，2021 年。
Chen et al. [2024] Chen 等人[2024] Ling-Hao Chen, Shunlin Lu, Wenxun Dai, Zhiyang Dou, Xuan Ju, Jingbo Wang, Taku Komura, and Lei Zhang. Pay attention and move better: Harnessing attention for interactive motion generation and training-free editing. arXiv preprint arXiv:2410.18977, 2024.
陈凌浩，陆顺林，戴文勋，窦志阳，鞠轩，王景波，Taku Komura，张磊。《专注行动更优：利用注意力机制实现交互式运动生成与免训练编辑》。arXiv 预印本 arXiv:2410.18977，2024 年。
Cheng et al. [2024] Cheng 等人[2024] Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback. In CoRL, 2024.
程旭新、李嘉隆、杨世琪、杨戈和王小龙。《开放电视：具有沉浸式主动视觉反馈的遥操作技术》。收录于 CoRL，2024 年。
Chernyadev et al. [2024] Chernyadev 等人[2024] Nikita Chernyadev, Nicholas Backshall, Xiao Ma, Yunfan Lu, Younggyo Seo, and Stephen James. Bigym: A demo-driven mobile bi-manual manipulation benchmark. arXiv preprint arXiv:2407.07788, 2024.
尼基塔·切尔尼亚德夫、尼古拉斯·巴克肖、马骁、卢云帆、徐永桥和斯蒂芬·詹姆斯。《Bigym：以演示驱动的移动双手操作基准》。arXiv 预印本 arXiv:2407.07788，2024 年。
Christen et al. [2022] Christen 等人[2022] Sammy Christen, Muhammed Kocabas, Emre Aksan, Jemin Hwangbo, Jie Song, and Otmar Hilliges. D-grasp: Physically plausible dynamic grasp synthesis for hand-object interactions. In CVPR, 2022.
Sammy Christen、Muhammed Kocabas、Emre Aksan、Jemin Hwangbo、Jie Song 和 Otmar Hilliges。D-Grasp：基于物理合理性的手-物交互动态抓取合成。载于 CVPR，2022 年。
Christen et al. [2024] Christen 等人[2024] Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, and Bugra Tekin. DiffH2O: Diffusion-based synthesis of hand-object interactions from textual descriptions. In SIGGRAPH Asia, 2024.
Sammy Christen、Shreyas Hampali、Fadime Sener、Edoardo Remelli、Tomas Hodan、Eric Sauser、Shugao Ma 和 Bugra Tekin。《DiffH2O：基于扩散模型的手-物交互文本描述合成》。收录于 SIGGRAPH Asia，2024 年。
Chu et al. [2024] Chu 等人[2024] Kun Chu, Xufeng Zhao, Cornelius Weber, Mengdi Li, Wenhao Lu, and Stefan Wermter. Large language models for orchestrating bimanual robots. arXiv preprint arXiv:2404.02018, 2024.
Kun Chu、Xufeng Zhao、Cornelius Weber、Mengdi Li、Wenhao Lu 和 Stefan Wermter。《大型语言模型在双手机器人协调中的应用》。arXiv 预印本 arXiv:2404.02018，2024 年。
Cong et al. [2024] Cong 等人[2024] Peishan Cong, Ziyi Wang, Zhiyang Dou, Yiming Ren, Wei Yin, Kai Cheng, Yujing Sun, Xiaoxiao Long, Xinge Zhu, and Yuexin Ma. LaserHuman: Language-guided scene-aware human motion generation in free environment. arXiv preprint arXiv:2403.13307, 2024.
丛培山，王子怡，窦志扬，任一鸣，殷威，程凯，孙雨静，龙晓晓，朱新格，马月新。LaserHuman：自由环境中语言引导的场景感知人体运动生成。arXiv 预印本 arXiv:2403.13307，2024 年。
Corona et al. [2020] Corona 等人[2020] Enric Corona, Albert Pumarola, Guillem Alenya, and Francesc Moreno-Noguer. Context-aware human motion prediction. In CVPR, 2020.
Enric Corona、Albert Pumarola、Guillem Alenya 和 Francesc Moreno-Noguer。上下文感知的人体运动预测。见 CVPR，2020 年。
Cui et al. [2024] 崔等人[2024] Jieming Cui, Tengyu Liu, Nian Liu, Yaodong Yang, Yixin Zhu, and Siyuan Huang. AnySkill: Learning open-vocabulary physical skill for interactive agents. In CVPR, 2024.
崔杰明、刘腾宇、刘念、杨耀东、朱艺昕和黄思远。AnySkill：为交互式代理学习开放词汇物理技能。收录于 CVPR，2024 年。
Dahiya et al. [2009] Dahiya 等人[2009] Ravinder S Dahiya, Giorgio Metta, Maurizio Valle, and Giulio Sandini. Tactile sensing—from humans to humanoids. IEEE transactions on robotics, 26(1):1–20, 2009.
Ravinder S Dahiya、Giorgio Metta、Maurizio Valle 与 Giulio Sandini。触觉传感——从人类到仿人机器人。《IEEE 机器人学汇刊》，26(1):1–20，2009 年。
Dai et al. [2024a] 戴等人 [2024a] Sisi Dai, Wenhao Li, Haowen Sun, Haibin Huang, Chongyang Ma, Hui Huang, Kai Xu, and Ruizhen Hu. Interfusion: Text-driven generation of 3d human-object interaction. In ECCV, 2024a.
戴思思、李文浩、孙浩文、黄海斌、马崇阳、黄慧、徐凯和胡瑞珍。《Interfusion：文本驱动的三维人机交互生成》。载于 ECCV，2024a。
Dai et al. [2024b] Dai 等[2024b] Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, and Yansong Tang. Motionlcm: Real-time controllable motion generation via latent consistency model. In ECCV, 2024b.
温迅岱、凌浩陈、王静波、刘金鹏、戴波和唐岩松。《MotionLCM：基于潜在一致性模型的实时可控运动生成》。载于 ECCV，2024b。
Daiya et al. [2024] Daiya 等人[2024] Divyanshu Daiya, Damon Conover, and Aniket Bera. COLLAGE: Collaborative human-agent interaction generation using hierarchical latent diffusion and language models. arXiv preprint arXiv:2409.20502, 2024.
Divyanshu Daiya、Damon Conover 与 Aniket Bera。《COLLAGE：基于分层潜在扩散与语言模型的协作式人机交互生成》。arXiv 预印本，arXiv:2409.20502，2024 年。
Dao et al. [2024] Dao 等人[2024] Jeremy Dao, Helei Duan, and Alan Fern. Sim-to-real learning for humanoid box loco-manipulation. In ICRA, 2024.
Jeremy Dao、Helei Duan 和 Alan Fern。《人形机器人箱体移动操作从仿真到现实的学习》。发表于 ICRA，2024 年。
Diller and Dai [2024] Diller 和 Dai [2024] Christian Diller and Angela Dai. CG-HOI: Contact-guided 3d human-object interaction generation. In CVPR, 2024.
Christian Diller 与 Angela Dai。《CG-HOI：接触引导的三维人物-物体交互生成》。收录于 CVPR，2024 年。
Ding et al. [2025] Ding 等人[2025] Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, Yuefan Wang, Huaicheng Zhou, Wenshuo Feng, Jiacheng Liu, Siteng Huang, and Donglin Wang. Humanoid-VLA: Towards universal humanoid control with visual integration. arXiv preprint arXiv:2502.14795, 2025.
丁鹏翔、马建飞、佟新阳、邹炳宏、罗欣欣、范益国、王婷、卢鸿超、莫潘忠、刘金鑫、王跃凡、周怀成、冯文硕、刘嘉诚、黄思腾、王栋林。《Humanoid-VLA：面向通用人形控制的视觉集成研究》。arXiv 预印本，arXiv:2502.14795，2025 年。
Fan et al. [2023] Fan 等人[2023] Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Otmar Hilliges. ARCTIC: A dataset for dexterous bimanual hand-object manipulation. In CVPR, 2023.
范子聪、Omid Taheri、Dimitrios Tzionas、Muhammed Kocabas、Manuel Kaufmann、Michael J. Black 与 Otmar Hilliges。《ARCTIC：双手灵巧操作手-物交互数据集》。载于 CVPR，2023 年。
Fu et al. [2024] Fu 等人[2024] Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. HumanPlus: Humanoid shadowing and imitation from humans. In CoRL, 2024.
傅子芃、赵晴晴、吴琦、Gordon Wetzstein 和 Chelsea Finn。《HumanPlus：人形机器人的影子学习与人类模仿》。收录于 CoRL，2024 年。
Gao et al. [2024a] Jianfeng Gao, Xiaoshu Jin, Franziska Krebs, Noémie Jaquier, and Tamim Asfour. Bi-KVIL: Keypoints-based visual imitation learning of bimanual manipulation tasks. In ICRA, 2024a.
Gao et al. [2024b] Jiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, and Jiangmiao Pang. CooHOI: Learning cooperative human-object interaction with manipulated object dynamics. In NeurIPS, 2024b.
Ghosh et al. [2023] Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. IMoS: Intent-driven full-body motion synthesis for human-object interactions. In Computer Graphics Forum, 2023.
Ghosh et al. [2024] Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. ReMoS: 3d motion-conditioned reaction synthesis for two-person interactions. In ECCV, 2024.
Gleicher [1997] Michael Gleicher. Motion editing with spacetime constraints. In Proceedings of the 1997 symposium on Interactive 3D graphics, 1997.
Gu et al. [2025] Zhaoyuan Gu, Junheng Li, Wenlan Shen, Wenhao Yu, Zhaoming Xie, Stephen McCrory, Xianyi Cheng, Abdulaziz Shamsah, Robert Griffin, C. Karen Liu, Abderrahmane Kheddar, Xue Bin Peng, Yuke Zhu, Guanya Shi, Quan Nguyen, Gordon Cheng, Huijun Gao, and Ye Zhao. Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning. arXiv preprint arXiv:2501.02116, 2025.
Guo et al. [2022] Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In CVPR, 2022.
Haarnoja et al. [2024] Tuomas Haarnoja, Ben Moran, Guy Lever, Sandy H Huang, Dhruva Tirumala, Jan Humplik, Markus Wulfmeier, Saran Tunyasuvunakool, Noah Y Siegel, Roland Hafner, et al. Learning agile soccer skills for a bipedal robot with deep reinforcement learning. Science Robotics, 9(89):eadi8022, 2024.
Hassan et al. [2021] Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J Black. Stochastic scene-aware motion prediction. In ICCV, 2021.
Hassan et al. [2023] Mohamed Hassan, Yunrong Guo, Tingwu Wang, Michael Black, Sanja Fidler, and Xue Bin Peng. Synthesizing physical character-scene interactions. In SIGGRAPH, 2023.
He et al. [2024a] Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. OmniH2O: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. arXiv preprint arXiv:2406.08858, 2024a.
He et al. [2024b] Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Learning human-to-humanoid real-time whole-body teleoperation. In IROS, 2024b.
He et al. [2024c] Wenkun He, Yun Liu, Ruitao Liu, and Li Yi. SyncDiff: Synchronized motion diffusion for multi-body human-object interaction synthesis. arXiv preprint arXiv:2412.20104, 2024c.
Holden et al. [2017] Daniel Holden, Taku Komura, and Jun Saito. Phase-functioned neural networks for character control. ACM Transactions on Graphics (TOG), 36(4):1–13, 2017.
Hong et al. [2019] Seokpyo Hong, Daseong Han, Kyungmin Cho, Joseph S Shin, and Junyong Noh. Physics-based full-body soccer motion control for dribbling and shooting. ACM Transactions on Graphics (TOG), 38(4):1–12, 2019.
Hou et al. [2023] Zhi Hou, Baosheng Yu, and Dacheng Tao. Compositional 3d human-object neural animation. arXiv preprint arXiv:2304.14070, 2023.
Huang et al. [2024a] Buzhen Huang, Chen Li, Chongyang Xu, Liang Pan, Yangang Wang, and Gim Hee Lee. Closely interactive human reconstruction with proxemics and physics-guided adaption. In CVPR, 2024a.
Huang et al. [2024b] Binghao Huang, Yixuan Wang, Xinyi Yang, Yiyue Luo, and Yunzhu Li. 3d-vitac: Learning fine-grained manipulation with visuo-tactile sensing. In CoRL, 2024b.
Huang et al. [2022] Yinghao Huang, Omid Taheri, Michael J. Black, and Dimitrios Tzionas. InterCap: Joint markerless 3D tracking of humans and objects in interaction. In GCPR, 2022.
[47] Inspire-robots. Smaller and higher-precision motion control experts. https://inspire-robots.store/.
Ji et al. [2024] Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang. ExBody2: Advanced expressive humanoid whole-body control. arXiv preprint arXiv:2412.13196, 2024.
Jiang et al. [2023] Nan Jiang, Tengyu Liu, Zhexuan Cao, Jieming Cui, Yixin Chen, He Wang, Yixin Zhu, and Siyuan Huang. CHAIRS: Towards full-body articulated human-object interaction. In ICCV, 2023.
Jiang et al. [2024a] Nan Jiang, Zimo He, Zi Wang, Hongjie Li, Yixin Chen, Siyuan Huang, and Yixin Zhu. Autonomous character-scene interaction synthesis from text instruction. In SIGGRAPH Asia, 2024a.
Jiang et al. [2024b] Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction modeling. In CVPR, 2024b.
Jiang et al. [2024c] Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. DexMimicGen: Automated data generation for bimanual dexterous manipulation via imitation learning. arXiv preprint arXiv:2410.24185, 2024c.
Juravsky et al. [2024] Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. SuperPADL: Scaling language-directed physics-based control with progressive supervised distillation. In SIGGRAPH, 2024.
Kareer et al. [2024] Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. EgoMimic: Scaling imitation learning via egocentric video. arXiv preprint arXiv:2410.24221, 2024.
Khirodkar et al. [2024] Rawal Khirodkar, Jyun-Ting Song, Jinkun Cao, Zhengyi Luo, and Kris Kitani. Harmony4D: A video dataset for in-the-wild close human interactions. In NeurIPS, 2024.
Kim et al. [2024a] Hyeonwoo Kim, Sookwan Han, Patrick Kwon, and Hanbyul Joo. Zero-shot learning for the primitives of 3d affordance in general objects. arXiv preprint arXiv:2401.12978, 2024a.
Kim et al. [2024b] Hyeonwoo Kim, Sookwan Han, Patrick Kwon, and Hanbyul Joo. Beyond the contact: Discovering comprehensive affordance for 3d objects from pre-trained 2d diffusion models. In ECCV, 2024b.
Kim et al. [2025] Hyeonwoo Kim, Sangwon Beak, and Hanbyul Joo. DAViD: Modeling dynamic affordance of 3d objects using pre-trained video diffusion models. arXiv preprint arXiv:2501.08333, 2025.
Kim et al. [2024c] Jeonghwan Kim, Jisoo Kim, Jeonghyeon Na, and Hanbyul Joo. ParaHome: Parameterizing everyday home activities towards 3d generative modeling of human-object interactions. arXiv preprint arXiv:2401.10232, 2024c.
Kim et al. [2023] Taeksoo Kim, Shunsuke Saito, and Hanbyul Joo. NCHO: Unsupervised learning for neural 3d composition of humans and objects. In ICCV, 2023.
Kim et al. [2016] Yeonjoon Kim, Hangil Park, Seungbae Bang, and Sung-Hee Lee. Retargeting human-object interaction to virtual avatars. IEEE transactions on visualization and computer graphics, 22(11):2405–2412, 2016.
Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
Kraft [1994] Dieter Kraft. Algorithm 733: Tomp–fortran modules for optimal control calculations. ACM Transactions on Mathematical Software (TOMS), 20(3):262–281, 1994.
Kulkarni et al. [2024] Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, and Leonidas Guibas. NIFTY: Neural object interaction fields for guided human motion synthesis. In CVPR, 2024.
Lee and Joo [2023] Jiye Lee and Hanbyul Joo. Locomotion-Action-Manipulation: Synthesizing human-scene interactions in complex 3d environments. In ICCV, 2023.
Lee et al. [2006] Kang Hoon Lee, Myung Geol Choi, and Jehee Lee. Motion patches: building blocks for virtual environments annotated with motion data. In SIGGRAPH. 2006.
Lee et al. [2010] Yoonsang Lee, Sungeun Kim, and Jehee Lee. Data-driven biped control. In SIGGRAPH. 2010.
Li et al. [2024a] Chuqiao Li, Julian Chibane, Yannan He, Naama Pearl, Andreas Geiger, and Gerard Pons-Moll. Unimotion: Unifying 3d human motion synthesis and understanding. arXiv preprint arXiv:2409.15904, 2024a.
Li et al. [2024b] Hongjie Li, Hong-Xing Yu, Jiaman Li, and Jiajun Wu. ZeroHSI: Zero-shot 4d human-scene interaction by video generation. arXiv preprint arXiv:2412.18600, 2024b.
Li et al. [2023] Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG), 42(6):1–11, 2023.
Li et al. [2024c] Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C Karen Liu. Controllable human-object interaction synthesis. In ECCV, 2024c.
Li et al. [2024d] Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. OKAMI: Teaching humanoid robots manipulation skills through single video imitation. In CoRL, 2024d.
Li and Dai [2024] Lei Li and Angela Dai. GenZI: Zero-shot 3d human-scene interaction generation. In CVPR, 2024.
Li et al. [2024e] Quanzhou Li, Jingbo Wang, Chen Change Loy, and Bo Dai. Task-oriented human-object interactions generation with implicit neural representations. In WACV, 2024e.
Li et al. [2024f] Ronghui Li, Youliang Zhang, Yachao Zhang, Yuxiang Zhang, Mingyang Su, Jie Guo, Ziwei Liu, Yebin Liu, and Xiu Li. InterDance: Reactive 3d dance generation with realistic duet interactions. arXiv preprint arXiv:2412.16982, 2024f.
Liu et al. [2024a] Fukang Liu, Zhaoyuan Gu, Yilin Cai, Ziyi Zhou, Shijie Zhao, Hyunyoung Jung, Sehoon Ha, Yue Chen, Danfei Xu, and Ye Zhao. Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco-manipulation. arXiv preprint arXiv:2409.20514, 2024a.
Liu et al. [2024b] Hanchao Liu, Xiaohang Zhan, Shaoli Huang, Tai-Jiang Mu, and Ying Shan. Programmable motion generation for open-set motion control tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1399–1408, 2024b.
Liu and Hodgins [2017] Libin Liu and Jessica Hodgins. Learning to schedule control fragments for physics-based characters using deep q-learning. ACM Transactions on Graphics (TOG), 36(3):1–14, 2017.
Liu and Hodgins [2018] Libin Liu and Jessica Hodgins. Learning basketball dribbling skills using trajectory optimization and deep reinforcement learning. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018.
Liu et al. [2022] Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, SM Ali Eslami, Daniel Hennes, Wojciech M Czarnecki, Yuval Tassa, Shayegan Omidshafiei, Abbas Abdolmaleki, et al. From motor control to team play in simulated humanoid football. Science Robotics, 7(69):eabo0235, 2022.
Liu et al. [2024c] Yunze Liu, Changxi Chen, Chenjing Ding, and Li Yi. PhysReaction: Physically plausible real-time humanoid reaction synthesis via forward dynamics guided 4d imitation. In ACMMM, 2024c.
Liu et al. [2024d] Yun Liu, Bowen Yang, Licheng Zhong, He Wang, and Li Yi. Mimicking-bench: A benchmark for generalizable humanoid-scene interaction learning via human mimicking. arXiv preprint arXiv:2412.17730, 2024d.
Liu et al. [2024e] Yun Liu, Haolin Yang, Xu Si, Ling Liu, Zipeng Li, Yuxiang Zhang, Yebin Liu, and Li Yi. Taco: Benchmarking generalizable bimanual tool-action-object understanding. In CVPR, 2024e.
Liu et al. [2025] Yunze Liu, Changxi Chen, and Li Yi. Interactive humanoid: Online full-body motion reaction synthesis with social affordance canonicalization and forecasting. In 3DV, 2025.
Lou et al. [2024] Zhenyu Lou, Qiongjie Cui, Tuo Wang, Zhenbo Song, Luoming Zhang, Cheng Cheng, Haofan Wang, Xu Tang, Huaxia Li, and Hong Zhou. Harmonizing stochasticity and determinism: Scene-responsive diverse human motion prediction. In NeurIPS, 2024.
Lu et al. [2025] Chenhao Lu, Xuxin Cheng, Jialong Li, Shiqi Yang, Mazeyu Ji, Chengjing Yuan, Ge Yang, Sha Yi, and Xiaolong Wang. Mobile-television: Predictive motion priors for humanoid whole-body control. In ICRA, 2025.
Lu et al. [2024a] Jintao Lu, He Zhang, Yuting Ye, Takaaki Shiratori, Sebastian Starke, and Taku Komura. CHOICE: Coordinated human-object interaction in cluttered environments for pick-and-place actions. arXiv preprint arXiv:2412.06702, 2024a.
Lu et al. [2024b] Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. HumanTOMATO: Text-aligned whole-body motion generation. In ICML, 2024b.
Luo et al. [2023a] Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. In ICCV, 2023a.
Luo et al. [2023b] Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, and Weipeng Xu. Universal humanoid motion representations for physics-based control. arXiv preprint arXiv:2310.04582, 2023b.
Luo et al. [2024a] Zhengyi Luo, Jinkun Cao, Sammy Christen, Alexander Winkler, Kris Kitani, and Weipeng Xu. Grasping diverse objects with simulated humanoids. In NeurIPS, 2024a.
Luo et al. [2024b] Zhengyi Luo, Jiashun Wang, Kangni Liu, Haotian Zhang, Chen Tessler, Jingbo Wang, Ye Yuan, Jinkun Cao, Zihui Lin, Fengyi Wang, et al. SMPLOlympics: Sports environments for physically simulated humanoids. arXiv preprint arXiv:2407.00187, 2024b.
Lv et al. [2024] Xintao Lv, Liang Xu, Yichao Yan, Xin Jin, Congsheng Xu, Shuwen Wu, Yifan Liu, Lincheng Li, Mengxiao Bi, Wenjun Zeng, et al. HIMO: A new benchmark for full-body human interacting with multiple objects. In ECCV, 2024.
Ma et al. [2024a] Junyi Ma, Jingyi Xu, Xieyuanli Chen, and Hesheng Wang. Diff-IP2D: Diffusion-based hand-object interaction prediction on egocentric videos. arXiv preprint arXiv:2405.04370, 2024a.
Ma et al. [2024b] Sihan Ma, Qiong Cao, Jing Zhang, and Dacheng Tao. Contact-aware human motion generation from textual descriptions. arXiv preprint arXiv:2403.15709, 2024b.
Makoviychuk et al. [2021] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. In NeurIPS, 2021.
Merel et al. [2020] Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu Pham, Tom Erez, Greg Wayne, and Nicolas Heess. Catch & carry: reusable neural controllers for vision-guided whole-body tasks. ACM Transactions on Graphics (TOG), 39(4):39–1, 2020.
Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
Pan et al. [2024] Liang Pan, Jingbo Wang, Buzhen Huang, Junyu Zhang, Haofan Wang, Xu Tang, and Yangang Wang. Synthesizing physically plausible human motions in 3d scenes. In 3DV, 2024.
Park et al. [2019] Soohwan Park, Hoseok Ryu, Seyoung Lee, Sunmin Lee, and Jehee Lee. Learning predict-and-simulate policies from unorganized human motion data. ACM Transactions on Graphics (TOG), 38(6):1–11, 2019.
Park et al. [2025] Sungjae Park, Seungho Lee, Mingi Choi, Jiye Lee, Jeonghwan Kim, Jisoo Kim, and Hanbyul Joo. Learning to transfer human hand skills for robot manipulations. arXiv preprint arXiv:2501.04169, 2025.
Pavlakos et al. [2019] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In CVPR, 2019.
Peng et al. [2023] Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. HOI-Diff: Text-driven synthesis of 3d human-object interactions using diffusion models. arXiv preprint arXiv:2312.06553, 2023.
Peng et al. [2018] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG), 37(4):1–14, 2018.
Peng et al. [2021] Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG), 40(4):1–20, 2021.
Peng et al. [2022] Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Transactions On Graphics (TOG), 41(4):1–17, 2022.
Petrov et al. [2023] Ilya A Petrov, Riccardo Marin, Julian Chibane, and Gerard Pons-Moll. Object pop-up: Can we infer 3d objects and their poses from human interactions alone? In CVPR, 2023.
Rajeswaran et al. [2018] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In RSS, 2018.
Razali and Demiris [2023] Haziq Razali and Yiannis Demiris. Action-conditioned generation of bimanual object manipulation sequences. In AAAI, 2023.
Reda et al. [2023] Daniele Reda, Jungdam Won, Yuting Ye, Michiel van de Panne, and Alexander Winkler. Physics-based motion retargeting from sparse inputs. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1–19, 2023.
Romero et al. [2017] Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, 36(6), 2017.
Ross et al. [2011] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
Schulman et al. [2016] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In ICLR, 2016.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Sferrazza et al. [2024] Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation. In RSS, 2024.
Si et al. [2024] Zilin Si, Gu Zhang, Qingwei Ben, Branden Romero, Zhou Xian, Chao Liu, and Chuang Gan. Difftactile: A physics-based differentiable tactile simulator for contact-rich robotic manipulation. In ICLR, 2024.
Smith et al. [2023] Laura Smith, J Chase Kew, Tianyu Li, Linda Luu, Xue Bin Peng, Sehoon Ha, Jie Tan, and Sergey Levine. Learning and adapting agile locomotion skills by transferring experience. In RSS, 2023.
Song et al. [2024] Wenfeng Song, Xinyu Zhang, Shuai Li, Yang Gao, Aimin Hao, Xia Hou, Chenglizhao Chen, Ning Li, and Hong Qin. HOIAnimator: Generating text-prompt human-object animations using novel perceptive diffusion models. In CVPR, 2024.
Starke et al. [2019] Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. Neural state machine for character-scene interactions. ACM Trans. Graph., 38(6):209–1, 2019.
Starke et al. [2020] Sebastian Starke, Yiwei Zhao, Taku Komura, and Kazi Zaman. Local motion phases for learning multi-contact character movements. ACM Transactions on Graphics (TOG), 39(4):54–1, 2020.
Sueda et al. [2008] Shinjiro Sueda, Andrew Kaufman, and Dinesh K Pai. Musculotendon simulation for hand animation. In SIGGRAPH. 2008.
Taheri et al. [2020] Omid Taheri, Nima Ghorbani, Michael J Black, and Dimitrios Tzionas. GRAB: A dataset of whole-body human grasping of objects. In ECCV, 2020.
Taheri et al. [2022] Omid Taheri, Vasileios Choutas, Michael J Black, and Dimitrios Tzionas. GOAL: Generating 4d whole-body motion for hand-object grasping. In CVPR, 2022.
Tessler et al. [2023] Chen Tessler, Yoni Kasten, Yunrong Guo, Shie Mannor, Gal Chechik, and Xue Bin Peng. Calm: Conditional adversarial latent models for directable virtual characters. In SIGGRAPH, 2023.
Tessler et al. [2024] Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM Transactions on Graphics (TOG), 43(6):1–21, 2024.
Tevet et al. [2023] Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. In ICLR, 2023.
Tevet et al. [2025] Guy Tevet, Sigal Raab, Setareh Cohan, Daniele Reda, Zhengyi Luo, Xue Bin Peng, Amit H Bermano, and Michiel van de Panne. CLoSD: Closing the loop between simulation and diffusion for multi-task character control. In ICLR, 2025.
Tian et al. [2024] Jie Tian, Lingxiao Yang, Ran Ji, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, and Jingya Wang. Gaze-guided hand-object interaction synthesis: Benchmark and method. arXiv preprint arXiv:2403.16169, 2024.
Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[130] Unitree. Unitree g1 humanoid agent ai avatar. https://www.unitree.com/g1/.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
Villegas et al. [2021] Ruben Villegas, Duygu Ceylan, Aaron Hertzmann, Jimei Yang, and Jun Saito. Contact-aware retargeting of skinned motion. In ICCV, 2021.
Wagener et al. [2022] Nolan Wagener, Andrey Kolobov, Felipe Vieira Frujeri, Ricky Loynd, Ching-An Cheng, and Matthew Hausknecht. MoCapAct: A multi-task dataset for simulated humanoid control. In NeurIPS, 2022.
Wan et al. [2022] Weilin Wan, Lei Yang, Lingjie Liu, Zhuoying Zhang, Ruixing Jia, Yi-King Choi, Jia Pan, Christian Theobalt, Taku Komura, and Wenping Wang. Learn to predict how humans manipulate large-sized objects from interactive motions. IEEE Robotics and Automation Letters, 2022.
Wang et al. [2024a] Hanqing Wang, Jiahe Chen, Wensi Huang, Qingwei Ben, Tai Wang, Boyu Mi, Tao Huang, Siheng Zhao, Yilun Chen, Sizhe Yang, et al. Grutopia: Dream general robots in a city at scale. arXiv preprint arXiv:2407.10943, 2024a.
Wang et al. [2021] Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiaolong Wang. Synthesizing long-term 3d human motion and interaction in 3d scenes. In CVPR, 2021.
Wang et al. [2022a] Jingbo Wang, Yu Rong, Jingyuan Liu, Sijie Yan, Dahua Lin, and Bo Dai. Towards diverse and natural scene-aware 3d human motion synthesis. In CVPR, 2022a.
Wang et al. [2024b] Jiashun Wang, Jessica Hodgins, and Jungdam Won. Strategy and skill learning for physics-based table tennis animation. In SIGGRAPH, 2024b.
Wang et al. [2024c] Ruocheng Wang, Pei Xu, Haochen Shi, Elizabeth Schumann, and C Karen Liu. Fürelise: Capturing and physically synthesizing hand motion of piano performance. In SIGGRAPH Asia, 2024c.
Wang et al. [2024d] Wenjia Wang, Liang Pan, Zhiyang Dou, Zhouyingcheng Liao, Yuke Lou, Lei Yang, Jingbo Wang, and Taku Komura. SIMS: Simulating human-scene interactions with real world script planning. arXiv preprint arXiv:2411.19921, 2024d.
Wang et al. [2022b] Xi Wang, Gen Li, Yen-Ling Kuo, Muhammed Kocabas, Emre Aksan, and Otmar Hilliges. Reconstructing action-conditioned human-object interactions using commonsense knowledge priors. In 3DV, 2022b.
Wang et al. [2023] Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. PhysHOI: Physics-based imitation of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023.
Wang et al. [2024e] Yinhuai Wang, Qihan Zhao, Runyi Yu, Ailing Zeng, Jing Lin, Zhengyi Luo, Hok Wai Tsui, Jiwen Yu, Xiu Li, Qifeng Chen, et al. SkillMimic: Learning reusable basketball skills from demonstrations. arXiv preprint arXiv:2408.15270, 2024e.
Wang et al. [2024f] Zhenzhi Wang, Jingbo Wang, Dahua Lin, and Bo Dai. InterControl: Generate human motion interactions by controlling every joint. In NeurIPS, 2024f.
Wheatland et al. [2015] Nkenge Wheatland, Yingying Wang, Huaguang Song, Michael Neff, Victor Zordan, and Sophie Jörg. State of the art in hand and finger modeling and animation. In Computer Graphics Forum, 2015.
Won and Lee [2019] Jungdam Won and Jehee Lee. Learning body shape variation in physics-based characters. ACM Transactions on Graphics (TOG), 38(6):1–12, 2019.
Won et al. [2020] Jungdam Won, Deepak Gopinath, and Jessica Hodgins. A scalable approach to control diverse behaviors for physically simulated characters. ACM Transactions on Graphics (TOG), 39(4):33–1, 2020.
Wu et al. [2024a] Qianyang Wu, Ye Shi, Xiaoshui Huang, Jingyi Yu, Lan Xu, and Jingya Wang. THOR: Text to human-object interaction diffusion via relation intervention. arXiv preprint arXiv:2403.11208, 2024a.
Wu et al. [2025] Qingxuan Wu, Zhiyang Dou, Sirui Xu, Soshi Shimada, Chen Wang, Zhengming Yu, Yuan Liu, Cheng Lin, Zeyu Cao, Taku Komura, et al. DICE: End-to-end deformation capture of hand-face interactions from a single image. In ICLR, 2025.
Wu et al. [2022] Yan Wu, Jiahao Wang, Yan Zhang, Siwei Zhang, Otmar Hilliges, Fisher Yu, and Siyu Tang. SAGA: Stochastic whole-body grasping with contact. In ECCV, 2022.
Wu et al. [2024b] Zhen Wu, Jiaman Li, Pei Xu, and C Karen Liu. Human-object interaction from human-level instructions. arXiv preprint arXiv:2406.17840, 2024b.
Xiao et al. [2024] Zeqi Xiao, Tai Wang, Jingbo Wang, Jinkun Cao, Wenwei Zhang, Bo Dai, Dahua Lin, and Jiangmiao Pang. Unified human-scene interaction via prompted chain-of-contacts. In ICLR, 2024.
Xie et al. [2022a] Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Chore: Contact, human and object reconstruction from a single rgb image. In ECCV, 2022a.
Xie et al. [2024a] Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. InterTrack: Tracking human object interaction without object templates. In 3DV, 2024a.
Xie et al. [2024b] Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. OmniControl: Control any joint at any time for human motion generation. In ICLR, 2024b.
Xie et al. [2022b] Zhaoming Xie, Sebastian Starke, Hung Yu Ling, and Michiel van de Panne. Learning soccer juggling skills with layer-wise mixture-of-experts. In SIGGRAPH, 2022b.
Xie et al. [2023] Zhaoming Xie, Jonathan Tseng, Sebastian Starke, Michiel van de Panne, and C Karen Liu. Hierarchical planning and control for box loco-manipulation. arXiv preprint arXiv:2306.09532, 2023.
Xu et al. [2023a] Liang Xu, Ziyang Song, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xiaokang Yang, et al. ActFormer: A gan-based transformer towards general action-conditioned 3d human motion generation. In ICCV, 2023a.
Xu and Wang [2024] Pei Xu and Ruocheng Wang. Synchronize dual hands for physics-based dexterous guitar playing. In SIGGRAPH Asia, 2024.
Xu et al. [2023b] Sirui Xu, Zhengyuan Li, Yu-Xiong Wang, and Liang-Yan Gui. InterDiff: Generating 3d human-object interactions with physics-informed diffusion. In ICCV, 2023b.
Xu et al. [2023c] Sirui Xu, Yu-Xiong Wang, and Liangyan Gui. Stochastic multi-person 3d motion forecasting. In ICLR, 2023c.
Xu et al. [2024a] Sirui Xu, Ziyin Wang, Yu-Xiong Wang, and Liang-Yan Gui. Interdreamer: Zero-shot text to 3d dynamic human-object interaction. In NeurIPS, 2024a.
Xu et al. [2024b] Zhu Xu, Qingchao Chen, Yuxin Peng, and Yang Liu. Semantic-aware human object interaction image generation. In ICML, 2024b.
Yang et al. [2024a] ChangHee Yang, ChanHee Kang, Kyeongbo Kong, Hanni Oh, and Suk-Ju Kang. Person in Place: Generating associative skeleton-guidance maps for human-object interaction image editing. In CVPR, 2024a.
Yang et al. [2024b] Jie Yang, Xuesong Niu, Nan Jiang, Ruimao Zhang, and Siyuan Huang. F-HOI: Toward fine-grained semantic-aligned 3d human-object interactions. In ECCV, 2024b.
Yang et al. [2024c] Yuhang Yang, Wei Zhai, Hongchen Luo, Yang Cao, and Zheng-Jun Zha. LEMON: Learning 3d human-object interaction relation from 2d images. In CVPR, 2024c.
Yang et al. [2024d] Yuhang Yang, Wei Zhai, Chengfeng Wang, Chengjun Yu, Yang Cao, and Zheng-Jun Zha. EgoChoir: Capturing 3d human-object interaction regions from egocentric views. In NeurIPS, 2024d.
Yang et al. [2022] Zeshi Yang, Kangkang Yin, and Libin Liu. Learning to use chopsticks in diverse gripping styles. ACM Transactions on Graphics (TOG), 41(4):1–17, 2022.
Ye and Liu [2012] Yuting Ye and C Karen Liu. Synthesis of detailed hand manipulations using contact sampling. ACM Transactions on Graphics (ToG), 31(4):1–10, 2012.
Ye et al. [2023] Yufei Ye, Xueting Li, Abhinav Gupta, Shalini De Mello, Stan Birchfield, Jiaming Song, Shubham Tulsiani, and Sifei Liu. Affordance diffusion: Synthesizing hand-object interactions. In CVPR, 2023.
Yu et al. [2024] Kelin Yu, Yunhai Han, Qixian Wang, Vaibhav Saxena, Danfei Xu, and Ye Zhao. Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation. In CoRL, 2024.
Yu et al. [2018] Wenhao Yu, Greg Turk, and C Karen Liu. Learning symmetric and low-energy locomotion. ACM Transactions on Graphics (TOG), 37(4):1–12, 2018.
Yuan and Kitani [2020] Ye Yuan and Kris Kitani. Residual force control for agile human behavior imitation and extended motion synthesis. In NeurIPS, 2020.
Yuan et al. [2021] Ye Yuan, Shih-En Wei, Tomas Simon, Kris Kitani, and Jason Saragih. Simpoe: Simulated character control for 3d human pose estimation. In CVPR, 2021.
Ze et al. [2024] Yanjie Ze, Zixuan Chen, Wenhao Wang, Tianyi Chen, Xialin He, Ying Yuan, Xue Bin Peng, and Jiajun Wu. Generalizable humanoid manipulation with improved 3d diffusion policies. arXiv preprint arXiv:2410.10803, 2024.
Zhang et al. [2024a] Chengwen Zhang, Yun Liu, Ruofan Xing, Bingda Tang, and Li Yi. Core4d: A 4d human-object-human interaction dataset for collaborative object rearrangement. arXiv preprint arXiv:2406.19353, 2024a.
Zhang et al. [2024b] Chong Zhang, Wenli Xiao, Tairan He, and Guanya Shi. Wococo: Learning whole-body humanoid control with sequential contacts. arXiv preprint arXiv:2406.06005, 2024b.
Zhang et al. [2021] He Zhang, Yuting Ye, Takaaki Shiratori, and Taku Komura. ManipNet: neural manipulation synthesis with a hand-object spatial representation. ACM Transactions on Graphics, 40(4):1–14, 2021.
Zhang et al. [2023a] Haotian Zhang, Ye Yuan, Viktor Makoviychuk, Yunrong Guo, Sanja Fidler, Xue Bin Peng, and Kayvon Fatahalian. Learning physically simulated tennis skills from broadcast videos. ACM Transactions on Graphics (TOG), 42(4):1–14, 2023a.
Zhang et al. [2023b] Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. NeuralDome: A neural modeling pipeline on multi-view human-object interactions. In CVPR, 2023b.
Zhang et al. [2023c] Jiajun Zhang, Yuxiang Zhang, Hongwen Zhang, Xiao Zhou, Boyao Zhou, Ruizhi Shao, Zonghai Hu, and Yebin Liu. Ins-hoi: Instance aware human-object interactions recovery. arXiv preprint arXiv:2312.09641, 2023c.
Zhang et al. [2024c] Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Hoi-m^ 3: Capture multiple humans and objects interaction within contextual environment. In CVPR, 2024c.
Zhang et al. [2024d] Jiajun Zhang, Yuxiang Zhang, Liang An, Mengcheng Li, Hongwen Zhang, Zonghai Hu, and Yebin Liu. ManiDext: Hand-object manipulation synthesis via continuous correspondence embeddings and residual-guided diffusion. arXiv preprint arXiv:2409.09300, 2024d.
Zhang et al. [2020] Jason Y Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa. Perceiving 3d human-object spatial arrangements from a single image in the wild. In ECCV, 2020.
Zhang et al. [2024e] Wanyue Zhang, Rishabh Dabral, Thomas Leimkühler, Vladislav Golyanik, Marc Habermann, and Christian Theobalt. ROAM: Robust and object-aware motion generation using neural pose descriptors. In 3DV, 2024e.
Zhang et al. [2022] Xiaohan Zhang, Bharat Lal Bhatnagar, Sebastian Starke, Vladimir Guzov, and Gerard Pons-Moll. COUCH: Towards controllable human-chair interactions. In ECCV, 2022.
Zhang et al. [2024f] Xiaohan Zhang, Bharat Lal Bhatnagar, Sebastian Starke, Ilya Petrov, Vladimir Guzov, Helisa Dhamo, Eduardo Pérez-Pellitero, and Gerard Pons-Moll. FORCE: Dataset and method for intuitive physics guided human-object interaction. In 3DV, 2024f.
Zhang et al. [2024g] Xiaohan Zhang, Sebastian Starke, Vladimir Guzov, Zhensong Zhang, Eduardo Pérez Pellitero, and Gerard Pons-Moll. SCENIC: Scene-aware semantic navigation with instruction-guided control. arXiv preprint arXiv:2412.15664, 2024g.
Zhang et al. [2023d] Yunbo Zhang, Deepak Gopinath, Yuting Ye, Jessica Hodgins, Greg Turk, and Jungdam Won. Simulation and retargeting of complex multi-character interactions. In SIGGRAPH, 2023d.
Zhang et al. [2024h] Yixuan Zhang, Hui Yang, Chuanchen Luo, Junran Peng, Yuxi Wang, and Zhaoxiang Zhang. OOD-HOI: Text-driven 3d whole-body human-object interactions generation beyond training domains. arXiv preprint arXiv:2411.18660, 2024h.
Zhao et al. [2024a] Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, and Lan Xu. I’M HOI: Inertia-aware monocular capture of 3d human-object interactions. In CVPR, 2024a.
Zhao et al. [2022] Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, and Siyu Tang. Compositional human-scene interaction synthesis with semantic control. In ECCV, 2022.
Zhao et al. [2023] Kaifeng Zhao, Yan Zhang, Shaofei Wang, Thabo Beeler, and Siyu Tang. Synthesizing diverse human motions in 3d indoor scenes. In ICCV, 2023.
Zhao et al. [2024b] Kaifeng Zhao, Gen Li, and Siyu Tang. DART: A diffusion-based autoregressive motion model for real-time text-driven motion control. arXiv preprint arXiv:2410.05260, 2024b.
Zheng et al. [2023] Juntian Zheng, Qingyuan Zheng, Lixing Fang, Yun Liu, and Li Yi. CAMS: Canonicalized manipulation spaces for category-level functional hand-object manipulation synthesis. In CVPR, 2023.
Zhong et al. [2024] Lei Zhong, Yiming Xie, Varun Jampani, Deqing Sun, and Huaizu Jiang. Smoodi: Stylized motion diffusion model. In ECCV, 2024.
Zhou et al. [2022] Keyang Zhou, Bharat Lal Bhatnagar, Jan Eric Lenssen, and Gerard Pons-Moll. Toch: Spatio-temporal object-to-hand correspondence for motion refinement. In ECCV, 2022.

\thetitle

Supplementary Material

{strip}

Figure A: InterMimic enables simulated humans to perform physical interactions, featuring scalable skill learning covering diverse objects.

In this supplementary, we provide additional method details and experimental setups:

(i)

Demo Video. A demonstration video (with a screenshot in Figure A) is provided at demo.mp4, as described in Sec. A.
(ii)

Simulation Setup. The environment configuration for physical HOI simulations is introduced in Sec. B.
(iii)

Reference Contact Labels. Additional information on obtaining the reference contact label $\hat{\boldsymbol{c}}_{t}$ is detailed in Sec. C.
(iv)

Reward Formulation. A comprehensive explanation of the reward design is provided in Sec. D.
(v)

Physical State Initialization & Interaction Early Termination. Further insights into these mechanisms are discussed in Sec. E.
(vi)

Implementation Details. This section (Sec. F) covers reframing our method for interaction prediction and text-guided interaction generation, as well as translating MoCap interactions into humanoid robot skills.
(vii)

Additional Experiments. Sec. G presents further qualitative results and analyzes failure cases.
(viii)

Limitations and Societal Impact. Finally, we examine the limitations of InterMimic and its potential societal implications in Sec. H.

We will release the code for this project at our webpage.

\etocdepthtag

.tocmtappendix \etocsettagdepthmtchapternone \etocsettagdepthmtappendixsubsection

A Demo Video

In addition to the qualitative results presented in the main paper, we provide a demo video (demo.mp4) for more detailed visualizations of the tasks, further illustrating the efficacy of our approach. The demo video conveys the following key points:

(i)

Our teacher policy can imitate highly dynamic and long-term interactions, both of which are inherently challenging.
(ii)

We visualize the effectiveness of our teacher policy in HOI retargeting. Given MoCap references for humans, we successfully transfer these tasks to a humanoid robot, tolerating embodiment differences.
(iii)

Our method corrects errors in reference interactions, addressing contact penetration, floating, and jittering issues. This demonstrates how teacher-based reference distillation can provide cleaner data for student policy training.
(iv)

The baseline method PhysHOI [142] fails on sequences our approach successfully imitates, complementing Figure 4 in the main paper.
(v)

Our student policy exhibits strong scalability, effectively learning from hours of data across diverse objects and interaction skills.
(vi)

The framework grants the student policy zero-shot generalizability, enabling direct application to text-to-HOI, interaction prediction, and interactions with new skills or objects – even multiple objects not present in the training set.

B Setup of Physical Interaction Simulation

The reference data represent humans using the SMPL models [111, 102]. For simulation, we convert these models into box and cylindrical rigid bodies following [174, 89]. Objects are also converted into simulation models through convex decomposition, as illustrated in Figure B. We summarize the physics parameters for our task in Table A. We follow the physics parameters for the human as specified in [142, 143], with the exception of the specialized range of motion (RoM) for hands, detailed in Table B. Our range of motion (RoM) setting is biologically inspired: finger flexion and extension (bending and straightening) are fully activated. However, unlike the real human, the abduction and adduction of the Metacarpophalangeal (MCP) joint are constrained to minimize the risk of finger interpenetration, in the absence of the correct reference hand pose for guidance. The rationale for these RoM settings is discussed in Sec.3.2 of the main paper and Sec.D.3 of the supplementary.

Hyperparameter	Value
Sim $dt$	$1/60$ s
Control $dt$	$1/30$ s
Number of envs	8192
Number of substeps	2
Number of pos iterations	4
Number of vel iterations	0
Contact offset	0.02
Rest offset	0.0
Max depenetration velocity	100
Object & ground restitution	0.7
Object & ground friction	0.9
Object density	200
Object max convex hulls	64

Table A: Simulation hyperparameters used in Isaac Gym [96].

Joint	x-dim	y & z-dim
MCP	$[-55.625^{\circ},55.625^{\circ}]$	$[-5.625^{\circ},5.625^{\circ}]$
PIP	$[-55.625^{\circ},55.625^{\circ}]$
DIP	$[-5.625^{\circ},90.000^{\circ}]$
CMC	$[-55.625^{\circ},55.625^{\circ}]$	$[-55.625^{\circ},55.625^{\circ}]$
MCP	$[-5.625^{\circ},5.625^{\circ}]$	$[-5.625^{\circ},5.625^{\circ}]$
IP	$[-5.625^{\circ},90.000^{\circ}]$	$[-5.625^{\circ},5.625^{\circ}]$

Table B: We constrain the Range of Motion (RoM) for the joints in the index, middle, ring, and pinky fingers including: the MCP (Metacarpophalangeal) joint where the finger meets the hand, the PIP (Proximal Interphalangeal) joint as the middle joint, and the DIP (Distal Interphalangeal) joint closest to the fingertip. For the thumb, we consider the CMC (Carpometacarpal) joint at the base in the palm, the MCP connecting the thumb to the hand, and the IP (Interphalangeal) joint within the thumb. The coordinates for describing these RoMs are based on the human model from [89].

C Reference Contact

In this section, we detail how we extract the reference contact that formulates the state and the reward as discussed in Sec. 3.1 of the main paper. One solution involves loading the HOI data into the simulation, replaying the data, and using the force detector in Isaac Gym [96] to identify contact, as suggested by [142]. However, this approach is ineffective for imperfect MoCap data; for instance, the force detector fails to capture contact when floating artifacts occur. To address this limitation, we propose solutions tailored differently for teacher and student training:

Reference contact for the student. We query the force detector from distilled reference in the simulation rather than from MoCap data replay, as the teacher policy is capable of correcting artifacts.

Reference contact for teachers. To account for contact distance variances, we determine reference contact based on inferred dynamics from kinematics, as outlined below.

C.1 Inferring Reference Dynamics

By analyzing the object’s acceleration over time, we can approximate external forces without depending on simulated dynamics. We assume human-object interaction occurs if any of these conditions hold: (i) The object is airborne, but its acceleration deviates significantly from gravitational acceleration, indicating that an external force, e.g., human interaction is acting upon it. (ii) The object is on the ground but not static, and its acceleration significantly differs from what is expected due to friction alone, suggesting additional force input. (iii) The minimum distance between the human and object vertices is below 0.01 meters.

When any condition is met, we define the contact threshold $\sigma$ as the minimum distance between the human and object vertices, plus 0.005 meters. This adaptive threshold is essential for accommodating contact distance variations in the ground truth MoCap data. For example, the contact promotion marker is defined as $\hat{\boldsymbol{c}}_{b}[i]=\|\hat{\boldsymbol{d}}[i]\|<\sigma$ , where $i$ is the index of human rigid bodies. We integrate $\hat{\boldsymbol{c}}_{b}$ into the contact promotion reward $R^{c}_{b}$ , as introduced in Sec. 3.2 of the main paper and detailed in Sec. D.2 of supplementary. $\hat{\boldsymbol{d}}$ is the joint-to-object vectors as defined in Sec. 3.1.

D Additional Details on Reward

In this section, we provide further details about the reward function used for policy training. Specifically, we describe how we balance the components of the embodiment-aware reward, formulate the contact and energy rewards, address hand interaction recovery, and explain the process of integrating all rewards into a unified scalar.

D.1 Embodiment-Aware Reward

We formulate the weight $\boldsymbol{w}_{d}$ , introduced in Sec. 3.2 of the main paper, for balancing the embodiment-aware reward:

\boldsymbol{w}_{d}[i]=0.5\times\frac{1/\|\boldsymbol{d}[i]\|^{2}}{\sum_{i}1/\|% \boldsymbol{d}[i]\|^{2}}+0.5\times\frac{1/\|\hat{\boldsymbol{d}}[i]\|^{2}}{% \sum_{i}1/\|\hat{\boldsymbol{d}}[i]\|^{2}},

(2)

where $i$ is the joint index, and $\boldsymbol{d}$ and $\hat{\boldsymbol{d}}$ are vectors from the human joint to the object surface for simulation and reference, respectively, as defined in Sec. 3.1 of the main paper. The value $\|\boldsymbol{d}[i]\|^{2}$ and $\|\hat{\boldsymbol{d}}[i]\|^{2}$ are clipped by a small positive value to prevent division by zero.

Our joint position and rotation tracking rewards, $R^{h}_{p}$ and $R^{h}_{\theta}$ , include both body and hand joints, even for imitating datasets such as [5, 70] which present hands always in flat or mean poses. This encourages hands to maintain a reasonable default pose when the contact reward is not activated.

D.2 Contact Reward

The contact promotion cost function $E^{c}_{b}$ is designed to encourage highly probable contact, as highlighted by the red regions in Figure 3(i) of the main paper. This reward utilizes the adaptive contact marker $\hat{\boldsymbol{c}}_{b}$ , described in Sec. C.1,

\displaystyle E^{c}_{b}=\sum\|\hat{\boldsymbol{c}}_{b}-\boldsymbol{c}\|\odot% \hat{\boldsymbol{c}}_{b},

(3)

where $\boldsymbol{c}$ is the simulated contact extracted from the force detected, as introduced in Sec. 3.1 of the main paper.

Contact penalties, applied to the blue regions in Figure 3(i) of the main paper, are defined using a larger and fixed threshold of $\sigma_{p}=0.1$ . Specifically, $\hat{\boldsymbol{c}}_{p}[i]=(\|\hat{\boldsymbol{d}}[i]\|>\sigma_{p})\land\neg% \hat{\boldsymbol{c}}_{g}[i]$ , where $\|\hat{\boldsymbol{d}}[i]\|$ is the distance between joint $i$ and the object surface in the reference interaction as defined in Sec. 3.1 of the main paper, and the negation $\neg$ of $\hat{\boldsymbol{c}}_{g}[i]$ indicates the rigid body part $i$ that is not in contact with the ground. The cost of penalty is then calculated as:

\displaystyle E^{c}_{p}=\sum\|\boldsymbol{c}\|\odot\hat{\boldsymbol{c}}_{p}.

(4)

D.3 Hand Interaction Recovery

Our hand contact guidance is defined as:

	$\displaystyle E^{c}_{h}$	$\displaystyle=\sum\\|\boldsymbol{c}^{\mathrm{lhand}}-\hat{\boldsymbol{c}}^{% \mathrm{lhand}}\\|\odot\hat{\boldsymbol{c}}^{\mathrm{lhand}}$		(5)
		$\displaystyle+\\|\boldsymbol{c}^{\mathrm{rhand}}-\hat{\boldsymbol{c}}^{\mathrm{% rhand}}\\|\odot\hat{\boldsymbol{c}}^{\mathrm{rhand}},$		(6)

where $\boldsymbol{c}^{\mathrm{lhand}}$ and $\boldsymbol{c}^{\mathrm{rhand}}$ represent contact labels for rigid body components of the hands. The reference contact markers, $\hat{\boldsymbol{c}}^{\mathrm{lhand}}$ and $\hat{\boldsymbol{c}}^{\mathrm{rhand}}$ , are defined when any hand vertices are within an adaptive threshold distance $\sigma$ to the objects, as described in Sec. C.1 of supplementary. To avoid overly aggressive hand contact that could lead to unrealistic poses, we impose range of motion constraints for the hand, as shown in Table B, ensuring that RL-explored grasping remains biologically realistic.

D.4 Energy Reward

We define the energy cost as $E^{e}_{h}=\sum\|\boldsymbol{a}_{h}\|$ , $E^{e}_{o}=\sum\|\boldsymbol{a}_{o}\|$ , and $E^{e}_{c}=\max\|\boldsymbol{f}\|$ , where $\boldsymbol{a}_{h}$ represents the acceleration of human joints, $\boldsymbol{a}_{o}$ the object’s acceleration, and $\boldsymbol{f}$ the force detected on human rigid bodies. Applying them penalizes abrupt contact and promotes smooth interactions.

D.5 Reward Aggregation

We define the weights for each cost function, including $E^{h}_{p}$ , $E^{h}_{\theta}$ , $E_{d}$ , $E^{o}_{p}$ , and $E^{o}_{\theta}$ , as described in Sec. 3.2 of the main paper, along with $E^{c}_{b}$ , $E^{c}_{p}$ , $E^{c}_{h}$ , $E^{e}_{h}$ , $E^{e}_{o}$ , and $E^{e}_{c}$ detailed in supplementary as $(\lambda^{h}_{p},\lambda^{h}_{\theta},\lambda_{d},\lambda^{o}_{p},\lambda^{o}_% {\theta},\lambda_{c_{b}},\lambda_{c_{p}},\lambda_{c_{h}},\lambda^{h}_{e},% \lambda^{o}_{e},\lambda^{f}_{e})$ . The final aggregated reward is computed as: $R=\exp(-\lambda^{h}_{\theta}E^{h}_{\theta}-\lambda^{h}_{p}E^{h}_{p}-\lambda^{o% }_{\theta}E^{o}_{\theta}-\lambda^{o}_{p}E^{o}_{p}-\lambda_{d}E_{d}-\lambda_{c_% {b}}E^{c}_{b}-\lambda_{c_{p}}E^{c}_{p}-\lambda_{c_{h}}E^{c}_{h}-\lambda^{h}_{e% }E^{e}_{h}-\lambda^{o}_{e}E^{e}_{o}-\lambda^{f}_{e}E^{e}_{c}).$ , following a multiplication of the exponential structure, as suggested in [147, 100].

E Additional Details on Trajectory Collection

E.1 Interaction Early Termination

In Sec. 3.2 of the main paper, we introduce the termination conditions defined for human-object interaction. Additionally, we use three conditions general for single human imitation as follows: (i) The joints are, on average, more than 0.5 meters from their reference. (ii) The root joint is under the height of 0.15. (iii) The episode ends after 300 frames, as the maximum episode length (also specified in Table C).

E.2 Physical State Initialization

Limitations of RSI. Figure C illustrates why Reference State Initialization (RSI) [104] is suboptimal for interaction imitation with imperfect MoCap data. In single-person MoCap scenarios, where failures are less frequent, RSI performs well; however, in the presence of MoCap errors, RSI leads to reduced experience collection, ultimately undermining performance.

Does Interaction Early Termination Help? While early termination can filter out poor initial states, excessive initialization failures lead to frequent simulation resets that significantly slow down training. Consequently, the agent spends more time restarting simulations rather than engaging in productive learning.

Step-by-step details to complement Sec. 3.2 of the main paper: (i) PSI begins by creating an initialization buffer that stores a collection of reference states from motion capture data and simulation states from previous rollouts. This buffer is used to select initialization states for future rollouts. (ii) For each new rollout, an initialization state is randomly selected from the buffer. (iii) Using the current policy, the agent performs rollouts in the simulation environment by taking actions, transitioning through states, and receiving rewards. (iv) After each rollout, the collected trajectories are evaluated based on their expected discounted rewards to update the critic network. Trajectories with expected rewards above a defined threshold are added to the PSI buffer, while older or lower-quality trajectories are removed to maintain the buffer’s size and quality. We apply PSI in a sparse manner to enhance training efficiency, with a probability of 0.005 for updating the buffer for each rollout.

F Additional Implementation Details

In Figures D and E, we illustrate the framework that integrates the kinematic generators with our InterMimic – let the policy use the kinematic output as the input reference to imitate. Table C lists the hyperparameters used during the PPO [114]. The weights for the reward function $(\lambda^{h}_{p},\lambda^{h}_{\theta},\lambda_{d},\lambda^{o}_{p},\lambda^{o}_% {\theta},\lambda_{c_{b}},\lambda_{c_{p}},\lambda_{c_{h}},\lambda^{h}_{e},% \lambda^{o}_{e},\lambda^{f}_{e})$ are set as $(30,2.5,5,0.1,5,5,5,3,2\times 10^{-5},2\times 10^{-5},10^{-9})$ .

For evaluation on the OMOMO [70] dataset, we use Subject 9 as the base model, with teacher policies retargeting interactions from other subjects into this base.

Similar to existing motion imitation approaches [104], we use API in Isaac Gym [96] to initialize the first frame to match the first reference frame – whether it comes from MoCap or kinematic generation methods. The subsequent sequence is then simulated based on the starting frame.

For learning interaction skills on a humanoid robot [130, 47] from SMPL-X [102] data, we bypass external retargeting and directly learn, highlighting our framework’s integrated ability for both retargeting and imitation. Note that we model each Inspire hand with 12 actuators using PD control, without accounting for the mimic joint present in the actual setup, which could be inapplicable in real deployment. Due to the embodiment gap, the humanoid cannot be initialized to match the first SMPL-X frame. Thus, we adopt a two-stage approach: during the first 15 frames, the policy learns to stand and approach the reference’s initial pose, establishing a basis for subsequent tracking. Afterward, the policy transitions to track the reference. We rewrite the position and rotation rewards for the robot’s joints mapped to SMPL-X joints. We do not use the contact reward as we disable the self-collision, since the human reference now cannot ensure proper collision constraints for the humanoid robot. To mitigate the impact of contact artifacts in MoCap data without relying on a contact reward, we leverage teacher distillation references for training.

For interactions involving multiple objects, our framework remains unchanged except for the state and reward components related to the objects, such as $\{{\boldsymbol{\theta}}_{t}^{o},\boldsymbol{p}_{t}^{o},{\boldsymbol{\omega}}_{% t}^{o},\boldsymbol{v}_{t}^{o}\}$ , $\boldsymbol{d}_{t}$ , and the rewards $R^{o}_{p}$ , $R^{o}_{\theta}$ , and $R_{d}$ , which now include multiple components to represent multiple objects.

Hyperparameters	value
Action distribution	153D Continuous
Discount factor $\gamma$	0.99
Generalized advantage estimation $\lambda$	0.95
Entropy regularization coefficient	0.0
Optimizer	Adam [62]
Learning rate (Actor)	2e-5
Learning rate (Critic)	1e-4
Minibatch size	16384
Horizon length $H$	32
Action bounds loss coefficient	10
Maximum episode length	300

Table C: Hyperparamters for training teacher and student policies.

G Additional Experiemental Results

In this section, we introduce experimental results that are not included in the main paper due to space limit.

Failure Cases. In Figure F, we illustrate an example where our teacher policies fail to perform successful imitation. Despite the strong adaptability of our policies, as demonstrated in Figures 1 and 5, where they effectively correct reference errors, there are limitations when encountering too many errors. Since the reward design inherently prioritizes reference tracking, excessive errors in the reference inevitably result in failures.

HOI Retargeting. Figure G shows that our teacher policies, trained on reference data for a specific body shape, can successfully drive a human model with a body shape different from the reference in the simulator to accomplish the same task, albeit with slightly varied trajectories. This result highlights the effectiveness of our design, which integrates retargeting into interaction imitation.

H Discussion

Limitations and Future Work. One limitation, as discussed in Sec. G and illustrated in Figure F, is that our method struggles to fully correct MoCap data with significant errors. However, it also underscores a strength of our teacher-student framework: teacher policies filter out data that are too corrupted to imitate, allowing the student policy to concentrate on learning from viable samples and avoid wasting training effort on unrecoverable data.

The policy sometimes results in unnatural object support, where the human produces penetration rather than relying on friction. While we mitigate this issue by setting a high maximum depenetration velocity in simulation (See Table A) and applying a contact-based energy (See Sec. D.4) to discourage large forces that could cause penetration, it does not entirely solve the problem. A potential solution could involve using a signed distance-based penetration score as a criterion for early termination.

The hand interaction recovery method is effective for the tasks explored in this paper. For tasks requiring dexterity with detailed finger motions, its benefits may be limited.

Additionally, while our method demonstrates good scalability by effectively training on hours of MoCap data involving different objects and generalizing to unseen skills and object geometries, its performance could be further improved with a larger dataset. Incorporating more diverse objects [154] would likely further enhance InterMimic’s zero-shot generalization capabilities.

Potential Negative Societal Impact. Our approach has the potential to generate vivid human-object interaction sequences, which, if misused, could lead to negative societal impacts, with the risk of creating misleading content by depicting individuals in fabricated scenarios. However, our model is designed with privacy in mind – it employs an abstract representation, using simple geometric shapes like boxes and cylinders to depict different parts. This abstraction reduces the inclusion of personally identifiable features, making it less likely for our synthesized data to be misused in ways that compromise individual identities.

InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object InteractionsInterMimic：迈向基于物理的人体-物体交互的通用全身控制