CEI : A Unified Interface for Cross-Embodiment Visuomotor Policy Learning in 3D Space

Tong Wu1†, Shoujie Li1,2†, Junhao Gong1, Changqing Guo1, Xingting Li1, Shilong Mu3, Wenbo Ding1
Equal contribution
1Shenzhen International Graduate School, Tsinghua University
2School of Mechanical and Aerospace Engineering, Nanyang Technological University
3Xspark AI

CEI transfers robot manipulation data across different embodiments.

Abstract

Teaser

Robotic foundation models trained on large-scale manipulation datasets have shown promise in learning generalist policies, but they often overfit to specific viewpoints, robot arms, and especially parallel-jaw grippers due to dataset biases. To address this limitation, we propose Cross-Embodiment Interface (CEI), a framework for cross-embodiment learning that enables the transfer of demonstrations across different robot arm and end-effector morphologies. CEI introduces the concept of functional similarity, which is quantified using Directional Chamfer Distance. Then it aligns robot trajectories through gradient-based optimization, followed by synthesizing observations and actions for unseen robot arms and end-effectors. In experiments, CEI transfers data and policies from a Franka Panda robot to 16 different embodiments across 3 tasks in simulation, and supports bidirectional transfer between a UR5+AG95 gripper robot and a UR5+Xhand robot across 6 real-world tasks, achieving an average transfer ratio of 82.4%. Finally, we demonstrate that CEI can also be extended with spatial generalization and multimodal motion generation capabilities using our proposed techniques.

Method

CEI leverages a novel notion of functional similarity, which captures shared object interaction behaviors across different end-effectors, to align demonstrations from a source embodiment to a target embodiment. This is accomplished by quantifying functional similarity using the Directional Chamfer Distance between manually selected functional representations, aligning trajectories via gradient-based optimization, and synthesizing observations and actions for the target robot.

Cross-Embodiment Transfer in Simulation

Simulation Environment

Here we demonstrate the cross-embodiment transfer capability of CEI in simulation. In 3 different tasks, the source demonstrations in the left column can be transferred to 16 target embodiments shown in the right column.


Evaluation Videos (👆Click to Select !)

Source Embodiment

Target Embodiment

Quantitative Results

Success rates across different embodiments
Table I: Success rates of CEI across the 16 different embodiment combinations.

These results indicate that despite variations in kinematics and morphology, CEI is capable of bridging the cross-embodiment gap by leveraging functional similarity. We further observe that the difficulty of cross-embodiment transfer increases with the complexity and dexterity requirements of the task.

Ablation Study on Cross-Embodiment Techniques

Ablation study on functional similarity
Table II: Ablation study on trajectory alignment across tasks and embodiments.

The results show that CEI without Directional Chamfer Distance achieves an average success rate of only 32%, only half of CEI. BMS completely failed in the PickCube and StackCube tasks, as it is challenging to manually determine optimal open and close poses, and the linear interpolation often leads to unstable grasps. Moreover, although BMS constrains the target embodiment to an opening degree similar to that of the source, discrepancies between the two end-effectors (e.g. the distance from grasp point to end-effector frame) result in frequent failures

Ablation Study on Functional Representations

Sensitivity Analysis of the Functional representations
Table III: Sensitivity analysis of the functional representations.

We found that although we select three different functional representations, their success rates remain comparable, suggesting that CEI is robust to such variations and exhibits low sensitivity to the choice of functional representation.

Ablation Study on Observation Synthesis

Ablation study on observation synthesis
Table IV: Policy evaluation on synthesized data generated by CEI.

Table IV presents the policy evaluation results using synthesized cross-embodiment data. Policies trained without any augmentation fail to complete the tasks, demonstrating the necessity of targeted data augmentation for cross-embodiment generalization. Additionally, removing Inference Augmentation results in a 22% drop in success rate.

Bidirectional Transfer in Real World

Real World Tasks

We evaluate bidirectional transfer between the AG95 gripper and Xhand on 6 real-world tasks: PushCube, OpenDrawer, PlaceBird, PickCup, PackageBread, and InsertFlower. For the first three tasks, we collect 25 AG95 demonstrations and transfer to Xhand; for the latter three, we collect 25 Xhand demonstrations and transfer to AG95. DP3 policies are trained on CEI-generated data and evaluated over 10 trials per task.

Generated Data Visualization (👆Click to Select !)

Below are examples of data generated by CEI for different tasks. Select a task to view the corresponding visualization.

Source Data

Target Data

Qualitative Evaluation (👆Click to Select !)

Source Embodiment

Target Embodiment

Quantitative Results

Success rates across 6 tasks
Table V: Real world evaluation.

Table V demonstrates the bidirectional transfer capabilities of CEI on real-world tasks. We compare policies trained on synthesized data against those trained on source data. Overall, CEI reaches an average success rate of 70%, with a transfer ratio (success rate of CEI divided by that of the source embodiment) of 82.4%.

Time Cost of Transfer

Time cost for generating real-world demonstrations
Table VI: Time cost for generating real-world demonstrations.

CEI requires significantly less time than MimicGen, which highly depends on online execution. DemoGen generates hundreds of demonstrations in one second, while CEI requires several minutes since it utilizes gradient-based optimization.

Adaptation to External Disturbances

Failure Cases

Failure Cases

In summary, the majority of failure cases are attributable to unstable contacts or grasps. This is an inherent characteristic of geometry-based synthesis approaches, which may not fully account for physical dynamics—a limitation we explicitly discussed in the Conclusion section that integrating tactile sensing would be an important direction.

Broader Applications

Spatial Generalization

Spatial Generalization

For spatial generalization, we first apply transform to the functional representation trajectory with clipped linear growth. The target embodiment is subsequently aligned to the augmented trajectory through the standard CEI optimization procedure. The augmented point cloud is then obtained by applying the transform to the object point cloud and synthesizing the robot point cloud according to the augmented trajectory. Results show that our approach extends the policy to press the button over a wide area of the table, rather than being limited to the original position.

Below are 10 video examples demonstrating CEI's spatial generalization ability. Click on a button to play the video !

Multimodal Data Generation

Multimodal Data Generation

During data synthesis, we observe that initializing the embodiment from different joint configurations leads to multimodal alignment outcomes, as the functional correspondences between parallel grippers and dexterous hands are inherently multimodal. This property can be leveraged by manipulating joint initialization through Elite-based Initialization Strategy (EIS).

CEI with RGB Inputs

CEI with RGB Inputs

Our method is compatible with RGB inputs by replacing the point cloud editing module with image augmentation module. We employ the Segment Anything Model (SAM) to mask the source robot, followed by ProPainter to reconstruct the background and object. To render the target embodiment, we retrieve the visual observation that best matches the aligned joint states from a pre-collected image-action library, using the lowest L2 distance as the selection metric, and use SAM to obtain the segmented robot. Finally, we composite the retrieved target robot into the inpainted background to generate the synthetic 2D observation.

Evaluation on PickCube

We selected the simulated PickCube task to validate this pipeline. Qualitative results confirm the pipeline's ability to synthesize visually plausible observations. We subsequently evaluated a 2D diffusion policy trained on the generated data, denoted as CEI-RGB. As illustrated in the Figure, CEI-RGB achieves comparable performance with the standard CEI, operating without depth modalities.

Real-world Transfer Results

We also selected PickCup as the representative real-world task to demonstrate transfer from UR5+AG95 to UR5+Xhand. Although we have not yet trained a 2D diffusion policy on this specific task, the high realism of the synthesized images and the simulation results provide preliminary evidence that the reliance on depth cameras could be effectively eliminated.