Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies

Sun Yat-sen University, The University of Hong Kong, Shanghai Jiao Tong University, The Chinese University of Hong Kong, Shenzhen, Shanghai AI Laboratory
NeurIPS 2025
† Equal contribution ‡ Corresponding Author

Real World Demos

Card Box Stacking

Card Box Handover

Grab Roller

Abstract

Recently, effective coordination in embodied multi-agent systems remains a fundamental challenge—particularly in scenarios where agents must balance individual perspectives with global environmental awareness. Existing approaches often struggle to balance fine-grained local control with comprehensive scene understanding, resulting in limited scalability and compromised collaboration quality.

In this paper, we present GauDP, a novel Gaussian-image synergistic representation that facilitates scalable, perception-aware imitation learning in multi-agent collaborative systems. Specifically, GauDP constructs a globally consistent 3D Gaussian field from decentralized RGB observations, then dynamically redistributes 3D Gaussian attributes to each agent's local perspective. This enables all agents to adaptively query task-critical features from the shared scene representation while maintaining their individual viewpoints. This design facilitates both fine-grained control and globally coherent behavior without requiring additional sensing modalities (e.g., 3D point cloud).

We evaluate GauDP on the RoboFactory benchmark, which includes diverse multi-arm manipulation tasks. Our method achieves superior performance over existing image-based methods and approaches the effectiveness of point-cloud-driven methods, while maintaining strong scalability as the number of agents increases.

Motivation

MY ALT TEXT

Both local and global context are essential in multi-agent collaboration. Comparison of multi-agent decision-making using different types of contextual information. (a) Using only local context leads to miscoordination such as collisions due to lack of global awareness. (b) Using only global context provides a holistic scene view but lacks detailed local features, resulting in inaccurate control, such as biased localization. (c) Our proposed method, GauDP, fuses global context, which is reconstructed from 2D local images via a shared 3D Gaussian representation, on top of local observations. This integration enables both accurate localization and coordinated execution. Our proposed method, based solely on 2D observations, effectively aggregates global context on top of the local context.

Local or Global Context Alone Is Not Enough

  • Purely local perception captures fine-grained visual details, but lacks awareness of other agents and the global scene, often leading to miscoordination.
  • Purely global perception provides a holistic view of the environment, yet misses viewpoint-specific details required for precise manipulation and control.

Limitations of Existing Multi-Agent Perception Pipelines

  • Naive fusion of multi-view images lacks explicit 3D structural constraints, making spatial reasoning unreliable.
  • Image-only methods struggle with depth and geometry understanding, while point-cloud-based approaches depend on additional sensing hardware and scale poorly.

Core Challenge in Multi-Agent Collaboration

  • How to enable agents to share a consistent global scene understanding while preserving agent-specific local precision.
  • How to achieve this with scalable computation and RGB-only observations.

Can we design a unified representation that simultaneously supports global 3D consistency and fine-grained local control using only RGB input?

Method Overview

MY ALT TEXT

(a) Overview of the proposed GauDP framework for multi-agent imitation learning. Each agent extracts a local context from its 2D observation. A shared 3D Gaussian field is constructed from all views to form the global context, which is fused with the local context and passed through an encoder. The resulting per-agent features are processed by a diffusion policy via cross-attention to predict actions. (b) Pipeline for constructing the global Gaussian field. Multi-view images are encoded and aggregated via cross-attention, followed by a reconstruction loss Lrec between rendered and input views to ensure consistency.

Visualization of Reconstruction Results

MY ALT TEXT

Simulation Performance

MY ALT TEXT

Real-World Performance

MY ALT TEXT

BibTeX

@inproceedings{
    wang2025gaudp,
    title={Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies},
    author={Ziye Wang and Li Kang and Yiran Qin and Jiahua Ma and Zhanglin Peng and Lei Bai and Ruimao Zhang},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
    url={https://openreview.net/forum?id=asS4W7Yw5e}
}