Single-View 3D-Aware Representations for Reinforcement Learning by Cross-View Neural Radiance Fields

Daesol Cho1,*, Seungyeon Yoo2,*, Dongseok Shim3, H. Jin Kim2,
1 Georgia Institute of Technology, 2 Seoul National University, 3 Sony Group Corporation
* Equal contribution

SinCro enables superior RL performance from single-view RGB inputs, robust even under viewpoint perturbations, without requiring camera poses or synchronized multi-view images.

Abstract

Reinforcement learning (RL) has enabled robots to develop complex skills, but its success in image-based tasks often depends on effective representation learning. Prior works have primarily focused on 2D representations, often overlooking the inherent 3D geometric structure of the world, or have attempted to learn 3D representations that require extensive resources such as synchronized multi-view images even during deployment. To address these issues, we propose a novel RL framework that extracts 3D-aware representations from single-view RGB input, without requiring camera pose or synchronized multi-view images during the downstream RL. Our method employs an autoencoder architecture, using a masked Vision Transformer (ViT) as the encoder and a latent-conditioned Neural Radiance Fields (NeRF) as the decoder, trained with cross-view completion to implicitly capture fine-grained, 3D geometry-aware representations. Additionally, we utilize a time contrastive loss that further regularizes the learned representation for consistency across different viewpoints, which enables viewpoint-robust robot manipulations. Our method significantly enhances the RL agent's performance both in simulation and real-world experiments, demonstrating superior effectiveness compared to prior 3D-aware representation-based methods, even when using only single-view RGB images during deployment.

Video

Novel view & dynamic scene rendering with single-view input

SinCro successfully renders the dynamic scene, even in the unseen viewpoints. In contrast, all baselines show jittering or fail to capture fine-grained details of the environment. This demonstrates that SinCro captures the essential 3D information of the environment, providing 3D-aware representation.

Real-world downstream task performance

RL evaluation

SinCro successfully performs downstream tasks only with a single-view image. In contrast, baselines usually fail to perform the tasks, despite using multiple viewpoints.


RL evaluation under perturbed viewpoints

SinCro remains robust to viewpoint perturbations, consistently succeeding in downstream tasks, whereas baselines mostly fail. Since perturbed viewpoints are never observed during the pre-training and the RL training, this robustness showcases the advantages of the proposed 3D-aware representation.

BibTeX

@ARTICLE{11180891,
  author={Cho, Daesol and Yoo, Seungyeon and Shim, Dongseok and Kim, H. Jin},
  journal={IEEE Robotics and Automation Letters}, 
  title={Single-View 3D-Aware Representations for Reinforcement Learning by Cross-View Neural Radiance Fields}, 
  year={2025},
  volume={10},
  number={11},
  pages={12039-12046},
  keywords={Three-dimensional displays;Neural radiance field;Cameras;Representation learning;Image reconstruction;Visualization;Robots;Robot vision systems;Training;Solid modeling;Reinforcement learning;representation learning;visual learning},
  doi={10.1109/LRA.2025.3615035}}