Panwang Panβ , Chenguo Linβ , Jingjing Zhao, Chenxin Li, Yuchen Lin, Haopeng Li, Honglei Yan, Kairun Wen, Yunlong Lin, Yixuan Yuan, Yadong Mu
This repository contains the official implementation of the paper: Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models. Diff4Splat is a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets.
Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement.
Here is our Project Page.
Feel free to contact us or open an issue if you have any questions or suggestions.
You may also be interested in our other works:
- [CVPR 2026] MoVieS: a feed-forward model for 4D dynamic reconstruction from monocular videos.
- 2026-02-21: The paper is accepted to CVPR 2026.
- 2025-11-01: Diff4Splat is released on arXiv.
- 2025-10-15: Initial codebase structure established.
- 2025-10-01: Project development started.
- Release inference scripts.
- Release training code and data preprocessing scripts.
- Release pretrained checkpoints.
- Provide a HuggingFaceπ€ demo.
- Release preprocessed dataset.
- Python >= 3.10
- PyTorch >= 2.0 (with CUDA support)
- CUDA >= 11.8
# Clone the repository
git clone https://github.com/paulpanwang/Diff4Splat.git
cd Diff4Splat
# Install required packages
pip install -r settings/requirements.txtThe settings/requirements.txt includes:
plyfile
ipython
numpy==1.26.4
matplotlib
Pillow
opencv-python
imageio
imageio-ffmpeg
pytorch-msssim
lpips
einops
safetensors
accelerate
transformers
diffusers
omegaconf
h5py
decord
deepspeed
flow_vis
kiui
# Run environment test script
python tests/test_environment.pyOr run a quick check:
python -c "
import torch
print('PyTorch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
# Check key imports
from src.options import opt_dict
from src.models import Wan, LRDM
# SplatRecon is available as a backward-compatible alias for LRDM
from src.models import SplatRecon
print('All imports successful!')
"All checkpoints should be placed in the resources/ckpts/ directory.
Contact the authors for access to the camera control checkpoint, or check our HuggingFace page for updates.
Once downloaded, place the checkpoint in resources/ckpts/ directory.
The code will attempt to download Wan2.2-TI2V-5B base model automatically from ModelScope/HuggingFace. If automatic download fails, you can manually download from:
- ModelScope:
Wan-AI/Wan2.2-TI2V-5B - Or contact the authors for the base model weights.
Default paths (can be modified in src/options.py):
wan_dir: str = "./resources/ckpts/Wan2.2-TI2V-5B"
vae_path: str = "./resources/ckpts/Wan2.2-TI2V-5B/Wan2.2_VAE.pth"Download from HuggingFace and place in resources/ckpts/ directory:
# Using huggingface-hub
pip install huggingface-hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id='paulpanwang/LRDM', filename='lrdm_ckpt.safetensors', local_dir='./resources/ckpts')
"LRDM checkpoints will be released on HuggingFace. Stay tuned for updates.
Default path in src/options.py:
pretrained_path: str = "./resources/ckpts/lrdm_ckpt.safetensors"Configure your dataset root path in src/options.py or via the DATASET_ROOT environment variable.
The following datasets are supported:
- RealEstate10K (
re10k) - Static scenes - TartanAir (
tartanair) - Static scenes - MatrixCity (
matrixcity) - Static scenes - DL3DV (
dl3dv) - Static scenes - DynamicReplica (
dynamicreplica) - Dynamic scenes - PointOdyssey (
pointodyssey) - Dynamic scenes - VKITTI2 (
vkitti2) - Dynamic scenes - Spring (
spring) - Dynamic scenes - Stereo4D (
stereo4d) - Dynamic scenes
Dataset paths can be configured in src/options.py:
# Single GPU training
python src/train_wan_cc.py \
--config_file configs/train.yaml \
--tag wan_camera_control_test \
--output_dir ./out \
--max_train_steps 100 \
--max_val_steps 1Or use the provided script:
bash scripts/train_camcc.shEdit configs/train.yaml:
opt_type: "wan2.2_ti2v_5b"
optimizer:
name: "adamw"
lr: 0.0004
betas: [0.9, 0.95]
weight_decay: 0.05
lr_scheduler:
name: "cosine_warmup"
num_warmup_steps: 1000
train:
batch_size_per_gpu: 8
gradient_accumulation_steps: 1
epochs: 10
...
val:
batch_size_per_gpu: 1This step trains a TinyVAE to align its latents with Wan VAE's latents (16-dim, 4x temporal, 8x spatial compression).
Training pipeline:
Images -> WanVAE (fixed, no grad) -> Latent -> TinyVAE (trainable) -> Images -> LRDM (fixed)
A training script is provided at src/train_latent_alignment.py. First, ensure LRDM checkpoint is in resources/ckpts/lrdm_ckpt.safetensors.
# Single GPU training
python src/train_latent_alignment.py \
--config configs/latent_alignment.yaml \
--output_dir ./out/latent_alignmentKey components:
src/models/latent_alignment.py- Latent alignment models (LinearMapper, UNetMapper, TinyVAEDecoderWrapper)src/models/tiny_vae.py- TinyVAE / TAEHV (Temporal Autoencoder)configs/latent_alignment.yaml- Training configuration
LRDM code has been fully merged into the main src/ directory.
Use the enhanced inference script:
# Novel View Synthesis from NPZ data
python src/infer_nvs.py \
--opt_type lrdm \
--pretrained_path ./resources/ckpts/lrdm_ckpt.safetensors \
--data_path /path/to/data.npz \
--output_dir ./out/nvs_resultsThe unified model is at src/models/lrdm.py with class LRDM. SplatRecon is available as a backward-compatible alias.
# Preprocess your data into NPZ format
python src/preprocess_npz.py \
--input_dir /path/to/images \
--output_path ./data/preprocessed.npz# Using LRDM for static scenes
python src/infer_nvs.py \
--opt_type lrdm_static \
--pretrained_path ./resources/ckpts/lrdm_ckpt.safetensors \
--data_path ./data/preprocessed.npz \
--output_dir ./out/reconstruction
# Using LRDM for dynamic scenes
python src/infer_nvs.py \
--opt_type lrdm \
--pretrained_path ./resources/ckpts/lrdm_ckpt.safetensors \
--data_path ./data/preprocessed.npz \
--output_dir ./out/dynamic_reconWe provide test scripts to verify your setup:
# Environment check
python tests/test_environment.py
# Wan model loading test
python tests/test_wan_model.py
# Latent alignment pipeline test
python tests/test_latent_alignment.pySee tests/README.md for more details.
Diff4Splat introduces a novel framework for controllable 4D scene generation:
- Video Latent Transformer: Augments video diffusion models to jointly capture spatio-temporal dependencies
- Deformable 3D Gaussian Field: Encodes appearance, geometry, and motion in a unified representation
- Single Forward Pass: Generates high-quality 4D scenes in approximately 30 seconds
- Controllable Generation: Supports camera trajectory and optional text prompts
- Explicit Representation: Produces deformable 3D Gaussian primitives
- Efficient Inference: No test-time optimization or post-hoc refinement required
- Multi-task Capability: Supports video generation, novel view synthesis, and geometry extraction
Diff4Splat demonstrates state-of-the-art performance across multiple tasks:
- Generates temporally consistent video sequences from single images
- Supports controllable camera trajectories
- Produces high-quality novel views from arbitrary camera positions
- Maintains geometric consistency across viewpoints
- Extracts accurate 3D geometry from generated scenes
- Enables downstream applications like mesh reconstruction
- Repository setup and documentation
- Inference code release
- Training scripts
- Pretrained model weights
- Training code release
- Dataset preprocessing scripts
- Comprehensive evaluation benchmarks
- Real-time inference optimization
- Multi-modal conditioning support
- Interactive demo applications
If you find our work helpful, please consider citing:
@article{pan2025diff4splat,
title={Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models},
author={Pan, Panwang and Lin, Chenguo and Zhao, Jingjing and Li, Chenxin and Lin, Yuchen and Li, Haopeng and Yan, Honglei and Wen, Kairun and Lin, Yunlong and Yuan, Yixuan and Mu, Yadong},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
year={2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
We would like to thank the authors of MoVieS, PartCrafter, DiffSplat, and other related works for their inspiring research and open-source contributions that helped shape this project.
