DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model

Abstract

With the increasing popularity of autonomous driving based on the powerful and unified bird's-eye-view (BEV) representation, a demand for high-quality and large-scale multi-view video data with accurate annotation is urgently required. However, such large-scale multi-view data is hard to obtain due to expensive collection and annotation costs. To alleviate the problem, we propose a spatial-temporal consistent diffusion framework DrivingDiffusion, to generate realistic multi-view videos controlled by 3D layout. There are three challenges when synthesizing multi-view videos given a 3D layout: How to keep 1) cross-view consistency and 2) cross-frame consistency? 3) How to guarantee the quality of the generated instances? Our DrivingDiffusion solves the problem by cascading the multi-view single-frame image generation step, the single-view video generation step shared by multiple cameras, and post-processing that can handle long video generation. In the multi-view model, the consistency of multi-view images is ensured by information exchange between adjacent cameras. In the temporal model, we mainly query the information that needs attention in subsequent frame generation from the multi-view images of the first frame. We also introduce the local prompt to effectively improve the quality of generated instances. In post-processing, we further enhance the cross-view consistency of subsequent frames and extend the video length by employing temporal sliding window algorithm. Without any extra cost, our model can generate large-scale realistic multi-camera driving videos in complex urban scenes, fueling the downstream driving tasks. The code will be made publicly available.

Method

[DrivingDiffusion] Training Pipeline

Diagram of the multi-view video generation framework DrivingDiffusion. For training, we separately train the multi-view model and the temporal model. These two models share similar structures, with the exception of the orange and purple components. During the inference stage, the two models are concatenated in a cascaded manner. First, the multi-view model generates the initial multi-view frame of the video. This frame is then set as the keyframe for the temporal model. Lastly, the temporal model generates video frames for each view, forming the final multi-view video.

Consistency Module & Local Prompt

[DrivingDiffusion] Long Video Generate Pipeline

Multi-view Long video generation pipeline. We introduce a multi-stage inference strategy to generate multi-view long videos: 1) We first adopt the multi-view model to generate the first frame panoramic image of the video sequence. 2) Then we use the generated image from each perspective as input for the temporal model, allowing for parallel sequence generation for each corresponding viewpoint. 3) For subsequent frames, we employ the finetune model for parallel inference as well. 4) Extend the video after identifying new keyframes, just like the sliding window algorithm does. Finally we obtain the entire synthetic multi-view video.

[DrivingDiffusion-Future] Future Generate Pipeline

Diagram of future generation pipeline. We only input the first frame and predict the following frames. We believe that compared to the redundant information in the images, BEV layout is an intermediate representation that is easier for the model to learn the main elements of road conditions. We support both unconditional and text-controlled prediction of future scenes. For the text controller, we decouple the behavior of the ego-vehicle and other vehicles. (Omitting the concept branch and relying solely on the visual branch for future prediction can still yield satisfactory results in the short term.)

Results

1. Visualization of Multi-View Image Generation

2. Visualization of Temporal Generation

3. Visualization of Control Capability

4. Multi-View Video Generation of Driving Scenes Controlled by 3D Layout on nuScenes

5. Multi-View Video Generation of Driving Scenes Controlled by 3D Layout on Private Dataset

6. Ability to Construct Future.

Control future video generation through text description of road conditions.

Future video generation without text description of road conditions. We found this to be an excellent way to get a pre-trained diffusion model.

BibTeX

If DrivingDiffusion is useful or relevant to your research, please kindly recognize our contributions by citing our paper:

@article{li2023drivingdiffusion,
  title={DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model},
  author={Xiaofan Li and Yifu Zhang and Xiaoqing Ye},
  journal={arXiv preprint arXiv:2310.07771},
  year={2023}
}