DrivingGen: Efficient Safety-Critical Driving Video Generation with Latent Diffusion Models

Zipeng Guo, Yuchen Zhou, Chao Gou*,
Sun Yat-sen University

Collision with another oncoming vehicle

Collision with another vehicle which turns into or crosses a road

Collision between pedestrian which crosses a road

Collision with another vehicle moving ahead or waiting

Collision with another vehicle moving laterally in the same direction

Abstract

With the increasing popularity of autonomous driving, a demand for high-quality safety-critical driving video data is urgently required. However, such large-scale data is hard to obtain due to expensive and risky collection costs. To alleviate the problem, we propose DrivingGen, an efficient approach built upon the T2I diffusion model for safety-critical driving video generation.

Our model employs the “Spatio-Temporal-then-Temporal” paradigm, learning motion priors from a local to global perspective. Firstly, we design an innovative Segment Flow Module to achieve local spatio-temporal modeling by capturing the distinctive dynamic features of different video segments. Secondly, a lightweight Directional Consistency Attention is proposed to further enhance temporal consistency from a global perspective. Additionally, we propose an efficient Temporal Shift Adapter to expand the T2I U-Net into the temporal dimension.

Empowered with these modules, DrivingGen outperforms the state-of-the-arts in driving video generation for safety-critical scenarios, as determined by both quality and efficiency measures.

Method Overview

Interpolate start reference image.

Overall illustration of our DrivingGen framework. (a) During the training phase, the input videos are corrupted via the diffusion process, and a U-Net denoiser is trained to reconstruct the videos. During the inference phase, a gaussian noise is randomly sampled, and the denoising process is repeated for T times. (b) The Segment Flow Module (SFM) achieves spatio-temporal joint modeling from a local perspective. (c) The Temporal Shift Adapter (TSA) is built upon the original T2I Residual blocks, facilitating information exchanged among neighboring frames. (d) The Directional Consistency Attention (DCA) further enhances global temporal consistency.

Text-to-Video Generation Results

Collision with another oncoming vehicle

Collision with another vehicle which turns into or crosses a road

Collision between pedestrian which crosses a road

Collision with another vehicle moving ahead or waiting

Collision with another vehicle moving laterally in the same direction

Out-of-control and leaving the roadway to the left

Out-of-control and leaving the roadway to the right

Method Comparison


MCVD[1]
PVDM[2]
VideoComposer[3]
Ours