Generative Inbetweening through Frame-wise Conditions-Driven Video Generation

Tianyi Zhu,1 Dongwei Ren,1 Qilong Wang,2 Xiaohe Wu,1 Wangmeng Zuo1
1Harbin Institute of Technology, 2Tianjin University
Start Frame
FILM
TRF
End Frame
GI
Ours

Abstract

Generative inbetweening aims to generate intermediate frame sequences by utilizing two key frames as input. Although remarkable progress has been made in video generation models, generative inbetweening still faces challenges in maintaining temporal stability due to the ambiguous interpolation path between two key frames. This issue becomes particularly severe when there is a large motion gap between input frames. In this paper, we propose a straightforward yet highly effective Frame-wise Conditions-driven Video Generation (FCVG) method that significantly enhances the temporal stability of interpolated video frames. Specifically, our FCVG provides an explicit condition for each frame, making it much easier to identify the interpolation path between two input frames and thus ensuring temporally stable production of visually plausible video frames. To achieve this, we suggest extracting matched lines from two input frames that can then be easily interpolated frame by frame, serving as frame-wise conditions seamlessly integrated into existing video generation models. In extensive evaluations covering diverse scenarios such as natural landscapes, complex human poses, camera movements and animations, existing methods often exhibit incoherent transitions across frames. In contrast, our FCVG demonstrates the capability to generate temporally stable videos using both linear and non-linear interpolation curves.

Comparison with State-of-the-arts

Case0

Start Frame
FILM
TRF
End Frame
GI
Ours

Case1

Start Frame
FILM
TRF
End Frame
GI
Ours

Case2

Start Frame
FILM
TRF
End Frame
GI
Ours

Case3

Start Frame
FILM
TRF
End Frame
GI
Ours

Case4

Start Frame
FILM
TRF
End Frame
GI
Ours

Case5

Start Frame
FILM
TRF
End Frame
GI
Ours

Case6

Start Frame
FILM
TRF
End Frame
GI
Ours

Case7

Start Frame
FILM
TRF
End Frame
GI
Ours

Case8

Start Frame
FILM
TRF
Input Frame 1
GI
Ours

Case9

Start Frame
FILM
TRF
End Frame
GI
Ours

Case10

Start Frame
FILM
TRF
End Frame
GI
Ours

Ablation Study

The `w/o Pose' and `w/o Matching' indicate the removal of human pose and line matching conditions, respectively. The line matching condition governs the overall motion of the scene, and the pose condition benefits details with human movements.

Condition Components

w/o Control
w/o Matching
w/o Pose
Ours

Control Weight

Start Frame
End Frame
weight=0.5
weight=1

Generalization to Animation

Input Frames FCVG Inbetweening Results Input Frames FCVG Inbetweening Results

Input Frames FCVG Inbetweening Results Input Frames FCVG Inbetweening Results

Input Frames FCVG Inbetweening Results Input Frames FCVG Inbetweening Results

Input Frames FCVG Inbetweening Results Input Frames FCVG Inbetweening Results

Limitation

Start Frame
End Frame
weight=1
weight=0.5
Start Frame
End Frame
weight=1
weight=0.5