ITE Technical Group Submission System
Conference Paper's Information
Online Proceedings
[Sign in]
 Go Top Page Go Previous   [Japanese] / [English] 

Paper Abstract and Keywords
Presentation 2024-12-22 13:30
Continuity Video-to-Video style transfer using Stable Diffusion Models
YU HAO, Liu JiaQing, Li YinHao, Chen YanWei (ISSE)
Abstract (in Japanese) (See Japanese page) 
(in English) Video Style Transformation with Stable Diffusion Model**
Yu Hao† Liu Jiaqiang† Li Yinhao† Chen Yanwei†
† College of Information Science and Engineering, Ritsumeikan University,
2-150 Iwakura-cho, Ibaraki, Osaka 567-8570, Japan
E-mail: †is0650is@ed.ritsumei.ac.jp
Keywords: Stable Diffusion, Text-to-Image (T2I), Video-to-Video (V2V), Style Transformation

1. Introduction
In the field of video style transformation, many methods focus on enabling models to understand temporal information, adopting approaches to transform the style of entire videos. These methods, however, face limitations in video length, demand significant computational resources, and often fail to ensure sufficient continuity across frames. Hence, efficient methods capable of processing long videos while maintaining high continuity are needed.

This study proposes a video generation method that utilizes frame-to-frame relational information such as edge maps, depth maps, and pose maps as guidance in the diffusion model's generation process. This method leverages a low-computation image-to-image (T2I) model, achieving video generation that maintains both quality and motion consistency.



2. Related Work

2.1 Stable Diffusion [1]
A technique that removes noise from a noisy image using text or image prompts, generating the target image.

2.2 ControlNet
A method that strengthens the influence of additional conditions in the diffusion model's noise removal process by directly adding weighted conditions, enhancing control over the noise removal.

2.3 Animatediff
This model incorporates the Temporal Transform structure, enabling prediction of future frames from past ones. It generates multiple noisy images based on text prompts and produces continuous short videos.



3. Proposed Method
This study introduces a method that uses a sliding window to select 16 frames (the current frame and the next 15 frames) as a group. For each frame in the group, edge, depth, and pose features are extracted and converted into a format (embedding) that can be fused using ControlNet. Temporal relational information between frames is extracted from the Temporal Transform structure of Animatediff and incorporated into the diffusion noise removal U-net, guiding image generation during noise removal. This approach ensures higher continuity across frames while maintaining frame relations (see Fig. 1).

Fig. 1. Overview of the Proposed Method

3.1 Guiding Generation Using Video Frame Features
The proposed method, based on an image generation model, processes videos by splitting them into frames. To maintain video continuity, it is crucial to accurately capture features such as outlines, motions, and part information. Experimental results confirm that edge information extracted using Canny, motion information extracted using OpenPose, and depth information extracted using LeReS effectively reinforce these features, contributing to video continuity. By adding these features as weights in the ControlNet model, important information can be retained during noise removal. Contrast experiments further demonstrate the significance of these features in video generation (see Fig. 3). Additionally, masking unnecessary parts to zero reduces computational load effectively.

3.2 Classification of Prompts
Prompts include continuous information, such as the main character's appearance and movements. Since movement descriptions vary by frame, the MotionCLIP model is used to generate text descriptions corresponding to motion features in the frames, labeling them. This enables combining user-specified appearance prompts with frame-specific motion prompts, guiding noise removal more accurately. As a result, images corresponding to the original video frames are generated.

4. Experiments

4.1 Experimental Setup
Models Used: Stable Diffusion 1.5, ControlNet 1.1
Input Video Resolution: 1024×1024
Output Video Resolution: 1024×1024
Long Video Experiment: 26.3 seconds (790 frames)
Short Video Experiment: 8.5 seconds (259 frames)
Input Prompt: “2D anime-style girl, dancing, red eyes, white sailor uniform, red skirt”

4.2 Results
Video quality was evaluated using three CLIP Scores to measure continuity, fidelity to the source video, and relevance to the prompt (see Fig. 2).

Fig. 2. Long Video and Short Video Results

Feature map significance was examined through contrast experiments and visualization (see Fig. 3).

Fig. 3. Contrast Experiments and Visualizations

Generated image frames displayed greater detail than the originals, and the resulting videos were higher resolution compared to the source (see Fig. 4).

Fig. 4. Comparison of Source Video (Left) and Generated Video (Right)

5. Conclusion
The experimental results demonstrate that adding edge, motion, and region recognition weights during noise removal ensures video continuity. These features were confirmed as crucial factors in maintaining video coherence. Furthermore, appropriate prompt classification enabled more accurate noise removal, yielding good results for videos of varying lengths. This validated the effectiveness of the style transformation approach for video frames.

Future work will explore the influence of additional features on video continuity and adjust their weights based on prompt content to enhance control over the model. This advancement aims to enable the general public to create AI-assisted artwork.

6. References

1. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. High-resolution image synthesis with latent diffusion models. *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 10684–10695, 2022.
2. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Lopes, R. G., Ayan, B. K., & Salimans, T., et al. Photorealistic text-to-image diffusion models with deep language understanding. *Advances in Neural Information Processing Systems*, 35:36479–36494, 2022.
3. Guo, Y., Yang, C., Rao, A., Wang, Y., Qiao, Y., Lin, D., & Dai, B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. *arXiv preprint*, Jul 2023.
4. Tevet, G., Gordon, B., Hertz, A., Bermano, A. H., & Cohen-Or, D. MotionCLIP: Exposing Human Motion Generation to CLIP Space. *arXiv preprint*, 2022.
Keyword (in Japanese) (See Japanese page) 
(in English) Stable Diffusion / Video to Video / Text to Video / Img to Img / Video Style Transformation / / /  
Reference Info. ITE Tech. Rep.
Paper #  
Date of Issue  
ISSN  
Download PDF

Conference Information
Committee KANSAI  
Conference Date 2024-12-22 - 2024-12-22 
Place (in Japanese) (See Japanese page) 
Place (in English) Osaka Metropolitan University, I-site Namba 
Topics (in Japanese) (See Japanese page) 
Topics (in English) The Institute of Image Information and Television Engineers, Kansai chapter, Workshop for young researchers 
Paper Information
Registration To KANSAI 
Conference Code 2024-12-KANSAI 
Language Japanese 
Title (in Japanese) (See Japanese page) 
Sub Title (in Japanese) (See Japanese page) 
Title (in English) Continuity Video-to-Video style transfer using Stable Diffusion Models 
Sub Title (in English)  
Keyword(1) Stable Diffusion  
Keyword(2) Video to Video  
Keyword(3) Text to Video  
Keyword(4) Img to Img  
Keyword(5) Video Style Transformation  
Keyword(6)  
Keyword(7)  
Keyword(8)  
1st Author's Name YU HAO  
1st Author's Affiliation Information Systems Science and Engineering Course (ISSE)
2nd Author's Name Liu JiaQing  
2nd Author's Affiliation Information Systems Science and Engineering Course (ISSE)
3rd Author's Name Li YinHao  
3rd Author's Affiliation Information Systems Science and Engineering Course (ISSE)
4th Author's Name Chen YanWei  
4th Author's Affiliation Information Systems Science and Engineering Course (ISSE)
5th Author's Name  
5th Author's Affiliation ()
6th Author's Name  
6th Author's Affiliation ()
7th Author's Name  
7th Author's Affiliation ()
8th Author's Name  
8th Author's Affiliation ()
9th Author's Name  
9th Author's Affiliation ()
10th Author's Name  
10th Author's Affiliation ()
11th Author's Name  
11th Author's Affiliation ()
12th Author's Name  
12th Author's Affiliation ()
13th Author's Name  
13th Author's Affiliation ()
14th Author's Name  
14th Author's Affiliation ()
15th Author's Name  
15th Author's Affiliation ()
16th Author's Name  
16th Author's Affiliation ()
17th Author's Name  
17th Author's Affiliation ()
18th Author's Name  
18th Author's Affiliation ()
19th Author's Name  
19th Author's Affiliation ()
20th Author's Name  
20th Author's Affiliation ()
21st Author's Name  
21st Author's Affiliation ()
22nd Author's Name  
22nd Author's Affiliation ()
23rd Author's Name  
23rd Author's Affiliation ()
24th Author's Name  
24th Author's Affiliation ()
25th Author's Name  
25th Author's Affiliation ()
26th Author's Name / /
26th Author's Affiliation ()
()
27th Author's Name / /
27th Author's Affiliation ()
()
28th Author's Name / /
28th Author's Affiliation ()
()
29th Author's Name / /
29th Author's Affiliation ()
()
30th Author's Name / /
30th Author's Affiliation ()
()
31st Author's Name / /
31st Author's Affiliation ()
()
32nd Author's Name / /
32nd Author's Affiliation ()
()
33rd Author's Name / /
33rd Author's Affiliation ()
()
34th Author's Name / /
34th Author's Affiliation ()
()
35th Author's Name / /
35th Author's Affiliation ()
()
36th Author's Name / /
36th Author's Affiliation ()
()
Speaker Author-1 
Date Time 2024-12-22 13:30:00 
Presentation Time 15 minutes 
Registration for KANSAI 
Paper #  
Volume (vol) vol. 
Number (no)  
Page  
#Pages  
Date of Issue  


[Return to Top Page]

[Return to ITE Web Page]


The Institute of Image Information and Television Engineers (ITE), Japan