Continuity Video-to-Video style transfer using Stable Diffusion Models

HAO,YU; JiaQing,Liu; YinHao,Li; YanWei,Chen

ITE Technical Group Submission System
Conference Paper's Information

Online Proceedings
[Sign in]

Paper Abstract and Keywords
Presentation		2024-12-22 13:30 Continuity Video-to-Video style transfer using Stable Diffusion Models YU HAO, Liu JiaQing, Li YinHao, Chen YanWei (ISSE)
Abstract	(in Japanese)	(See Japanese page)
	(in English)	Video Style Transformation with Stable Diffusion Model** Yu Hao† Liu Jiaqiang† Li Yinhao† Chen Yanwei† † College of Information Science and Engineering, Ritsumeikan University, 2-150 Iwakura-cho, Ibaraki, Osaka 567-8570, Japan E-mail: †is0650is@ed.ritsumei.ac.jp Keywords: Stable Diffusion, Text-to-Image (T2I), Video-to-Video (V2V), Style Transformation 1. Introduction In the field of video style transformation, many methods focus on enabling models to understand temporal information, adopting approaches to transform the style of entire videos. These methods, however, face limitations in video length, demand significant computational resources, and often fail to ensure sufficient continuity across frames. Hence, efficient methods capable of processing long videos while maintaining high continuity are needed. This study proposes a video generation method that utilizes frame-to-frame relational information such as edge maps, depth maps, and pose maps as guidance in the diffusion model's generation process. This method leverages a low-computation image-to-image (T2I) model, achieving video generation that maintains both quality and motion consistency. 2. Related Work 2.1 Stable Diffusion [1] A technique that removes noise from a noisy image using text or image prompts, generating the target image. 2.2 ControlNet A method that strengthens the influence of additional conditions in the diffusion model's noise removal process by directly adding weighted conditions, enhancing control over the noise removal. 2.3 Animatediff This model incorporates the Temporal Transform structure, enabling prediction of future frames from past ones. It generates multiple noisy images based on text prompts and produces continuous short videos. 3. Proposed Method This study introduces a method that uses a sliding window to select 16 frames (the current frame and the next 15 frames) as a group. For each frame in the group, edge, depth, and pose features are extracted and converted into a format (embedding) that can be fused using ControlNet. Temporal relational information between frames is extracted from the Temporal Transform structure of Animatediff and incorporated into the diffusion noise removal U-net, guiding image generation during noise removal. This approach ensures higher continuity across frames while maintaining frame relations (see Fig. 1). Fig. 1. Overview of the Proposed Method 3.1 Guiding Generation Using Video Frame Features The proposed method, based on an image generation model, processes videos by splitting them into frames. To maintain video continuity, it is crucial to accurately capture features such as outlines, motions, and part information. Experimental results confirm that edge information extracted using Canny, motion information extracted using OpenPose, and depth information extracted using LeReS effectively reinforce these features, contributing to video continuity. By adding these features as weights in the ControlNet model, important information can be retained during noise removal. Contrast experiments further demonstrate the significance of these features in video generation (see Fig. 3). Additionally, masking unnecessary parts to zero reduces computational load effectively. 3.2 Classification of Prompts Prompts include continuous information, such as the main character's appearance and movements. Since movement descriptions vary by frame, the MotionCLIP model is used to generate text descriptions corresponding to motion features in the frames, labeling them. This enables combining user-specified appearance prompts with frame-specific motion prompts, guiding noise removal more accurately. As a result, images corresponding to the original video frames are generated. 4. Experiments 4.1 Experimental Setup Models Used: Stable Diffusion 1.5, ControlNet 1.1 Input Video Resolution: 1024×1024 Output Video Resolution: 1024×1024 Long Video Experiment: 26.3 seconds (790 frames) Short Video Experiment: 8.5 seconds (259 frames) Input Prompt: “2D anime-style girl, dancing, red eyes, white sailor uniform, red skirt” 4.2 Results Video quality was evaluated using three CLIP Scores to measure continuity, fidelity to the source video, and relevance to the prompt (see Fig. 2). Fig. 2. Long Video and Short Video Results Feature map significance was examined through contrast experiments and visualization (see Fig. 3). Fig. 3. Contrast Experiments and Visualizations Generated image frames displayed greater detail than the originals, and the resulting videos were higher resolution compared to the source (see Fig. 4). Fig. 4. Comparison of Source Video (Left) and Generated Video (Right) 5. Conclusion The experimental results demonstrate that adding edge, motion, and region recognition weights during noise removal ensures video continuity. These features were confirmed as crucial factors in maintaining video coherence. Furthermore, appropriate prompt classification enabled more accurate noise removal, yielding good results for videos of varying lengths. This validated the effectiveness of the style transformation approach for video frames. Future work will explore the influence of additional features on video continuity and adjust their weights based on prompt content to enhance control over the model. This advancement aims to enable the general public to create AI-assisted artwork. 6. References 1. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022. 2. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Lopes, R. G., Ayan, B. K., & Salimans, T., et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022. 3. Guo, Y., Yang, C., Rao, A., Wang, Y., Qiao, Y., Lin, D., & Dai, B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint, Jul 2023. 4. Tevet, G., Gordon, B., Hertz, A., Bermano, A. H., & Cohen-Or, D. MotionCLIP: Exposing Human Motion Generation to CLIP Space. arXiv preprint, 2022.
Keyword	(in Japanese)	(See Japanese page)
	(in English)	Stable Diffusion / Video to Video / Text to Video / Img to Img / Video Style Transformation / / /
Reference Info.		ITE Tech. Rep.
Paper #
Date of Issue
ISSN
Download PDF

Conference Information
Committee	KANSAI
Conference Date	2024-12-22 - 2024-12-22
Place (in Japanese)	(See Japanese page)
Place (in English)	Osaka Metropolitan University, I-site Namba
Topics (in Japanese)	(See Japanese page)
Topics (in English)	The Institute of Image Information and Television Engineers, Kansai chapter, Workshop for young researchers
Paper Information
Registration To	KANSAI
Conference Code	2024-12-KANSAI
Language	Japanese
Title (in Japanese)	(See Japanese page)
Sub Title (in Japanese)	(See Japanese page)
Title (in English)	Continuity Video-to-Video style transfer using Stable Diffusion Models
Sub Title (in English)
Keyword(1)	Stable Diffusion
Keyword(2)	Video to Video
Keyword(3)	Text to Video
Keyword(4)	Img to Img
Keyword(5)	Video Style Transformation
Keyword(6)
Keyword(7)
Keyword(8)
1st Author's Name	YU HAO
1st Author's Affiliation	Information Systems Science and Engineering Course (ISSE)
2nd Author's Name	Liu JiaQing
2nd Author's Affiliation	Information Systems Science and Engineering Course (ISSE)
3rd Author's Name	Li YinHao
3rd Author's Affiliation	Information Systems Science and Engineering Course (ISSE)
4th Author's Name	Chen YanWei
4th Author's Affiliation	Information Systems Science and Engineering Course (ISSE)
5th Author's Name
5th Author's Affiliation	()
6th Author's Name
6th Author's Affiliation	()
7th Author's Name
7th Author's Affiliation	()
8th Author's Name
8th Author's Affiliation	()
9th Author's Name
9th Author's Affiliation	()
10th Author's Name
10th Author's Affiliation	()
11th Author's Name
11th Author's Affiliation	()
12th Author's Name
12th Author's Affiliation	()
13th Author's Name
13th Author's Affiliation	()
14th Author's Name
14th Author's Affiliation	()
15th Author's Name
15th Author's Affiliation	()
16th Author's Name
16th Author's Affiliation	()
17th Author's Name
17th Author's Affiliation	()
18th Author's Name
18th Author's Affiliation	()
19th Author's Name
19th Author's Affiliation	()
20th Author's Name
20th Author's Affiliation	()
21st Author's Name
21st Author's Affiliation	()
22nd Author's Name
22nd Author's Affiliation	()
23rd Author's Name
23rd Author's Affiliation	()
24th Author's Name
24th Author's Affiliation	()
25th Author's Name
25th Author's Affiliation	()
26th Author's Name	/ /
26th Author's Affiliation	() ()
27th Author's Name	/ /
27th Author's Affiliation	() ()
28th Author's Name	/ /
28th Author's Affiliation	() ()
29th Author's Name	/ /
29th Author's Affiliation	() ()
30th Author's Name	/ /
30th Author's Affiliation	() ()
31st Author's Name	/ /
31st Author's Affiliation	() ()
32nd Author's Name	/ /
32nd Author's Affiliation	() ()
33rd Author's Name	/ /
33rd Author's Affiliation	() ()
34th Author's Name	/ /
34th Author's Affiliation	() ()
35th Author's Name	/ /
35th Author's Affiliation	() ()
36th Author's Name	/ /
36th Author's Affiliation	() ()
Speaker	Author-1
Date Time	2024-12-22 13:30:00
Presentation Time	15 minutes
Registration for	KANSAI
Paper #
Volume (vol)	vol.
Number (no)
Page
#Pages
Date of Issue

[Return to Top Page]

[Return to ITE Web Page]

The Institute of Image Information and Television Engineers (ITE), Japan