| (in English) |
Video Style Transformation with Stable Diffusion Model**
Yu Hao† Liu Jiaqiang† Li Yinhao† Chen Yanwei†
† College of Information Science and Engineering, Ritsumeikan University,
2-150 Iwakura-cho, Ibaraki, Osaka 567-8570, Japan
E-mail: †is0650is@ed.ritsumei.ac.jp
Keywords: Stable Diffusion, Text-to-Image (T2I), Video-to-Video (V2V), Style Transformation
1. Introduction
In the field of video style transformation, many methods focus on enabling models to understand temporal information, adopting approaches to transform the style of entire videos. These methods, however, face limitations in video length, demand significant computational resources, and often fail to ensure sufficient continuity across frames. Hence, efficient methods capable of processing long videos while maintaining high continuity are needed.
This study proposes a video generation method that utilizes frame-to-frame relational information such as edge maps, depth maps, and pose maps as guidance in the diffusion model's generation process. This method leverages a low-computation image-to-image (T2I) model, achieving video generation that maintains both quality and motion consistency.
2. Related Work
2.1 Stable Diffusion [1]
A technique that removes noise from a noisy image using text or image prompts, generating the target image.
2.2 ControlNet
A method that strengthens the influence of additional conditions in the diffusion model's noise removal process by directly adding weighted conditions, enhancing control over the noise removal.
2.3 Animatediff
This model incorporates the Temporal Transform structure, enabling prediction of future frames from past ones. It generates multiple noisy images based on text prompts and produces continuous short videos.
3. Proposed Method
This study introduces a method that uses a sliding window to select 16 frames (the current frame and the next 15 frames) as a group. For each frame in the group, edge, depth, and pose features are extracted and converted into a format (embedding) that can be fused using ControlNet. Temporal relational information between frames is extracted from the Temporal Transform structure of Animatediff and incorporated into the diffusion noise removal U-net, guiding image generation during noise removal. This approach ensures higher continuity across frames while maintaining frame relations (see Fig. 1).
Fig. 1. Overview of the Proposed Method
3.1 Guiding Generation Using Video Frame Features
The proposed method, based on an image generation model, processes videos by splitting them into frames. To maintain video continuity, it is crucial to accurately capture features such as outlines, motions, and part information. Experimental results confirm that edge information extracted using Canny, motion information extracted using OpenPose, and depth information extracted using LeReS effectively reinforce these features, contributing to video continuity. By adding these features as weights in the ControlNet model, important information can be retained during noise removal. Contrast experiments further demonstrate the significance of these features in video generation (see Fig. 3). Additionally, masking unnecessary parts to zero reduces computational load effectively.
3.2 Classification of Prompts
Prompts include continuous information, such as the main character's appearance and movements. Since movement descriptions vary by frame, the MotionCLIP model is used to generate text descriptions corresponding to motion features in the frames, labeling them. This enables combining user-specified appearance prompts with frame-specific motion prompts, guiding noise removal more accurately. As a result, images corresponding to the original video frames are generated.
4. Experiments
4.1 Experimental Setup
Models Used: Stable Diffusion 1.5, ControlNet 1.1
Input Video Resolution: 1024×1024
Output Video Resolution: 1024×1024
Long Video Experiment: 26.3 seconds (790 frames)
Short Video Experiment: 8.5 seconds (259 frames)
Input Prompt: “2D anime-style girl, dancing, red eyes, white sailor uniform, red skirt”
4.2 Results
Video quality was evaluated using three CLIP Scores to measure continuity, fidelity to the source video, and relevance to the prompt (see Fig. 2).
Fig. 2. Long Video and Short Video Results
Feature map significance was examined through contrast experiments and visualization (see Fig. 3).
Fig. 3. Contrast Experiments and Visualizations
Generated image frames displayed greater detail than the originals, and the resulting videos were higher resolution compared to the source (see Fig. 4).
Fig. 4. Comparison of Source Video (Left) and Generated Video (Right)
5. Conclusion
The experimental results demonstrate that adding edge, motion, and region recognition weights during noise removal ensures video continuity. These features were confirmed as crucial factors in maintaining video coherence. Furthermore, appropriate prompt classification enabled more accurate noise removal, yielding good results for videos of varying lengths. This validated the effectiveness of the style transformation approach for video frames.
Future work will explore the influence of additional features on video continuity and adjust their weights based on prompt content to enhance control over the model. This advancement aims to enable the general public to create AI-assisted artwork.
6. References
1. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. High-resolution image synthesis with latent diffusion models. *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 10684–10695, 2022.
2. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Lopes, R. G., Ayan, B. K., & Salimans, T., et al. Photorealistic text-to-image diffusion models with deep language understanding. *Advances in Neural Information Processing Systems*, 35:36479–36494, 2022.
3. Guo, Y., Yang, C., Rao, A., Wang, Y., Qiao, Y., Lin, D., & Dai, B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. *arXiv preprint*, Jul 2023.
4. Tevet, G., Gordon, B., Hertz, A., Bermano, A. H., & Cohen-Or, D. MotionCLIP: Exposing Human Motion Generation to CLIP Space. *arXiv preprint*, 2022. |