Stable Diffusionモデルを用いた連続性のあるビデオ画風変換

于,浩; 劉,家慶; 李,印豪; 陳,延偉

映像情報メディア学会研究会発表申込システム
講演論文詳細

技報閲覧サービス
技報オンライン

講演抄録／キーワード
講演名		2024-12-22 13:30 Stable Diffusionモデルを用いた連続性のあるビデオ画風変換 ○于　浩・劉　家慶・李　印豪・陳　延偉（立命館大）
抄録	（和）	Stable Diffusionモデルを用いた連続性のあるビデオ画風変換于浩† 劉家慶† 李印豪† 陳延偉† †立命館大学情報理工学部〒567-8570 大阪府茨木市岩倉町2-150 E-mail: †is0650is@ed.ritsumei.ac.jp キーワード Stable Diffusion, Text to Image(T2I)，Video to Video(V2V)，画風変換 1. はじめに動画のスタイル変換分野では,多くの手法がモデルに時間情報を理解させる能力を持たせ,動画全体をスタイル変換するアプローチを採用している.この方法は,処理できる動画の長さが限られる上,膨大な計算量を必要とする.また,動画の連続性を十分に保証できないという課題がある.高い連続性を保ちながら長時間の動画を効率的に処理する方法が求められている. 本研究では,テキストプロンプトに加え,エッジマップや深度マップ,姿勢マップなどのフレーム間関連情報を拡散モデル生成プロセスのガイダンスとして活用し,計算量が低い画像から画像を生成する(T2I)モデルを制御を行う動画生成手法を提案する.この生成プロセスにフレーム間関連情報を組み込むことで,品質と動きの一貫性を保った動画生成を実現している. 2. 関連研究 2.1 Stable Diffusion[1] ノイズ画像に対して,テキストや画像のプロンプトを用いて徐々にノイズを除去し,目標の画像を生成する手法である. 2.2 ControlNet 拡散モデルのノイズ除去プロセスに追加条件の重みを直接加えることで,条件の影響を強化し,ノイズ除去の過程をより制御可能にする手法である. 2.3 Animatediff Temporal Transform構造を使用することにより,過去のフレームから未来のフレームを予測する能力を付与したモデルであり,これによりテキストプロンプトに基づいて複数のノイズ画像を生成し,連続した短い動画を作成することが可能となる. 3. 提案手法本研究で提案する手法は,スライディングウィンドウを用いて,現在のフレームを含む16フレーム（現在のフレームおよびその後の15フレーム）を1つのグループとして選択する.このグループ内の各フレームについて,エッジ,深度,ポーズ特徴を抽出し,それらをそれぞれControlNetを用いて融合できる形（Embedding）に変換し,AnimatediffのTemporal Transform構造からフレーム間関連性情報を抽出し,Diffusionノイズ除去U-netに加えることで,ノイズ除去時に特徴を考慮した画像生成を行う.このアプローチにより,フレーム間の関連性を保ちながら,より高い連続性を持つ動画生成を実現する（図1参照）. 図1. 提案法の全体図 3.1 動画フレーム特徴を用いた生成のガイド本手法は画像生成モデルをベースとしているため,動画をフレームに分解して処理を行う必要がある.動画の連続性を保つためには,人物の輪郭,動作,部位情報の特徴を適切に捉えることが重要であると考えられる.実験結果から,Cannyを用いて抽出したエッジ情報,OpenPoseを用いて抽出した動作情報,LeReSを用いて抽出した深度情報が,輪郭,動作,部位特徴の重みを効果的に強化し,動画の連続性を維持するのに寄与することが確認された.そのため,これらの特徴をControlNetモデルで重みとして加えることで,ノイズ除去時にこれらの重要な情報を保持しつつ,生成を行うことができる.対照実験の結果も,これらの特徴が動画生成において重要であることを示している（対照実験図3参照）.また,不要な部分をマスクしてゼロにすることで計算負荷を軽減する方法も効果的であることが分かった. 3.2 プロンプトの分類プロンプトには連続性を持つ情報,例えば主要人物の外見や動作に関する情報が含まれる.人物の動作形容詞はフレームごとに異なるため,MotionClipという画像生成からテキスト生成を行うモデルを用いて,人物動作特徴画像に対応するテキストを生成し,それにラベルを付ける.この方法により,各フレームの処理時に,ユーザーが指定した外見に関するプロンプトとそのフレームに対応する動作特徴を結合し,より正確にノイズ除去をガイドすることができる.このアプローチにより,元動画のフレームに対応した画像を生成することが可能となる. 4. 実験 4.1. 実験設定使用モデル：Stable Diffusion1.5,ControlNet1.1 入力動画サイズ：1024×1024 出力動画サイズ：1024×1024 長い動画実験：26.3秒（790フレーム）短い動画実験：8.5秒（259フレーム）入力プロンプト：「2D anime-style girl, dancing, red eyes, white sailor uniform, red skirt」 4.2. 実験結果動画品質の評価には,連続性,元動画への忠実性,プロンプトとの関連性を測定するため,3つのClip Scoreを使用した（結果は図2参照）. 図2. 長時間動画と短時間動画また,特徴マップの重要性については,対比実験とその可視化を行った（結果は図3参照）. 図3. 対照実験と各段階の可視化さらに,生成された画像フレームは元の画像フレームよりも詳細が豊富であり,生成された動画は元動画よりも高解像度であることが確認された（図4参照）. 図4. 元動画(左)と生成動画(右)の対照 5. 結論実験結果から見ると,ノイズ除去プロセスにおけるエッジ,動作,領域認識の重みを追加することで,動画の連続性を確保できることが示された.これらの特徴が動画の連続性を維持するための重要な要因であることが確認された.また,適切なプロンプト分類により,ノイズ除去がより正確になり,異なる長さの動画に対しても良好な結果を得ることができた.このことから,動画フレームに対するスタイル変換のアプローチが有効であることが証明された. 今後は,さらなる特徴が動画の連続性に与える影響を探り,プロンプトの内容に応じてこれらの特徴の重みを調整することで,プロンプトの影響力を高め,モデルの制御性を向上させることを目指す.これにより,一般の人々にもAIを活用したアート創作が可能となることを期待している. 6. 参考文献 [1] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022. [2] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022. [3] Guo, Y., Yang, C., Rao, A., Wang, Y., Qiao, Y., Lin, D., Dai, B.: Animatediff:Animate your personalized text-to-image diffusion models without specific tuningJul, 2023. [4] Guy Tevet, Brian Gordon, Amir Hertz, Amit H. Bermano, Daniel Cohen-Or. MotionCLIP: Exposing Human Motion Generation to CLIP Space, 2022 arXiv:2203.08063
	（英）	Video Style Transformation with Stable Diffusion Model** Yu Hao† Liu Jiaqiang† Li Yinhao† Chen Yanwei† † College of Information Science and Engineering, Ritsumeikan University, 2-150 Iwakura-cho, Ibaraki, Osaka 567-8570, Japan E-mail: †is0650is@ed.ritsumei.ac.jp Keywords: Stable Diffusion, Text-to-Image (T2I), Video-to-Video (V2V), Style Transformation 1. Introduction In the field of video style transformation, many methods focus on enabling models to understand temporal information, adopting approaches to transform the style of entire videos. These methods, however, face limitations in video length, demand significant computational resources, and often fail to ensure sufficient continuity across frames. Hence, efficient methods capable of processing long videos while maintaining high continuity are needed. This study proposes a video generation method that utilizes frame-to-frame relational information such as edge maps, depth maps, and pose maps as guidance in the diffusion model's generation process. This method leverages a low-computation image-to-image (T2I) model, achieving video generation that maintains both quality and motion consistency. 2. Related Work 2.1 Stable Diffusion [1] A technique that removes noise from a noisy image using text or image prompts, generating the target image. 2.2 ControlNet A method that strengthens the influence of additional conditions in the diffusion model's noise removal process by directly adding weighted conditions, enhancing control over the noise removal. 2.3 Animatediff This model incorporates the Temporal Transform structure, enabling prediction of future frames from past ones. It generates multiple noisy images based on text prompts and produces continuous short videos. 3. Proposed Method This study introduces a method that uses a sliding window to select 16 frames (the current frame and the next 15 frames) as a group. For each frame in the group, edge, depth, and pose features are extracted and converted into a format (embedding) that can be fused using ControlNet. Temporal relational information between frames is extracted from the Temporal Transform structure of Animatediff and incorporated into the diffusion noise removal U-net, guiding image generation during noise removal. This approach ensures higher continuity across frames while maintaining frame relations (see Fig. 1). Fig. 1. Overview of the Proposed Method 3.1 Guiding Generation Using Video Frame Features The proposed method, based on an image generation model, processes videos by splitting them into frames. To maintain video continuity, it is crucial to accurately capture features such as outlines, motions, and part information. Experimental results confirm that edge information extracted using Canny, motion information extracted using OpenPose, and depth information extracted using LeReS effectively reinforce these features, contributing to video continuity. By adding these features as weights in the ControlNet model, important information can be retained during noise removal. Contrast experiments further demonstrate the significance of these features in video generation (see Fig. 3). Additionally, masking unnecessary parts to zero reduces computational load effectively. 3.2 Classification of Prompts Prompts include continuous information, such as the main character's appearance and movements. Since movement descriptions vary by frame, the MotionCLIP model is used to generate text descriptions corresponding to motion features in the frames, labeling them. This enables combining user-specified appearance prompts with frame-specific motion prompts, guiding noise removal more accurately. As a result, images corresponding to the original video frames are generated. 4. Experiments 4.1 Experimental Setup Models Used: Stable Diffusion 1.5, ControlNet 1.1 Input Video Resolution: 1024×1024 Output Video Resolution: 1024×1024 Long Video Experiment: 26.3 seconds (790 frames) Short Video Experiment: 8.5 seconds (259 frames) Input Prompt: “2D anime-style girl, dancing, red eyes, white sailor uniform, red skirt” 4.2 Results Video quality was evaluated using three CLIP Scores to measure continuity, fidelity to the source video, and relevance to the prompt (see Fig. 2). Fig. 2. Long Video and Short Video Results Feature map significance was examined through contrast experiments and visualization (see Fig. 3). Fig. 3. Contrast Experiments and Visualizations Generated image frames displayed greater detail than the originals, and the resulting videos were higher resolution compared to the source (see Fig. 4). Fig. 4. Comparison of Source Video (Left) and Generated Video (Right) 5. Conclusion The experimental results demonstrate that adding edge, motion, and region recognition weights during noise removal ensures video continuity. These features were confirmed as crucial factors in maintaining video coherence. Furthermore, appropriate prompt classification enabled more accurate noise removal, yielding good results for videos of varying lengths. This validated the effectiveness of the style transformation approach for video frames. Future work will explore the influence of additional features on video continuity and adjust their weights based on prompt content to enhance control over the model. This advancement aims to enable the general public to create AI-assisted artwork. 6. References 1. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022. 2. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Lopes, R. G., Ayan, B. K., & Salimans, T., et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022. 3. Guo, Y., Yang, C., Rao, A., Wang, Y., Qiao, Y., Lin, D., & Dai, B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint, Jul 2023. 4. Tevet, G., Gordon, B., Hertz, A., Bermano, A. H., & Cohen-Or, D. MotionCLIP: Exposing Human Motion Generation to CLIP Space. arXiv preprint, 2022.
キーワード	（和）	ステーブルディフュージョン / 動画から動画 / テキストから動画 / 画像から画像 / 動画画風変換 / / /
	（英）	Stable Diffusion / Video to Video / Text to Video / Img to Img / Video Style Transformation / / /
文献情報		映情学技報
資料番号
発行日
ISSN
PDFダウンロード

研究会情報
研究会	KANSAI
開催期間	2024-12-22 - 2024-12-22
開催地（和）	大阪公立大学 I-siteなんば
開催地（英）	Osaka Metropolitan University, I-site Namba
テーマ（和）	映像情報メディア学会関西支部若手研究発表会
テーマ（英）	The Institute of Image Information and Television Engineers, Kansai chapter, Workshop for young researchers
講演論文情報の詳細
申込み研究会	KANSAI
会議コード	2024-12-KANSAI
本文の言語	日本語
タイトル（和）	Stable Diffusionモデルを用いた連続性のあるビデオ画風変換
サブタイトル（和）
タイトル（英）	Continuity Video-to-Video style transfer using Stable Diffusion Models
サブタイトル（英）
キーワード(1)（和/英）	ステーブルディフュージョン / Stable Diffusion
キーワード(2)（和/英）	動画から動画 / Video to Video
キーワード(3)（和/英）	テキストから動画 / Text to Video
キーワード(4)（和/英）	画像から画像 / Img to Img
キーワード(5)（和/英）	動画画風変換 / Video Style Transformation
キーワード(6)（和/英）	/
キーワード(7)（和/英）	/
キーワード(8)（和/英）	/
第1著者氏名（和/英/ヨミ）	于浩 / YU HAO / ウコウ
第1著者所属（和/英）	立命館大学情報理工学部 (略称：立命館大) Information Systems Science and Engineering Course (略称： ISSE)
第2著者氏名（和/英/ヨミ）	劉家慶 / Liu JiaQing / リュウカケイ
第2著者所属（和/英）	立命館大学情報理工学部 (略称：立命館大) Information Systems Science and Engineering Course (略称： ISSE)
第3著者氏名（和/英/ヨミ）	李印豪 / Li YinHao / リインゴウ
第3著者所属（和/英）	立命館大学情報理工学部 (略称：立命館大) Information Systems Science and Engineering Course (略称： ISSE)
第4著者氏名（和/英/ヨミ）	陳延偉 / Chen YanWei / チンエンイ
第4著者所属（和/英）	立命館大学情報理工学部 (略称：立命館大) Information Systems Science and Engineering Course (略称： ISSE)
第5著者氏名（和/英/ヨミ）	/ /
第5著者所属（和/英）	(略称： ) (略称： )
第6著者氏名（和/英/ヨミ）	/ /
第6著者所属（和/英）	(略称： ) (略称： )
第7著者氏名（和/英/ヨミ）	/ /
第7著者所属（和/英）	(略称： ) (略称： )
第8著者氏名（和/英/ヨミ）	/ /
第8著者所属（和/英）	(略称： ) (略称： )
第9著者氏名（和/英/ヨミ）	/ /
第9著者所属（和/英）	(略称： ) (略称： )
第10著者氏名（和/英/ヨミ）	/ /
第10著者所属（和/英）	(略称： ) (略称： )
第11著者氏名（和/英/ヨミ）	/ /
第11著者所属（和/英）	(略称： ) (略称： )
第12著者氏名（和/英/ヨミ）	/ /
第12著者所属（和/英）	(略称： ) (略称： )
第13著者氏名（和/英/ヨミ）	/ /
第13著者所属（和/英）	(略称： ) (略称： )
第14著者氏名（和/英/ヨミ）	/ /
第14著者所属（和/英）	(略称： ) (略称： )
第15著者氏名（和/英/ヨミ）	/ /
第15著者所属（和/英）	(略称： ) (略称： )
第16著者氏名（和/英/ヨミ）	/ /
第16著者所属（和/英）	(略称： ) (略称： )
第17著者氏名（和/英/ヨミ）	/ /
第17著者所属（和/英）	(略称： ) (略称： )
第18著者氏名（和/英/ヨミ）	/ /
第18著者所属（和/英）	(略称： ) (略称： )
第19著者氏名（和/英/ヨミ）	/ /
第19著者所属（和/英）	(略称： ) (略称： )
第20著者氏名（和/英/ヨミ）	/ /
第20著者所属（和/英）	(略称： ) (略称： )
第21著者氏名（和/英/ヨミ）	/ /
第21著者所属（和/英）	(略称： ) (略称： )
第22著者氏名（和/英/ヨミ）	/ /
第22著者所属（和/英）	(略称： ) (略称： )
第23著者氏名（和/英/ヨミ）	/ /
第23著者所属（和/英）	(略称： ) (略称： )
第24著者氏名（和/英/ヨミ）	/ /
第24著者所属（和/英）	(略称： ) (略称： )
第25著者氏名（和/英/ヨミ）	/ /
第25著者所属（和/英）	(略称： ) (略称： )
第26著者氏名（和/英/ヨミ）	/ /
第26著者所属（和/英）	(略称： ) (略称： )
第27著者氏名（和/英/ヨミ）	/ /
第27著者所属（和/英）	(略称： ) (略称： )
第28著者氏名（和/英/ヨミ）	/ /
第28著者所属（和/英）	(略称： ) (略称： )
第29著者氏名（和/英/ヨミ）	/ /
第29著者所属（和/英）	(略称： ) (略称： )
第30著者氏名（和/英/ヨミ）	/ /
第30著者所属（和/英）	(略称： ) (略称： )
第31著者氏名（和/英/ヨミ）	/ /
第31著者所属（和/英）	(略称： ) (略称： )
第32著者氏名（和/英/ヨミ）	/ /
第32著者所属（和/英）	(略称： ) (略称： )
第33著者氏名（和/英/ヨミ）	/ /
第33著者所属（和/英）	(略称： ) (略称： )
第34著者氏名（和/英/ヨミ）	/ /
第34著者所属（和/英）	(略称： ) (略称： )
第35著者氏名（和/英/ヨミ）	/ /
第35著者所属（和/英）	(略称： ) (略称： )
第36著者氏名（和/英/ヨミ）	/ /
第36著者所属（和/英）	(略称： ) (略称： )
講演者	第1著者
発表日時	2024-12-22 13:30:00
発表時間	15分
申込先研究会	KANSAI
資料番号
巻番号（vol）	vol.
号番号（no）
ページ範囲
ページ数
発行日

[研究会発表申込システムのトップページに戻る]

[映像情報メディア学会ホームページ]

ITE / 映像情報メディア学会