Hunyuan Video Fine-Tuning

A powerful toolkit for fine-tuning Hunyuan Video LoRA using LoRA, plus advanced video inference and automatic captioning via QWEN-VL. This guide focuses on the most important aspect: how to run fine-tuning (training) and generation (inference) using Cog, with detailed explanations of all parameters.

Quick Start

Place your training videos in a ZIP file. Optionally include <video_name>.txt captions alongside each <video_name>.mp4, e.g.:
```
your_data.zip/
├── dance_scene.mp4
├── dance_scene.txt
├── city_stroll.mp4
└── ...
```
Tip: You can use create-video-dataset to easily prepare your training data with automatic QWEN-VL captioning.
Install Cog and Docker.
Run the training example command (see below).
After training, run the inference example command to generate a new video.

Installation & Setup

Install Docker (required by Cog).

Install Cog from cog.run:

curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
chmod +x /usr/local/bin/cog
pip install cog

Clone or download this repository.

From the project root directory, you can run Cog commands with parameters:

# For training:
sudo cog train -i "input_videos=@your_videos.zip" -i "trigger_word=MYSTYLE"

# For inference:
sudo cog predict -i "prompt=your prompt here" -i "replicate_weights=@/tmp/trained_model.tar"

See below for detailed parameter explanations and more examples.

Training

Training Command

Use:

sudo cog train \
  -i "input_videos=@your_videos.zip" \
  [other parameters...]

The result of training is saved to /tmp/trained_model.tar containing:

LoRA weights (.safetensors)
(Optional) ComfyUI-compatible LoRA
Any logs or training artifacts

You can use this output directly in inference by passing it to the replicate_weights parameter:

sudo cog predict \
  -i "prompt='Your prompt here'" \
  -i "replicate_weights=@/tmp/trained_model.tar" \
  [other parameters...]

Training Parameters

Below are the key parameters you can supply to cog train. All parameters have validated types and ranges:

• input_videos (Path)

Description: A ZIP file containing videos (and optional .txt captions).
Example: -i "input_videos=@my_videos.zip"

• trigger_word (str)

Description: A "fake" or "rare" word that represents the style or concept you're training on.
Default: "TOK"
Example: -i "trigger_word=STYLE3D"

• autocaption (bool)

Description: Whether to auto-caption your videos using QWEN-VL.
Default: True
Example: -i "autocaption=false"

• autocaption_prefix (str)

Description: Text prepended to all generated captions (helps set consistent context).
Default: None
Example: -i "autocaption_prefix='A cinematic scene of TOK, '"

• autocaption_suffix (str)

Description: Text appended to all generated captions (helps reinforce the concept).
Default: None
Example: -i "autocaption_suffix='in the art style of TOK.'"

• epochs (int)

Description: Number of full passes (epochs) over the dataset.
Range: 1–2000
Default: 16

• max_train_steps (int)

Description: Limit the total number of steps (each step processes one batch). -1 for unlimited.
Range: -1–1,000,000
Default: -1

• rank (int)

Description: LoRA rank. Higher rank can capture more detail but also uses more resources.
Range: 1–128
Default: 32

• batch_size (int)

Description: Batch size (frames per iteration). Lower for less VRAM usage.
Range: 1–8
Default: 4

• learning_rate (float)

Description: Training learning rate.
Range: 1e-5–1
Default: 1e-3

• optimizer (str)

Description: Which optimizer to use. Usually "adamw8bit" is a good default.
Choices: ["adamw", "adamw8bit", "AdaFactor", "adamw16bit"]
Default: "adamw8bit"

• timestep_sampling (str)

Description: Sampling strategy across diffusion timesteps.
Choices: ["sigma", "uniform", "sigmoid", "shift"]
Default: "sigmoid"

• consecutive_target_frames (str)

Description: How many consecutive frames to pull from each video.
Choices: ["[1, 13, 25]", "[1, 25, 45]", "[1, 45, 89]", "[1, 13, 25, 45]"]
Default: "[1, 25, 45]"

• frame_extraction_method (str)

Description: How frames are extracted (start, chunk, sliding-window, uniform).
Choices: ["head", "chunk", "slide", "uniform"]
Default: "head"

• frame_stride (int)

Description: Stride used for slide-based extraction.
Range: 1–100
Default: 10

• frame_sample (int)

Description: Number of samples used in uniform extraction.
Range: 1–20
Default: 4

• seed (int)

Description: Random seed. Use <= 0 for truly random.
Default: 0

• hf_repo_id (str)

Description: If you want to push your LoRA to Hugging Face, specify "username/my-video-lora".
Default: None

• hf_token (Secret)

Description: Hugging Face token for uploading to a private or public repository.
Default: None

Examples

Simple Training

sudo cog train \
  -i "input_videos=@your_videos.zip" \
  -i "trigger_word=MYSTYLE" \
  -i "epochs=4"

This runs 4 epochs with default batch size and autocaption.

Memory-Constrained Training

sudo cog train \
  -i "input_videos=@your_videos.zip" \
  -i "rank=16" \
  -i "batch_size=1" \
  -i "gradient_checkpointing=true"

Uses a lower rank and smaller batch size to reduce VRAM usage, plus gradient checkpointing.

Motion-Focused Training

sudo cog train \
  -i "input_videos=@videos.zip" \
  -i "consecutive_target_frames=[1, 45, 89]" \
  -i "frame_extraction_method=slide" \
  -i "frame_stride=10"

Extracts frames in sliding windows to capture more motion variety.

Quick Test Run

sudo cog train \
  -i "input_videos=@test.zip" \
  -i "rank=16" \
  -i "epochs=4" \
  -i "max_train_steps=100" \
  -i "batch_size=1" \
  -i "gradient_checkpointing=true"

Minimal training to verify your setup and data.

Style Focus

sudo cog train \
  -i "input_videos=@style.zip" \
  -i "consecutive_target_frames=[1]" \
  -i "frame_extraction_method=uniform" \
  -i "frame_sample=8" \
  -i "epochs=16"

Optimized for learning static style elements rather than motion.

Inference

Inference Command

Use:

sudo cog predict \
  -i "prompt='Your prompt here'" \
  [other parameters...]

The generated video is saved to the output directory (usually /src or /outputs inside Docker), and Cog returns the path.

Inference Parameters

Below are the key parameters for cog predict:

• prompt (str)

Description: Your text prompt for the scene or style.
Example: -i "prompt='A cinematic shot of a forest in MYSTYLE'"

• lora_url (str)

Description: URL or Hugging Face repo ID for the LoRA weights.
Example: -i "lora_url='myuser/my-lora-repo'"

• lora_strength (float)

Description: How strongly the LoRA style is applied.
Range: -10.0–10.0
Default: 1.0

• scheduler (str)

Description: The diffusion sampling/flow algorithm.
Choices: ["FlowMatchDiscreteScheduler", "SDE-DPMSolverMultistepScheduler", "DPMSolverMultistepScheduler", "SASolverScheduler", "UniPCMultistepScheduler"]
Default: "DPMSolverMultistepScheduler"

• steps (int)

Description: Number of diffusion steps.
Range: 1–150
Default: 50

• guidance_scale (float)

Description: How strongly the prompt influences the generation.
Range: 0.0–30.0
Default: 6.0

• flow_shift (int)

Description: Adjusts motion consistency across frames.
Range: 0–20
Default: 9

• num_frames (int)

Description: Total frames in the output video.
Range: 1–1440
Default: 33

• width (int), height (int)

Description: Dimensions of generated frames.
Range: width (64–1536), height (64–1024)
Default: 640x360

• denoise_strength (float)

Description: Controls how strongly noise is applied each step: 0 = minimal noise, 2 = heavy noise.
Range: 0.0–2.0
Default: 1.0

• force_offload (bool)

Description: Offload layers to CPU for lower VRAM usage.
Default: True

• frame_rate (int)

Description: Frames per second in the final video.
Range: 1–60
Default: 16

• crf (int)

Description: H.264 compression quality. Lower = better.
Range: 0–51
Default: 19

• enhance_weight (float)

Description: Strength of optional enhancement effect.
Range: 0.0–2.0
Default: 0.3

• enhance_single (bool) & enhance_double (bool)

Description: Whether to enable enhancement on single frames or across pairs of frames.
Default: True, True

• enhance_start (float) & enhance_end (float)

Description: Control when in the video enhancement starts or ends (fractional times, 0.0–1.0 range).
Default: 0.0–1.0

• seed (int)

Description: Random seed for reproducible output.
Default: random if not provided

• replicate_weights (Path)

Description: Path to a local .tar containing LoRA weights from replicate training.
Default: None

Examples

Basic Inference with Local LoRA

sudo cog predict \
  -i "prompt='A serene lake at sunrise in the style of MYSTYLE'" \
  -i "lora_url='local-file.safetensors'" \
  -i "width=512" \
  -i "height=512" \
  -i "steps=30"

Advanced Motion and Quality

sudo cog predict \
  -i "prompt='TOK winter cityscape, moody lighting'" \
  -i "lora_url='myuser/my-lora-repo'" \
  -i "steps=50" \
  -i "flow_shift=15" \
  -i "num_frames=80" \
  -i "frame_rate=30" \
  -i "crf=17" \
  -i "lora_strength=1.2"

Here, we use more frames, faster frame rate, and a lower CRF for higher quality.

Using Replicate Tar

sudo cog predict \
  -i "prompt='An astronaut dancing on Mars in style TOK'" \
  -i "replicate_weights=@trained_model.tar" \
  -i "guidance_scale=8" \
  -i "num_frames=45"

Instead of lora_url, we pass a local .tar with LoRA weights.

Quick Preview

sudo cog predict \
  -i 'steps=30' \
  -i 'width=512' \
  -i 'height=512' \
  -i 'num_frames=33' \
  -i 'force_offload=true'

Smooth Motion

sudo cog predict \
  -i 'scheduler=FlowMatchDiscreteScheduler' \
  -i 'flow_shift=15' \
  -i 'frame_rate=30' \
  -i 'num_frames=89'

Tips & Tricks

Reduce OOM Errors
- Use a smaller batch_size or lower rank during training.
- Enable force_offload=true during inference.
Better Quality
- Increase steps and guidance_scale.
- Use a lower crf (e.g., 17 or 18).
Faster Training
- For smaller datasets, reduce epochs.
- Increase learning_rate slightly (e.g., 2e-3) while monitoring for overfitting.
Motion Emphasis
- Use frame_extraction_method=slide or consecutive_target_frames=[1, 25, 45] during training for improved motion consistency.
- Adjust flow_shift (5–15 range) during inference.
Style Activation
- Always include your trigger_word in the inference prompt.

License

This project is released under the MIT License.
Please see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Hunyuan Video Fine-Tuning

Table of Contents

Quick Start

Installation & Setup

Training

Training Command

Training Parameters

Examples

Inference

Inference Command

Inference Parameters

Examples

Tips & Tricks

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Hunyuan Video Fine-Tuning

Table of Contents

Quick Start

Installation & Setup

Training

Training Command

Training Parameters

Examples

Inference

Inference Command

Inference Parameters

Examples

Tips & Tricks

License