FramePack

Packing Input Frame Context for Video Generation

A revolutionary next-frame prediction neural network structure that generates videos progressively. FramePack compresses input contexts to a constant length so the generation workload is invariant to video length.

Video Diffusion, But Feels Like Image Diffusion

FramePack is a next-frame prediction neural network structure that generates videos progressively. It compresses input contexts to a constant length making generation workload invariant to video length.

  • Process a large number of frames even on laptop GPUs
  • Only requires 6GB GPU memory
  • Can be trained with a much larger batch size
  • Generate 1-minute, 30FPS videos (1800 frames)
FramePack Overview

Key Features

Minimal Memory Requirements

Generate 60-second, 30fps (1800 frames) videos with a 13B model using only 6GB VRAM. Laptop GPUs can handle it easily.

Instant Visual Feedback

As a next-frame prediction model, you'll directly see the generated frames, getting plenty of visual feedback throughout the entire generation process.

Compressed Input Context

Compresses input contexts to a constant length, making generation workload invariant to video length and supporting ultra-long video generation.

Standalone Desktop Software

Provides a feature-complete desktop application with minimal standalone high-quality sampling system and memory management.

Amazing Demos

Graceful Dance

The girl dances gracefully, with clear movements, full of charm.

Energetic Dance

The man dances energetically, leaping mid-air with fluid arm swings and quick footwork.

Get Started

One-click-package will be released soon. Please check back later.

# We recommend having an independent Python 3.10 environment
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt

# Start the GUI
python demo_gradio.py

Research Paper

Paper Preview

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

FramePack is a revolutionary video generation technology that compresses input contexts to a constant length, making the generation workload invariant to video length. Learn about our methods, architecture, and experimental results in detail.