Minor - Fix CUDA OOM for 16GB VRAM GPUs by changhaowuwu · Pull Request #14 · Tele-AI/TeleStyle

changhaowuwu · 2026-02-21T19:22:00Z

Fix CUDA OOM for 16GB GPUs
Problem
telestylevideo_inference.py consistently hit torch.OutOfMemoryError on GPUs with 16GB VRAM (e.g. RTX 4080). Two issues:

1.Both models on GPU simultaneously — The VAE (3.5 GB) and transformer (3.5 GB) were both loaded to GPU at init, leaving insufficient headroom for activation tensors during the transformer forward pass (340 MiB for patch_embedding2) and VAE encoding (8.7 GiB for 129-frame 3D convolutions).
2.Stale GPU processes — Interrupted runs (Ctrl+C, OOM kills → exit code 137) left zombie processes holding 11+ GiB of VRAM, starving subsequent runs.

Solution
Model offloading — Only one model resides on GPU at any time:

Additional optimizations:

VAE tiling (enable_tiling()) — tiles 3D convolutions spatially so encoding 129 frames doesn't require one massive allocation
Latent-only pipeline output — pipeline returns raw latents (output_type="latent") so VAE decode happens after the transformer is offloaded
Automatic stale process cleanup — _kill_stale_gpu_processes() queries nvidia-smi at startup and kills any leftover GPU processes
Graceful exit handlers — atexit + SIGINT/SIGTERM handlers call torch.cuda.empty_cache()
PYTORCH_ALLOC_CONF=expandable_segments:True set programmatically to reduce fragmentation

Testing
Verified end-to-end on a 16GB GPU with the default 129-frame, 720×1248 configuration. All 25 diffusion steps complete at ~77s/step with peak VRAM usage under 12 GiB.

- Enable VAE tiling to reduce peak memory during encode/decode - Keep only one large model (VAE or transformer) on GPU at a time, swapping between CPU and GPU at each inference stage - Return raw latents from pipeline and decode separately after offloading transformer, avoiding both models on GPU simultaneously - Auto-kill stale GPU processes from previous interrupted runs at startup - Add atexit/signal handlers for graceful GPU memory cleanup - Set PYTORCH_ALLOC_CONF=expandable_segments:True by default

…rting PIL frames to numpy arrays

changhaowuwu changed the title ~~Minor - Fix CUDA OOM for 16GB GPUs~~ Minor - Fix CUDA OOM for 16GB VRAM GPUs Feb 21, 2026

fix: ensure video frames are compatible with export_to_video by conve…

f0fca80

…rting PIL frames to numpy arrays

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Minor - Fix CUDA OOM for 16GB VRAM GPUs#14

Minor - Fix CUDA OOM for 16GB VRAM GPUs#14
changhaowuwu wants to merge 2 commits intoTele-AI:mainfrom
changhaowuwu:main

changhaowuwu commented Feb 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

changhaowuwu commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changhaowuwu commented Feb 21, 2026 •

edited

Loading