Skip to content

Comments

Minor - Fix CUDA OOM for 16GB VRAM GPUs#14

Open
changhaowuwu wants to merge 2 commits intoTele-AI:mainfrom
changhaowuwu:main
Open

Minor - Fix CUDA OOM for 16GB VRAM GPUs#14
changhaowuwu wants to merge 2 commits intoTele-AI:mainfrom
changhaowuwu:main

Conversation

@changhaowuwu
Copy link

@changhaowuwu changhaowuwu commented Feb 21, 2026

Fix CUDA OOM for 16GB GPUs
Problem
telestylevideo_inference.py consistently hit torch.OutOfMemoryError on GPUs with 16GB VRAM (e.g. RTX 4080). Two issues:

1.Both models on GPU simultaneously — The VAE (3.5 GB) and transformer (3.5 GB) were both loaded to GPU at init, leaving insufficient headroom for activation tensors during the transformer forward pass (340 MiB for patch_embedding2) and VAE encoding (8.7 GiB for 129-frame 3D convolutions).
2.Stale GPU processes — Interrupted runs (Ctrl+C, OOM kills → exit code 137) left zombie processes holding 11+ GiB of VRAM, starving subsequent runs.

Solution
Model offloading — Only one model resides on GPU at any time:

Additional optimizations:

  • VAE tiling (enable_tiling()) — tiles 3D convolutions spatially so encoding 129 frames doesn't require one massive allocation
  • Latent-only pipeline output — pipeline returns raw latents (output_type="latent") so VAE decode happens after the transformer is offloaded
  • Automatic stale process cleanup — _kill_stale_gpu_processes() queries nvidia-smi at startup and kills any leftover GPU processes
  • Graceful exit handlers — atexit + SIGINT/SIGTERM handlers call torch.cuda.empty_cache()
  • PYTORCH_ALLOC_CONF=expandable_segments:True set programmatically to reduce fragmentation

Testing
Verified end-to-end on a 16GB GPU with the default 129-frame, 720×1248 configuration. All 25 diffusion steps complete at ~77s/step with peak VRAM usage under 12 GiB.

- Enable VAE tiling to reduce peak memory during encode/decode
- Keep only one large model (VAE or transformer) on GPU at a time,
  swapping between CPU and GPU at each inference stage
- Return raw latents from pipeline and decode separately after
  offloading transformer, avoiding both models on GPU simultaneously
- Auto-kill stale GPU processes from previous interrupted runs at startup
- Add atexit/signal handlers for graceful GPU memory cleanup
- Set PYTORCH_ALLOC_CONF=expandable_segments:True by default
@changhaowuwu changhaowuwu changed the title Minor - Fix CUDA OOM for 16GB GPUs Minor - Fix CUDA OOM for 16GB VRAM GPUs Feb 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant