Skip to content

Fix for Black Image Issue on Tesla V100 (Volta Architecture) #1292

@tylike

Description

@tylike

Fix for Black Image Issue on Tesla V100 (Volta Architecture)

Issue Description

When using Q4_0 quantized models on Tesla V100 (Volta architecture), the generated images are completely black. This issue does not occur on newer GPUs like RTX 4060 Ti (Ada Lovelace architecture).

Root Cause Analysis

The V100 (Volta architecture) Tensor Cores can experience FP16 overflow when using CUBLAS_COMPUTE_16F (FP16 accumulation) for matrix multiplication. The intermediate calculation values can exceed the FP16 representable range (max ~65504), causing overflow and resulting in black images.

Newer architectures (Turing, Ampere, Ada Lovelace) have better hardware-level overflow handling mechanisms, which is why they don't exhibit this issue.

Proposed Fix

File Location

ggml/src/ggml-cuda/ggml-cuda.cu

Modification (around line 1310)

Add GGML_CUDA_CC_VOLTA to the condition for FP32 accumulation:

// Before:
if (GGML_CUDA_CC_IS_CDNA(cc) || GGML_CUDA_CC_IS_RDNA4(cc)) {
    // FP32 accumulation path
    ...
}

// After:
if (GGML_CUDA_CC_IS_CDNA(cc) || GGML_CUDA_CC_IS_RDNA4(cc) || cc == GGML_CUDA_CC_VOLTA) {
    // FP32 accumulation path
    ...
}

Technical Details

Setting Input Format Accumulator Precision Output Format
CUBLAS_COMPUTE_16F FP16 FP16 FP16
CUBLAS_COMPUTE_32F FP16 FP32 FP16/FP32

V100 Tensor Cores support FP16 input + FP32 accumulation with the same throughput (125 TFLOPS) as FP16 accumulation.

Performance Comparison

Tested on Tesla V100-SXM2-32GB with Z-Image Turbo (Q4_0), 512x512, 4 steps:

Metric Before Fix (with --type bf16) After Fix (default)
Model Loading 7.44s 2.48s
Text Encoding 213ms 68ms
Sampling 5.68s 1.63s
VAE Decoding 0.26s 0.26s
Total Generation 6.23s 1.99s
VRAM Usage 20.3 GB 7.7 GB

The fix provides:

  • 3x faster generation time
  • 62% less VRAM usage
  • No need for --type bf16 workaround

GPU Architecture Reference

GPU Compute Capability Architecture BF16 Support
Tesla V100 7.0 Volta No
RTX 2080 Ti 7.5 Turing No
RTX 3090 8.6 Ampere Yes
RTX 4060 Ti 8.9 Ada Lovelace Yes

Disclaimer

Important: This fix was identified and implemented with the assistance of AI (Claude/GPT). I am not a CUDA programming expert and cannot guarantee that this modification:

  1. Is the optimal solution for the problem
  2. Does not have unintended side effects on other operations
  3. Is compatible with all use cases and hardware configurations

I am sharing this information for reference purposes only. The maintainers should evaluate whether this approach is appropriate for inclusion in the main codebase. If there are better solutions or if this modification could cause issues elsewhere, please disregard this suggestion.

Related Information

  • The --type bf16 workaround forces full BF16 computation, which works but significantly increases memory usage and computation time
  • This fix allows V100 users to use quantized models (Q4_0, Q8_0, etc.) efficiently without the BF16 workaround
  • The modification only affects Volta architecture (compute capability 7.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions