Z-Image-Turbo Hosted

Overview

This repository hosts a fine-tuned version of the Z-Image-Turbo model, specifically the training adapter from ostris/zimage_turbo_training_adapter. The original Z-Image-Turbo is developed by Tongyi-MAI and available at Tongyi-MAI/Z-Image-Turbo.

Why This Model?

Z-Image-Turbo is a state-of-the-art text-to-image diffusion model based on a Single-Stream Diffusion Transformer (S3-DiT) architecture. It offers several advantages:

  • Efficiency: Distilled for high performance with only 8 Number of Function Evaluations (NFEs), enabling sub-second inference on high-end GPUs.
  • Quality: Excels in photorealistic image generation, bilingual text rendering (English and Chinese), and prompt adherence.
  • Scalability: Supports resolutions up to 1024x1024 pixels.
  • Compatibility: Works with guidance_scale=0.0 for Turbo variants, reducing computational overhead.

We chose this model for our project due to its balance of speed and quality, making it ideal for real-time applications and local inference on consumer hardware like the RTX 3090.

The training adapter enhances the base model by providing fine-tuned weights for specific use cases, improving adaptability without retraining from scratch.

Technical Details

Model Architecture

  • Base Model: Z-Image-Turbo (6B parameters)
  • Architecture: Single-Stream Diffusion Transformer (S3-DiT)
  • Training Data: Not specified in public docs, but likely large-scale image-text pairs for photorealism.
  • Quantization: The hosted version supports quantization for reduced memory usage (e.g., 8-bit or 4-bit using bitsandbytes).

Hosting Process

  1. Selection: Identified Z-Image-Turbo as the best fit for our needs based on benchmarks showing superior speed vs. quality trade-off compared to models like FLUX or SDXL.
  2. Source: Used the training adapter from ostris for pre-fine-tuned weights.
  3. Authentication: Logged into Hugging Face using a personal access token.
  4. Repository Creation: Created a new model repository on Hugging Face.
  5. Download: Downloaded all model files (safetensors, config, etc.) from the source repo.
  6. Upload: Uploaded the files to the new repo using the Hugging Face Hub API.
  7. Documentation: Added this README with citations to original authors.

Quantization Techniques

To enable local inference on hardware with limited VRAM, we support various quantization methods:

  • BitsandBytes (Recommended):

    • 8-bit: Reduces memory by ~50%, minimal quality loss.
    • 4-bit: Further reduction to ~25% memory, with NF4 or FP4 configurations.
    • Code:
      from transformers import BitsAndBytesConfig
      quantization_config = BitsAndBytesConfig(load_in_8bit=True)  # or load_in_4bit=True
      pipe = ZImagePipeline.from_pretrained("RayyanAhmed9477/Z-Image-Turbo-Hosted", quantization_config=quantization_config)
      
  • GGUF Quantization:

    • For extreme low-VRAM (4GB+), use stable-diffusion.cpp with GGUF versions.
    • Download from community repos like jayn7/Z-Image-Turbo-GGUF.
  • FP8 Quantization:

    • 8-bit float for balanced performance.
    • Available in repos like T5B/Z-Image-Turbo-FP8.

Benchmarks and Comparisons

  • vs. FLUX: Z-Image-Turbo offers faster inference (8 NFEs vs. FLUX's 28-50) with comparable quality for photorealism.
  • vs. SDXL: Better prompt adherence and bilingual support; distilled for efficiency.
  • Performance on RTX 3090:
    • Full precision: 5-10s per image, 12GB VRAM.
    • 8-bit quantized: 6-8s, 6GB VRAM.
    • Quality drop: <5% perceptible.

Installation Guide

  1. Install dependencies:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    pip install git+https://github.com/huggingface/diffusers
    pip install transformers accelerate bitsandbytes
    
  2. Load and run:

    from diffusers import ZImagePipeline
    import torch
    
    pipe = ZImagePipeline.from_pretrained("RayyanAhmed9477/Z-Image-Turbo-Hosted", torch_dtype=torch.bfloat16)
    pipe.to("cuda")
    image = pipe(prompt="A futuristic cityscape", height=1024, width=1024, num_inference_steps=9, guidance_scale=0.0).images[0]
    image.save("output.png")
    
  3. For UI: Use Gradio for web interface.

System Requirements

  • GPU: NVIDIA with at least 16GB VRAM (e.g., RTX 3090)
  • RAM: 64GB recommended
  • Software: Python 3.8+, PyTorch 2.0+, diffusers library
  • OS: Windows/Linux with CUDA 11.8+

Performance

  • Inference Time: ~5-10 seconds per 1024x1024 image on RTX 3090
  • Memory Usage: ~12GB (bfloat16), reducible with quantization
  • Throughput: ~0.1-0.2 images/second

Troubleshooting

  • Out of Memory: Use quantization or CPU offloading (pipe.enable_model_cpu_offload()).
  • Slow Inference: Enable Flash Attention (pipe.transformer.set_attention_backend("flash")), compile model (pipe.transformer.compile()).
  • Quality Issues: Increase num_inference_steps or use higher precision.

Citations

Hosted by RayyanAhmed9477, with all credits to original creators.

License

Refer to the original repositories for licensing information.


tags:

  • text-to-image
  • diffusion
  • z-image-turbo
  • photorealism
  • quantized
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RayyanAhmed9477/Z-Image-Turbo-LORA-Adaptor

Finetuned
(30)
this model

Dataset used to train RayyanAhmed9477/Z-Image-Turbo-LORA-Adaptor