Instructions to use HuggingFaceM4/idefics-9b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceM4/idefics-9b-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceM4/idefics-9b-instruct")

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics-9b-instruct")
model = AutoModelForImageTextToText.from_pretrained("HuggingFaceM4/idefics-9b-instruct")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceM4/idefics-9b-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceM4/idefics-9b-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceM4/idefics-9b-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/HuggingFaceM4/idefics-9b-instruct

SGLang

How to use HuggingFaceM4/idefics-9b-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceM4/idefics-9b-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceM4/idefics-9b-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceM4/idefics-9b-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceM4/idefics-9b-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use HuggingFaceM4/idefics-9b-instruct with Docker Model Runner:
```
docker model run hf.co/HuggingFaceM4/idefics-9b-instruct
```

No output generated with sample code on non-quantised model

by Pwicke - opened Oct 19, 2023

Discussion

Pwicke

Oct 19, 2023

•

edited Oct 19, 2023

Hi and thanks for this brilliant model.

I have been running your Colab notebook and it works like a charm on Google Colab. I have also tried to reproduce it on my server with 8x NVIDIA RTX A6000. With the exact same code from the notebook I receive the exact same output:

Question: What's on the picture? Answer: Kittens.

But whatever I do, if I do not use the quantised model but idefics-9b or idefics-9b-instruct, I only ever receive:

Question: What's on the picture? Answer:

The only difference between the colab code and my code is the removal of quantization_config=bnb_config from the IdeficsForVisionText2Text.from_pretrained(...) parameter list. I have a had a colleague find their own way of running the model with the code you provided and they have reproduced the exact same issue independently (Question: What's on the picture? Answer:). I've tried different GPUs and different servers, but without the quantised model, I am unable to produce any output. The model loads into memory and is accessed during inference - it just does not generate or return or display any new tokens (I have also increased max_new_tokens=50, tried other prompts like the Pokémon example).

Any help would be appreciated.

VictorSanh

Oct 19, 2023

Hi @Pwicke ,
That does not sound right indeed.
Could you say more about your env? In particular transformers and tokenizers versions?
I'll try to reproduce the error.

Pwicke

Oct 19, 2023

Thank you for your response.

accelerate 0.24.0.dev0, bitsandbytes 0.41.1, nvidia-cublas-cu12 12.1.3.1, python 3.10.12 , sentencepiece 0.1.99 ,tokenizers 0.14.1, torch 2.1.0, transformers 4.35.0.dev0

Pwicke

Nov 16, 2023

•

edited Nov 16, 2023

Could I ask for an update on this? @VictorSanh

TITH

Nov 22, 2023

@Pwicke Have you solved this?

Pwicke

Nov 22, 2023

@TITH unfortunately not. I have to use the 4-bit quantised version. I recently tried the full model again, but still no new tokens are being generated. Do you have the same issue?

TITH

Nov 23, 2023

@Pwicke Yes. But I noticed that using cpu instead of cuda can solve it. Then I switched to torch 2.0.1 and cuda works as well.

Pwicke

Nov 28, 2023

Thanks for the response @TITH . I've tried cpu and it works. But since I also switched to torch 2.0.1, it does no longer use my gpu even though it's specified to do so. Now, I am running my experiment on cpu, which is suboptimal.

WindOcean

Jan 31, 2024

upgrading transformers to 4.37 can solve this problem.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment