Instructions to use HuggingFaceH4/starchat-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceH4/starchat-beta with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/starchat-beta")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/starchat-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/starchat-beta")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceH4/starchat-beta with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceH4/starchat-beta"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/starchat-beta",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/HuggingFaceH4/starchat-beta

SGLang

How to use HuggingFaceH4/starchat-beta with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceH4/starchat-beta" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/starchat-beta",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceH4/starchat-beta" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/starchat-beta",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use HuggingFaceH4/starchat-beta with Docker Model Runner:
```
docker model run hf.co/HuggingFaceH4/starchat-beta
```

Inference VRAM Size

by tjohnson - opened Jun 11, 2023

Discussion

tjohnson

Jun 11, 2023

Hello,

Thank you for such a tremendous contribution! I have tried running inference on my RTX4090 (24GB vram) to no avail so I used TheBloke's rendition of GGML and GPTQ which work great but verrrrry slow. Which is in direct contrast to your starchat playground which is lightning fast...

I would like to try inference with this repos (native) weights on a GPU to get somewhere in the ballpark of the speed of your playground but how many GB do I need? Do I need to rent like an A100 80?

avspavan

Jun 12, 2023

Ditto. I have the same question.

valdanito

Jun 14, 2023

I'm running it on an A100 80 and most of the time it's using 30GB of VRAM, peaking at 48GB.

tjohnson

Jun 14, 2023

@valdanito thank you

Maxrubino

Jun 30, 2023

If you want to safe money, you should import it in 4-bit mode you need only 10gb of GPU RAM

More info: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/starchat-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/starchat-beta",load_in_4bit=True, device_map="auto")

avicennax

Jul 17, 2023

@Maxrubino what versions of related quantization dependencies are you running? I get this exception on the last line:

TypeError: GPTBigCodeForCausalLM.__init__() got an unexpected keyword argument 'load_in_4bit'

Maxrubino

Jul 17, 2023

transformers==4.30.2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment