Instructions to use PygmalionAI/pygmalion-2.7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use PygmalionAI/pygmalion-2.7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="PygmalionAI/pygmalion-2.7b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("PygmalionAI/pygmalion-2.7b")
model = AutoModelForCausalLM.from_pretrained("PygmalionAI/pygmalion-2.7b")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use PygmalionAI/pygmalion-2.7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "PygmalionAI/pygmalion-2.7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "PygmalionAI/pygmalion-2.7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/PygmalionAI/pygmalion-2.7b

SGLang

How to use PygmalionAI/pygmalion-2.7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "PygmalionAI/pygmalion-2.7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "PygmalionAI/pygmalion-2.7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "PygmalionAI/pygmalion-2.7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "PygmalionAI/pygmalion-2.7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use PygmalionAI/pygmalion-2.7b with Docker Model Runner:
```
docker model run hf.co/PygmalionAI/pygmalion-2.7b
```

Character talks instead of me and is basically chatting with itself.

by Vulonkul - opened Apr 2, 2023

Discussion

Vulonkul

Apr 2, 2023

I'm not really looking for a tutorial, but anything I try leads to conversations like this:
[ME] [Real user input]
[Char] [Generated Answer]
[ME] [Bot pretending to continue the conversation from the generated answer]
[Char] [etc.]
Until it gets bored. Here is my formatting, since I assume that is the issue.

persona = "AI-chan's persona: AI-chan is a cheerful person who loves to make others smile. She is an optimist who loves to spread happiness and positivity wherever she goes."
history = ["AI-chan: What's your name? You: My name is Creator!", "AI-chan: Nice to meet you Creator! You: Nice to meet you too AI-chan!", "AI-chan: What shall we talk about? You: I don't know, you tell me...", " AI-chan: We can talk about anime. You: Ok, what's your favourite anime?"]
user_input = input("You: ")
response = get_answer(user_input, history, persona)

def get_answer(user_input, history, persona):
# Combine persona and history into one string
user_input = ("You: " + user_input)
history.append(user_input)
input_text = persona + " ".join(history) + " AI-chan: "

# encode context the generation is conditioned on
input_ids = tokenizer.encode(input_text, truncation=True, add_special_tokens=True, return_tensors='pt')

    output = model.generate(input_ids=input_ids, max_length=2048, pad_token_id=tokenizer.eos_token_id, 
        temperature=0.7,
        max_new_tokens=500,
        repetition_penalty=1.2,
        do_sample=True,
        top_k=50,
        top_p=0.8)

I'd really appreciate if someone who has faced this issue can say how they've delt with this while I continue to experiment. I will update this if I get an answer!

11b

Pygmalion org Apr 2, 2023

This is not a problem with your prompt actually, it's just because these older models (anything other than the dev branch of the 6B, basically) are unsupervised fine-tunes, so they learn to spit out an entire conversation instead of just a response. A way to work around this is to stop generation as soon as you reach a \nYou: , then trim that out before returning the text to the user, see this code for an example.

Vulonkul

Apr 2, 2023

Oh that's tremendous help. Thanks a lot! I hope this thread helps others who are struggling with similar issues.

Vulonkul changed discussion status to closed Apr 2, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment