openbmb
/

MiniCPM4-8B-Eagle-vLLM

Text Generation

Model card Files Files and versions

hyx21 commited on Jun 6

Commit

1d5249b

·

verified ·

1 Parent(s): dd100b9

Update README.md

Files changed (1) hide show

README.md +9 -3

README.md CHANGED Viewed

@@ -53,7 +53,7 @@ MiniCPM 4 is an extremely efficient edge-side large model that has undergone eff
 ## Usage
-### Inference with [vLLM](https://github.com/vllm-project/vllm)
 For now, you need to install the latest version of vLLM.
 ```
 pip install -U vllm \
@@ -61,7 +61,7 @@ pip install -U vllm \
     --extra-index-url https://wheels.vllm.ai/nightly
 ```
-Then you can inference MiniCPM4-8B with vLLM:
 ```python
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
@@ -77,7 +77,13 @@ llm = LLM(
     trust_remote_code=True,
     max_num_batched_tokens=32768,
     dtype="bfloat16",
-    gpu_memory_utilization=0.8,
 )
 sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)

 ## Usage
+### Using Eagle Speculative Decoding with [vLLM](https://github.com/vllm-project/vllm)
 For now, you need to install the latest version of vLLM.
 ```
 pip install -U vllm \
     --extra-index-url https://wheels.vllm.ai/nightly
 ```
+Then you can use Eagle Speculative Decoding to inference MiniCPM4-8B with vLLM. Use `speculative_config` to set the draft model.
 ```python
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
     trust_remote_code=True,
     max_num_batched_tokens=32768,
     dtype="bfloat16",
+    gpu_memory_utilization=0.8,
+    speculative_config={
+        "method": "eagle",
+        "model": "openbmb/MiniCPM4-8B-Eagle-vLLM",
+        "num_speculative_tokens": 2,
+        "max_model_len": 32768,
+    },
 )
 sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)