Conversation
|
Tried on a more realistic example and got worse performance, think I'll need to tune / implement a heuristic for draft models similar to https://huggingface.co/blog/assisted-generation
|
…to add-speculative-decoding
…pp_python into add-speculative-decoding
|
Added the adaptive heuristic and it does do better but still occasionally slower even with termperature=0, will need to investigate. |
|
Highly appreciated PR. Is it possible to make |
|
@oobabooga I saw, I was looking at the hf implementation as a reference. I could add it as a general |
…/llama-cpp-python into add-speculative-decoding
|
@oobabooga going to merge this now. For updating the draft model or it's properties without re-creating the entire |
|
Awesome, thanks @abetlen! |
Uses prompt lookup decoding but the draft model class can be extended to support almost any existing method.
Server Usage
Python Usage
Performance
This is a very dumb / easy example but it looks like it's working!
With prompt lookup decoding
Without prompt lookup decoding
Closes #675