I’m just a hobbyist in this topic, but I would like to share my experience with using local LLMs for very specific generative tasks.
Predefined formats
With prefixes, we can essentially add the start of the LLMs response, without it actually generating it. Like, when We want it to Respond with bullet points, we can set the prefix to be -
(A dash and space). If we want JSON, we can use ` ` `json\n{
as a prefix, to make it think that it started a JSON markdown code block.
If you want a specific order in which the JSON is written, you can set the prefix to something like this:
` ` `json
{
"first_key":
Translation
Let’s say you want to translate a given text. Normally you would prompt a model like this
Translate this text into German:
` ` `plaintext
[The text here]
` ` `
Respond with only the translation!
Or maybe you would instruct it to respond using JSON, which may work a bit better. But what if it gets the JSON key wrong? What if it adds a little ramble infront or after the translation? That’s where prefixes come in!
You can leave the promopt exactly as is, maybe instructing it to respond in JSON
Respond in this JSON format:
{"translation":"Your translation here"}
Now, you can pretend that the LLM already responded with part of the message, which I will call a prefix. The prefix for this specific usecase could be this:
{
"translation":"
Now the model thinks that it already wrote these tokens, and it will continue the message from right where it thinks it left off. The LLM might generate something like this:
Es ist ein wunderbarer Tag!"
}
To get the complete message, simply combine the prefix and the generated text to result in this:
{
"translation":"Es ist ein wunderschöner Tag!"
}
To minimize inference costs, you can add "}
and "\n}
as stop tokens, to stop the generation right after it finished the json entrie.
Code completion and generation
What if you have an LLM which didn’t train on code completion tokens? We can get a similar effect to the trained tokens using an instruction and a prefix!
The prompt might be something like this
` ` `python
[the code here]
` ` `
Look at the given code and continue it in a sensible and reasonable way.
For example, if I started writing an if statement,
determine if an else statement makes sense, and add that.
And the prefix would then be the start of a code block and the given code like this
` ` `python
[the code here]
This way, the LLM thinks it already rewrote everything you did, but it will now try to complete what it has written. We can then add \n` ` `
as a stop token to make it only generate code and nothing else.
This approach for code generation may be more desireable, as we can tune its completion using the prompt, like telling it to use certain code conventions.
Simply giving the model a prefix of ` ` `python\n
Makes it start generating code immediately, without any preamble. Again, adding the stop keyword \n` ` `
makes sure that no postamble is generated.
Using this in ollama
Using this “technique” in ollama is very simple, but you must use the /api/chat
endpoint and cannot use /api/generate
. Simply append the start of a message to the conversation passed to the model like this:
"conversation":[
{"role":"user", "content":"Why is the sky blue?"},
{"role":"assistant", "content":"The sky is blue because of"}
]
It’s that simple! Now the model will complete the message with the prefix you gave it as “content”.
Be aware!
There is one pitfall I have noticed with this. You have to be aware of what the prefix gets tokenized to. Because we are manually setting the start of the message ourselves, it might not be optimally tokenized. That means, that this might confuse the LLM and generate one too many or few spaces. This is mostly not an issue though, as
What do you think? Have you used prefixes in your generations before?
Very good idea. I mean there are frameworks for programmers to do exaclty that, like LangChain. But I also end up doing this manually. I use Kobold.cpp and most of the times I just switch it to Story mode and I get one lage notebook / text area. I’ll put in the questions, prompts, special tokens if it’s an instruct-tuned variant and start the bullet point list for it. Or click on generate after I’ve already typed in the chapter names or a table of contents. Or opened the code block with the proper markdown. So pretty much like what you lined out. It’s super useful to guide the LLM into the proper direction. Or steer it back on track with a small edit in its output, and a subsequent call to generate from there.
Could you please tell me why you chose kobold.cpp over llama.cpp? I only ever used llama.cpp so I’d like to hear from the other side!
I really like the idea of letting an LLM perform too calls into middle of the generation.
Like, we instruct the LLM to Say what it will do, then to put the tool call into <tool></tool> tags. Then we could set </tool> as a stop keyword and insert the results into it’s message.
I have tries this before, but it tends to not believe what is in its own message. It tends to see the output of the tool cal and go
Don't believe what I just said, I made that up
, even though LLMs are infamous for hallucinating…Kobold.cpp is using llama.cpp under the hoods. It just adds a few extras and a webserver and an user interface. Plus some backwards compatibility for older model file formats, and it’s relatively easy to install. But the project builds upon llama.cpp and uses that same code for inference.
This is interesting. Need to check if this is implemented in Open-WebUI.
But I think the thing which I’m hoping for most (in open-webui), is the support of draft models for speculative decoding. This would be really nice!
Edit: it seems it’s not implemented in ollama yet
This prefix feature is already in Open Web UI! There is the “Playground”, which lets you define any kind of conversation and also let it continue a message you started writing for it. The playground is really useful.
What exactly do you mean by “draft models”? I have never heard of that speculative decoding thing…
As you probably know, an LLM works iteratively: you give it instructions and it “auto-completes”, one token at a time. Every time you want to generate the next token, you have to perform the whole inference task, which is expensive.
However, verifying if a next token is the correct one, can be cheap because you can do it in parallel. For instance, take the sentence " The answer to your query is that the sky is blue due to some physical concept". If you wanted to check whether your model would output each one of those tokens, you would split the sentence after every token and you could batch verify the next token for every split and see whether the next token matches the sentence.
Speculative decoding is the process where a cheap and efficient draft model is used to generate a tentative output, which is then verified in parallel by the expensive model. Because the cheap draft model is many times quicker, you can get a sample output very fast and batch verify the output with the expensive model. This saves a lot of computational time because all the parallel verifications require a single forward pass. And the best part is that it has zero effect on the output quality of the expensive model. The cost is that you know have to run two models, but the smaller one may be a tenth of the size, so runs possibly 10x faster. The closer the draft model output matches the expensive model output, the higher the inference speed gain potential.