Despite your best efforts, the LLM still isn’t behaving as expected. What should you try next? Do you edit the prompt? Change the model? Fine-tune? Any of these can be valid options and there is an order in which to try these fixes.
Principle V: Follow Prompt-Fix Escalation Ladder
(This is part of an ongoing Principles of AI Engineering series: see posts
When a prompt doesn’t work as expected, I try the following fixes in order of preference:
-
Expanding and rephrasing instructions.
-
Adding examples.
-
Adding an explanation field.
-
Using a different model.
-
Breaking up a single prompt into multiple prompts.
-
Fine-tuning the model.
-
Throwing the laptop out the window in frustration.
In some cases, the order of things to try will be different; nonetheless, having a default path saves time and preserves mental capacity for debugging. The list isn’t designed as a rigid ladder, but as a guide rope intended to keep you moving forward.
Now let’s skim over each approach. The first three fall into the bucket of Prompt Engineering and will be covered in more depth in the next chapter. Multi-prompt and Fine-tuning approaches will each have dedicated chapters.
Lightweight Approaches
Adding Instructions
First thing to try is re-explaining to the LLM what to do via prompt instructions. Try adding clearer directions, rephrasing, or moving instructions around.
Don't hesitate to repeat or reformulate statements multiple times in different parts of the prompt - LLMs don’t get annoyed by repetition. For particularly important directives, add them at the beginning or end of the prompt for maximum effect (
Adding Examples
LLMs respond very well to in-context learning (input-output examples). They are particularly important if you are utilizing smaller models; these are not as naturally ”intelligent” so require lots of guidance (
Example of a Prompt with 2-shot Inference (Language Detection):
Detect the language of the text and output it in the JSON format: {“language”: “name_of_language”}. If you don’t know the language, output “unknown” in the language field.
Example I:
Input: Hello
Output: {“language”: “English”}
Example II:
Input: EjnWcn
Output: {“language”: “Unknown”}
Text: {{text}}
Typically you would use 1-3 examples, though in some cases you could add more. There is evidence that performance improves with a higher number of examples (
Adding an Explanation Field
LLMs, like humans, benefit from having to explain their thinking. Add an “explanation” field to your output JSON and the output will usually get better. This will also help you identify why the model is making certain decisions and adjust instructions and examples.
In cases where the prompt uses internal documentation - ask the LLM to output the sections of documentation it used to construct answers. This reduces hallucinations (
You can also attempt to use a
Changing the Model
Different models excel at different types of tasks. OpenAI’s o3 model excels at analyzing code, but good old 4o tends to produce better writing despite being cheaper per token. Part of the job of an AI engineer is keeping up with the strengths and weaknesses of available models as they are released and updated.
Frequently try out different models for the same task. This experimentation works way faster and safer when you have automated tests and metrics to measure each model's “fitness” for the task.
Heavyweight Approaches
Every approach until now has been relatively low cost to try. Now we are getting into the heavyweight fixes.
Breaking Up the Prompt
If one prompt can’t get the job done - why not try a system of two or more prompts? This can work effectively in some cases; the two common approaches are:
-
Splitting the prompt by area of responsibility.
-
Using a new prompt as a guardrail reviewing output of the previous one.
Both approaches are introduced in
Fine-Tuning
Fine-tuning is an even heavier approach than using multiple prompts. For most problems, I use it as a last resort.
Why am I hesitant to recommend fine-tuning in most cases? Fine-tuning is fundamentally a machine learning approach
Consider fine-tuning when:
- Other techniques failed to achieve the objective.
- The problem is highly complex and specialized, and default LLM knowledge is insufficient.
- You have a high-volume use case and want to save money by using a lower-end model.
- Low latency is needed, so multiple prompts cannot be executed in sequence.
Conclusion
Hopefully, this article clarifies the order of steps you should take when prompts don’t work as intended. First, you would typically try a prompt engineering approach. If that doesn’t work, attempt to switch the model and see if that helped. Next step is utilizing multiple interacting prompts. Finally, consider fine-tuning if all other methods have failed.
If you’ve enjoyed this post - subscribe for more.