If you haven’t read my previous article PrivateGPT for Book Summarization: Testing and Ranking Configuration Variables then you may find it beneficial to review, as I’ve defined terms and explained the means by which I came to various practices and beliefs.
If you did read that article, then you will be aware that I’ve been refining my processes, for a few months, using Large Language Models (LLM) for the purpose of summarizing books. I measured a series of parameters including prompt templates, system prompts, user prompts, etcetera.
From that preliminary round of model rankings and collecting data on the use of configuration variables, I found mistral-7b-instruct-v0.2.Q8_0.gguf to produce the highest quality bulleted notes, and have been searching for one to best it, that fits on my 12GB 3060, ever since. (Tho OpenChat 3.5 0106 is a strong contender)
For this ranking, I’m using that base of knowledge to assess a variety of leading 7b models. This time I’m using Ollama, as I find it simpler to use and quite performant.
I chose the following models because I found them ranking above Mistral 7b Instruct 0.2 on various leader-boards, or were self-proclaimed as best 7b. (chat templates tested in parenthesis)
openchat-3.5-0106.Q8_0.gguf (OpenChat)
snorkel-mistral-pairrm-dpo.Q8_0.gguf (Mistral)
dolphin-2.6-mistral-7b.Q8_0.gguf (Mistral)
supermario-v2.Q8_0.gguf (ChatML)
openhermes-2.5-mistral-7b.Q8_0.gguf (ChatML)
openhermes-2.5-neural-chat-7b-v3-1-7b.Q8_0.gguf (ChatML)
openhermes-2.5-neural-chat-v3-3-slerp.Q8_0.gguf (ChatML)
WestLake-7B-v2-Q8_0.gguf (ChatML, Mistral)
MBX-7B-v3-DPO.q8_0.gguf (ChatML, Mistral)
neuralbeagle14-7b.q8_0.gguf (ChatML, Mistral)
omnibeagle-7b-q8_0.gguf (ChatML, Mistral)
For some models, where I wasn’t getting the desired results, as they are mostly Mistral derived, I tested the Mistral template even though they list ChatML as their preferred input.
Write comprehensive bulleted notes summarizing the following text, with headings, terms, and key concepts in bold.\n\nTEXT:
While GPT3.5 isn’t my personal baseline, it is something of an industry standard, and I would expect it to produce better results than most 7b Q8 GGUFs.
While there are no key concepts of terms in bold, the headings are in bold, and overall, this is quite easy to read compared to blocks of paragraphs. Also, whether or not we find terms in bold may depend on the input text itself, where a bullet point summary should always include bolded headings.
I see this as a fundamental task for any Instruct model. Ideally, developers will train their models to generate these types of ideal bulleted notes. I have tons of data, with some books trained already, but it’s relatively simple to generate these notes for a book (Using Mistral 7b Instruct 0.2 with the text semantically chunked, by hand, into parts below 2.5k tokens, each).
If it’s a 300-600 page book, then it can usually be done in a single day, including pre and post-processing.
Eventually, I might experiment with some fine-tuning in an attempt to improve their capacities myself.
Ollama is a command-line tool for running GGUF, written in GO and depending upon Llama.cpp. That’s right, Its fast! Its got lots of great integrations and supports OpenAI API format, so you can use it with your favorite apps, front-ends, and frameworks.
Ollama is really easy to use from the command-line and provides a simple interface for adding new models having their own particular parameters.
Modelfiles include GGUF location, template, and parameters to a Model file, which it uses to save a copy of your chosen model, using your specified configuration. This makes it easy to demo various models without too much fussing around with parameters.
I’ve kept the parameters the same for all models except the chat template, but I will share with you the template I’m using for each, so you can see precisely how I use the template. You can let me know if I’d get better results from the following models using a differently configured.
Once you’ve made a modelfile (as shown below for each model tested) with your chosen LLM and parameters, getting is easy:
ollama create mistralq8 -f Modelfile
ollama run mistralq8 "How is the weather in San Francisco?"
I’ve left a more detailed walkthrough on GitHub.
Previously, I tried to give each ranking a score. It’s really hard to give a numerical score. In the future, I think I’ll try to get an LLM to rank the summaries. This time, I’ll just leave a comment on where it falls short, and what I like, without giving a numerical score to each model.
I tested each of the following models on a single book chapter, divided into 6 chunks from 1900-3000 tokens each. I’ll share a representative example output from each, and the full data will be available on GitHub, as always.
I’m sure you realize by now that, in my opinion, Mistral has the 7b to beat.
TEMPLATE """
<s></s>[INST] {{ .Prompt }} [/INST]
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
I won’t say that Mistral does it perfectly every single time, but more often than not, this is my result. And if you look back to the GPT3.5 response, you might agree that this is better.
I was pleasantly surprised by OpenChat’s 0106. Here is a model that claims to have the best 7b model, and at least is competitive with Mistral 7b.
TEMPLATE """
GPT4 Correct User: {{ .Prompt }}<|end_of_turn|>GPT4 Correct Assistant:
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
In this small sample, it gave bold headings 4/6 times. Later, I will review it along with any other top contenders using a more detailed analysis.
Obviously, I’m biased, here, as Snorkel was trained on Mistral 7b Instruct 0.2. Regardless, I am cautiously optimistic and look forward to more releases from Snorkel.ai.
TEMPLATE """
<s></s>[INST] {{ .Prompt }} [/INST]
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
4/6 of these summaries are spot on, but others contain irregularities such as super long lists of key terms and headings instead of just bolding them inline as part of the summary.
Here is another mistral derivative that’s well regarded.
TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
This is another decent model that’s almost as good as Mistral 7b Instruct 0.2. Three out of 6 summaries gave proper format and bold headings, another had good format with no bold, but 2/6 were bad form all around.
This model is quite popular, both on leaderboards and among “the people” in unassociated discord chats. I want it to be a leader in this ranking, but it’s just not.
TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
3/6 results produce proper structure, but no bold text. One of them has both structure and bold text. The other two had more big blocks of text \ and poor structure.
I also tried a few high-ranking derivatives of OpenHermes 2.5 Mistral to see if I could get better results. Unfortunately, that was not the case.
TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
None of these results were desirable.
Whatever they did, these derivatives did not improve upon the original.
TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
It’s just getting worse with each new version!
I wasn’t expecting much from Mario, but it shows some promise. Meanwhile, V3 and V4 are available, but I haven’t found GGUF for those, yet.
TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
Its first result was deceptively good. However, each of the following summaries deviated from the desired pattern. I’ll be on the lookout for GGUF of the newer releases. You can see here we got blocks of paragraphs with an initial bolded heading. Not really what I asked for.
I wish I had better news to share. My ideal headline is that there is an abundance of leading models that produce quality output when creating comprehensive bulleted note summaries, and it’s just so hard for me to choose among them. Unfortunately, that is not the case.
Maybe they outperform Mistral 0.2 in full form but only are trailing in GGUF format? I think it’s quite likely that none of our existing evals target this type of output, but I would certainly argue that it’s a task that any leading 7b gguf model should be able to manage.
Another thing to consider is that Mistral 7b Instruct v0.2 came out soon after Mixtral, amidst a bunch of fanfare. I think that release slipped under the radar. In fact, many of the “leading” models I’ve looked at are based on 0.1 Mistral.
Maybe things will change, and the world will realize that their best models still can’t top Mistral? Then again, maybe all those models are really good at all the other tasks I’m not asking for.
I have data, I have a pipeline, and I have an endless need to create bulleted note summaries. If you want to work with me, please reach out.
You are also welcome to check out my GitHub, check the data, and try out your own version of this experiment. I’m happy to be proven wrong.
I would like to do a deeper analysis on the best models of this round, which are really the survivors of numerous previous rounds of rankings. Included in that analysis might be further experimentation with prompt templates, user prompt, system prompt. I’m also interested in trying these same models with an example summary included in its context.