paint-brush
Ranking 7b Q8 GGUF for Comprehensive Bulleted Notes With Ollamaby@cognitivetech
1,469 reads
1,469 reads

Ranking 7b Q8 GGUF for Comprehensive Bulleted Notes With Ollama

by CognitiveTechFebruary 11th, 2024
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

Searching for a model that actually beats Mistral 7b Instruct 0.2. Contrary to the ranking on leaderboards, I find none. Review my methods and results, prove me wrong!
featured image - Ranking 7b Q8 GGUF for Comprehensive Bulleted Notes With Ollama
CognitiveTech HackerNoon profile picture

Introduction

If you haven’t read my previous article PrivateGPT for Book Summarization: Testing and Ranking Configuration Variables then you may find it beneficial to review, as I’ve defined terms and explained the means by which I came to various practices and beliefs.


If you did read that article, then you will be aware that I’ve been refining my processes, for a few months, using Large Language Models (LLM) for the purpose of summarizing books. I measured a series of parameters including prompt templates, system prompts, user prompts, etcetera.


From that preliminary round of model rankings and collecting data on the use of configuration variables, I found mistral-7b-instruct-v0.2.Q8_0.gguf to produce the highest quality bulleted notes, and have been searching for one to best it, that fits on my 12GB 3060, ever since. (Tho OpenChat 3.5 0106 is a strong contender)

I double dare you!! Show me a 7b outperforming Mistral for this task.


For this ranking, I’m using that base of knowledge to assess a variety of leading 7b models. This time I’m using Ollama, as I find it simpler to use and quite performant.


I chose the following models because I found them ranking above Mistral 7b Instruct 0.2 on various leader-boards, or were self-proclaimed as best 7b. (chat templates tested in parenthesis)


  • openchat-3.5-0106.Q8_0.gguf (OpenChat)

  • snorkel-mistral-pairrm-dpo.Q8_0.gguf (Mistral)

  • dolphin-2.6-mistral-7b.Q8_0.gguf (Mistral)

  • supermario-v2.Q8_0.gguf (ChatML)

  • openhermes-2.5-mistral-7b.Q8_0.gguf (ChatML)

  • openhermes-2.5-neural-chat-7b-v3-1-7b.Q8_0.gguf (ChatML)

  • openhermes-2.5-neural-chat-v3-3-slerp.Q8_0.gguf (ChatML)

  • WestLake-7B-v2-Q8_0.gguf (ChatML, Mistral)

  • MBX-7B-v3-DPO.q8_0.gguf (ChatML, Mistral)

  • neuralbeagle14-7b.q8_0.gguf (ChatML, Mistral)

  • omnibeagle-7b-q8_0.gguf (ChatML, Mistral)


For some models, where I wasn’t getting the desired results, as they are mostly Mistral derived, I tested the Mistral template even though they list ChatML as their preferred input.

Bullet Point Notes With Headings and Terms in Bold

Write comprehensive bulleted notes summarizing the following text, with headings, terms, and key concepts in bold.\n\nTEXT:


While GPT3.5 isn’t my personal baseline, it is something of an industry standard, and I would expect it to produce better results than most 7b Q8 GGUFs.


An example response from GPT3.5

While there are no key concepts of terms in bold, the headings are in bold, and overall, this is quite easy to read compared to blocks of paragraphs. Also, whether or not we find terms in bold may depend on the input text itself, where a bullet point summary should always include bolded headings.

I’m Looking for Models That Produces Notes:

  • faster
  • with more detail, less filler
  • with comparable detail with longer context (currently stretching these capabilities around 2.5k context)


I see this as a fundamental task for any Instruct model. Ideally, developers will train their models to generate these types of ideal bulleted notes. I have tons of data, with some books trained already, but it’s relatively simple to generate these notes for a book (Using Mistral 7b Instruct 0.2 with the text semantically chunked, by hand, into parts below 2.5k tokens, each).


If it’s a 300-600 page book, then it can usually be done in a single day, including pre and post-processing.


Eventually, I might experiment with some fine-tuning in an attempt to improve their capacities myself.

Ollama

Ollama is a command-line tool for running GGUF, written in GO and depending upon Llama.cpp. That’s right, Its fast! Its got lots of great integrations and supports OpenAI API format, so you can use it with your favorite apps, front-ends, and frameworks.

Ollama is really easy to use from the command-line and provides a simple interface for adding new models having their own particular parameters.

Get up and running with large language models, locally.

Ollama Modelfiles

Modelfiles include GGUF location, template, and parameters to a Model file, which it uses to save a copy of your chosen model, using your specified configuration. This makes it easy to demo various models without too much fussing around with parameters.


I’ve kept the parameters the same for all models except the chat template, but I will share with you the template I’m using for each, so you can see precisely how I use the template. You can let me know if I’d get better results from the following models using a differently configured.


Once you’ve made a modelfile (as shown below for each model tested) with your chosen LLM and parameters, getting is easy:

ollama create mistralq8 -f Modelfile
ollama run mistralq8 "How is the weather in San Francisco?"

I’ve left a more detailed walkthrough on GitHub.

The Rankings

Previously, I tried to give each ranking a score. It’s really hard to give a numerical score. In the future, I think I’ll try to get an LLM to rank the summaries. This time, I’ll just leave a comment on where it falls short, and what I like, without giving a numerical score to each model.


I tested each of the following models on a single book chapter, divided into 6 chunks from 1900-3000 tokens each. I’ll share a representative example output from each, and the full data will be available on GitHub, as always.

Mistral 7b Instruct 0.2 Q8 GGUF

I’m sure you realize by now that, in my opinion, Mistral has the 7b to beat.

Modelfile

TEMPLATE """
<s></s>[INST] {{ .Prompt }} [/INST]
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

Mistral 7b Instruct v0.2 Result

I won’t say that Mistral does it perfectly every single time, but more often than not, this is my result. And if you look back to the GPT3.5 response, you might agree that this is better.

7b GOAT?

OpenChat 3.5 0106 Q8 GGUF

I was pleasantly surprised by OpenChat’s 0106. Here is a model that claims to have the best 7b model, and at least is competitive with Mistral 7b.

Modelfile

TEMPLATE """
GPT4 Correct User:  {{ .Prompt }}<|end_of_turn|>GPT4 Correct Assistant:
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

OpenChat 3.5 0106 Result

In this small sample, it gave bold headings 4/6 times. Later, I will review it along with any other top contenders using a more detailed analysis.

I like what I see, but it needs a deeper examination

Snorkel Mistral Pairrm DPO Q8 GGUF

Obviously, I’m biased, here, as Snorkel was trained on Mistral 7b Instruct 0.2. Regardless, I am cautiously optimistic and look forward to more releases from Snorkel.ai.

Modelfile

TEMPLATE """
<s></s>[INST] {{ .Prompt }} [/INST]
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

Snorkel Mistral Pairrm DPO Result

4/6 of these summaries are spot on, but others contain irregularities such as super long lists of key terms and headings instead of just bolding them inline as part of the summary.

The dark horse of this race.

Dolphin 2.6 Mistral 7B Q8 GGUF

Here is another mistral derivative that’s well regarded.

Modelfile

TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

Dolphin 2.6 Mistral 7B Result

This is another decent model that’s almost as good as Mistral 7b Instruct 0.2. Three out of 6 summaries gave proper format and bold headings, another had good format with no bold, but 2/6 were bad form all around.

Bad form

OpenHermes 2.5 Mistral-7B Q8 GGUF

This model is quite popular, both on leaderboards and among “the people” in unassociated discord chats. I want it to be a leader in this ranking, but it’s just not.

Modelfile

TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

OpenHermes 2.5 Mistral Result

3/6 results produce proper structure, but no bold text. One of them has both structure and bold text. The other two had more big blocks of text \ and poor structure.

Just not "there", for me.

OpenHermes 2.5 Neural Chat 7b v3.1 7B Q8 GGUF

I also tried a few high-ranking derivatives of OpenHermes 2.5 Mistral to see if I could get better results. Unfortunately, that was not the case.

Modelfile

TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

OpenHermes 2.5 Neural Chat 7b v3.1 Result

None of these results were desirable.

If I pay you $20 will you do it?

OpenHermes 2.5 Neural-Chat v3.3 Slerp Q8 GGUF

Whatever they did, these derivatives did not improve upon the original.

Modelfile

TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

OpenHermes 2.5 neural-chat v3.3 Slerp Result

It’s just getting worse with each new version!

I'm a very sad rater of leading language models.

Super Mario V2 Q8

I wasn’t expecting much from Mario, but it shows some promise. Meanwhile, V3 and V4 are available, but I haven’t found GGUF for those, yet.

Modelfile

TEMPLATE """
<|im_start|>system
You are a helpful AI writing assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }} <|im_end|>
<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

Super Mario V2 Result

Its first result was deceptively good. However, each of the following summaries deviated from the desired pattern. I’ll be on the lookout for GGUF of the newer releases. You can see here we got blocks of paragraphs with an initial bolded heading. Not really what I asked for.

Example of what I don't want.

Honorable Mentions

Conclusion

I wish I had better news to share. My ideal headline is that there is an abundance of leading models that produce quality output when creating comprehensive bulleted note summaries, and it’s just so hard for me to choose among them. Unfortunately, that is not the case.


Maybe they outperform Mistral 0.2 in full form but only are trailing in GGUF format? I think it’s quite likely that none of our existing evals target this type of output, but I would certainly argue that it’s a task that any leading 7b gguf model should be able to manage.


Another thing to consider is that Mistral 7b Instruct v0.2 came out soon after Mixtral, amidst a bunch of fanfare. I think that release slipped under the radar. In fact, many of the “leading” models I’ve looked at are based on 0.1 Mistral.


Maybe things will change, and the world will realize that their best models still can’t top Mistral? Then again, maybe all those models are really good at all the other tasks I’m not asking for.

Help Me Help You

I have data, I have a pipeline, and I have an endless need to create bulleted note summaries. If you want to work with me, please reach out.


You are also welcome to check out my GitHub, check the data, and try out your own version of this experiment. I’m happy to be proven wrong.

Future Direction

I would like to do a deeper analysis on the best models of this round, which are really the survivors of numerous previous rounds of rankings. Included in that analysis might be further experimentation with prompt templates, user prompt, system prompt. I’m also interested in trying these same models with an example summary included in its context.