My personal collection of interesting models I've quantized from the past week (yes, just week)

noneabove1182@sh.itjust.works · 8 months ago

You can get the resulting PPL but that’s only gonna get you a sanity check at best, an ideal world would have something like lmsys’ chat arena and could compare unquantized vs quantized but that doesn’t yet exist

noneabove1182@sh.itjust.works · 8 months ago

My personal collection of interesting models I've quantized from the past week (yes, just week)

noneabove1182@sh.itjust.works · 9 months ago

Stop making me want to buy more graphics cards…

Seriously though this is an impressive result, “beating” gpt3.5 is a huge milestone and I love that we’re continuing the trend. Will need to try out a quant of this to see how it does in real world usage. Hope it gets added to the lmsys arena!

noneabove1182@sh.itjust.works · edit-2 9 months ago

itsme2417/PolyMind: A multimodal, function calling powered LLM webui.

noneabove1182@sh.itjust.works · 9 months ago

Introducing Nomic Embed: A Truly Open Embedding Model

noneabove1182@sh.itjust.works · 9 months ago

You shouldn’t need nvlink, I’m wondering if it’s something to do with AWQ since I know that exllamav2 and llama.cpp both support splitting in oobabooga

noneabove1182@sh.itjust.works · 10 months ago

InternLM2 models llama-fied

noneabove1182@sh.itjust.works · 10 months ago

I run my Nvidia stuff in containers to not have to deal with all the stupid shenanigans

noneabove1182@sh.itjust.works · 10 months ago

The 3060 is a nice cheap one for running okay sized models, but if you can find a way to stretch for a 3090 or a 7900 XTX you’ll be able to run these 33B models with decent quant levels

noneabove1182@sh.itjust.works · 10 months ago

First few quants are up: https://huggingface.co/bartowski/WizardCoder-33B-V1.1-exl2

4.25 should fit nicely into 24gb (3090, 4090)

Smaller sizes still being created, 3.5, 3.0, and 2.4

noneabove1182@sh.itjust.works · 10 months ago

WizardLM/WizardCoder-33B-V1.1 released!

noneabove1182@sh.itjust.works · 11 months ago

Microsoft announces WaveCoder

noneabove1182@sh.itjust.works · 11 months ago

Seems relatively uncensored, willing to answer most questions

noneabove1182@sh.itjust.works · 11 months ago

It’s definitely a little odd… I’m glad they did any kind of official release for 0.2, but yeah information is sorely lacking and would be nice to have more, especially with how revolutionary the previous one was… is this incremental? Is it a huge change? Is it just more fine tuning? Did they start from scratch? We’ll never know 🤷‍♂️

noneabove1182@sh.itjust.works · 11 months ago

Mixture of Experts Explained (Huggingface blog)

noneabove1182@sh.itjust.works · 11 months ago

Mistral releases version 0.2 of their 7B model

noneabove1182@sh.itjust.works · 11 months ago

The only concern I had was my god is it a lot of faith to put in this random twitter, hope they never get hacked lol, but otherwise yes it’s a wonderful idea, would be a good feature for huggingface to speed up downloads/uploads

noneabove1182@sh.itjust.works · 11 months ago

Yeah this seems less focused on creativity, there’s a lot of really good models out there tuned for story telling that will far exceed generalized SoTA models

noneabove1182@sh.itjust.works · 11 months ago

Mistral drops a new magnet download

noneabove1182@sh.itjust.works · 11 months ago

Better finetuning is such an important factor, i feel like the future is all of us having our own personal tunes for models that work well with our lives, and iterating for learning more basically every day is also really helpful, so the more barriers we can take down the better!

noneabove1182@sh.itjust.works · edit-2 11 months ago

Hmm had interesting results from both of those base models, haven’t tried the combo yet, will start some exllamav2 quants to test

What’s it doing well at?

quant link for anyone who may want: https://huggingface.co/bartowski/OpenHermes-2.5-neural-chat-7b-v3-1-7B-exl2

noneabove1182@sh.itjust.works · 11 months ago

Btw there’s a 16k tune available now:

https://huggingface.co/bartowski/Orca-2-13B-16k-exl2

noneabove1182@sh.itjust.works · 11 months ago

I use text-generation-webui mostly. If you’re only using GGUF files (llama.cpp), koboldcpp is a really good option

A lot of it is the automatic prompt formatting, there’s probably like 5-10 specific formats that are used, and using the right one for your model is very important to achieve optimal output. TheBloke usually lists the prompt format in his model card which is handy

Rope and yarn refer to extending the default context of a model through hacky (but functional) methods and probably deserve their own write up

noneabove1182@sh.itjust.works · 11 months ago

Yeah so those are mixed, definitely not putting each individual weight to 2 bits because as you said that’s very small, i don’t even think it averages out to 2 bits but more like 2.56

You can read some details here on bits per weight: https://huggingface.co/TheBloke/LLaMa-30B-GGML/blob/8c7fb5fb46c53d98ee377f841419f1033a32301d/README.md#explanation-of-the-new-k-quant-methods

Unfortunately this is not the whole story either, as they get further combined with other bits per weight, like q2_k is Q4_K for some of the weights and Q2_K for others, resulting in more like 2.8 bits per weight

Generally speaking you’ll want to use Q4_K_M unless going smaller really benefits you (like you can fit the full thing on GPU)

Also, the bigger the model you have (70B vs 7B) the lower you can go on quantization bits before it degrades to complete garbage

noneabove1182@sh.itjust.works · 1 year ago

If you’re using llama.cpp chances are you’re already using a quantized model, if not then yes you should be. Unfortunately without crazy fast ram you’re basically limited to 7B models if you want any amount of speed (5-10 tokens/s)

noneabove1182@sh.itjust.works · 1 year ago

I’m looking forward to trying it today, I think this might make a good RAG model based on the orca 2 paper, but testing will be needed

noneabove1182@sh.itjust.works · 1 year ago

according to the config it looks like it’s only 4096, and they specify in the arxiv that they kept the training data under that value so it must be 4096… i’m sure people will expand it soon like they have with others

noneabove1182@sh.itjust.works · edit-2 1 year ago

Orca 2: Teaching Small Language Models How to Reason

noneabove1182@sh.itjust.works · 1 year ago

got any other articles? that one doesn’t make it out to be all that bad

“Without that catalyst, I don’t see an angle to a near term mutually agreeable merger of Nintendo and MS and I don’t think a hostile action would be a good move, so we are playing the long game.”

doesn’t sound like meddling to me, just wanting to mutually merge, and who wouldn’t want that as a CEO lol

noneabove1182@sh.itjust.works · 1 year ago

They’ve been doing orders of magnitude better in recent years, I’m never thrilled about aggressive vertical integration but of all the massive corporations Microsoft is pretty high up on my personal list for trust (which yeah is a pretty low bar compared to Amazon/Google/etc)