Linked is the new repo, it’s still in relatively early stages but does work.
I’m using it in oobabooga text-gen-ui and the OLD GPTQ format, so not even the new stuff, and on my 3060 I see a genuine >200% increase in speed:
Exllama v1
Output generated in 21.84 seconds (9.16 tokens/s, 200 tokens, context 135, seed 1891621432)
Exllama v2
Output generated in 6.23 seconds (32.10 tokens/s, 200 tokens, context 135, seed 313599079)
Absolutely crazy, all settings are the same. And it’s not just a burst at the front, it lasts:
Output generated in 22.40 seconds (31.92 tokens/s, 715 tokens, context 135, seed 717231733)
And this is using the old format, exllama v2 includes a new way to quant, allowing for much more granular bitrates.
Turbo went with a really cool approach here, you set a target bits per weight, say, 3.5, and it’ll automatically adjust the appropriate weights to the appropriate quant levels to achieve maximum performance where it counts, saving data in important weights and sacrificing more on non important ones, very cool stuff!
Get your latest oobabooga webui and start playing!
https://github.com/oobabooga/text-generation-webui
https://github.com/noneabove1182/text-generation-webui-docker
Some models in the new format from turbo: https://huggingface.co/turboderp
I’m really interested in this. Is Exllama2 a separately trained variant of Llama2? The use restrictions of Llama2 have always irked me and a similarly performing open variant of that architecture is very intriguing.Nevermind. This is a processor that runs the model, not the model itself. My bad.