@Kerfuffle

Kerfuffle@sh.itjust.works · 1 year ago

Smaller models (7B down to 350M) can handle long conversations better

What are you basing that on? I mean, it is true there are more small models that support very long context lengths than big ones, but it’s not really because smaller models can handle them better, but because training big models takes a lot more resources. So people usually do that kind of fine-tuning on small models since training a 70B to 32K would take a crazy amount of compute and hardware.

If you could afford fine tuning it though, I’m pretty sure the big model has at least the same inherent capabilities. Usually larger models deal with ambiguity and stuff better, so there’s a pretty good chance it would actually do better than the small model assuming everything else was equal.

Kerfuffle@sh.itjust.works · 1 year ago

Definitely very interesting, but figuring out what layers to skip is a relatively difficult problem.

I really wish they’d shown an example of the optimal layers to skip for the 13B model. Like the paper notes, using the wrong combination of skipped layers can be worse overall. So it’s not just about how many layers you skip, but which ones as well.

It would also be interesting to see if there are any common patterns in which layers are most skippable. It probably would be model architecture specific but it would be pretty useful if you could calculate the optimal skip pattern for say a 3B model and then translate that to a 30B with good/reasonable results.

Kerfuffle@sh.itjust.works · 1 year ago

So I have never once ever considered anything produced by a LLM as true or false, because it cannot possibly do that.

You’re looking at this in an overly literal way. It’s kind of like if you said:

Actually, your program cannot possibly have a “bug”. Programs are digital information, so it’s ridiculous to suggest that an insect could be inside! That’s clearly impossible.

“Bug”, “hallucination”, “lying”, etc are just convenient ways to refer to things. You don’t have to interpret them as the literal meaning of the word. It also doesn’t require anything as sophisticated as a LLM for something like a program to “lie”. Just for example, I could write a program that logs some status information. It could log that everything is fine and then immediately crash: clearly everything isn’t actually fine. I might say something about the program being “lying”, but this is just a way to refer to the way that what it’s reporting doesn’t correspond with what is factually true.

People talk so often about how they “hallucinate”, or that they are “inaccurate”, but I think those discussions are totally irrelevant in the long term.

It’s actually extremely relevant in terms of putting LLMs to practical use, something people are already doing. Even when talking about plain old text completion for something like a phone keyboard, it’s obviously relevant if the completions it suggests are accurate.

So text prediction is saying when A, high probability that then B.

This is effectively the same as “knowing” A implies B. If you get down to it, human brains don’t really “know” anything either. It’s just a bunch of neurons connected up, maybe reaching a potential and firing, maybe not, etc.

(I wouldn’t claim to be an expert on this subject but I am reasonably well informed. I’ve written my own implementation of LLM inference and contributed to other AI-related projects as well, you can verify that with the GitHub link in my profile.)

Kerfuffle@sh.itjust.works · 1 year ago

If you’re using llama.cpp, some ROCM stuff recently got merged in. It works pretty well, at least on my 6600. I believe there were instructions for getting it working on Windows in the pull.

Kerfuffle@sh.itjust.works · 1 year ago

Is there any reason why support for loading both formats cannot be included within GGML/llama.cpp directly?

It could be (and I bet koboldcpp and maybe other projects will take that route). There absolutely is a disadvantage to dragging around a lot of legacy stuff for compatibility. llama.cpp/ggml’s approach has pretty much always been to favor rapid development over compatibility.

As I understand it, the new format is basically the same as the old format

I’m not sure that’s really accurate. There are significant differences in how the model vocabulary is handled, for instance.

Even if it was true right now, in the very first version of GGUF that is merged it’ll likely be less true as GGUF evolves and the stuff it enables starts getting used more. Having to maintain compatibility with the GGML stuff would make iterating on GGUF and adding new features more difficult.

Kerfuffle@sh.itjust.works · 1 year ago

I was able to contribute a script (convert-llama-ggmlv3-to-gguf.py) to convert GGML models to GGUF so you can potentially still use your existing models. Ideally it should be used with the metadata from the original model since converting vocab from GGML to GGUF without that is imperfect. (By metadata I mean stuff like the HuggingFace config.json, tokenizer.model, etc.)

Kerfuffle@sh.itjust.works · 1 year ago

which would mean going forward no more breaking changes when new formats are worked on

I definitely wouldn’t count on that.

But it does make it much easier to do some changes, like adding/changing model specific fields which previously would have required a format change. Stuff like changing or dropping support for existing quantizations would also break stuff independent of the model format itself.

Kerfuffle@sh.itjust.works · 1 year ago

Ah, I see. Wouldn’t it be pretty easy to determine if MPS is actually the issue by trying to run the model with the non-MPS PyTorch version? Since it’s a 7B model, CPU inference should be reasonably fast. If you still get the memory leak, then you’ll know it’s not MPS at fault.

Kerfuffle@sh.itjust.works · 1 year ago

You can find the remote code in the huggingface repo.

Ahh, interesting.

I mean, it’s published by a fairly reputable organization so the chances of a problem are fairly low but I’m not sure there’s any guarantee that the compiled Python in the pickle matches the source files there. I wrote my own pickle interpreter a while back and it’s an insane file format. I think it would be nearly impossible to verify something like that. Loading a pickle file with the safety stuff disabled is basically the same as running a .pyc file: it can do anything a Python script can.

So I think my caution still applies.

It could also be PyTorch or one of the huggingface libraries, since mps support is still very beta.

From their description here: https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md#model

It doesn’t seem like anything super crazy is going on. I doubt the issue would be in Transformers or PyTorch.

I’m not completely sure what you mean by “MPS”.

Kerfuffle@sh.itjust.works · 1 year ago

I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn’t notice any appreciable improvement.

Well, you said you sometimes did that so it’s not entirely clear what conclusions you came to are based on deterministic sampling and which aren’t. Anyway, like I said, it’s not just temperature that may be causing issues.

I want to be clear I’m not criticizing you personally or anything like that. I’m not trying to catch you out and you don’t have to justify anything about your decisions or approach to me. The only thing I’m trying to do here is provide information that might help you and potentially other people get better results or understand why the results with a certain approach may be better or worse.

Kerfuffle@sh.itjust.works · edit-2 1 year ago

Another one that made a good impression on me is Qwen-7B-Chat

Bit off-topic but if I’m looking at this correctly, it uses a custom architecture which requires turning on trust_remote_code and the code that would be embedded into the models and trusted is not included in the repo. In fact, there’s no real code in the repo: it’s the just a bit of boilerplate to run inference and tests. If so, that’s kind of spooky and I suggest being careful not to run inference on those models outside of a locked down environment like a container.

Kerfuffle@sh.itjust.works · 1 year ago

For sampling I normally use the llama-cpp-python defaults

Most default settings have the temperature around 0.8-0.9 which is likely way too high for code generation. Default settings also frequently include stuff like a repetition penalty. Imagine the LLM is trying to generate Python, it has to produce a bunch of spaces before every line but something like a repetition penalty can severely reduce the probability of the tokens it basically must select for the result to be valid. With code, there’s often very little leeway for choosing what to write.

So you said:

I’m aware of how sampling and prompt format affect models.

But judging the model by what it outputs with the default settings (I checked and it looks like for llama-cpp-python it has both a pretty high temperature setting and a repetition penalty enabled) kind of contradicts that.

By the way, you might also want to look into the grammar sampling stuff that recently got added to llama.cpp. This can force the model to generate tokens that conform to some grammar which is pretty useful for code and some other stuff where the output has to conform to something. You should still carefully look at the other settings to ensure they conform to the type of result you want to generate though, the defaults are not suitable for every use case.

Kerfuffle@sh.itjust.works · 1 year ago

But I can’t accept such an immediately-noticeable decline in real-world performance (model literally craps itself) compared to previous models while simultaneously bragging about how outstanding the benchmark performance is.

Your criticisms are at least partially true and benchmarks like “x% of ChatGPT” should be looked at with extreme skepticism. In my experience as well, parameter size is extremely important. Actually, even with the benchmarks it’s very important: if you look at the ones that collect results you’ll see, for example, there are no 33B models that have a MMLU score in the 70s.

However, I wonder if all of the criticism is entirely fair. Just for example, I believe MMLU is 5-shot, ARC is 10-shot. That means there are a bunch of examples of that type of question and the correct answer before the one the LLM has to answer. If you’re just asking it a question, that’s 1-shot: it has to get it right the first time, without any examples of correct question/answer pairs. Seeing a high MMLU score doesn’t necessarily directly translate to 1-shot performance, so your expectations might not be in line with reality.

Also, different models have different prompt formats. For these fine-tuned models, it won’t necessarily just say “ERROR” if you use the wrong prompt form but the results can be a lot worse. Are you making sure you’re using exactly the prompt that was used when benchmarking?

Finally, sampling settings can also make a really big difference too. A relatively high temperature setting when generating creative output can be good but not when generating source code. Stuff like repetition, frequency/presence penalties can be good in some situations but maybe not when generating source code. Having the wrong sampler settings can force a random token to be picked, even if it’s not valid for the language, or ban/reduce the probability of tokens that would be necessary to produce valid output.

You may or may not already know, but LLMs don’t produce any specific answer after evaluation. You get back an array of probabilities, one for every token ID the model understands (~32,000 for LLaMA models). So sampling can be extremely important.

Kerfuffle@sh.itjust.works · 1 year ago

Are you using a distro with fairly recent packages? If not, then possibly you could try looking for supplementary sources that could provide more recent version. Just as an example, someone else mentioned having a similar issue on Debian. Debian tends to be very conservative about updating their packages and they may be quite outdated. (It’s possible to be on the other side of the problem, with fast moving distros like Arch but they also tend to fix stuff pretty fast as well.)

Possibly worth considering that hardware can also cause random crashes, faulty RAM, overheating GPUs, CPUs, memory or overclocking stuff beyond its limits. Try checking sensors to make sure temperatures are in a reasonable range, etc.

You can also try to determine if the times it crashes have anything in common or anything unusual is happening. I.E. playing graphics intensive games, hardware video decoding, that kind of thing. Some distros have out of memory process killers set up that have been known to be too aggressive, and processes like the WM that can control a lot of memory will sometimes be a juicy target for them.

As you probably already know if you’ve been using Linux for a while, diagnosing problems is usually a process of elimination. So you need to eliminate as many other possibilities as you can. Also, it’s general hard for people to help you with such limited information. We don’t know the specific CPU, GPU, distribution, versions of software, what you were doing when it occurred, anything like that. So we can’t eliminate many possibilities to give you more specific help. More information is almost always better when asking for technical help on the internet.

Kerfuffle@sh.itjust.works · 1 year ago

It also appears on the HF leaderboard now so you can get a very general idea of how (at least the 70b model) compares: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Kerfuffle@sh.itjust.works · 1 year ago

Its about time we start looking into alternatives to the transformer model.

People have been looking into alternatives. If you read the paper, you can see that they compare their approach to a bunch of different alternatives/modifications. Naturally they claim it comes out looking very favorable, but we’ll have to wait and see if the models/code they release actually perform as well as they’re saying and non-obvious downsides.

It’s not an easy thing to get right.

Kerfuffle@sh.itjust.works · 1 year ago

With a quantized GGML version you can just run on it on CPU if you have 64GB RAM. It is fairly slow though, I get about 800ms/token on a 5900X. Basically you start it generating something and come back in 30minutes or so. Can’t really carry on a conversation.

Kerfuffle@sh.itjust.works · 1 year ago

guanaco-65B is my favorite. It’s pretty hard to go back to 33B models after you’ve tried a 65B.

It’s slow and requires a lot of resources to run though. Also, not like there are a lot of 65B model choices.