

Chat is not very interesting for me :) and i use ik_llama.cpp fork that says that sliding window is not supported for this model :/ I want a full development on it, so waiting for some technologies to arrive so it would be possible. (Or ill start to make my own fork if my patience won’t be able to last :D)
Batch size reduced pre fill speed but freed some ram thx. About inference engines - definitely will try some later