I was browsing Reddit (yetch) while waiting for some stuff to finish when I came across this post
https://old.reddit.com/r/LocalLLM/comments/1tek00h/why_is_llm_is_so_expensive/
The author make a (very) interesting claim: if table stakes are $6K (they’re not…but go with it for now), then most folks are cooked from the get go.
Personally, I have been figuring out how to get more from less. For example, people have found ways to run Qwen3.6 35B on a 6GB VRAM GTX 1060 at ~20tok/s (–ctx 64K IIRC, but go check the vids yourself)
I think there’s a lot of juice to squeeze by turning LLMs from “all seeing sages” into basically mouth pieces for shit that actually runs fast on regular silicon - but that’s just me and my crazy brain. YMMV.


I’ve been running Qwen3.6 35B very easily. but that’s because I’ve got a ASUS Z13, which is one of the newish laptops that have the AMD Ryzen AI MAX+ 395 in it. you can get that chip with 128gb of unified memory, 96gb of which I have dedicated to be VRAM. I can also run Qwen3 Coder Next 80B. I’m not sure how many tokens per second I’m getting with Coder, but it’s fast.
honestly I think this unified memory might be the future of mobile chips, because the things I can do with it are pretty crazy. it’s not just useful for AI either, it’s in a few gaming laptops because it also works really well when gaming. but the things you can do with LLMs or diffusion models is amazing. I donate compute to AI Horde, and I’m finishing image generation jobs for people in like 4 seconds.
Good man/woman. Nerd Valhalla awaits you :)