I was browsing Reddit (yetch) while waiting for some stuff to finish when I came across this post

https://old.reddit.com/r/LocalLLM/comments/1tek00h/why_is_llm_is_so_expensive/

The author make a (very) interesting claim: if table stakes are $6K (they’re not…but go with it for now), then most folks are cooked from the get go.

Personally, I have been figuring out how to get more from less. For example, people have found ways to run Qwen3.6 35B on a 6GB VRAM GTX 1060 at ~20tok/s (–ctx 64K IIRC, but go check the vids yourself)

https://youtu.be/8F_5pdcD3HY

I think there’s a lot of juice to squeeze by turning LLMs from “all seeing sages” into basically mouth pieces for shit that actually runs fast on regular silicon - but that’s just me and my crazy brain. YMMV.

  • I’ve been running Qwen3.6 35B very easily. but that’s because I’ve got a ASUS Z13, which is one of the newish laptops that have the AMD Ryzen AI MAX+ 395 in it. you can get that chip with 128gb of unified memory, 96gb of which I have dedicated to be VRAM. I can also run Qwen3 Coder Next 80B. I’m not sure how many tokens per second I’m getting with Coder, but it’s fast.

    honestly I think this unified memory might be the future of mobile chips, because the things I can do with it are pretty crazy. it’s not just useful for AI either, it’s in a few gaming laptops because it also works really well when gaming. but the things you can do with LLMs or diffusion models is amazing. I donate compute to AI Horde, and I’m finishing image generation jobs for people in like 4 seconds.

  • SuspiciousCarrot78@aussie.zoneOP
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    7 hours ago

    I’m actually thinking of pivoting my router/orchestrater entirely. I think the way forward is to look at expert systems (yes, those ancient things from the long, long ago of…1980) but with modern tooling (that can be user updated), with a small LLM in the middle that the user can talk to. That is, de-emphasize the central role of the LLM entirely; rather, make it the user-facing NLP input/output and let the real programs, running on real silicon, do the work. I might have a different use case than most, but I bet not so different (that is to say, online LLM discussion seem to gravitate around user that use LLMs for coding; Anthropic and OAI internal reports say otherwise)

    Ironically, I’m writing the blurb now while waiting for smoke test #90238472398 to finish.

    • randomaside@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      5
      ·
      6 hours ago

      I’ve been saying this for a while to people. I think the long term use case for LLMs is the semantic human interface device.

      Siri,Alexa, even google home?(whatever they called it), they all swung and missed at this. However evening able to provide commands unclearly to a computer and get the intended result would be a huge win.

      I know the big llm inference can do a lot more but the cost is high for systems with that ability to reason however small, lightweight llms are actually very good for command and control.

      This where my current homeland projects are focused.

      • SuspiciousCarrot78@aussie.zoneOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        6 hours ago

        Hey, me too :) As my school teachers use to tell me “Great minds think alike (but fools seldom differ :)”

        For me, I’m thinking of having a LLM as one layer / one container in a homelab that does some specific stuff

        • queries against local docs / notes / manuals / PDFs / wiki material as the trusted knowledge layer
        • uses tools for search, file lookup, shell, git, Docker, Home Assistant, calendar, etc.
        • a local “Codex” / wiki layer that turns my own source material into an inspectable knowledge base
        • provenance and audit trails

        I want to take a screenshot of something, drop it into Syncthing from my phone, then later ask “did I fuck the pins on this?” … and for it to look up the schematics, eyeball the pins and tell me. Or I say “hey, can you grab a copy of X for me, usual params” and have the LLM instruct Sonarr/Radarr/Sabnzdb to do that. (That is, make your OWN “Alexa” with an Arduino ESP32, stick it in a room and then call it when you need it).

        So instead of asking a 70B model to “know” why your media server is down, the system checks service status, logs, last config changes, prior notes, Docker state, network state, etc., then the LLM explains the result in human language. You can probably do that with a 4B (I’m testing that assumption now).

        Same for “find that motherboard note,” “summarize this email thread,” “turn this into a task,” “compare this Ebay listing to my saved hardware notes,” “what did I do last time this broke,” or “run the smoke test and tell me the first real failure.”

        I think small models are the shit for this because if the model only has to classify intent, route the request, render structured evidence, and talk like a normal human…then it doesn’t need to be a giant oracle. The expensive (time wise) part becomes less “make the model smarter” and more “build a better control plane around it.”

        Basically: local LLM as semantic HID; expert system/tool router underneath; user owns the data and the machine.

        As always, ICBW…but fuck it, I’m gonna try.

        PS: I have an idea of how to apply that to coding too…but that’s a project for much later. I’ve been cooking this shit for far too long. The next thing I wanna do is a fun project for myself (that is: ROM hack a parachute and grappling gun into Super Mario Sunshine, so I can basically play “What if Super Mario Sunshine but actually Just Cause 2” on my Wii with the kids.

  • Denixen@feddit.nu
    link
    fedilink
    English
    arrow-up
    2
    ·
    7 hours ago

    I think this is where a lot of LLMs will land, in local usage. At first people will try to use big general models for their rather domain-specific tasks until they realize smaller specialized models can do the same thing, but cheaper, with no subscription or per-token costs and with full ownership of the model, not just renting a model.

    Imagine having a model that is proprietary to your company, cost nothing but the cost of electricity to run, no renting computing capacity etc. You own the model and can customise it at will and have no restrictions.

    Currently super large models are driven by a hope for AGI capabilities, but we are not there yet for years and LLMs will never get us there. It requires different architectures.

    I run ministral 8B on my laptop and as long as I don’t ask it too complex tasks it can do things like translating, explaining simple things or help me understand functions and basic code while I am learning. If it can web search, RAG search and index you don’t need that big of a model for it to work as a decent assistant.

    If you want to replace people and not just enhance their work, then at current prices i don’t think you get your bang for the bucks. A model owned and run by someone else outside the company is a just a very expensive consultant… They’ll leave with all their competence if you stop paying them. An in-house AI will never leave and only get better with time and you don’t have to pay anything other than electricity and the initial capital cost for the server.

    I give it a year or two before companies realise this, but first they need to realise that the money they spend on subscriptions and tokens aren’t investments, but rather costs. They don’t get the money back. With a local model, it is an investment that keeps on giving after initial capital investment and cost much less for repetitive work.

    Mistral is betting on this and I think it will pay off. Unless I am wrong about AGI.

  • hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    4 hours ago

    I don’t think that’s any new insight 😂. That’s how the AI game works. There’s always been two classes: Big corpo. And the GPU poor. Of course the big AI companies get to shape AI. Economy of scale also works in their favour. They’ve bought most of the skill. And they have all the money. They simply buy a 4x EPYC +3TB RAM connected to 16 Nvidia AI cards. And then a few hundred nodes more. You don’t even buy one. It’s just a very unequal environment if you want to compare the two.