• robber@lemmy.mlOP
    link
    fedilink
    English
    arrow-up
    5
    ·
    26 days ago

    I’d add that memory bandwidth is still a relevant factor, so the faster the RAM the faster the inference will be. I think this model would be a perfect fit for the Strix Halo or a >= 64GB Apple Silicon machine, when aiming for CPU-only inference. But mind that llamacpp does not yet support the qwen3-next architecture.

    • Multiplexer@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      3
      ·
      25 days ago

      Can confirm that from my setup. Increasing the parallelization beyond 3-4 concurrent threads doesn’t also significantly increase the inference speed any more.
      This is a telltale sign that some of the cores are starving because data doesn’t arrive fast enough any more…