i've experimented a bit, lm studio, vllm, on a ryzen 5 ai + 96GB system ram, the problem isn't really the inference, it's the prefill that's super slow
Im pretty into local models. Got a website up so others can play with them. Broke it yesterday though, hopefully fixing today. Been messing with local models obliterated with heretic recently.
i tried with setups that used ollama. even with a rtx 3070 (8GB) and 3080 (10GB) i wasnt able to use any models for tool calling unless ollama offloaded considerable amount of work to the cpu and slowed everything to a crawl.
im considering getting a 5090 (32GB) to try again with more recent models like glm 4.7.
what are you looking to do?