@TootGuitar

TootGuitar@sh.itjust.works · 5 days ago

This isn’t really true — a lot of the newer MoE models run just fine on a CPU coupled with gobs of RAM. Yes, they won’t be quite as fast as a GPU, but getting 128GB+ of VRAM is out of reach of most people.

You can even run Deepseek R1 671b (Q8) on a Xeon or Epyc with 768GB+ of RAM, at 4-8 tokens/sec depending on configuration. A system supporting this would be at least an order of magnitude cheaper than a GPU setup to run the same thing.

TootGuitar@sh.itjust.works · 12 days ago

Yeah I definitely get your point (and I didn’t downvote you, for the record). But I will note that ChatGPT generates text way faster than most people can read, and 4 tokens/second, while perhaps slower than reading speed for some people, is not that bad in my experience.

TootGuitar@sh.itjust.works · edit-2 13 days ago

It depends on what you mean by “relative responsiveness”, but you can absolutely get ~4 tokens/sec of performance on R1 671b (Q4 quantized) from a system costing a fraction of the number you quote.

TootGuitar@sh.itjust.works · 13 days ago

It is indeed called a refund by the IRS and all tax professionals. The person(s) attempting to correct your use of “refund” are wrong, but they were probably trying to make the point that giving a lot of extra money to the government interest-free is not a smart financial idea.