My mind was blown: running a 120B parameter AI model on a budget GPU at home

The secret sauce is the Mixture of Experts (MoE) technique, which splits one big LLM into several smaller chunks. This drastically reduces the need to move huge amounts of data in memory, enabling data processing on the processor (CPU) while offloading only key parts to a modest graphics accelerator (GPU).

Leading AI company OpenAI has recently unveiled an open LLM called gpt-oss-120b. OpenAI claims that a powerful datacenter GPU with at least 80GB is required to run this model efficiently.

But I stumbled on a Reddit post by Wrong-Historian, who claimed that the model can run smoothly on a PC with a budget 8GB GPU and 64GB of RAM. They were getting 25 tokens per second.

“Honestly, this 120B is the perfect architecture for running at home on consumer hardware. Somebody did some smart thinking when designing all of this!” the Redditor’s post reads.

So it happens that my computer also has 64GB of RAM and a GPU with even more RAM than required. It’s time to cancel some subscriptions.

Testing the claims

My Windows gaming rig has a Radeon RX 7900 XT. This AMD GPU isn’t ideal for running LLMs because its support lags behind industry-leading Nvidia cards. It has 20GB of VRAM, which is nowhere near enough to fit large LLMs, but still more than the Redditor claimed was needed.

I’ve previously attempted to run over 70B parameter models on it, like Llama or R1, but the experience was excruciating. It was usually around one token, or even less than one word, per second

Testing the claims

Leave a Comment Cancel reply