Everything I've learned so far about running local LLMs

Everything I've learned so far about running local LLMs (via) Chris Wellons shares detailed notes on his experience running local LLMs on Windows - though most of these tips apply to other operating systems as well.

This is great, there's a ton of detail here and the root recommendations are very solid: Use llama-server from llama.cpp and try ~8B models first (Chris likes Llama 3.1 8B Instruct at Q4_K_M as a first model), anything over 10B probably won't run well on a CPU so you'll need to consider your available GPU VRAM.

This is neat:

Just for fun, I ported llama.cpp to Windows XP and ran a 360M model on a 2008-era laptop. It was magical to load that old laptop with technology that, at the time it was new, would have been worth billions of dollars.

I need to spend more time with Chris's favourite models, Mistral-Nemo-2407 (12B) and Qwen2.5-14B/72B.

Chris also built illume, a Go CLI tool for interacting with models that looks similar to my own LLM project.

Posted 10th November 2024 at 6:01 pm

Simon Willison’s Weblog

Recent articles