A few questions have popped up recently, so I thought I would address them here.
I’ve already written about running local LLMs here. Both posts also include the reasons why you might want to do this. I’m not going to repeat that here.
- Running Large Language Models (LLMs) Locally using Ollama
- Running Large Language Models (LLMs) Locally using LM Studio
Can I run a local LLM on ???
This is the most common question I get. Someone will list their hardware and ask if they can run an LLM locally. The answer is always, “It depends”.
- What kit do you actually have?
- What sort of models do you want to use?
- How are you expecting to use them?
- Do you expect it to be fast, or is it acceptable for it to be slow?
I can’t give a definitive answer, but here are some things to consider.
What kit do you actually have?
The models you will be able to run will vary a lot depending on your kit. If you have an old laptop (like me), you will be restricted to running the small models, as you may be running on CPU, using the system memory. I always start with “Granite 4 H Tiny” from IBM, which is small and seems to run pretty quick on my laptop, whilst also giving me OK results. I will use bigger models as well, but only if I’m prepared to wait for a result.
If you have a GPU, the type of models you can run will depend a lot on the VRAM on the GPU. The relationship between the size of the model and the VRAM you need is not as simple as it sounds. A lot of people will just say the model has to be smaller than the VRAM of your GPU, which is kind-of correct, but you also have to consider some other stuff. The quantization (compression) of the model reduces the model size. The compression of the memory (see TurboQuant) and the size of the context you are using, which affects the Key-Value (KV) cache. You want everything to fit into the VRAM on the card. Even with a GPU you might still be limited to smaller models if you want to pass in a lot of context into your queries.
Typically you will find the more parameters a model has, the bigger it is, but remember some models with a lot of parameter can use a subset of them, making them a lot more efficient than they would first appear.
If you have kit with unified memory, like Apple Silicon kit, the system memory can be used for either the main system or the GPU, so in some cases it’s easier to run bigger models on that kit because you effectively have more VRAM. You aren’t going to run a huge model on a 8G Neo, but you may find the Apple Silicon kit works a lot better than the equivalent x86-64 kit because of unified memory.
I’ve been purposely vague here, but you hopefully get the idea that it is not as straight forward as you telling me your kit, and me giving you a definitive answer. You have to play around.
What sort of models do you want to use?
Based on the previous section, you already know the limitations of your kit. When you are using cloud-based LLMs, you are often routed to different models depending on the type of work you are trying to do. Some are specialized for coding. Some are better and dealing with written text. Some are better at handling image/video data. You have to select a model that will work with your kit, that is focussed on the type of work you are trying to do. You may find a smaller specialized model gives you better and quicker results than one of the big models. It all depends.
Play around and see what works for your requirement and kit.
How are you expecting to use them?
Asking short directed questions is very different to having huge amounts of context, based on either lots of provided information, or long running chats. Every time a new question is asked, the previous chat history becomes part of the context. You need to be careful.
Likewise, trying to run agents that can make tool calls can quickly increase the size of the context. Being extremely simplistic, for every tool call, the query is halted, the tool call is made, the information returned is added to the context, the query run again with the additional information. The amount of context can grow substantially.
Once again, you will have to play around and see what tools and models work for your kit. You are not going to be able to run 50 OpenClaw agents against a massive model on your 10 year old crappy laptop. Just be realistic.
Do you expect it to be fast, or is it acceptable for it to be slow?
I think a lot of people try the free version of OpenAI and expect running a local model to be just as fast. Pick the right model and it could be, but most of the time you have to manage your expectations. I will often try stuff on a small model. If I get an acceptable answer I go with it. If I don’t I will try a larger, much slower model. Sometimes I’m prepared to wait for a superior answer.
You have to deal with the limitations of your kit.
Local LLMs are crap?
I saw this comment on a post recently and it made me kind-of mad.
“I tried most of these small models. They are all crap. Sure they might succeed in some narrow tasks, that still need their output to be verified.”
Well yes, the small models are often best suited to narrow/directed tasks. But there are hundreds/thousands of them to choose from. Find those that suit your workflows. Not every task needs a frontier model to give good results.
If you have good kit at home, like a gaming GPU or a higher spec Apple Silicon laptop, you might be able to run bigger models.
The bit about, “that still need their output to be verified”, made me want to go postal. Literally every response from a LLM needs to be verified. They hallucinate all the time. Almost everyone that tries to tell you they are near perfect has some skin in the game. They stand to make money by convincing you to buy their service/tool. The reality is they screw up all the time. Not validating the results is like Googling commands and running them directly on your production servers as root. You are going to be out of a job pretty quickly.
Conclusion
You can run LLMs locally on a wide range of kit. The result you get will vary greatly depending on your kit, the models you are trying to use and how you are trying to use them. You are not going to be challenging OpenAI/Anthropic/Google any time soon, but you might get something that works for you. If so, it will be free, private and under your control.
Just play around and be realistic. I’m not an AI expert. Just a dabbler…
Cheers
Tim…