Can I run local AI on an Apple Silicon Mac?

Yes, and Apple Silicon is one of the best consumer platforms for local AI. The unified memory architecture means your GPU and CPU share the same fast memory pool, which is well-suited to model inference. LM Studio and Jan both have native Apple Silicon builds with Metal acceleration.

What is Ollama and how does it relate to these apps?

Ollama is a local model runtime that runs in the background and exposes an OpenAI-compatible API. Many local AI apps (Jan, LobeHub, and others) can use Ollama as their model backend. Think of Ollama as a headless model server, and LM Studio/Jan/GPT4All as interfaces that can optionally connect to it.

How much storage do local models need?

A 7–8B model at Q4 quantization runs 4.66–4.92 GB depending on variant (Q4_K_M is the most common balanced choice at ~4.92 GB; Q8 is 8.54 GB for near-full quality). A 13B model at Q4 is roughly 8 GB. A 70B model at Q4 is roughly 40 GB. Models are stored in GGUF format and sit in a folder on your drive — you can delete them like any other file.

Running AI Locally: What to Actually Use

Running AI locally means the model runs on your own hardware — your laptop, desktop, or home server — with no internet connection required and no API costs. The tradeoff is that local models are generally smaller and less capable than the frontier models you get through cloud APIs, and the setup is more involved.

That said, local AI has gotten significantly better. Models like Llama 3, Mistral, and Qwen have narrowed the gap with cloud models for many everyday tasks. If you care about privacy, offline access, or eliminating ongoing API costs, local AI is worth understanding.

The two things you need to understand before starting

First: a local AI app is not the same as a local AI model. The app is the interface — LM Studio, Jan, GPT4All. The model is the file that does the actual inference — Llama 3.1 8B, Mistral 7B, Phi-3. The app downloads and manages models for you, but you are choosing both.

Second: quantization matters. Local models are compressed to fit on consumer hardware. A 'Q4' quantization is aggressively compressed for speed and small size. A 'Q8' or 'FP16' version is closer to the original quality but requires much more RAM and VRAM. For most people starting out, Q4 is the right choice.

8 GB RAM: can run 7B models at Q4 quality — capable but limited
16 GB RAM: comfortable for 7–13B models — good for most daily tasks
32 GB+ RAM or dedicated GPU: can run 30–70B models — near-frontier quality

LM Studio — the best starting point for most people

LM Studio is a polished desktop app for macOS, Windows, and Linux that handles model discovery, download, and management behind a clean interface. You can search for models by name, see hardware compatibility estimates before downloading, and start a local chat with a few clicks.

It also runs a local OpenAI-compatible server, which means any BYOK tool that supports custom API endpoints — NextChat, big-AGI, LobeHub, and many others — can point to LM Studio and use your local models through the same interface they use for cloud APIs.

LM Studio is the right starting point because it makes the learning curve gradual. You do not need to understand quantization formats, model files, or GGUF before you start. The interface guides you.

One honest caveat for technical users: LM Studio's Electron-based interface can reduce GPU throughput compared to running llama.cpp directly — some users report 20–40% lower token-per-second rates versus the CLI. If raw inference speed matters, tools like Jan (which uses llama.cpp under the hood) or running Ollama directly can be faster. For most people starting out, the difference is not noticeable enough to matter.

Jan — the most polished cross-platform option

Jan is an open-source desktop app with a similar scope to LM Studio but a different philosophy. It is more transparency-forward — you can see exactly what is happening with your models and data. The UI is clean and the model management is good.

Jan also runs an OpenAI-compatible local server (on port 1337 by default), so it works as a model backend for other tools. It has slightly less hand-holding than LM Studio on hardware compatibility detection, but the core experience is very good.

If you are specifically privacy-conscious and want everything fully open source and auditable, Jan is the better choice over LM Studio.

GPT4All — the most accessible option for non-technical users

GPT4All, from Nomic AI, is the easiest local AI app to get started with for someone who does not come from a developer background. The installer is standard, the model library is curated and small, and the interface deliberately avoids technical jargon.

It has a built-in document chat feature (LocalDocs) that lets you point it at a folder of PDFs or text files and ask questions about them — no setup required. This is GPT4All's strongest differentiator.

The tradeoff is less control. GPT4All has fewer configuration options than LM Studio or Jan, and the model selection is more curated and smaller. For technical users, this feels limiting. For people who just want it to work, it is exactly right.

Text Generation WebUI — for power users

Text Generation WebUI (oobabooga) is the original local LLM interface and still the most powerful. It runs as a web app on your local machine and exposes more controls than any other option in this category — sampling parameters, rope scaling, context length, multiple backends (llama.cpp, ExLlama, GPTQ, AWQ).

It is designed for people who want to understand and control exactly how inference works. The setup is more involved, the interface is utilitarian, and the learning curve is steeper. In exchange, you get capabilities that none of the polished apps expose.

Use Text Generation WebUI if you are experimenting with model fine-tuning, need specific backend options, or want the most flexibility. Use one of the other three options if you want to get productive quickly.

One thing to set expectations on

Local models running on consumer hardware — even with 32 GB of RAM and a mid-range GPU — are meaningfully behind frontier cloud models on complex reasoning tasks. For tasks like writing, summarisation, light coding, and document Q&A, the gap is small. For difficult code generation, multi-step reasoning, and tasks requiring broad knowledge, the gap is real.

Running locally is a genuine option for many use cases. It is not a full replacement for cloud APIs for all use cases. Most people end up using local models for some things and cloud APIs for others.