I've been running AI models locally since early 2025. It started as a curiosity — could my laptop even do it? (Answer: yes, badly at first.) Now it's my daily driver. No API calls. No rate limits. No "sorry, we're experiencing high demand" messages. Just me, my GPU, and a handful of surprisingly capable models.
This is a quick-start guide for normal people, not ML researchers. I'm assuming you know how to open a terminal and you've got a computer made in the last three years. If I can do this — and I once spent two hours debugging a YAML indentation error — you definitely can.
Why Bother Running AI Locally?
I get this question a lot. Usually from people who are genuinely curious, sometimes from people who think I'm being paranoid. Here's the short version:
Privacy. Every prompt you send to ChatGPT or Claude lives on their servers. They can train on it, they can read it, they can comply with subpoenas for it. With local models, your data never leaves your machine. For client work, personal notes, or sensitive code, this isn't a luxury. It's basic security.
No API keys, no billing. A friend of mine racked up a $487 API bill in one night because his script had a loop bug. Local models cost exactly zero dollars per token. The only recurring cost is electricity — and my desktop idles anyway.
Offline-first. Internet went down for two days during a storm last month. My local LLM kept working. That alone sold me.
No censorship or content filtering. Local models don't refuse requests because a safety filter thinks you're asking something sensitive. You get raw, unfiltered responses. (This is a genuine trade-off — be careful what you do with it.)
Nobody can take it away. When OpenAI deprecates a model or Anthropic changes its pricing yet again, your local setup doesn't care. The models live on your drive. They work the same today as they did yesterday.
What Hardware Do You Actually Need?
You don't need a $5,000 workstation. In 2026, a decent gaming laptop does fine. (I started on a MacBook Air M1 with 8 GB of RAM. It wasn't fast. It worked.)
| Tier | RAM | GPU VRAM | What You Can Run | Example Hardware |
|---|---|---|---|---|
| Minimum | 8 GB | None (CPU) | Llama 3.2 3B, Gemma 4 E4B (quantized) | Any laptop from 2022+ |
| Decent | 16 GB | 6 GB (GTX 1060+) | Llama 3.1 8B, Mistral 7B, Qwen3 7B | Budget gaming laptop |
| Good | 32 GB | 12 GB (RTX 3060+) | DeepSeek R1 Distill 14B, Gemma 4 26B (Q4) | Mid-range gaming PC |
| Great | 64 GB | 24 GB (RTX 4090) | Everything above at full quality, fast inference | High-end workstation |
CPU-only works. It's slower, sure. But models like Llama 3.2 3B run at conversational speed on a modern Intel i7 or Apple M-series chip. I ran my first local model — Llama 2 7B at the time — on that M1 Air. It took about 8 seconds per response. Frustrating but usable.
Step-by-Step: Set Up Your Local AI in 20 Minutes
I've done this on Windows, macOS, and Linux. The steps are nearly identical across all three.
Step 1: Install Ollama
Ollama handles model downloading, quantization, and serving — all behind a CLI that doesn't make you want to throw your laptop out the window.
Windows/Mac: Go to ollama.com, download the installer, run it. Done.
Linux: One command:
curl -fsSL https://ollama.com/install.sh | sh
Check it's working:
ollama --version
Step 2: Pull Your First Model
Start with Llama 3.1 8B. It's the sweet spot — capable enough for real work, small enough for most hardware:
ollama pull llama3.1:8b
This downloads about 4.7 GB. On decent internet, 2-5 minutes.
Test it:
ollama run llama3.1:8b
You're now talking to an AI running entirely on your machine. No internet. No account. No API key. I still find this slightly magical.
Sanity check: Ask it "What model are you?" If it hallucinates — and sometimes it does — try ollama pull llama3.1:8b-q4_K_M for a lower-precision version that's more stable.
Step 3: Install Open WebUI
Ollama's terminal interface is fine, but Open WebUI gives you a ChatGPT-style experience. It's clean, supports multiple models, and includes user management if you want other people on your machine to have accounts.
Via Docker (easiest path):
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
No Docker? Install Docker Desktop first. Yes, Docker adds overhead. I know. But for non-Linux users, it's genuinely the simplest option. (On Linux, you can install directly — the Open WebUI docs have instructions.)
Open http://localhost:3000, create a local account, go to Settings → Models, select your Ollama model, and start chatting.
Step 4: Install AnythingLLM for RAG
This is where things get genuinely useful.
RAG — Retrieval-Augmented Generation — lets you ask questions about your own files. "Summarize this contract." "What did the team decide about the database migration?" "Find every mention of the API key rotation policy."
AnythingLLM makes this dead simple:
docker pull mintplexlabs/anythingllm
docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v anythingllm_storage:/app/server/storage -v anythingllm_hotdir:/app/collector/hotdir --name anythingllm mintplexlabs/anythingllm
Open http://localhost:3001, connect it to Ollama (localhost:11434), upload documents. It chunks them, embeds them, and lets you query them with your local model.
For coding, point it at your repo. For writing, point it at your notes. For research, dump in PDFs. I've used this to query six months of Slack exports and got genuinely useful answers — not hallucinations, actual citations from real messages.
Best Models to Try Right Now
My subjective ranking, ordered by what I actually use:
| Model | Size | Best For | Min Hardware | Notes |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | General, coding, writing | 16 GB RAM / 6 GB VRAM | Meta's workhorse. Start here. |
| Gemma 4 E4B | 4B | CPU-only, quick tasks | 8 GB RAM | Google's tiny model. Runs on a Raspberry Pi 5. |
| Gemma 4 26B | 26B | Complex reasoning | 32 GB RAM / 12 GB VRAM | Beats Llama 3.1 on reasoning. My heavy-work model. |
| DeepSeek R1 Distill 14B | 14B | Math, logic, code | 32 GB RAM / 12 GB VRAM | Best math model you can run locally. Slow but worth it. |
| Mistral 7B v0.4 | 7B | Speed, multilingual | 16 GB RAM / 6 GB VRAM | If you work in multiple languages, this is the pick. |
| Qwen3 7B | 7B | Code generation | 16 GB RAM / 6 GB VRAM | Alibaba's model. Underrated for coding. |
I switch between Gemma 4 26B for real work and Llama 3.1 8B for quick tasks. DeepSeek R1 Distill when I'm stuck on math or a particularly gnarly bug.
Don't ignore the tiny models. Gemma 4 E4B runs on a Raspberry Pi 5 and handles basic Q&A fine. I have one running on an old Intel NUC in my home office — it's basically a local Siri.
Make It Useful: RAG on Your Own Files
Most tutorials stop at "you have a local chatbot, congratulations." But a chatbot that only knows its training data is a toy.
What RAG actually does: You point it at documents, it indexes them, and when you ask a question, it searches those documents for relevant chunks and feeds them to the model as context. The model answers based on your documents, not its training data.
Setup in AnythingLLM:
- Create a workspace
- Upload documents (PDFs, Markdown, text files)
- Wait for indexing — a few seconds to a few minutes
- Ask questions
The first time I used this properly: I had 300 pages of technical documentation for a legacy system I was maintaining. Dumped it all in. Asked "where does the payment processing logic validate the currency code?" Got a real answer in about 10 seconds, complete with a reference to page 147 of the PDF. I almost cried. (Okay, not really. But it was a good moment.)
For developers: Point AnythingLLM at your repository root. It becomes a local Copilot that understands your entire codebase without sending a single line to Microsoft or GitHub.
Things That Actually Went Wrong (So You Don't Have To)
- Model's responding but it's slow: Try the quantized version (
:q4_K_M). Lower precision, faster inference, and honestly the quality difference is barely noticeable for most tasks. - Out of memory: Model's too big. Drop down a tier. Gemma 4 E4B is the safety net — if that doesn't run, you need a different computer.
- Open WebUI can't see Ollama models: Ollama might not be running. Run
ollama servein a terminal. On Windows, the installer starts it as a service — check Task Manager. - Docker errors on Windows: Enable WSL2, then install Docker Desktop with WSL2 backend. Annoying, yes. Worth it, also yes.
- Models hallucinating more than you'd expect: Update Ollama. Older versions used sampling settings that increased hallucination rates. Current defaults are much better.
- AnythingLLM indexing stuck: Your documents might be scanned images without embedded text. AnythingLLM doesn't do OCR — convert them to searchable PDFs first.
FAQ
Is running AI locally actually free?
Yes. The software (Ollama, Open WebUI, AnythingLLM) is free and open source. The models are free downloads. You pay for electricity and hardware. That's it.
How does local AI compare to ChatGPT?
A local Llama 3.1 8B is roughly equivalent to GPT-4 from early 2024 for structured tasks — summarization, code completion, question answering. It's weaker on creative writing and nuanced conversation. For $0/month, the trade-off is reasonable.
Can I run this on a Mac?
Yes. Apple Silicon Macs (M1 through M4) are excellent for local AI. The unified memory architecture means even 8 GB base models can run smaller models. Use the macOS Ollama installer.
Is my data really staying local?
Completely. Ollama runs on 127.0.0.1. Nothing goes to the internet unless you explicitly configure cloud integrations. Open WebUI and AnythingLLM are local by default. Verify with Wireshark if you don't believe me — I did.
Do I need internet to download models?
Only for the initial download. After that, models work fully offline. I use mine on planes regularly.
Can I use multiple models at once?
Yes. Ollama serves multiple models simultaneously. Open WebUI lets you switch between them mid-conversation. Running two models side by side for comparison is genuinely useful.
Is this good enough for professional work?
For many tasks, yes. I use local models daily for code review, document summarization, and research. What they lack in polish, they make up for in privacy and reliability.
What if I need more power?
Add a GPU. An RTX 4060 ($300) triples inference speed over CPU-only. With an RTX 4090 (24 GB VRAM), you can run 26B-parameter models at full quality.
Setting up a local AI stack in 2026 is genuinely easy. Twenty years ago I spent an entire weekend configuring a Linux kernel just to get my sound card working. This took 20 minutes and the benefits are permanent.
I still use ChatGPT and Claude for some things. But my daily assistant lives on my machine. And with GPT-5.6 requiring government sign-off per customer, I'm betting a lot more people will be joining me soon.
Related: The AI Coding Productivity Paradox

Comments (0)
Post a Comment