Skip to main content

Running LLMs on a Raspberry Pi — Step-by-Step Tutorial (2026)

Running LLMs on a Raspberry Pi — Step-by-Step Tutorial (2026)



Can a $60 computer run a large language model?

Yes. And it works better than you'd expect.

In 2026, you don't need a cloud GPU cluster to run Llama 2, Phi-3, or TinyLlama. A Raspberry Pi 5 with 8GB of RAM can handle small LLMs right at the edge — no internet required.

In this tutorial, I'll show you exactly how.



What You'll Need

Hardware:
- Raspberry Pi 5 (8GB) — or Pi 4 (4GB minimum)
- 64GB microSD card (Class 10)
- 5V/5A power supply
- Active cooler or fan

Total cost: $60–120 depending on options



Which LLMs Actually Run on a Pi?

TinyLlama 1.1B — 2-3GB RAM — Good quality — Best for beginners

Phi-3 mini (4-bit) — 3-4GB RAM — Very good — Reasoning and logic

Llama 2 7B (4-bit) — 5-6GB RAM — Great — Text generation

Recommendation for first-timers: Start with TinyLlama 1.1B — it's the easiest to run.



Step 1: Set Up Your Raspberry Pi

If you already have Raspberry Pi OS installed, skip to Step 2.

Fresh setup:
1. Download Raspberry Pi Imager from raspberrypi.com
2. Choose Raspberry Pi OS Lite (64-bit)
3. Flash to microSD card
4. Enable SSH (create empty 'ssh' file in boot partition)
5. Boot and connect via SSH



Step 2: Install Dependencies

Run these commands one by one:

sudo apt update && sudo apt upgrade -y
sudo apt install git cmake build-essential -y
sudo apt install python3-pip python3-venv -y

Install Ollama (easiest method):

curl -fsSL https://ollama.com/install.sh | sh



Step 3: Download and Run Your First LLM

Start Ollama service:

ollama serve

Open a second terminal and run:

ollama run tinyllama

First run will download the model — takes 5-15 minutes depending on your internet.

Expected speed: 2-5 tokens per second on Pi 5



Step 4: Run Better Models (Optional)

After TinyLlama works, try Phi-3:

ollama run phi3:mini

Or Llama 2 (7B) if you have 8GB RAM:

ollama run llama2:7b



Step 5: Create a Simple Chat Script (Python)

Save this as chat.py:

import subprocess

def ask_llm(prompt, model="tinyllama"):
    result = subprocess.run(
        ["ollama", "run", model, prompt],
        capture_output=True,
        text=True
    )
    return result.stdout

response = ask_llm("Explain edge computing in one sentence")
print(response)

Run it:

python3 chat.py



Performance Benchmarks (Real Tests)

On Raspberry Pi 5 (8GB) with active cooler:

TinyLlama 1.1B: 4-6 tokens/sec — 1-2 sec first response

Phi-3 mini (4-bit): 3-4 tokens/sec — 2-3 sec first response

Llama 2 7B (4-bit): 1-2 tokens/sec — 5-8 sec first response

On Raspberry Pi 4 (4GB): TinyLlama only (2-3 tokens/sec)





Troubleshooting

Problem: ollama: command not found
Fix: Reinstall or add ~/.local/bin to PATH

Problem: Model downloads forever
Fix: Check internet connection (WiFi on Pi can be slow)

Problem: Pi freezes or throttles
Fix: Add active cooling — thermal throttling kills performance

Problem: Out of memory error
Fix: Use smaller model or 4-bit quantized version

The #1 mistake: Using a Pi 4 with 4GB and trying to run Llama 2 7B. Don't do it.



What Can You Actually Do With an LLM on a Pi?

Local chat assistant — Yes (slow but usable)
Text summarization — Yes
Code generation — Yes (short snippets)
Real-time translation — Borderline (2-3 second delay)
Long document analysis — No (memory limit)

Best use: Offline assistant for home automation, note-taking, or learning how LLMs work.



The Hybrid Setup (My Favorite)

Run the Pi as an edge LLM server:

1. Keep Ollama running on the Pi
2. Call it from any device on your local network
3. No cloud. No API fees. No privacy concerns.

API endpoint example:

curl http://raspberrypi.local:11434/api/generate -d '{
  "model": "tinyllama",
  "prompt": "What is 42?"
}'

Now every device in your house has private LLM access.




Key Takeaway

Yes — you can run LLMs on a Raspberry Pi.

It won't match ChatGPT speed. But for $60, you get:

- Complete privacy (no data leaves your home)
- No monthly subscription
- Offline capability
- A fun weekend project that teaches real AI skills

Start with TinyLlama. Upgrade to Phi-3. Then build something useful.



What's Next?

In our next post: TinyML on a Microcontroller — AI for $5



New to Edge AI? Read my beginner's guide: [Edge AI vs Cloud AI: Which One Wins in 2026?]

Comments

Popular posts from this blog

Edge AI vs. Cloud AI: Which One Wins in 2026?

​ Latency kills user experience. Cloud AI is incredibly powerful. But every millisecond spent sending data to a server and waiting for a response adds friction. That’s where The AI Edge comes in. In 2026, the debate is no longer “Is Edge AI possible?” It’s “Which approach wins for my specific use case?” Let’s break down Edge AI vs. Cloud AI — head to head. What is Cloud AI? Cloud AI processes data on remote servers (AWS, Google Cloud, Azure). Your device captures data, sends it to the cloud, and waits for the result. Examples: ChatGPT, Google Photos recognition, voice assistants (usually). Pros: · Massive compute power (GPUs/TPUs at scale) · Easy to update models centrally · Great for non-real-time tasks Cons: · High latency (100–500 ms round trips) · Requires internet always · Privacy concerns (your data leaves the device) What is Edge AI? Edge AI runs models directly on local devices — phones, cameras, sensors, or microcontrollers. No round trip to the cloud. Examples: Face unlock on...