Run LLMs Locally with Ollama - A Practical Guide

March 30, 2026 by Rap Payne

So you want to run LLMs on your own machine without sending your data to OpenAI, Anthropic, or Google? Maybe you’re tired of API costs, or you need to work offline, or you just want full control over your AI stack. Ollama makes this straightforward. Here’s everything you need to know.

What is Ollama?

Ollama is a command-line tool that lets you run large language models locally. Think of it as Docker for LLMs. It handles model downloads, manages GPU memory, and provides a simple API you can hit from any application.

Instead of this:

// Sending data to OpenAI
const response = await openai.chat.completions.create({
  model: "gpt-5",
  messages: [{ role: "user", content: "Hello" }]
});

You get this:

// Running locally with Ollama
const response = await fetch('http://localhost:11434/api/generate', {
  method: 'POST',
  body: JSON.stringify({
    model: 'llama3.1',
    prompt: 'Hello'
  })
});

No API keys. No usage limits. No data leaving your machine.

Installation

macOS

Easiest option is Homebrew:

brew install ollama

Or download the app from ollama.ai and drag it to Applications.

Linux

One-liner install:

curl -fsSL https://ollama.ai/install.sh | sh

This works on most distros (Ubuntu, Debian, Fedora, CentOS, etc.). It installs Ollama as a system service that starts automatically.

Windows

Download the installer from ollama.ai and run it. Ollama runs as a Windows service in the background.

Basic Usage

Check your Ollama version:

ollama --version

Running Your First Model

Start with a small, fast model like Phi:

ollama run phi3

First time you run this, Ollama downloads the model (about 2.2GB). This might take a few minutes depending on your internet connection. After that, it’s instant.

You’ll drop into a chat interface:

>>> Hello, who are you?
I am Phi, a large language model trained by Microsoft...

>>> /bye

Popular Models to Try

Llama 3.1 (8B parameters, good all-around choice):

ollama run llama3.1

Mistral (7B parameters, fast and capable):

ollama run mistral

Codellama (optimized for code):

ollama run codellama

Gemma 2 (Google’s model, excellent quality):

ollama run gemma2

DeepSeek Coder (best for programming tasks):

ollama run deepseek-coder

Browse all available models at ollama.ai/library.

Model Size Considerations

Models come in different sizes. Bigger is usually better but requires more RAM:

7B models: 8GB RAM minimum, fast responses
13B models: 16GB RAM, better quality
70B models: 64GB+ RAM, excellent but slow on CPU

If you have an NVIDIA GPU with enough VRAM, Ollama uses it automatically and everything runs much faster.

Estimating RAM needs

A model’s RAM requirement depends on its size and quantization level. Quantization compresses the model — less precision, less RAM, slightly lower quality.

A rough formula: RAM needed ≈ (parameters in billions) × (bits per weight) ÷ 8

Model	Quantization	Approx. RAM
7B	q4 (4-bit)	~4 GB
7B	q8 (8-bit)	~8 GB
13B	q4	~8 GB
13B	q8	~16 GB
70B	q4	~40 GB

When in doubt, check the model page on ollama.ai/library — it lists the RAM requirement for each variant.

Checking your GPU and VRAM

macOS:

system_profiler SPDisplaysDataType | grep -E "Chipset|VRAM"

Linux (NVIDIA):

nvidia-smi

Windows:

nvidia-smi

Or open Task Manager → Performance → GPU to see VRAM visually.

If nvidia-smi isn’t found, you either don’t have an NVIDIA GPU or the drivers aren’t installed. AMD GPU support in Ollama is limited on non-Linux systems.

Essential Commands

List installed models

ollama list

Pull a model without running it

ollama pull llama3.1

Delete a model to free up space

ollama rm mistral

Show model details

ollama show llama3.1

Update a model to the latest version

ollama pull llama3.1

Models improve over time. Re-pulling gets you the latest version.

Using Ollama via HTTP API

The chat interface is fine for quick tests, but you’ll usually want to call Ollama from your applications.

Ollama runs a local web server on port 11434. You can hit it with curl, Python, JavaScript, or any language that speaks HTTP.

Simple generation request

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Response:

{
  "model": "llama3.1",
  "created_at": "2025-12-27T10:30:00.000Z",
  "response": "The sky appears blue because...",
  "done": true
}

Chat API (with conversation history)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    { "role": "user", "content": "What is 2+2?" },
    { "role": "assistant", "content": "4" },
    { "role": "user", "content": "What about 2+3?" }
  ],
  "stream": false
}'

The chat endpoint maintains context across multiple turns.

Streaming responses

Set "stream": true to get responses token-by-token:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a haiku",
  "stream": true
}'

You’ll receive multiple JSON objects, one per token:

{"model":"llama3.1","response":"Silent","done":false}
{"model":"llama3.1","response":" morning","done":false}
{"model":"llama3.1","response":" dew","done":false}
...
{"model":"llama3.1","response":"","done":true}

This is how you build that ChatGPT-style typing effect.

Using Ollama from JavaScript

Install the official client:

npm install ollama

Basic example:

import ollama from 'ollama';

const response = await ollama.chat({
  model: 'llama3.1',
  messages: [{ role: 'user', content: 'Explain async/await' }],
});

console.log(response.message.content);

Streaming example:

const response = await ollama.chat({
  model: 'llama3.1',
  messages: [{ role: 'user', content: 'Write a short story' }],
  stream: true,
});

for await (const part of response) {
  process.stdout.write(part.message.content);
}

The JavaScript client handles all the HTTP details and gives you a clean async interface.

Using Ollama from Python

Install the package:

uv pip install ollama # or just `pip install ollama`

Basic usage:

import ollama

response = ollama.chat(
    model='llama3.1',
    messages=[
        {'role': 'user', 'content': 'What is recursion?'}
    ]
)

print(response['message']['content'])

Streaming:

stream = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Count to 10'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Creating Custom Models with Modelfiles

You can customize model behavior by creating a Modelfile. This lets you set system prompts, adjust temperature, or build specialized models.

Create a file called Modelfile:

FROM llama3.1

PARAMETER temperature 0.8
PARAMETER top_p 0.9

SYSTEM """
You are a helpful coding assistant. You provide concise, accurate code examples.
Always explain your reasoning. Format code with proper syntax highlighting.
"""

Build your custom model:

ollama create my-coding-assistant -f Modelfile

Use it:

ollama run my-coding-assistant

Your model now has custom behavior built in. This is powerful for creating domain-specific assistants.

Common Modelfile parameters

temperature: Controls randomness (0.0 = deterministic, 1.0 = creative)
top_p: Nucleus sampling threshold (0.9 is a good default)
top_k: Limits to top K tokens (50-100 is common)
presence_penalty: Discourages repeating topics
frequency_penalty: Discourages repeating exact words
num_ctx: Context window size (default 2048, increase for longer conversations)
stop: Custom stop sequences
seed: For reproducible outputs

Practical Use Cases

Local code completion

Run Codellama and hit it from your editor. No sending proprietary code to external APIs.

Offline AI assistance

Download models before traveling. Work on flights or anywhere without internet.

Rapid prototyping

No API rate limits means you can iterate fast without worrying about costs.

Data privacy

Keep sensitive data on your machine. Medical records, legal documents, financial info — none of it leaves your control.

Learning and experimentation

Try different models and parameters without burning through API credits.

Troubleshooting

Model won’t load - out of memory

You’re trying to run a model bigger than your RAM. Try a smaller variant:

Instead of llama3.1:70b, use llama3.1:8b
Check available models: ollama list

Slow performance

CPU inference is slow, especially for large models. Options:

Use a smaller model
Get a machine with more RAM
Use a GPU (NVIDIA cards work best)

Kill existing processes or change the port:

OLLAMA_HOST=0.0.0.0:11435 ollama serve

Can’t connect from another machine

By default, Ollama only listens on localhost. To expose it:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Use with caution. Anyone who can reach your machine can now use your LLM.

GPU not being detected

Check NVIDIA drivers:

nvidia-smi

If that fails, your GPU drivers aren’t installed properly. On Linux:

# Ubuntu/Debian
sudo apt install nvidia-driver-525

# Restart Ollama after installing drivers
sudo systemctl restart ollama

Model gives weird or incorrect outputs

Try adjusting temperature and top_p in your Modelfile or API calls. Lower temperature (0.3-0.5) gives more focused, deterministic responses.

Performance Tips

Use quantized models: Most Ollama models are already quantized (compressed) for faster inference. You’ll see tags like q4_0 or q8_0. Lower numbers = more compression, less quality, but faster.

Increase context window for longer conversations:

ollama run llama3.1 --num-ctx 4096

Pre-load models: The first request is slow because Ollama loads the model into memory. After that, it’s fast. Keep Ollama running and models stay loaded.

Batch similar requests: If you’re processing multiple items, send them in quick succession while the model is hot.

Comparison with Cloud APIs

When to use Ollama

You need data privacy
You want to avoid API costs
You work offline frequently
You’re prototyping and need fast iteration
You have decent hardware (16GB+ RAM, or a good GPU)

When to use cloud APIs

You need the absolute best quality (GPT-4, Claude Opus)
You don’t have strong hardware
You need guaranteed uptime and scaling
You want someone else to handle model updates

There’s no wrong choice. Many developers use both — Ollama for development and testing, cloud APIs for production.

Going Further

Running Ollama in Docker

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Useful for deployment or keeping your system clean.

Setting up auto-start

macOS: Ollama auto-starts when you install the app.

Linux: Already configured if you used the install script.

Windows: Runs as a service automatically.

To disable:

# Linux
sudo systemctl disable ollama

# macOS
launchctl unload ~/Library/LaunchAgents/com.ollama.ollama.plist

Multi-GPU setup

If you have multiple GPUs, Ollama uses them automatically. Control which GPUs with:

CUDA_VISIBLE_DEVICES=0,1 ollama serve

Conclusion

Ollama makes running LLMs locally practical. Download a model, send it prompts via HTTP, get responses back. No complex setup, no external dependencies.

Start with a small model like Phi or Mistral. Once you’re comfortable, experiment with larger models and custom Modelfiles. The flexibility is worth it.

If you’re building applications that need AI but can’t send data to external APIs, or if you’re tired of paying per-token, Ollama is your answer.

Quick Reference

# Install (macOS)
brew install ollama

# Install (Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Run a model
ollama run llama3.1

# List models
ollama list

# Pull a model
ollama pull mistral

# Delete a model
ollama rm mistral

# API call
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Hello world",
  "stream": false
}'

Prerequisites

This guide assumes you have:

A Mac, Linux, or Windows machine
At least 8GB RAM (16GB+ recommended)
Basic familiarity with command line
curl or another way to make HTTP requests

Optional but recommended:

NVIDIA GPU with 8GB+ VRAM for faster inference
50GB+ free disk space if you plan to download multiple large models