How to get a ChatGPT-like service running on your own machine

September 1, 2025 by Rap Payne

“I don’t trust ChatGPT with my questions. (Or Gemini or Claude or any of them).”

“Our app is getting more users so tokens are getting expensive.”

“I need more control in the formatting of prompts and responses.”

The solution to these problems is to run your own LLM installed on your own machine. No cost, no latency, no data sharing, full control. It is surprisingly simple.

I initially thought I’d need a supercomputer with tons of disk space and GPU power. And I would — to train the model. But to just run it, even a laptop is fine.

It’s surprisingly simple. We’ll do these four steps …

get an engine,
get a model,
start up the server, and
interact with it.

Step 1: Download the Ollama engine

Go to Ollama and grab the installer for your operating system. Run the installer and follow the prompts. It’s basically just hit the Next button a few times.

Verify it’s running by executing:

ollama version

Step 2: Download a model

To install a model, simply run “ollama pull <model_name>” from any shell.

You have tons of models to choose from. Go here to get a list of them.

For my purposes, llama3.1 was what I needed. I installed it with ollama pull llama3.1 and saw this output:

$ ollama pull llama3.1
pulling manifest 
pulling 667b0c1932bc: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.9 GB                         
pulling 948af2743fc7: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.5 KB                         
pulling 0ba8f0e314b4: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████▏  12 KB                         
pulling 56bb8bd477a5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████▏   96 B                         
pulling 455f34728c9b: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████▏  487 B                         
verifying sha256 digest 
writing manifest 
success

Step 3: Start up the server

Simple:

ollama run llama3.1

That’s it. You now have an LLM working on your machine. All that’s left to do is work with the model by prompting it.

Step 4: prompt the model

You can do this in two ways: interactively, or programmatically.

Interactively

You can use a simple command line interface to interact with the model. Just type your prompts after the ”>>>” and see the responses.

$ ollama run llama3.1
>>> Write a limerick about running a marathon.
Here is a limerick about running a marathon:

There once was a runner so fine,
Whose marathon goal she did design.
She trained with great care,
And ran through the air,
And crossed the finish line in good time!

>>> Send a message (/? for help)

This is cute and all but not practical for real applications. You probably did all this to create an app. So let’s see how to use your new service in an app.

Programmatically

Any app can send a POST request to http://localhost:11434/api/generate with your prompt. Here’s an example body:

{
  "model": "llama3.1",
  "prompt": "Write a limerick about running", // Or your prompt
  "stream": false
}

You’ll get this response:

{
  "model": "llama3.1",
  "created_at": "2025-09-01T16:54:47.594435Z",
  "response": "Here is a limerick about running:\n\nThere once was a runner so fine,\nWhose speed on the track was divine.\nShe pounded the ground,\nWith her feet all around,\nAnd finished with a happy shine.",
  "done": true,
  "done_reason": "stop",
  "context": [128006, ... 33505],
  "total_duration": 892255750,
  "load_duration": 63993125,
  "prompt_eval_count": 17,
  "prompt_eval_duration": 126625000,
  "eval_count": 45,
  "eval_duration": 701249583
}

As you can plainly see it’s simple JSON. The data you want is in the “response” field.

Bonus! Processing the streaming value

If you set "stream": true in your request, you’ll receive a stream of responses as the model generates them. This is useful for long prompts or when you want to display the output in real-time. Here’s a node snippet to show how to handle that.

  const response = await fetch("http://localhost:11434/api/generate", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llama3.1",
      prompt: "Write a limerick about running"
    }),
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  let fullText = "";

  while (true) {
    const { done, response } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(response, { stream: true }).trim();
    // each chunk may contain multiple JSON lines
    for (const line of chunk.split("\n")) {
      if (!line) continue;
      const json = JSON.parse(line);
      if (json.response) {
        fullText += json.response;
        process.stdout.write(json.response); // stream to console
      }
    }
  }

  console.log("\n\nFinal output:", fullText);

Final thoughts

Running your own ChatGPT-like service is super empowering. You get privacy, control, and flexibility, and you’ll learn a ton about modern AI. If you hit a snag, feel free to reach out to me for help.