LLM Hallucination and Long Delays

17 March 2026

LLMs Don’t "Do" Things, They Talk About Doing Things

Artificial intelligence tools like ChatGPT can be incredibly helpful. They can explain ideas, rewrite emails, summarize documents, and answer questions in seconds.

But there’s something important to understand:

They don’t actually do things.

They generate text that describes doing things.

That difference matters.

One of my biggest pet peeves is when I ask an AI or an LLM (Large Language Model) to do something and it happily responds that it'll do the thing I requested, but:

it needs my go-ahead to do it
it's working, it just has a long processing period
it can do something it really can't actually do

These responses are insidious in that they imply -- no, they explicitly state -- that the LLM is going to do something. However, the LLM is fundamentally incapable of carrying out that task. This can result in a small inconvenience, or it can turn into hours or days of wasted time.

Here are some common ways AI tools can accidentally mislead people because of how they are designed, not because they are malicious.

Running Renovate Locally in Jenkins

11 March 2026

All of the repositories I own on GitHub -- public and private -- have Dependabot configured to update repository dependencies. Since almost all repos have at least MegaLinter configured to run when commits are added to a pull request, there's always something that needs to be watched. My default template repo has seven workflows, none of which I want to manually review daily, especially when there are hundreds of repositories.

I have very little problem putting source code out on GitHub that's intended for public consumption, even if I'm the only one who ever looks at that code. That said, I have a certain discomfort with storing Infrastructure as Code (IaC) into GitHub, even in private repositories.

Where it Hurts

Modern repositories multiply quietly. One service becomes three. Three become twelve. Before long, you are maintaining dozens or hundreds of repositories, each with its own workflows, linters, scanners, test runners, and release logic. Each repository may carry five, six, or seven GitHub Actions or Jenkins pipelines. Every dependency bump becomes a pull request. Every pull request triggers validation. The noise compounds.

Now multiply that by time.

A single dependency update rarely touches only one repository. Shared libraries drift. Container base images age. Transitive dependencies surface CVEs. Without automation, you are left manually scanning changelogs, running npm update, go get -u, or pip install --upgrade, committing changes, opening pull requests, and waiting for pipelines to pass. Each action may be individually small. The aggregate burden is not.

Running Ollama on TrueNAS CORE in a Jail

04 March 2026

The Hardware

NAS Server (Alhoon)

I have a pretty beefy system in my home lab to provide Network-Attached Storage (NAS) services.

The system has dual Intel(R) Xeon(R) CPU E5-2680 processors running at 2.70GHz with a total of 32 cores. The NAS server has 128GB of RAM, but no GPU. What it lacks in modernity, it makes up in volume.

The NAS is running TrueNAS CORE which is based on FreeBSD 13.1. TrueNAS SCALE runs Linux, but this is TrueNAS CORE. FreeBSD doesn't support Docker natively (Docker provides a containerized approach to workload management built around a single Linux kernel running on the host; FreeBSD runs its own kernel, not a Linux kernel). Someone could use an emulator to run a virtualized Linux kernel which could run Docker, but I suspect the multiple layers of abstraction would have a negative impact on performance.

In other words, the NAS is running FreeBSD rather than Linux; as a result, the virtualization options are different than what Linux offers, even when running on Linux-compatible hardware.

Dedicated LLM Server (Bakwas)

There are other systems in the cluster, but the one most relevant here is the dedicated LLM (Large Language Model) server. The LLM server is much newer, but as much less capacity. It is running an Intel(R) Xeon(R) E-2124G CPU clocked at 3.40GHz. It has 32GB of RAM, but it also has an NVIDIA GeForce GTX 1080 GPU with 8GB of VRAM. Clearly, this is a much smaller server, but the presence of the GPU has a tremendous impact on the system's ability to run an LLM.

Going into this project, I knew that CPU inference is slower than GPU inference. The additional cores and memory would offset some of that cost, but I wasn't sure what that impact those differences would have.

Comparison

Here's a comparison of the key differences between the two systems:

System	Role	OS	CPU	Cores	GPU	RAM	VRAM
Alhoon	NAS Server	FreeBSD	2.7	32	N/A	128GB	N/A
Bakwas	LLM Server	Linux	3.4	4	1080	32GB	8GB

The Software

Bakwas

Bakwas is running Linux, so the path to getting Ollama running was quick and direct. I ran a containerized version of Ollama, orchestrated by Docker Compose. Here's the relevant portion of my docker-compose.yml file:

---
services:
  # --------------------------------------------------
  # Local LLM - Ollama
  # --------------------------------------------------
  ollama:
    image: ollama/ollama:0.16.1@sha256:dca1224ecd799f764b21b8d74a17afdc00505ecce93e7b55530d124115b42260
    container_name: ollama
    ports:
      - 11434:11434
    # GPU access (Compose recognizes this when the NVIDIA Container Toolkit is present
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              count: all
    gpus: all
    environment:
      OLLAMA_DISABLE_TELEMETRY: "1"
      OLLAMA_NUM_GPU_LAYERS: 100
      TZ: America/New_York
      NVIDIA_VISIBLE_DEVICES: "all"
      OLLAMA_KEEP_ALIVE: 5m
    volumes:
      - ollama_models:/root/.ollama
      - /etc/localtime:/etc/localtime:ro
    networks: [web]
    restart: unless-stopped

#
# other services (Open WebUI, AnythingLLM, Qdrant, Glances, etc.) defined here
#

networks:
  web:

volumes:
  ollama_models:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /srv/ollama

Running Ollama on Bakwas was a matter of running docker compose up just like any other Docker Compose implementation.

I keep the configuration in a Gitea repository hosted locally with Renovate to maintain image versions.

Alhoon

Getting Ollama to run on FreeBSD 13.1 was a little more difficult. Because this system doesn’t have a working Vulkan implementation, Ollama fails when it tries to initialize Vulkan when a prompt was submitted. The error would look like this:

llama_model_load: error loading model: vk::createInstance: ErrorIncompatibleDriver
llama_load_model_from_file: exception loading model

Building Ollama

Because there was no option to download an Ollama package with the Vulkan integration disabled at the time of my writing this, I needed to rebuild Ollama with the Vulkan drivers turned off to get Ollama to work in my TrueNAS CORE jail,

If you don't have git, install it:

pkg install git

If you don't have the ports tree, clone it:

git clone https://git.FreeBSD.org/ports.git /usr/ports

If you do have the ports tree, update it:

cd /usr/ports
git pull

Then, build Ollama for your system:

cd /usr/ports/misc/ollama
make clean
make patch
sed -i~ -Ee 's/(vulkan)=on/\1=off/Ig' /usr/ports/misc/ollama/work/github.com/ollama/ollama@v0.3.6/llm/generate/gen_bsd.sh
make install

The make patch line creates the work/ directory so that the following update (the sed command) can update the generated files. The next line, make install performs the actual build (make build is dependency of make install) while the make install line will perform the actual installation. It's also common to pair the install target with the clean target (i.e., make install clean) to clean up after the build; I didn't do that here so that the updated files could be reviewable after the build (and installation) completed.

Starting Ollama

To start Ollama, be sure to set a few environment variables and send them along when starting ollama:

env \
  OLLAMA_HOST="0.0.0.0:11434" \
  OLLAMA_LLM_LIBRARY="cpu" \
  ollama serve

The first variable tells ollama to listen on all interfaces (if this pattern works for your environment; otherwise, consider binding to a specific interface which is protected by a packet filter).

The second variable tells ollama to only run on the CPU and not attempt to use the Vulkan drivers.

The command to start ollama is:

ollama serve

If you're pleased with how this works, you're welcome to setup an RC script so that you may use service ollama start to run the service.

I found that the FreeBSD documentation on RC scripting approach, while helpful, needed a little tweaking to make work in my environment. Also, instead of attempting to use the daemon tool to scaffold the ollama process, I used a nohup-based approach.

The core of my RC script was:

/usr/bin/env \
  OLLAMA_LLM_LIBRARY=cpu \
  OLLAMA_HOST="0.0.0.0:11434" \
  OLLAMA_MODELS="${OLLAMA_BASE}" \
  OLLAMA_TMPDIR="${OLLAMA_TMPDIR}" \
  OLLAMA_RUNNERS_DIR="${OLLAMA_RUNNERS_DIR}" \
  nohup "${OLLAMA_BIN}" serve >> "${OLLAMA_LOG}" 2>&1 &

There's some additional duct tape around that invocation to make sure the log location (OLLAMA_LOG) and model directory (OLLAMA_MODELS) existed and could be read by a dedicated ollama user.

What's provided here is a simple background start; for production consider using daemon(8) and pidfiles.

Performance Experiments

I thought that it might be possible to run Ollama on both systems with smaller, faster jobs on the dedicated LLM server (Bakwas) and more complex jobs running on the NAS (Alhoon). So, I setup Ollama on both systems and played around to see what the possibilities were.

Models

I run a bunch of different models on Bakwas, the dedicated LLM server. Most are small, typically of the 7b-q4k_m variety (i.e., 7 billion parameters, quantized to 4 bits, grouped with medium precision). They run reasonably quickly, often approaching chat-like speeds. Unfortunately, they are not able to reason very well at all.

Bakwas can also run some larger models, although they tend to run mostly on the CPU, so the speed drops off very quickly as compared to the 7b models which can run entirely on the GPU.

Sample Query

I attempted to run the following query:

How many days are between February 26, 2026 and July 14th, 2028?

The correct answer is 870 days (inclusively counted).

To make sure there weren't residual processes running or extraneous model data in memory, I bounced the Ollama process between each test.

Bakwas 7b Model

I ran the prompt twice against the llama:latest 7b model. The first response was:

There are 95 days between February 26, 2026 and July 14, 2028.

That prompt took 11 seconds to run; however, because it hadn't loaded the model from disk, I decided to run it again.

The second run took 38 seconds to run, but it drafted a Python script for me to run before providing me with its final answer:

There are 795.0 days between February 26, 2026 and July 14, 2028.

Both are way off.

Bakwas 30b Model

The next experiment was the qwen3:30b model.

This run was able to correctly calculate the right answer of 870 days. However, the real differentiator was that the 30b model ran for about 36 minutes before I had to stop the run. Looking at the thinking, it would recalculate the answer many, many, many different ways, often doubting its calculations along the way.

Bakwas 3.4b Model

The last experiment with Bakwas was the smallthinker:latest model which weighed in at 3.4b. After 171 seconds, the correct answer of 870 days was returned.

Alhoon 70b Model

My first experiment with Alhoon (the NAS server) was a 70b model (deepseek-r1:70b) with that same prompt.

As-expected, the prompt had performance along the lines of seconds per token rather than tokens per second. I'll never know how long Alhoon took because I cut it off after 10,260 seconds. When watching the thinking, it engaged in similar reasoning to the qwen3:30b model, but each fraction of a word would take several seconds to appear.

The Verdict

While the hardware wasn't sufficiently performant to make Alhoon a viable option for running prompts, it did demonstrate that it could be used. That is, running Ollama in a jail on a TrueNAS CORE system did work. I had hoped to be able to use Alhoon to run larger, more complicated prompts while acknowledging that I was trading speed for depth of processing; I did not anticipate that the gap between the two platforms was so wide.

Credits

Much of what I did was based on Thomas Spielauer's article on Running Ollama on a CPU-Only Headless FreeBSD 14 System which was critical to getting started. The sed script I drafted automated the updates to the gen_bsd.sh script while the OLLAMA_LLM_LIBRARY variable helped make sure that Ollama didn't even try to instantiate a Vulkan instance.

DevSecOps Engineer, Author, and Mentor