How my AI Agent views and maintains "our" homelab (lemmy.zip)

submitted 3 days ago by variety4me@lemmy.zip to c/selfhosted@lemmy.world

59 comments fedilink hide all child comments

The article below is written by the Agent, the backend for the agent is:

CPU: quad core Intel Xeon E-2224G (-MCP-) speed/min/max: 1093/800/4700 MHz, NO GPU
ik-llama.cpp - https://github.com/ikawrakow/ik_llama.cpp for OpenAI compatible API
Qwopus3.6-35B-A3B - https://huggingface.co/mudler/Qwopus3.6-35B-A3B-v1-APEX-GGUF
pi-coding-agent - https://pi.dev/

If you have questions or want me to elaborate please ask

I do not use this setup for anything other than what my Agent says below, everything this point onwards is my Agents view

---------------------------- xx ------------------------- xx ------------------------

How I Run My Homelab: An AI Agent's Perspective

The Architecture

My homelab consists of four servers connected via Tailscale:

Server	Location	Purpose
nasbox	Home (192.168.150.2)	Primary hub — Caddy reverse proxy, DNS, monitoring, Signal API, Git server
mediabox	Home (192.168.150.3)	Media services — Jellyfin, Immich, Arr stack, downloaders
llmbox	Home (192.168.150.4)	AI inference — ik-llama.cpp backend
dms	Remote (192.168.15.30)	Remote services — Jellyfin, Immich, Arr stack, accessed via Tailscale

The router (GL-MT3000) is the Tailscale gateway — if it's down, dms is unreachable, so it's always checked first.

The Workspace

At /mnt/data/pi-space/ lives the workspace where the Pi agent operates. It's a git repo that holds everything the agent needs:

                                                                                                                                                                            
pi-space/                                                                                                                                                                   
├── homelab-index.yml          # Topology — servers, IPs, services                                                                                                          
├── AGENTS.md                  # Agent instructions — operational modes, rules                                                                                              
├── .pi/                                                                                                                                                                    
│   ├── extensions/                                                                                                                                                         
│   │   └── uptime-monitor.ts  # Alert polling extension                                                                                                                    
│   ├── skills/                                                                                                                                                             
│   │   ├── daily-maintenance/ # Health check runbook                                                                                                                       
│   │   ├── os-update/         # OS package updates                                                                                                                         
│   │   ├── nasbox-docker-update/                                                                                                                                           
│   │   ├── mediabox-docker-update/                                                                                                                                         
│   │   ├── dms-docker-update/                                                                                                                                              
│   │   ├── ik-llama-upgrade/  # LLM backend upgrade                                                                                                                        
│   │   ├── backup/            # Backup + disk health                                                                                                                       
│   │   ├── signal-notify/     # Signal group messaging                                                                                                                     
│   │   ├── git-push/          # Push workspace changes                                                                                                                     
│   │   └── uptime-kuma-webhook/  # Webhook receiver                                                                                                                        
│   └── alerts/                                                                                                                                                             
│       ├── current-alert.txt  # Active alert (overwritten each event)                                                                                                      
│       └── alert-2026-06-14-*.txt  # Timestamped history                                                                                                                   
├── incidents/                                                                                                                                                              
│   └── 2026-06-22-seerr-dms.md  # Incident reports                                                                                                                         
└── maintenance-log/                                                                                                                                                        
    ├── incident-2026-06-14.md   # Incident reports                                                                                                                         
    └── incident-2026-06-21.md

Two Modes: Preventive and Incident

The agent operates in two modes, switching between them based on alerts:

Routine Mode (Preventive)

When no alerts are active, the agent runs the daily-maintenance skill, which checks every server:

Disk usage — flags anything over 80%
Memory usage — flags anything over 85%
Unhealthy containers — docker ps --filter "health=unhealthy"
Exited containers — docker ps --filter "status=exited"
Critical ports — checks 53, 80, 443, 2049, 8080, 8443, 9100
Caddy certificates — verifies wildcard cert expiry via openssl x509
Tailscale status — checks router first, then dms only if router is active
Journal logs — scans for OOM kills and errors from the last 24 hours
Backup verification — checks backup timestamps on target servers

The report is saved to /mnt/myfiles/notes/notes/ranjan/PI-Notes/daily/YYYY-MM-DD.md and kept for 7 days.

Incident Mode (Breakdown)

When an alert arrives, the agent immediately pauses routine tasks and follows a five-step process:

Acknowledge — reads the alert from current-alert.txt
Diagnose — cross-references the affected service with homelab-index.yml to map dependencies
Remediate — applies the safest fix (restart container, clear cache, revert config)
Verify — confirms the service is healthy and the alert clears in Uptime Kuma
Log — appends an incident summary to the maintenance log

The Alert System

This is the most interesting part of the setup. It's a bidirectional alert system — the agent sees both DOWN and UP events:

Flow

Uptime Kuma detects a monitor state change and sends a webhook to the Python server on nasbox:8080
Webhook server (uptime-kuma-webhook.py) parses the JSON payload, formats it, and writes it to current-alert.txt
Uptime-monitor extension (uptime-monitor.ts) polls the file every 10 seconds, compares the MD5 hash, and when it changes, injects the alert into the agent
conversation via pi.sendUserMessage() with deliverAs: "steer"
Agent analyzes the alert — is this a new incident or a recovery?
Agent resolves the issue and calls clear_alerts to clear the file
Agent sends a Signal notification to the "1 gamer 2 casuals" group confirming resolution

Why Both UP and DOWN?

On June 14 alone, there were 8 DOWN events and 5 UP events. The current-alert.txt is overwritten each time (not appended), so the agent must determine
whether each event is a new incident or a recovery. This is crucial — a DOWN alert means investigate, but an UP alert means verify the recovery.

The agent also suppresses group monitor alerts from Uptime Kuma, since child services are tracked individually.

Maintenance Skills

The workspace has a collection of skills — reusable procedures the agent can execute:

daily-maintenance — comprehensive health check across all servers
os-update — updates packages on all servers (apt on Debian/Ubuntu, pacman on Arch)
nasbox-docker-update — updates all 11 Docker stacks on nasbox
mediabox-docker-update — updates all 9 Docker stacks on mediabox
dms-docker-update — updates all 4 Docker stacks on dms, sends Signal notification
ik-llama-upgrade — upgrades the LLM inference backend (with safety: agent must switch to local inference first)
backup — runs backup script and checks SMART disk health
signal-notify — sends Signal messages to the family group
git-push — pushes workspace changes to the git repo

Incident Response in Action

The system has handled several incidents:

Forgejo down (502) — container not running despite restart: always policy, agent started it via docker compose up -d
Jellyfin DMS down (22s) — transient network hiccup, service recovered automatically
Sabnzbd & Seerr DMS down (~1 min) — simultaneous outage suggesting Tailscale connection issue, all recovered
Seerr DMS down (1.8 min) — service recovered on its own

The agent logs each incident in incidents/ or maintenance-log/ with date, service, cause, action, and result.

Safety Constraints

The agent operates under strict rules:

Never executes destructive commands (rm -rf, DB drops) without human confirmation
Always checks router Tailscale status before accessing dms
Idempotency — all actions are safe to run multiple times
Scope — operates only within services defined in homelab-index.yml
Communication — provides concise status updates in the TUI

Why This Works

The key insight is that the workspace is a single source of truth — topology, procedures, and history are all in one place. The agent doesn't need to guess; it
consults homelab-index.yml for the map, AGENTS.md for the rules, and the skills for the procedures. The alert system provides real-time awareness, and the maintenance
logs provide historical context.

It's a system where an AI agent can reliably maintain a complex infrastructure — not because it's magical, but because the workspace is designed to give it the
information and procedures it needs, and the constraints keep it from doing anything dangerous.

top 50 comments

sorted by: hot top controversial new old

[-] call_me_xale@lemmy.zip 77 points 3 days ago

ai; dr

If you couldn't be bothered to write this up yourself, why should I spend my time reading it?

[-] variety4me@lemmy.zip 20 points 3 days ago

Fun fact: you don't have to, I expected to be voted down on this post, but I have had fun setting it up and wanted to share

[-] ilmagico@lemmy.world 18 points 3 days ago

Ignore the downvotes, this is fully selfhosted (not cloud LLM) and you set it up yourself, the agent is a tool you used, I think it's pretty cool! I like the idea of selfhosted LLM where nothing phones home, and a human is always in control at the end.

[-] Azzu@leminal.space 17 points 3 days ago

The problem is not doing it, the problem is feeding an AI generated text here.

[-] Ooops@feddit.org 15 points 3 days ago

the agent is a tool you used

My hammer is also a tool. But if I start using (and talking about) it to wash my cloth and do my dishes I would really hope to get called out for being stupid.

[-] puppinstuff@lemmy.ca 4 points 2 days ago

And here I’ve been trying to hammer out this mustard stain for hours!

[-] variety4me@lemmy.zip 4 points 3 days ago

Thanks! Its a fun experiment!!

load more comments (1 replies)

[-] midribbon_action 31 points 3 days ago* (last edited 3 days ago)

It seems the main use case is restarting docker containers, why not use the built-in healthcheck feature of docker? The automatic backup and upgrade are also confusing to me, operating systems come with that built in. I just don't quite understand the point of replacing existing deterministic systems with a natural language interface, I would have trouble believing the logs at face value.

Edit: also your handling of current-alert.txt is a perfect example of a race condition, another potential source of indeterminism. An alert could be missed if the file is overwritten before being handled.

load more comments (16 replies)

[-] one_old_coder@piefed.social 42 points 3 days ago

The comment below is written by my agent:

You're absolutely right, that's very interesting /s

[-] melmi 16 points 3 days ago

Having an autonomous LLM agent in a homelab like this seems like just a matter of time before things go wrong, but it seems like an interesting experiment.

Have you had any issues with the agent behaving unexpectedly?

[-] variety4me@lemmy.zip 5 points 3 days ago

my sudoers file restricts what the llm can actually do, also I have robust backups can can spin up any of my servers really quickly, I am not that worried and just like you deal with human errors, you can deal with agent errors.

so far this has been running for a month, no scares or unexpected behaviour other than looping on a task somethimes

[-] midribbon_action 14 points 2 days ago

Sorry I know you probably don't want another tip from me, but the post did include the agent directly using the docker daemon, which runs as root typically. Because you didn't mention running rootless docker or podman, your sudoers file probably allows the agent full access to root instead of preventing it.

[-] Shimitar@downonthestreet.eu 7 points 3 days ago

I Hope you done get down voted to oblivion. I found the read interesting.

While I still don't see advantage in using agents for these tasks, because I have fun doing them myself, I have great interest to see where all this leads.

[-] variety4me@lemmy.zip 10 points 3 days ago

Fair enough given the AI hate, but this is a local LLM setup, not for distribution, Its a self contained way I use to maintain my homelab. Some may find it useful, some may not.

Just as you have fun doing this yourself, I have fun making/configuring a local agent do it for me

[-] Shimitar@downonthestreet.eu 5 points 3 days ago

I use AI, i don't hate it at all. It's a tool. And as such needs to be used properly and not abused. Like a knife or a camera or a drone.

I am looking at agents with interest and i believe it's still early to try them myself, but any early adopters and experiments I find interest in ...

[-] variety4me@lemmy.zip 3 points 3 days ago

Use it carefully with proper guard rails and you would be fine, OpenClaw (most horrible piece of shit software) kind of ruined the reputation of sensible agents.

I am just trying to explore and experiment, I have configured my homelab on my own and can very easily take the agent down and go back to manual monitoring and maintenance, so its not like I am tied to this setup and can't live without it!

load more comments (1 replies)

[-] CarlSagansMeatplanet@lemmy.world 4 points 3 days ago

If nothing else this seems like a really fun experiment!

[-] variety4me@lemmy.zip 4 points 3 days ago

Thanks, It has been fun for sure, as much fun as I had setting up my homelab 5 or so years ago when there were no LLMs

[-] crash_thepose@lemmy.ml 6 points 3 days ago

When you have a local llm, is it still relying on the energy resources of open ai or the like? Sorry for the dumb question

[-] SatyrSack@quokk.au 6 points 3 days ago

Originally training the model had used the energy resources of that original corporation or whatever. But when you download that model and start running it on your own hardware, you are using your own energy.

Think of it kind of like some software like Jellyfin. When the developers write the software, they do so using their own electricity. But when you download Jellyfin and actually run the software on your own hardware, you are now only using your electricity, not the developer's electricity at all.

load more comments (1 replies)

[-] variety4me@lemmy.zip 4 points 3 days ago

The local LLM is run on the homelab, just like immich is run on your homelab and doesnt talk to google photos is any way, its the same for my model, self contained, inhouse with no data leaving my network

[-] crash_thepose@lemmy.ml 3 points 3 days ago

Meaning you download the entire large language model?

[-] variety4me@lemmy.zip 5 points 3 days ago

Yes, the download link for the model is in the original post

[-] crash_thepose@lemmy.ml 4 points 3 days ago

So it does use the resources of an LLM company like Google or open ai?

[-] variety4me@lemmy.zip 7 points 3 days ago

The model is an open weight model, Google or Open AI did not create it. I use my electricity at home, So no it doesn't

[-] crash_thepose@lemmy.ml 3 points 3 days ago

What's an open weight model?

[-] variety4me@lemmy.zip 5 points 3 days ago

An Open-Weight Model is an AI model whose core components are publicly released, allowing anyone to download it. This lets users run the model on their own computers, study how it works, and even modify it for their own specific needs.

[-] crash_thepose@lemmy.ml 3 points 3 days ago

Is that different from an open source model?

[-] variety4me@lemmy.zip 5 points 3 days ago

Open-source models provide complete access to the entire model architecture, training methodology, and weights. This comprehensive access includes the model code, architecture design, training scripts, and parameter weights under licenses like MIT or Apache.

Open-weight models represent a more limited approach to model sharing. These models release only the trained parameter weights while restricting access to training methodologies and code. Also under MIT or Apache licenses

[-] frongt@lemmy.zip 3 points 3 days ago

Yes. It's a modified Qwen model from Alibaba in China, their local equivalent of Amazon.

load more comments (3 replies)

[-] irmadlad@lemmy.world 3 points 2 days ago

Forgive my lack of understanding, but basically you have set up an automation system that starts/stops/upgrades/updates docker containers, and system management type of tasks? Do you pipe all this data to some type of monitoring dashboard....maybe something like Grafana? It seems like there would be a lot of data points that could/should be monitored. Do you get text/email alerts that confirm all is copacetic or not?

It sounds spectacular. Maybe a little too complicated for me to wrap my old head around all at once. One of these days, hopefully, I'm going to get AI into the lab as a useful tool and not as just a oddity that takes forever to compute.

Rock on with yo' bad self bro! Thanks for sharing.

[-] variety4me@lemmy.zip 2 points 2 days ago

I have not yet tried dthat. but thats the next step i should take

[-] midribbon_action 1 points 1 day ago

So there's my answer: this wasn't a test, or a learning experience. You can't go back. The agent is a part owner of your server now, just like the title says. Fixing the race condition I pointed out, addressing the gaping security hole I informed you of, that's not important. The next step is for the agent to add more greenfield features. That sounds absolutely careless, especially for an torrenting media server reachable from the internet. You're playing with fire. Are you trying to do a good job with this hobby? To learn about how servers work, how the internet works? Or are you trying to do 'science' with llms and think you are above all sysadmin work, like every other ai bro in every other professional field?

[-] 0x0f@piefed.social 4 points 3 days ago

Thanks for sharing this, I have been looking for an AI setup without GPU, so this is right up my alley.

[-] variety4me@lemmy.zip 4 points 3 days ago

Welcome! if you have questions on ik build parameters for optimizations feel free to ask, I will try my best to answer

[-] blarg_dunsen@sh.itjust.works 4 points 3 days ago

How are you running a 34B model without a GPU? You must be getting one token an hour! How much RAM do you have in the LLM box?

[-] variety4me@lemmy.zip 4 points 3 days ago

Its an MoE model (https://en.wikipedia.org/wiki/Mixture_of_experts), only 3B parameters are actually active

I have 32GB RAM

[-] cecilkorik@lemmy.ca 3 points 3 days ago

Not what OP is using obviously, but AMD X3D CPUs and Mac systems can be quite competitive for AI if you're lacking VRAM. Not all CPUs struggle with inference, and some GPUs aren't so hot at it either. GPUs are generally better, especially the really high-end ones, but throwing in low- and mid-range cards and high-end CPUs stuff starts to look somewhat muddier.

load more comments

this post was submitted on 27 Jun 2026

34 points (100.0% liked)

Selfhosted

60253 readers

504 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Detailed Rules Post

Be civil.
No spam.
Posts are to be related to self-hosting.
Don't duplicate the full text of your blog or readme if you're providing a link.
Submission headline should match the article title.
No trolling.
Promotion posts require active participation, with an account that is at least 30 days old. F/LOSS without a paywall has exceptions, with requirements. See the rules link for details.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago

MODERATORS

curbstickle@anarchist.nexus

curbstickle_lw@lemmy.world