Chapter 14

Security, Reliability, and Operations

Every chapter before this one focused on building something: a profile, a skill, a cron job, a full agent team. This chapter focuses on keeping what you built running — and making sure the wrong things do not happen along the way.

Hermes runs on your machine, with access to your files, your API keys, and your shell. That power is what makes it useful. It is also what makes security and reliability non-negotiable. An agent that can run commands on your server can also run the wrong commands. An agent that holds your API keys can also leak them. A cron job that fails silently can leave you thinking work happened when it did not.

This chapter covers the defenses Hermes has built in, the things you need to do yourself, and the operational habits that keep your setup reliable over time.

Defense in depth

Seven layers between your agent and your system

Hermes uses a defense-in-depth model. That means no single layer is expected to catch everything — each layer adds protection, and a failure in one layer is caught by the next. Here are the seven layers, from the outside in:

You do not need to configure all seven layers yourself. Layers 2 through 7 are active by default. Layer 1 (user authorization) requires you to set up allowlists for your gateway platform, which Chapter 5 covered. The rest are built in.

Protecting credentials

Where your secrets live and how they stay safe

Hermes stores API keys and other credentials in ~/.hermes/.env, not in config.yaml. This separation matters: config.yaml contains your agent's preferences (model, toolsets, approval mode) and may be shared or exported. The .env file contains secrets and should never be shared, committed to Git, or included in a profile export.

Secret redaction is on by default. When Hermes logs tool output, model responses, or gateway messages, it scans for patterns that look like API keys, tokens, and credentials, then masks them. Short tokens are fully replaced. Longer tokens preserve the first six and last four characters so you can identify which key was used without seeing the full value. This redaction cannot be disabled at runtime — it is frozen at import time to prevent an agent from turning it off mid-session.

The practical rule: never put API keys in config.yaml, never commit .env to Git, and never share .env files in profile exports. Hermes separates these on purpose — follow the separation.

Human in the loop

When the agent asks permission before acting

Before Hermes executes a command, it checks the command against a curated list of dangerous patterns. Things like deleting files, overwriting configs, installing packages, and running network operations all trigger approval. When a match is found, the agent pauses and asks you for permission.

Three approval modes are available, configured in ~/.hermes/config.yaml:

manual (default)

Every dangerous command requires explicit approval. The agent pauses, shows you the command, and waits. If you do not respond within the timeout (60 seconds by default), the command is denied. This is the safest mode and the one you should use unless you have a specific reason to change it.

smart

An auxiliary LLM assesses each dangerous command. Low-risk commands (like running a Python print statement) are auto-approved. Genuinely dangerous commands are auto-denied. Uncertain cases escalate to a manual prompt. This reduces interruptions without fully disabling safety.

off

All approval prompts are disabled. Every command executes immediately, regardless of risk. This is equivalent to YOLO mode. Use only in disposable environments — CI containers, one-off sandboxes, environments you can recreate from scratch.

There is also YOLO mode, which you can toggle per-session with /yolo in the CLI or gateway. It bypasses all approval prompts for that session only. YOLO is useful for trusted, well-tested automation scripts where you know exactly what commands will run. It is dangerous for exploratory work where the agent might generate unexpected commands.

One critical detail: YOLO mode has a floor. The hardline blocklist is always on — even in YOLO mode, even with approvals set to off, certain commands are never allowed. Commands like rm -rf /, fork bombs, and raw block device overwrites are blocked unconditionally. There is no override flag. If you hit the blocklist, the tool call returns an error and nothing runs. If your workflow legitimately needs one of these commands, run it outside the agent.

Resilience

What happens when your model provider goes down

Your agent depends on a model provider to think. If that provider goes down — rate limits, server overload, auth failures, connection drops — your agent stops. Provider fallback is the mechanism that keeps it running.

Hermes supports three layers of resilience:

1. Credential pools

Rotate across multiple API keys for the same provider. If one key hits a rate limit, the next key takes over. Useful if you have several keys for the same service.

2. Primary model fallback

Automatically switch to a different provider and model when your primary fails. Configured in config.yaml under fallback_model or fallback_providers. When the primary provider errors, Hermes tries the fallback on that same turn, then restores the primary for the next turn. Your conversation continues without manual intervention.

3. Auxiliary task fallback

Side tasks like vision, context compression, and web extraction have independent provider resolution. If your main provider is down, these tasks can still work through a different route. They degrade gracefully — if no provider is available for compression, the session continues without summarization.

The easiest way to set up fallback is the interactive command:

bashVerified

hermes fallback

This opens the same provider picker as hermes model. You can also add fallbacks directly in config.yaml. The key principle: always configure at least one fallback provider. Model providers are reliable but not infallible. A single rate limit event can halt an entire cron pipeline if no fallback exists.

When things break

Gateway restarts, silent cron failures, and broken configs

Three operational problems come up repeatedly in long-running agent setups. Here is what each one looks like and what to do about it.

Problem: The gateway restarts

If you run the messaging gateway on a VPS, it will eventually restart — either because the server reboots, the process crashes, or you update Hermes. When the gateway comes back up, it reconnects to your messaging platforms (Telegram, Discord, Slack) and resumes listening. Pending messages queue on the platform side and arrive once the gateway is back. The risk is not data loss — it is downtime. Your agents cannot respond while the gateway is down. Solution: run the gateway with automatic restart (systemd or Docker restart policy). Set up monitoring so you know when it goes down, not just when it comes back up.

Problem: A cron job fails silently

Cron jobs run unattended. If a job fails — the model provider is down, the API key is invalid, the working directory is wrong — it fails without a human watching. The job records the error in its status, but you will not see it unless you check. Solution: use the deliver field on cron jobs to send results (and errors) to a messaging platform. If the job succeeds, you get the output. If it fails, you get the error message in the same channel. No output at all is the worst case — it means the job might not have run at all. Also run hermes doctor periodically to catch configuration problems before they cause silent failures.

Problem: The config file breaks

If config.yaml has a syntax error — an unclosed quote, a wrong indent, a missing colon — Hermes cannot parse it. The agent will fail to start, or it will start with default settings and ignore your custom configuration. This is especially dangerous because the failure is silent: you might not realize your approval mode, model, or toolset configuration was lost. Solution: after editing config.yaml, run hermes doctor to validate the configuration. The doctor command checks for parse errors, missing keys, invalid values, and common misconfigurations. It takes seconds and catches problems that would otherwise surface as mysterious agent behavior.

Recovery

What to back up and how to do it

Everything that matters lives in one directory: . That includes your config, your profiles, your memory, your skills, your cron jobs, and your API keys. If you lose this directory, you lose your entire setup.

Hermes has a built-in backup command that creates a zip archive of the full ~/.hermes directory:

bashVerified

hermes backup

This produces a timestamped zip file in your home directory. The backup excludes the hermes-agent code repository (which you can re-clone), bytecode caches, Git metadata, and transient runtime files like gateway PID files. It includes everything else: config, profiles, memory, skills, cron jobs, session databases, and your .env file with API keys.

bashVerified

hermes backup --output /path/to/backup.zip

To restore from a backup, use the import command:

bashVerified

hermes import /path/to/backup.zip

The import overlays the backup onto your current ~/.hermes directory — existing files are overwritten, but files not in the backup are preserved.

Maintenance

How to update Hermes without breaking your setup

Hermes updates frequently — the project is active, with regular releases that add features, fix bugs, and improve security. Updating is straightforward, but it is worth doing carefully when you have a running setup with cron jobs and gateway connections.

bashVerified

hermes update

The update command pulls the latest code, reinstalls dependencies, and restarts the gateway if it is running. Before you update, follow this sequence:

1. Back up first

Run hermes backup before every update. If the update changes config format, breaks a profile, or introduces a regression, you can restore from the backup. This takes seconds and saves hours.

2. Check what changed

Read the release notes for the version you are updating to. Look for breaking changes to config format, new required fields, or deprecated features. The hermes update command handles most migration automatically, but knowing what changed helps you spot problems faster.

3. Run the update

Run hermes update. The command handles the git pull, dependency reinstall, and gateway restart. If the gateway is running, the update restarts it with the new code. Active sessions are preserved across the restart.

4. Run the doctor

After the update, run hermes doctor. This catches configuration problems introduced by the update — new required fields, changed defaults, or format migrations that did not apply cleanly.

5. Verify your agents

Test one manual session with each active profile. Confirm the model connects, tools work, and memory loads. Then check that your cron jobs are still listed and that the gateway is accepting messages.

bashVerified

hermes doctor

The doctor command is your diagnostic tool. It checks: Python and Node availability, API key configuration, model provider connectivity, toolset loading, profile integrity, and gateway status. Run it after any change that might affect your setup — updates, config edits, new profiles, or when something just feels wrong.

Running reliably

Six habits that prevent most problems

Security features and backup commands are necessary but not sufficient. The difference between a setup that runs for months and one that breaks weekly is operational habits. Here are the six that matter most:

1. Back up before every change

Before updating Hermes, editing config.yaml, adding a new profile, or changing a cron schedule — run hermes backup. The command takes seconds. Restoring from a backup takes minutes. Rebuilding a lost setup takes hours or days.

2. Run the doctor after every update

hermes doctor catches configuration problems early, before they become silent failures in cron jobs or gateway sessions. Make it a reflex: update, then doctor.

3. Always configure a fallback provider

Model providers go down. Rate limits happen. API keys expire. A single fallback provider in your config means your agent stays running when your primary provider fails. Without one, every provider outage halts your entire pipeline.

4. Deliver cron output to a messaging channel

Silent failures are the most dangerous kind. If your Monday scheduled report fails and you do not find out until Friday, you have lost a week. Configure the deliver field on every cron job so errors arrive in your Slack, Telegram, or Discord channel immediately.

5. Keep approvals on manual for production

Smart approval mode is convenient for development. Manual mode is the right choice for production profiles with access to sensitive files and systems. The few extra approval clicks are cheaper than recovering from an auto-approved destructive command.

6. Review memory and skills periodically

Memory grows. Skills accumulate. Outdated facts and stale procedures degrade agent quality over time. Review each profile's memory files monthly. Delete contradictions, remove outdated information, and keep files compact. Accurate context produces reliable output.

Final note

Security is a practice, not a feature

Hermes provides the mechanisms — approvals, redaction, fallbacks, backups, isolation. The mechanisms work automatically when they are configured. But no mechanism replaces judgment. You decide which approval mode to use. You decide whether to configure fallback providers. You decide whether to back up before an update. The security model gives you the tools. Using them consistently is your responsibility.

The reward for that consistency is a setup that runs reliably for months. The cost of skipping it is unpredictable failures at the worst possible time — a cron job that silently stops, an API key that leaks into a log, a config change that disables your approval mode without you noticing. These are not theoretical risks. They are the normal failure modes of any long-running system. The defenses in this chapter are how you avoid them.

Your cron job runs a scheduled report every Monday at 9 AM. You check on Wednesday and realize you never received Monday's results. The job has been failing silently since you changed your API key two weeks ago. What should you add to prevent this from happening again?

Designing a Practical Agent Team