Designing Hermes Agent from Scratch: A Systems Deep Dive

Jun 08, 2026

∙ Paid

If you strip away the packaging, Hermes is not a chatbot with plugins bolted on. It is a long-running control loop that sits between a language model and a curated set of side effects. Every design choice in NousResearch/hermes-agent follows from that premise: the model proposes actions, the runtime enforces budgets and guardrails, tools execute in isolated environments, and SQLite holds the truth about what actually happened.

This article walks through Hermes Agent architecture in detail. The goal is not a feature tour. It is a builder's map: where state lives, how turns commit, why prompts are split into tiers, and which patterns you can steal if you are designing your own agent runtime.

What you are building

Most "agent" demos are thin wrappers: send messages to an API, parse tool calls, call a few functions, repeat until the model stops. Hermes inverts the emphasis. The loop is the product. Tools are registered data, not hard-coded branches. Sessions survive restarts. Gateways multiplex the same loop across Telegram, Discord, Slack, and a local CLI. Subagents get their own iteration budgets, not a shared piggy bank.

Three invariants show up everywhere once you know where to look:

One internal message format. Adapters translate OpenAI chat completions, Anthropic messages, and Codex responses into the same dict-shaped history the loop consumes.
Prompt prefix stability. The system prompt is assembled once per session and rebuilt only after compression. Volatile facts (memory, timestamps) are segregated so provider-side prefix caches stay warm.
Blast-radius control via toolsets. The registry knows every tool; toolsets decide which subset a given session, subagent, or gateway channel may touch.

Figure 1 is the spine. Everything else hangs off the facade and the conversation loop.

CLI sessions, gateway adapters, and cron jobs all converge on AIAgent, a facade in run_agent.py. The heavy work lives in agent/conversation_loop.py (~4,900 lines of turn orchestration), with prompt assembly in agent/system_prompt.py, dispatch in model_tools.py, and persistence in hermes_state.py (SessionDB, schema version 14).

The facade pattern and why it matters

Early Hermes put the entire loop inside run_agent.AIAgent.run_conversation. That method grew past 3,900 lines. The refactor extracted run_conversation into agent/conversation_loop.py but kept AIAgent as the public object tests and gateways patch.

The indirection is deliberate. Production code and tests monkeypatch symbols on run_agent (handle_function_call, OpenAI, _set_interrupt). The loop resolves those through a lazy _ra() helper:

def _ra():
    import run_agent
    return run_agent

When conversation_loop needs to dispatch a tool, it calls _ra().handle_function_call(...). Patches applied to run_agent.handle_function_call still work even though the implementation body moved to model_tools.py.

For your own agent, this is the stable seam pattern: keep a thin, patch-friendly module boundary even when you split monoliths. Gateways, cron workers, and CLI entry points should depend on a small surface (AIAgent.run_conversation, session IDs, tool lists), not on internal loop helpers.

Initialization is centralized in agent/agent_init.py. An AIAgent instance carries provider credentials, tool allowlists, memory flags, compression settings, and an IterationBudget. Subagents created by delegate_task run through the same init path with a different toolset and budget cap.

One user turn, end to end

A "turn" is one user message through to a final assistant reply (possibly after many model round-trips and tool calls). Figure 2 shows the phases.

When run_conversation starts, it does more housekeeping than you might expect from a demo loop:

Binds logging context to session_id so hermes logs --session filters correctly.
Resets per-turn retry counters (_empty_content_retries, _invalid_tool_retries, etc.) so a painful previous turn does not poison the next.
Resets the iteration budget consumption for this turn (parent budget is per-turn fresh; subagents carry their own).
Runs a TCP health check on idle HTTP connections to avoid hanging on zombie sockets after provider outages.
Ensures a SessionDB row exists via _ensure_db_session().

Then the loop enters the iteration cycle:

Assemble messages — history plus system prompt. If cached, reuse agent._cached_system_prompt.
Preflight compression — if estimated tokens exceed threshold, compress before the API call (see below).
Provider call — streaming or non-streaming, with failover classification on errors.
Parse response — tool calls, reasoning blocks, empty-content guards.
Dispatch tools — parallel or serial depending on config; results appended as tool messages.
Repeat until the model returns final text or budgets/guardrails halt the turn.
Post-turn hooks — memory review nudges, skill provenance, persistence to SessionDB.

The loop is defensive in ways chat wrappers skip: surrogate sanitization on clipboard paste, vision capability downgrade after a text-only endpoint rejects images, and explicit retry paths for "model returned empty string but listed tools" failure modes common on smaller open-weight models.

Insight: turn boundaries are economic boundaries. Hermes resets retry state and iteration accounting at the turn edge because tool-heavy turns can consume dozens of inner iterations. Without that reset, a subagent-heavy turn would starve the next user message.

IterationBudget: thread-safe fuel, not a global cap

Each AIAgent owns an IterationBudget (agent/iteration_budget.py). It is a mutex-protected counter with consume(), refund(), and read-only used / remaining.

class IterationBudget:
    def __init__(self, max_total: int):
        self.max_total = max_total
        self._used = 0
        self._lock = threading.Lock()

    def consume(self) -> bool:
        with self._lock:
            if self._used >= self.max_total:
                return False
            self._used += 1
            return True

    def refund(self) -> None:
        with self._lock:
            if self._used > 0:
                self._used -= 1

Defaults from config: parent max_iterations is 90; each subagent gets delegation.max_iterations (50 by default). These budgets are independent. A parent at iteration 80 can still spawn a child with a fresh 50. Total work across the tree can exceed the parent's cap. That surprised me at first, but it matches the product goal: delegation is outsourcing, not sharing one loop counter.

execute_code (programmatic tool calling) refunds iterations because those inner steps are runtime bookkeeping, not user-visible reasoning steps.

Pattern for builders: separate user-visible iteration limits from internal runtime steps via explicit refund hooks. Otherwise agents either die mid-refactor or learn to avoid programmatic calling.

System prompt: three tiers and cache discipline

Hermes treats the system prompt as a caching asset, not a dump of everything known. build_system_prompt_parts returns stable, context, and volatile strings joined with double newlines.

Figure 3 maps the tiers.

Stable tier includes SOUL.md or default identity, Hermes self-help guidance, task-completion rules, tool-specific behavioral blocks (memory, skills, kanban worker mode), tool-use enforcement for models that need nagging, model-family operational hints (Gemini conciseness, GPT execution discipline), skills index, environment probes, active profile warnings, and platform hints for gateways.

Context tier holds caller-supplied system messages and discovered project files (AGENTS.md, .cursorrules, etc.) under TERMINAL_CWD. Gateway mode sets TERMINAL_CWD to the user's project directory so the daemon does not read rules from the install path.

Volatile tier carries memory snapshots, USER.md profile, external memory provider blocks, and a date line. Timestamps are day-granularity on purpose. Minute-level timestamps would bust prefix caches on every rebuild.

Critical invariant from the upstream docs: the prompt is built once, stored on agent._cached_system_prompt, and only invalidated via invalidate_system_prompt() after compression. Ephemeral instructions are injected at API-call time, not baked into the cached string.

For Anthropic models, apply_anthropic_cache_control marks stable blocks for provider-side caching. The engineering goal is byte-stable prefixes across turns in a session.

If you build your own agent and wonder why responses got slow and expensive after you added "helpful" dynamic sections to the system prompt, this is why. Hermes segregates volatile state so the expensive part stays stable.

Tool registry and AST discovery

Hermes does not maintain a central list of tool names in a YAML file. Each tool module self-registers at import time by calling registry.register(...) with schema, handler, toolset membership, optional check_fn, and metadata.

Discovery scans tools/*.py with AST inspection:

def _module_registers_tools(module_path: Path) -> bool:
    source = module_path.read_text(encoding="utf-8")
    tree = ast.parse(source, filename=str(module_path))
    return any(_is_registry_register_call(stmt) for stmt in tree.body)

Only top-level registry.register(...) calls count. A helper that registers inside a function is ignored, which prevents accidental double registration from lazy imports.

Figure 4 shows the boot sequence.

At process start, discover_builtin_tools() imports every qualifying module. Side effect: registration populates the singleton ToolRegistry. MCP tools register separately when MCP servers connect. check_fn callables (e.g., "is Docker running?") are TTL-cached for 30 seconds so repeated get_definitions() calls do not probe the filesystem every time.

Each ToolEntry carries JSON schema, sync or async handler, toolset label, optional dynamic_schema_overrides, and max_result_size_chars for truncation before results hit the context window.

Pattern: registration beats enumeration. New tools are new files plus one register block. model_tools.get_tool_definitions() queries the registry; nothing else needs updating.

Toolsets: product surface, not implementation detail

The registry is complete; toolsets are curatorial. Figure 5 shows layers.

toolsets.py defines named bundles: which tools belong, alias names, optional toolset-level check_fn, and CLI/gateway presets (minimal, coding, full, etc.). When you launch Hermes with --toolsets terminal,browser, you are not loading different code. You are filtering get_tool_definitions().

Why bother? Three reasons: token economics (fewer schemas in the prompt), safety (a read-only Slack bot should not expose terminal or delegate_task), and subagent contracts (delegate_task passes explicit tool lists to children).

Dynamic schema overrides matter for delegation. The delegate_task description embeds current max_concurrent_children and max_spawn_depth so the model sees accurate limits, not stale defaults from when the schema was written.

Pattern: two-layer tool model — global registry for engineers, toolsets for product and security.

Dispatch: handlefunctioncall and runasync

All tool execution funnels through model_tools.handle_function_call. It coerces stringy JSON arguments to typed values, routes Tool Search bridge calls (tool_search, tool_describe, tool_call), runs pre/post hooks, enforces edit-approval gates, and finally invokes the handler from the registry.

Figure 6 summarizes the path.

Async handlers cannot assume a running event loop. CLI sessions often have none; gateway code often has one. _run_async is the single bridge: worker threads inside async gateways, per-thread persistent loops for parallel tool executors, and a process-wide loop for plain CLI.

Tool Search collapses deferrable tools behind a small bridge surface so the model searches a catalog instead of receiving fifty schemas upfront. Bridge dispatch re-scopes definitions to the session's enabled toolsets so a restricted subagent cannot invoke tools outside its grant via tool_call.

Context compression: protect head and tail, summarize the middle

Long sessions die in one of two ways: hard context limits from the provider, or soft death by token cost. Hermes compresses proactively.

Figure 7 outlines the pipeline.

ContextCompressor tracks estimated and provider-reported token counts. When over threshold, it prunes old tool results, protects head and tail messages, summarizes the middle via an LLM call, and optionally forks a child session linked by parent_session_id in SessionDB.

Preflight compression runs at the start of a turn if estimates cross threshold, before paying for a doomed API call. After compression, retry counters reset so the model gets a fresh attempt on the shorter context.

Insight: compression is a session event, not an invisible edit. Gateways replay compression warnings through status callbacks so users know history was summarized.

SessionDB: SQLite as the agent's memory bus

JSONL transcripts do not scale to gateway concurrency, full-text search, and parent/child session chains. SessionDB (hermes_state.py) is a WAL-mode SQLite database (schema version 14) with FTS5 on messages.

Figure 8 shows the main entities.

Design choices worth copying: WAL mode for concurrent readers, FTS5 plus trigram triggers for session_search, system_prompt column for resume fidelity, parent_session_id for compression lineage, and source tagging (cli, telegram, discord, …) for filtered history.

Pattern: one durable conversation store with search and lineage, separate from execution telemetry.

Delegation: subagents as fresh AIAgent instances

delegate_task spawns a child AIAgent with restricted tools, a clean message history containing only the delegated instruction, and its own IterationBudget. Figure 9 captures the economics.

Important defaults: max_spawn_depth defaults to 1; recursive delegate_task is stripped from child tool lists; roles (leaf, orchestrator, …) normalize tool access and prompt fragments injected into the child system message.

The parent sees only the child's final text. Intermediate tool traces stay in the child session. Cross-agent file state uses task_id exposed on the parent during the turn so snapshots at child launch identify the correct isolation namespace.

Pattern: subagents are processes, not threads — separate budgets, separate tool grants, separate SessionDB rows, summarized return channel.

Skills versus tools: progressive disclosure

Tools are executable functions in the registry. Skills are documentation packages on disk, loaded through skills_list, skill_view, and skill_manage. Figure 10 shows the disclosure levels.

Figure 10: Skills progressive disclosure

Level 1: a compact skills index lives in the stable prompt. Level 2: the model calls skill_view to load full SKILL.md content when a task matches. Level 3: reference files inside the skill directory load on demand.

Skills carry provenance via write-origin tagging so automated improvements do not look like user-directed edits.

Pattern: progressive disclosure for context-heavy capabilities — index in prompt, payload on tool call.

Gateway: same loop, many surfaces

The gateway adapts chat platforms to Hermes turns. Figure 11 shows inbound and outbound flow.

Inbound: platform adapter → message guard → SessionStore → AIAgent turn → delivery.py back to the platform. The gateway sets TERMINAL_CWD so context files resolve to the user's project. Platform hints inject stable-tier prompt fragments for channel-specific behavior.

Insight: gateways fail on session affinity, not model quality. SessionDB mapping and delivery acknowledgment matter as much as model choice.

API modes: adapters converging on one history

Hermes speaks multiple provider APIs. Figure 12 shows the adapter pattern.

api_mode selects among chat_completions, anthropic_messages, and codex_responses. Each adapter translates tool call formats into the internal OpenAI-style message list the loop already understands. Failover logic classifies rate limits, context exceeded, and entitlement errors.

Pattern: internal canonical transcript plus thin outward adapters. Never fork the loop per provider.

Cron, MCP, and execution environments

Three extension points round out the runtime.

Cron (cron/scheduler.py, cron/jobs.py) fires scheduled prompts through the same AIAgent path. Jobs inherit profile-specific config and can run with skip_context_files when the task is operational rather than repo-scoped.

MCP (tools/mcp_tool.py) registers tools from external MCP servers at startup. They appear alongside builtins in the registry with explicit server provenance. MCP is opt-in at connect time because each server is a trust boundary.

Execution environments under tools/environments/ implement terminal sessions (local, Docker, Modal, SSH). Tools like execute_command route to the configured backend. task_id isolates concurrent sessions so two gateway chats do not share shell state.

These are not separate agent cores. They are alternate entrypoints and backends feeding the same loop and registry.

Memory and external providers

Built-in memory uses files under the Hermes home directory plus optional SessionDB-backed search. MemoryManager (agent/memory_manager.py) can attach external memory providers with their own system prompt blocks in the volatile tier.

Background review forks can propose memory or skill updates after a turn. Write origin tagging prevents automated suggestions from bypassing user intent.

session_search leverages FTS5 to pull prior conversations into context when the model needs institutional memory across sessions.

The design tension is familiar: more memory helps until it erases prefix cache stability. Hermes keeps memory in volatile tier and day-stable timestamps to balance recall with economics.

Guardrails, hooks, and middleware

Tool execution passes through layers many prototypes skip:

Pre-tool-call hooks for user confirmation on destructive edits
Tool guardrails with per-turn reset and halt decisions
Request middleware traces for debugging gateway sessions
Unicode and surrogate sanitization on messages and tool schemas
Invalid JSON argument repair before rejecting tool calls
Post-tool empty-response retry when the model stalls after tool results

The loop tracks housekeeping tools separately so retries behave differently when all tool calls were benign metadata fetches versus substantive edits.

This is unglamorous code. It is also what separates "works on my laptop once" from "runs on a gateway for a week."

CLI versus legacy REPL

Primary CLI entry is hermes_cli/main.py (Typer-based commands, gateway control, doctor, tools subcommands). cli.py remains as a legacy REPL path. New integrations should target the CLI module patterns: load config profile, construct AIAgent, call run_conversation, persist via SessionDB.

Slash commands (/resume, /branch, /title, /history) consult SessionDB and surface format_session_db_unavailable() with the captured init error when NFS breaks WAL.

Building an Agentic AI Security Operations Center

Hermes is a practical substrate for an Agentic AI SOC: human analysts supervise autonomous triage, investigation, and gated response while the runtime enforces tool boundaries, session lineage, and approval hooks. You are not replacing the SOC; you are adding a governed agent layer that executes runbooks, enriches alerts, and documents every step in SessionDB.

Figure 13 sketches the topology.

Figure 13: Agentic AI SOC on Hermes runtime

Design goals

An agentic SOC must satisfy constraints traditional chatbots ignore:

Least privilege. Tier-1 triage reads alerts and memory; it does not isolate hosts or edit firewall rules.
Evidence lineage. Every enrichment, query, and analyst override links to a session row searchable by case ID.
Human-in-the-loop for mutation. Containment actions pass pre-tool-call approval hooks.
Parallel enrichment. IOC lookups (DNS, WHOIS, sandbox reports) fan out via delegate_task with leaf subagents and merge summaries back to the parent case session.
Runbook fidelity. Skills encode NIST-aligned playbooks (phishing, malware, credential abuse) without stuffing megabytes of prose into the system prompt.

Layer mapping

Map Hermes primitives to SOC roles:

Ingress. SIEM webhooks and EDR alerts hit a gateway HTTP adapter or Slack slash commands. Message guards enforce API keys, IP allowlists, and per-tenant rate limits. Each alert opens or continues a SessionDB row tagged source=webhook with external_id=alert_id.

Triage agent (toolset `soc_triage`). Enable memory, web search, MCP read tools for your SIEM (Splunk search, Elastic query, Sentinel KQL via MCP), and session_search. Disable terminal, browser, and delegation. Stable-tier prompt: classify severity, deduplicate against open cases, recommend investigate vs close. Volatile tier: alert JSON snapshot and day stamp.

Investigation agent (toolset `soc_investigate`). Add delegate_task, read-only terminal (Docker sandbox with no outbound except allowlisted threat-intel APIs), and browser for sandbox portals. Context tier loads AGENTS.md for your environment's log field names. Orchestrator-role subagents spawn leaf workers per IOC cluster; each child gets 50 iterations and no recursive delegation.

Response agent (toolset `soc_respond`). Minimal tools: ticket update, webhook ack, optional integration MCP for ServiceNow/Jira. Pre-tool-call hooks require analyst confirmation in Slack before execute_command or containment MCP calls fire. Tool guardrails halt the turn if the model attempts destructive strings outside an approved playbook step.

Cron hunter. Scheduled jobs in cron/jobs.yaml run threat-hunt prompts against yesterday's logs with skip_context_files and a dedicated profile profiles/threat-hunt/. Results post to a Slack channel; analysts branch sessions with /branch when a hunt hits signal.

Implementation steps

Step 1 — Profiles and toolsets. Create ~/.hermes/profiles/soc-prod/ with separate config.yaml toolset lists:

# soc_triage excerpt — register custom toolsets in toolsets.py
enabled_toolsets:
  - memory
  - web
  - mcp_siem_read
disabled_toolsets:
  - terminal
  - browser
  - delegation

Step 2 — Skills as playbooks. Add ~/.hermes/profiles/soc-prod/skills/phishing-triage/SKILL.md with decision trees: verify SPF/DKIM anomalies, check user-reported volume, escalate if C-level targeted. The stable prompt indexes the skill; the investigate agent calls skill_view when subject lines match patterns.

Step 3 — Gateway wiring. Run the Hermes gateway with platform adapters for Slack and a small FastAPI webhook receiver that normalizes alerts to plain text plus structured metadata in the user message. Map #soc-alerts channel IDs to triage toolsets; map #soc-warroom to investigate toolsets with analyst user ID allowlists in message guards.

Step 4 — Session and case model. Store case_id in session metadata (custom field via session title prefix or memory entry). Use parent_session_id when compression forks a long investigation. Analysts run session_search for prior incidents involving the same host or user.

Step 5 — Delegation for enrichment. Parent investigate agent prompt template:

Delegate parallel leaf tasks: (1) passive DNS for each IP,
(2) VT summary if hash present, (3) user login timeline from SIEM.
Merge into single timeline; cite subagent session IDs.

Each child session persists tool traces separately; parent transcript stays readable for audit.

Step 6 — Approvals and audit. Wire pre-tool-call hooks to post Slack interactive buttons ("Approve host isolation?"). Log hook decisions to SessionDB via tool middleware traces. Export sessions for compliance with existing SIEM retention policies.

Step 7 — Observability. Bind session_id in structured logs; ship to your log stack. Track iteration budget exhaustion and compression events as SOC KPIs (mean time to enrich, cost per case).

Security patterns specific to agentic SOCs

Continue reading this post for free, courtesy of Ken Huang.

Or purchase a paid subscription.