First Principles of Building Agentic Software — Bench

Start With Nothing

Ignore LangChain. Ignore AutoGen. Ignore every framework that ships with a diagram full of boxes and arrows. Start with what an agent actually is at its core: a loop.

observe → reason → act → observe again

That’s it. Everything else — tools, memory, multi-agent coordination, streaming — is what you put inside that loop or how you manage it. If you don’t understand the loop, no framework will save you. If you do understand it, you can build an agent with any model API and a while loop.

The Loop Is the Architecture

The observe-reason-act cycle is not a metaphor. It maps directly to code:

Observe — assemble the context window: system prompt, conversation history, tool results from the last turn
Reason — call the model, get a response; it either produces a final answer or a tool call
Act — if there’s a tool call, execute it, append the result to history; go back to step 1
Terminate — if the model produces a final answer (no tool call), return it

The loop exits in two ways: the model decides it’s done, or you force it out (max iterations, timeout, error threshold). Both exits need explicit design. Most agent bugs live in the termination logic — the model thinks it’s done when it isn’t, or can’t figure out how to stop.

Tools Are Your API to the World

The model can only affect the world through tools. It cannot run code it wasn’t given a tool for. It cannot read a file unless you gave it a file-reading tool. It cannot know the current time unless you told it or gave it a clock.

This constraint is a gift. It means the agent’s blast radius is exactly the set of tools you provision. Design that set deliberately — like you’d design an API surface. A few principles that hold:

Names are prompts. get_file_contents and read_file invoke different model behavior even if they do the same thing. The tool name, description, and parameter names all go into the context window; the model reasons about what to call based on the language, not just the schema. Name tools for what they accomplish, not how they work.

Narrow is better than broad. A tool that does one thing predictably beats a tool that does five things conditionally. The model will misuse optionality. Give it search_notes and read_note as separate tools rather than a single notes tool with a mode parameter.

Define failure. If a tool can fail — file not found, API timeout, bad input — decide what it returns. An empty string, an error message, an exception? The model will reason about the return value. A confusing failure response leads to confused downstream reasoning.

Context Is the Only Memory

Unless you explicitly build memory in, the model knows exactly what is in the context window — nothing more. This sounds obvious. It isn’t, in practice.

The first time you build an agent that loses track of something it “should have known,” it’s because the information wasn’t in context when the model needed it. Not a model failure — an architecture failure. You didn’t put the right thing in the window at the right time.

Memory, then, is the question of what to include in context and when. Four layers:

In-context — everything in the current window; immediate, expensive, limited
Retrieval — pull relevant history or documents at query time (RAG); requires knowing what’s relevant
External state — a database the agent reads and writes; persists across sessions, requires explicit tool access
Model weights — what the base model was trained on; free, but fixed and often wrong about your specific domain

Most agents need the first two. The third matters when the agent runs across sessions or multiple instances. The fourth is background knowledge — don’t rely on it for anything you need to be accurate.

The System Prompt Is the Constitution

It doesn’t tell the model what to do step by step. It sets constraints the model reasons from. The difference matters.

A system prompt written as a script — “first do X, then do Y, then check Z” — fights against the model’s tendency to reason. You’ll spend your time patching edge cases where the sequence breaks down. A system prompt written as principles — “never write a file without reading it first”, “if search returns nothing, say so rather than guessing”, “always confirm before deleting” — gives the model something to reason from in novel situations.

The other thing the system prompt does: scopes the model’s identity. What does it know about the environment it’s operating in? What’s it optimizing for? What does failure look like? A model with no context about its environment will make up plausible-sounding context. Give it the actual context instead.

Design the Action Space for Reversibility

Agents make mistakes. The model will sometimes call the wrong tool, misinterpret a result, or pursue a subtly wrong interpretation of the goal. This is not a reason to avoid agentic software — it’s a reason to design the action space so that mistakes are recoverable.

A few patterns that follow from this:

Read before write. If the agent might modify something, give it a read tool and instruct it to verify before modifying. Cheap to add, prevents a class of hard-to-recover errors.

Draft before act. For actions with external effects — sending a message, calling a webhook, committing to a database — add an intermediate draft state the agent writes to first. Review the draft, then execute.

Prefer append over overwrite. An agent that appends to a log is easier to audit and reverse than one that overwrites state. When you can choose the data structure, choose the reversible one.

Hard limits, not soft guidance. If there are things the agent must never do — delete a production record, send to an external address, exceed a spend limit — enforce those in the tool implementation, not just the system prompt. The system prompt can be reasoned around. A hard check in the tool code cannot.

Observability Is a First-Class Concern

A function call has a traceable stack. An agent has a reasoning chain you can’t inspect directly — you can only observe inputs and outputs at each step. This means you need to log everything the loop touches: each tool call with its arguments and return value, each model response including reasoning (if available), and the state of the context window at each turn.

Without this, debugging is archaeology. You have a final output you don’t understand and no way to reconstruct how the model got there. With it, you can see exactly which tool call returned the bad result that led to the wrong conclusion three steps later.

Structured logging per turn, not just final outputs. Build it in from the start — adding it later to an already-deployed agent is painful.

What This Looks Like From Scratch

Strip it to the bones:

def run_agent(goal: str, tools: list, system_prompt: str, max_turns: int = 20):
    messages = [{"role": "user", "content": goal}]
    tool_map = {t.__name__: t for t in tools}

    for _ in range(max_turns):
        response = model.call(system=system_prompt, messages=messages)
        messages.append({"role": "assistant", "content": response})

        if not response.tool_calls:
            return response.text  # model is done

        for call in response.tool_calls:
            result = tool_map[call.name](**call.args)
            messages.append({"role": "tool", "content": result, "tool_use_id": call.id})

    raise MaxTurnsExceeded()

No framework. No abstraction. The loop is visible, the termination condition is explicit, the tool dispatch is a dict lookup. Every framework wraps some version of this — understanding this version means you can debug any framework’s behavior, because you know what it’s doing underneath.