First Principles of Building Agentic Software
Strip away the frameworks and the hype — what are the irreducible building blocks of an agent, and what does each one demand of you?
Start With Nothing
Ignore LangChain. Ignore AutoGen. Ignore every framework that ships with a diagram full of boxes and arrows. Start with what an agent actually is at its core: a loop.
observe → reason → act → observe again
That’s it. Everything else — tools, memory, multi-agent coordination, streaming — is what you put inside that loop or how you manage it. If you don’t understand the loop, no framework will save you. If you do understand it, you can build an agent with any model API and a while loop.
The Loop Is the Architecture
The observe-reason-act cycle is not a metaphor. It maps directly to code:
- Observe — assemble the context window: system prompt, conversation history, tool results from the last turn
- Reason — call the model, get a response; it either produces a final answer or a tool call
- Act — if there’s a tool call, execute it, append the result to history; go back to step 1
- Terminate — if the model produces a final answer (no tool call), return it
The loop exits in two ways: the model decides it’s done, or you force it out (max iterations, timeout, error threshold). Both exits need explicit design. Most agent bugs live in the termination logic — the model thinks it’s done when it isn’t, or can’t figure out how to stop.
Tools Are Your API to the World
The model can only affect the world through tools. It cannot run code it wasn’t given a tool for. It cannot read a file unless you gave it a file-reading tool. It cannot know the current time unless you told it or gave it a clock.
This constraint is a gift. It means the agent’s blast radius is exactly the set of tools you provision. Design that set deliberately — like you’d design an API surface. A few principles that hold:
Names are prompts. get_file_contents and read_file invoke different model behavior even if they do the same thing. The tool name, description, and parameter names all go into the context window; the model reasons about what to call based on the language, not just the schema. Name tools for what they accomplish, not how they work.
Narrow is better than broad. A tool that does one thing predictably beats a tool that does five things conditionally. The model will misuse optionality. Give it search_notes and read_note as separate tools rather than a single notes tool with a mode parameter.
Define failure. If a tool can fail — file not found, API timeout, bad input — decide what it returns. An empty string, an error message, an exception? The model will reason about the return value. A confusing failure response leads to confused downstream reasoning.
Context Is the Only Memory
Unless you explicitly build memory in, the model knows exactly what is in the context window — nothing more. This sounds obvious. It isn’t, in practice.
The first time you build an agent that loses track of something it “should have known,” it’s because the information wasn’t in context when the model needed it. Not a model failure — an architecture failure. You didn’t put the right thing in the window at the right time.
Memory, then, is the question of what to include in context and when. Four layers:
- In-context — everything in the current window; immediate, expensive, limited
- Retrieval — pull relevant history or documents at query time (RAG); requires knowing what’s relevant
- External state — a database the agent reads and writes; persists across sessions, requires explicit tool access
- Model weights — what the base model was trained on; free, but fixed and often wrong about your specific domain
Most agents need the first two. The third matters when the agent runs across sessions or multiple instances. The fourth is background knowledge — don’t rely on it for anything you need to be accurate.
The System Prompt Is the Constitution
It doesn’t tell the model what to do step by step. It sets constraints the model reasons from. The difference matters.
A system prompt written as a script — “first do X, then do Y, then check Z” — fights against the model’s tendency to reason. You’ll spend your time patching edge cases where the sequence breaks down. A system prompt written as principles — “never write a file without reading it first”, “if search returns nothing, say so rather than guessing”, “always confirm before deleting” — gives the model something to reason from in novel situations.
The other thing the system prompt does: scopes the model’s identity. What does it know about the environment it’s operating in? What’s it optimizing for? What does failure look like? A model with no context about its environment will make up plausible-sounding context. Give it the actual context instead.
Design the Action Space for Reversibility
Agents make mistakes. The model will sometimes call the wrong tool, misinterpret a result, or pursue a subtly wrong interpretation of the goal. This is not a reason to avoid agentic software — it’s a reason to design the action space so that mistakes are recoverable.
A few patterns that follow from this:
Read before write. If the agent might modify something, give it a read tool and instruct it to verify before modifying. Cheap to add, prevents a class of hard-to-recover errors.
Draft before act. For actions with external effects — sending a message, calling a webhook, committing to a database — add an intermediate draft state the agent writes to first. Review the draft, then execute.
Prefer append over overwrite. An agent that appends to a log is easier to audit and reverse than one that overwrites state. When you can choose the data structure, choose the reversible one.
Hard limits, not soft guidance. If there are things the agent must never do — delete a production record, send to an external address, exceed a spend limit — enforce those in the tool implementation, not just the system prompt. The system prompt can be reasoned around. A hard check in the tool code cannot.
Observability Is a First-Class Concern
A function call has a traceable stack. An agent has a reasoning chain you can’t inspect directly — you can only observe inputs and outputs at each step. This means you need to log everything the loop touches: each tool call with its arguments and return value, each model response including reasoning (if available), and the state of the context window at each turn.
Without this, debugging is archaeology. You have a final output you don’t understand and no way to reconstruct how the model got there. With it, you can see exactly which tool call returned the bad result that led to the wrong conclusion three steps later.
Structured logging per turn, not just final outputs. Build it in from the start — adding it later to an already-deployed agent is painful.
What This Looks Like From Scratch
Strip it to the bones:
def run_agent(goal: str, tools: list, system_prompt: str, max_turns: int = 20):
messages = [{"role": "user", "content": goal}]
tool_map = {t.__name__: t for t in tools}
for _ in range(max_turns):
response = model.call(system=system_prompt, messages=messages)
messages.append({"role": "assistant", "content": response})
if not response.tool_calls:
return response.text # model is done
for call in response.tool_calls:
result = tool_map[call.name](**call.args)
messages.append({"role": "tool", "content": result, "tool_use_id": call.id})
raise MaxTurnsExceeded()
No framework. No abstraction. The loop is visible, the termination condition is explicit, the tool dispatch is a dict lookup. Every framework wraps some version of this — understanding this version means you can debug any framework’s behavior, because you know what it’s doing underneath.