Part 3 · Agent Harness Engineering

Building Your First Agent Harness

An e-commerce customer-service chatbot in about a hundred lines of Python: four tools, three hook layers, and the ratchet of what to add only when you watch it fail.

June 18, 20269 minute readAI Engineering

Two posts of theory. Time for code.

In Part 1, we landed on Agent = Model + Harness. In Part 2, we walked through the six pieces — filesystem, tools, context, memory, hooks, sub-agents — with examples from both coding agents and e-commerce chatbots. This post is where we stop reading and build one.

The goal isn't a production-ready harness. It's the smallest e-commerce chatbot that's recognizably an agent — short enough to fit in one file, real enough to fail in interesting ways. A customer can ask about a product, check an order, request a return. The harness enforces policies the model might forget: no refunds above the order value, no unapproved discounts, no competitor mentions reaching the customer.

About a hundred lines of Python. Five of the six pieces from Part 2.

Setup: Python 3.9+ and pip install anthropic. An Anthropic API key in ANTHROPIC_API_KEY. Optionally, a store_policies.md with your return windows and escalation rules. The pattern works with any tool-using model — Anthropic is the example here because the schema is clean.


The shape

Memory — load store_policies.md if it exists. — System prompt — combine agent identity + policies. — Tools — four tools: search products, check inventory, check order status, process return. — Hooks — pre-tool (refund cap), pre-response (competitor scan), back-pressure (on-stop policy check). — The loop — call the model, execute tool calls, run hooks, repeat.

No AgentClass, no framework. The harness is the loop and the things around it.


01 — Memory: load store policies

from pathlib import Path
 
POLICIES = Path("store_policies.md").read_text() if Path("store_policies.md").exists() else ""
 
SYSTEM_PROMPT = f"""You are a customer service agent for Acme Store.
Help customers check orders, find products, and process returns.
Be empathetic but concise. Never recommend competitor products.
 
## Store policies
{POLICIES}"""

Same pattern as a coding agent's AGENTS.md, different content. Keep it under 60 lines. Earn each line from a real customer complaint.

02 — Tools: four, not forty

TOOLS = [
    {"name": "search_products",
     "description": "Search the product catalog by keyword. Returns items with prices and stock.",
     "input_schema": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}},
    {"name": "check_order_status",
     "description": "Look up an order by ID. Returns details and shipping status.",
     "input_schema": {"type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"]}},
    {"name": "check_inventory",
     "description": "Check real-time stock for a product ID and optional size/variant.",
     "input_schema": {"type": "object", "properties": {"product_id": {"type": "string"}, "variant": {"type": "string"}}, "required": ["product_id"]}},
    {"name": "process_return",
     "description": "Initiate a return. Requires order_id, reason, and refund_amount.",
     "input_schema": {"type": "object", "properties": {"order_id": {"type": "string"}, "reason": {"type": "string"}, "refund_amount": {"type": "number"}}, "required": ["order_id", "reason", "refund_amount"]}},
]

Four tools. Not forty. Remember Part 2: 10 tools performs well, 107 is total failure. In production, these call your real backend — Shopify, Stripe, your warehouse system.

03 — Hooks: probabilistic vs. deterministic

Three layers: a pre-tool hook that validates before any API fires, a pre-response hook that scans outgoing messages, and back-pressure that bounces the agent back if the response contradicts policy.

COMPETITORS = {"amazon", "walmart", "target", "shein", "temu"}
 
class BlockedAction(Exception): pass
 
# PRE-TOOL: fires before every tool call
def before_tool_call(name, args):
    if name == "process_return":
        if args.get("refund_amount", 0) > 200:
            raise BlockedAction("Refund exceeds $200 limit. Escalate to manager.")
 
# PRE-RESPONSE: fires before the message reaches the customer
def before_response(text):
    lower = text.lower()
    for comp in COMPETITORS:
        if comp in lower:
            raise BlockedAction(f"Response mentions competitor '{comp}'. Rephrase.")
    return text

Telling the model "never recommend competitors" in the prompt is probabilistic. Scanning the response before it ships is deterministic. The model can't talk its way around a hook.

04 — Tool execution (mock backend)

import json
 
CATALOG = {
    "SHOE-001": {"name": "Classic Runner", "price": 89.99, "sizes": ["8","9","10","11"]},
    "BAG-042": {"name": "Canvas Tote", "price": 34.99, "sizes": ["one-size"]},
    "JACK-007": {"name": "Denim Jacket", "price": 129.99, "sizes": ["S","M","L","XL"]},
}
ORDERS = {
    "ORD-4827": {"item": "JACK-007", "size": "M", "total": 129.99, "status": "delivered"},
    "ORD-5103": {"item": "SHOE-001", "size": "10", "total": 89.99, "status": "shipped"},
}
 
def execute_tool(name, args):
    if name == "search_products":
        q = args["query"].lower()
        return json.dumps([{"id": k, **v} for k,v in CATALOG.items() if q in v["name"].lower()] or [{"msg": "No products found."}])
    if name == "check_order_status":
        return json.dumps(ORDERS.get(args["order_id"], {"error": "Order not found."}))
    if name == "check_inventory":
        p = CATALOG.get(args["product_id"])
        if not p: return json.dumps({"error": "Product not found."})
        v = args.get("variant", "")
        return json.dumps({"in_stock": v in p["sizes"] if v else True})
    if name == "process_return":
        o = ORDERS.get(args["order_id"])
        if not o: return json.dumps({"error": "Order not found."})
        return json.dumps({"status": "approved", "label": "RET-" + args["order_id"]})
    return json.dumps({"error": f"Unknown tool: {name}"})

This is a mock backend — dictionaries pretending to be APIs. In production, swap each branch for a real HTTP call. The interface stays the same.

05 — The loop: this is the agent

from anthropic import Anthropic
 
def run_agent(message, max_turns=15):
    client = Anthropic()
    history = [{"role": "user", "content": message}]
 
    for _ in range(max_turns):
        response = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=2048,
            system=SYSTEM_PROMPT, tools=TOOLS, messages=history,
        )
        history.append({"role": "assistant", "content": response.content})
 
        if response.stop_reason == "end_turn":
            text = "\n".join(b.text for b in response.content if b.type == "text")
            try:
                return before_response(text)  # pre-response hook
            except BlockedAction as e:
                # back-pressure: bounce the agent back
                history.append({"role": "user", "content":
                    f"[hook] {e} Please rephrase your response."})
                continue
 
        results = []
        for block in response.content:
            if block.type != "tool_use": continue
            try:
                before_tool_call(block.name, block.input)  # pre-tool hook
                output = execute_tool(block.name, block.input)
            except BlockedAction as e:
                output = f"[blocked by hook] {e}"
            results.append({"type": "tool_result", "tool_use_id": block.id, "content": output})
        history.append({"role": "user", "content": results})
 
    return "I need to hand this to a human agent. One moment please."
 
if __name__ == "__main__":
    print(run_agent("I ordered a denim jacket (ORD-4827) and it's too small. Can I return it?"))

Two hook injection points. The pre-tool hook fires before every tool call — if it raises, the error goes back to the model as a tool result. The pre-response hook fires when the model tries to stop — if it raises, the error gets injected and the agent loops again. That's back-pressure: the agent can't declare itself done until the harness agrees.


What happens when you run it

$ python harness.py
 
I found your order ORD-4827 — a Denim Jacket in size M,
delivered last Tuesday. Since it's within our 30-day return
window, I've initiated a return for you.
 
Your return label is RET-ORD-4827. You'll receive a full
refund of $129.99 once we receive the item.
 
Is there anything else I can help with?

Now the interesting part. Watch it fail. Ask it to recommend a product when a competitor is cheaper. Ask it to process a refund for $300. You'll see exactly which harness piece is missing.


The ratchet, applied

What to add next — in the order you'll actually need it. Don't add anything speculatively.

Bot recommends an out-of-stock item — wire check_inventory as mandatory before every product recommendation.

Bot gives a refund above order total — enhance the hook to cross-reference the actual order, not a flat cap.

Bot mentions a competitor you hadn't listed — add it to COMPETITORS. Consider a broader scan: regex for "buy it at," "available on," "cheaper at."

Bot forgets the customer's preferences — add a recitation pattern: write a ticket_state.md after each turn, re-read before every action.

Bot promises something not in policy — strengthen the pre-response hook to cross-check claims against store_policies.md. This is where back-pressure earns its keep.

Voice assistant processes duplicate order — add idempotency check: hash(order_id + action), reject duplicates per session.

Context fills with catalog search results — graduate to sub-agents. A sub-agent searches 50 products, returns the top 3. Parent context stays clean.

Costs spike unexpectedly — instrument per-turn token counts. Two numbers: cache hit rate (target >80%) and escalation rate.

Angry customers get generic responses — add a sentiment hook after each customer message. If triggered, inject "prioritize resolution and empathy."


When to stop hand-rolling

At some point you graduate to Harness-as-a-Service:

The Claude Agent SDK — Anthropic's published harness, the same engine inside Claude Code. Loop, tools, hooks with 25 lifecycle points, context management, sandbox primitives.

The OpenAI Agents SDK — same idea, with their conventions for multi-agent handoffs.

The Codex SDK — narrower scope, ships with patterns from OpenAI's million-line experiment.

Don't graduate too early. Write the hundred-line version first. You need to understand what the SDKs are doing for you.


The model is the engine. The harness is the car.

Three posts ago, we started with two stories — a coding agent and an e-commerce chatbot — and the same lesson: the model was never the bottleneck.

You now have:

  1. The frame — Agent = Model + Harness; failures are configuration problems, not model problems.
  2. The anatomy — six pieces, each solving something the model can't do alone.
  3. A working harness — about a hundred lines, on disk, with real hooks that enforce policy deterministically.

That's enough to start. Point it at your product catalog. Watch where it breaks. Ratchet in the fix. Don't add anything you haven't earned.

Go build something. Watch it break. Fix the harness. Repeat.


The Series: I. Stop Blaming the Model · II. Inside the Harness · III. Building Your First Harness

Sources: Mitchell Hashimoto, Addy Osmani, HumanLayer, OpenAI, Vikas Sah, Bhavishya Pandit, Nader Dabit, and the awesome-harness-engineering list.