How to Build Custom AI Agents with OpenAI’s Responses API: From Single-Turn Chat to Multi-Step Autonomous Workflows

How to Build Custom AI Agents with OpenAI’s Responses API
A practical, end-to-end tutorial for developers: environment setup, a basic agent, tool integration, multi-step reasoning, error handling, and deploying to production with Python examples.
” />
- Introduction: What is an AI agent?
- 1. Setup: development environment and libraries
- 2. Building a minimal agent with the Responses API
- 3. Adding tool use: registers, secure execution, and tool results
- 4. Multi-step reasoning and plan-execute-observe loops
- 5. Robustness: validation, retries, and error handling
- 6. Production readiness: deployment, monitoring, and governance
- Appendix: utility code, Dockerfile, and testing suggestions
Introduction: What is an AI agent?
In this tutorial you’ll learn how to build customizable AI agents using OpenAI’s Responses API. In the context of this guide, an “agent” is a software component that:
- Accepts user input (a question, task, or request)
- Plans how to complete the task, possibly breaking it into steps
- Invokes external tools (search, a calculator, internal APIs, or custom code) when needed
- Observes the results of tools and iterates until it completes the task
- Returns a structured and user-friendly answer
This guide uses Python examples and assumes you have a valid OpenAI API key and a reasonable familiarity with Python development. Where appropriate, you’ll see full working snippets and suggestions for production hardening.
1. Setup: development environment and libraries
Before building an agent, prepare a development environment. The minimal requirements:
- Python 3.10+ (or 3.11 recommended)
- A virtual environment (venv, pipenv, or poetry)
- An OpenAI API key set as an environment variable (OPENAI_API_KEY)
- Basic libraries: openai (or the latest official OpenAI Python SDK), requests, and optionally pydantic for input/output validation
Create a project and virtual environment
# Create a project directory and a venv
python -m venv .venv
source .venv/bin/activate # macOS / Linux
.venv\Scripts\activate # Windows
# Upgrade pip and install dependencies
pip install --upgrade pip
pip install openai requests pydantic
Set your API key in your shell environment (never hard-code it in source). On macOS/Linux:
export OPENAI_API_KEY="sk-..."
On Windows (PowerShell):
$env:OPENAI_API_KEY="sk-..."
Note: OpenAI’s Python SDK sometimes changes; if you are using the newer openai package that exposes an OpenAI client class, import and instantiate it. In this guide we use the pattern shown below:
from openai import OpenAI
client = OpenAI() # will read OPENAI_API_KEY from the environment
Be careful with secrets and credentials. Use environment variables or a secrets manager (HashiCorp Vault, AWS Secrets Manager, etc.) in production.
” />
2. Building a minimal agent with the Responses API
We’ll start by creating a minimal agent that receives a user query and returns a direct answer using the Responses API. The agent will be modular: it will have an Agent class responsible for orchestration and a simple “LLM” wrapper for calls to the Responses API.
Minimal LLM wrapper
This wrapper centralizes API calls (you can add logging, retries, or telemetry later).
from openai import OpenAI
class LLM:
def __init__(self, client=None, model="gpt-4o-mini"):
self.client = client or OpenAI()
self.model = model
def generate(self, prompt, max_tokens=512, temperature=0.2):
resp = self.client.responses.create(
model=self.model,
input=prompt,
max_output_tokens=max_tokens,
temperature=temperature
)
# The Responses API returns a structure with `output` content
# For simplicity we concatenate text segments if present
output_text = ""
if resp.output and isinstance(resp.output, list):
for item in resp.output:
if getattr(item, "content", None):
# item.content may be structured; handle simple case
for c in item.content:
if c["type"] == "output_text":
output_text += c["text"]
elif getattr(resp, "output_text", None):
output_text = resp.output_text
else:
# fallback
output_text = str(resp)
return output_text
Simple agent orchestration
The agent will accept a user prompt, call the LLM, and return the result. This is intentionally minimal and suitable for simple Q&A and small tasks.
class SimpleAgent:
def __init__(self, llm: LLM):
self.llm = llm
def handle(self, user_input: str) -> str:
prompt = f"You are an assistant. Answer concisely and clearly:\n\nUser: {user_input}\nAssistant:"
return self.llm.generate(prompt)
# Usage
if __name__ == "__main__":
client = OpenAI()
llm = LLM(client=client)
agent = SimpleAgent(llm)
print(agent.handle("What's a good plan to learn web development in 3 months?"))
That gets you started: a request goes to the Responses API and a text answer returns. Next, we’ll allow the agent to “use tools” — functions that perform actions like web search, database queries, or calculators.
3. Adding tool use
Agents become much more powerful when they can call external tools to retrieve facts, perform calculations, or connect to internal systems. We’ll implement:
- A tool registry to register and describe tools
- A simple protocol for the agent to request a tool call
- Secure execution and result feeding back to the LLM
There are multiple patterns to enable “tool use”. Some frameworks offer built-in tool/function calling. Here, to stay framework-agnostic and explicit, we’ll use a “structured JSON plan” approach: ask the LLM to return a JSON plan describing steps, where each step can ask the agent to call a named tool with arguments.
Tool registry
from typing import Callable, Dict, Any
import json
class Tool:
def __init__(self, name: str, description: str, func: Callable[..., Any]):
self.name = name
self.description = description
self.func = func
class ToolRegistry:
def __init__(self):
self.tools: Dict[str, Tool] = {}
def register(self, tool: Tool):
if tool.name in self.tools:
raise ValueError(f"Tool {tool.name} already registered")
self.tools[tool.name] = tool
def call(self, name: str, args: dict):
if name not in self.tools:
raise ValueError(f"Unknown tool: {name}")
return self.tools[name].func(**args)
def describe(self):
# Return a plain description for the LLM prompt
return [
{"name": t.name, "description": t.description} for t in self.tools.values()
]
Example tools: web_search and calculator
For demonstration, we’ll add a very small web search wrapper (using Bing or Google scraping in production is not recommended — use an official search API). We’ll show a placeholder function here; in real usage, replace with a proper search API.
import requests
from urllib.parse import urlencode
def web_search_stub(query: str, top_k: int = 3):
"""
Stubbed web search. In production, call a search API.
Returns a list of dicts like: [{"title": "...", "snippet": "...", "url": "..."}]
"""
# Placeholder: return a deterministic fake result for testing
return [
{"title": f"Result {i+1} for {query}", "snippet": f"Snippet {i+1}", "url": f"https://example.com/{i+1}"}
for i in range(top_k)
]
def calculator(expr: str):
# Very small, safe evaluator — DO NOT eval() untrusted code
# For demonstration, use simple arithmetic parsing
import ast, operator as op
allowed_operators = {
ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv,
ast.Pow: op.pow, ast.USub: op.neg
}
def eval_(node):
if isinstance(node, ast.Num):
return node.n
if isinstance(node, ast.BinOp):
return allowed_operators[type(node.op)](eval_(node.left), eval_(node.right))
if isinstance(node, ast.UnaryOp):
return allowed_operators[type(node.op)](eval_(node.operand))
raise ValueError("Unsupported expression")
try:
node = ast.parse(expr, mode='eval').body
return eval_(node)
except Exception as e:
raise ValueError(f"Could not evaluate expression safely: {e}")
Registering tools and prompting the model for plans
We’ll ask the model to produce a JSON plan. The model is given a list of available tools and a schema for the plan. The agent parses the plan, runs the tools, and returns the result. This approach gives you full control over how tools are executed (sandboxing, validation, logging).
import json
tool_registry = ToolRegistry()
tool_registry.register(Tool(name="web_search", description="Search the web and return top results", func=web_search_stub))
tool_registry.register(Tool(name="calculator", description="Evaluate basic arithmetic expressions", func=calculator))
class ToolUsingAgent:
def __init__(self, llm: LLM, tools: ToolRegistry):
self.llm = llm
self.tools = tools
def _build_tool_prompt(self, user_input: str) -> str:
tools_list = json.dumps(self.tools.describe(), indent=2)
prompt = f"""
You are an assistant that can plan actions and call tools. Available tools:
{tools_list}
Respond with a JSON object describing your plan. Schema:
{{
"plan": [
{{
"action": "think" | "call_tool" | "finish",
"thought": "short internal reasoning (optional)",
"tool": "tool_name (if action is call_tool)",
"args": {{ ... }} (if action is call_tool),
"note": "optional textual note for the user (optional)"
}}
]
}}
User request: {user_input}
Remember:
- If you need to fetch facts, use "call_tool" with the web_search tool.
- If you need to compute, use "call_tool" with the calculator tool.
- Conclude with an action "finish" that summarizes the result for the user.
"""
return prompt
def handle(self, user_input: str):
prompt = self._build_tool_prompt(user_input)
raw = self.llm.generate(prompt, max_tokens=800, temperature=0.0)
# The model should return JSON. Attempt to find and parse the JSON in the output.
try:
plan_json = json.loads(raw)
except Exception:
# Try to extract JSON substring
import re
m = re.search(r"\{.*\}", raw, flags=re.S)
if m:
plan_json = json.loads(m.group(0))
else:
raise ValueError("LLM did not return JSON plan:\n" + raw)
# Execute the plan sequentially
final_user_message = None
for step in plan_json.get("plan", []):
action = step.get("action")
if action == "think":
# optional internal step; we can log thought
print("LLM thought:", step.get("thought"))
elif action == "call_tool":
tool_name = step.get("tool")
args = step.get("args", {})
result = self.tools.call(tool_name, args)
# Provide the observation back into a short follow-up prompt to the LLM
followup_prompt = f"Observation from tool call ({tool_name}):\n{json.dumps(result, indent=2)}\n\nContinue planning based on this observation."
# You can call the LLM again to get the next plan or let the loop continue
raw = self.llm.generate(followup_prompt, max_tokens=400, temperature=0.0)
# For simplicity we assume the original plan contained all steps.
print(f"Tool {tool_name} returned:", result)
elif action == "finish":
final_user_message = step.get("note") or "Task complete."
else:
raise ValueError(f"Unknown action: {action}")
if final_user_message is None:
final_user_message = "No final summary provided by agent."
return final_user_message
# Example usage
if __name__ == "__main__":
client = OpenAI()
llm = LLM(client=client)
agent = ToolUsingAgent(llm, tool_registry)
print(agent.handle("What is the population of Japan and what is 123 * 456?"))
This plan-execute loop is explicit, auditable, and simple to reason about. The LLM produces a plan; your code validates and executes it. This pattern keeps the execution environment safe because you control which tools are callable and how arguments are validated.
For production use, you might prefer “function-calling” style if available in the SDK, which allows the model to return a structured function call that your client can map to a tool. The manual JSON-plan approach is intentionally portable across SDKs and versions.
Before building custom agents with the Responses API, developers who previously used the Assistants API need to understand the migration path. The architectural differences between the two APIs are significant, affecting how state management, tool use, and conversation threading work. Our step-by-step migration guide covers how to migrate from the OpenAI Assistants API to the Responses API with complete code examples.
4. Multi-step reasoning and plan-execute-observe loops
Many tasks require multiple steps: gather facts, analyze, compute, and produce a final answer. The plan-execute-observe loop (also called “recurrent planning”) lets an agent interleave LLM planning with tool execution until a stopping criterion is met.
Design pattern
- Prompt LLM to produce a plan (one or more actions)
- Validate the plan format and tool arguments
- Execute the first actionable step(s)
- Collect observations from tools and append them to a context
- Re-prompt the LLM with the updated context to get the next plan
- Repeat until the LLM signals completion or a max step limit is reached
Avoid exposing the model’s chain-of-thought or internal deliberations to end users. You can keep “thoughts” in the plan for debugging but exclude them from the final output.
Example: multi-step agent loop
import time
class MultiStepAgent:
def __init__(self, llm: LLM, tools: ToolRegistry, max_steps: int = 6):
self.llm = llm
self.tools = tools
self.max_steps = max_steps
def run(self, user_input: str):
step = 0
context = {"user_input": user_input, "observations": []}
while step < self.max_steps:
prompt = self._build_prompt(context)
raw = self.llm.generate(prompt, max_tokens=800, temperature=0.0)
plan = self._parse_plan(raw)
if not plan:
raise RuntimeError("No plan returned from LLM")
# Process first actionable item
next_action = plan[0]
action = next_action.get("action")
if action == "call_tool":
tool_name = next_action.get("tool")
args = next_action.get("args", {})
# Validate args (ensure types, size limits etc.)
obs = self._safe_call(tool_name, args)
context["observations"].append({"tool": tool_name, "args": args, "result": obs})
step += 1
continue
elif action == "finish":
return next_action.get("note") or "Done."
elif action == "think":
# treat as internal; add to debug logs
context.setdefault("internal_thoughts", []).append(next_action.get("thought"))
step += 1
continue
else:
raise RuntimeError(f"Unknown action from plan: {action}")
raise RuntimeError("Max steps reached without finishing")
def _build_prompt(self, context: dict):
tools_list = json.dumps(self.tools.describe(), indent=2)
observations_text = json.dumps(context["observations"], indent=2)
return f"""
You are an assistant that outputs a plan for action in JSON form. Tools: {tools_list}
User request: {context['user_input']}
Past observations: {observations_text}
Return an array "plan", where each plan item is one of:
- {{ "action": "call_tool", "tool": "tool_name", "args": {{...}} }}
- {{ "action": "think", "thought": "internal note" }}
- {{ "action": "finish", "note": "final answer for the user" }}
"""
def _parse_plan(self, raw: str):
try:
parsed = json.loads(raw)
return parsed.get("plan", [])
except Exception:
# attempt to extract JSON substring
import re
m = re.search(r"\{.*\}", raw, flags=re.S)
if m:
parsed = json.loads(m.group(0))
return parsed.get("plan", [])
else:
return []
def _safe_call(self, tool_name: str, args: dict):
# Validate arguments size/type
if tool_name not in self.tools.tools:
raise ValueError("Attempt to call unknown tool")
# Example validation: limit string length
for k, v in args.items():
if isinstance(v, str) and len(v) > 2000:
raise ValueError("Argument too large")
# Run tool and return observation
return self.tools.call(tool_name, args)
# Example usage
if __name__ == "__main__":
client = OpenAI()
llm = LLM(client=client)
agent = MultiStepAgent(llm=llm, tools=tool_registry)
print(agent.run("Find three recent news articles about electric vehicles and summarize their points."))
Note: The agent re-prompts the LLM at each step using the growing observation log. This is robust because you can validate each tool call and decide whether to feed results back.
Limit tokens growth: when you keep adding observations to the prompt, token usage will increase and costs may rise. Use summarization, vector stores, or truncated histories to keep contexts manageable.
5. Robustness: validation, retries, and error handling
Real-world systems require robust error handling for network failures, rate limits, partial tool failures, and unexpected model outputs. This section outlines practical patterns for production-grade reliability.
API call retries and exponential backoff
Use exponential backoff for transient errors (HTTP 429 or 5xx). Below is an example decorator that performs retry with jitter:
import time
import random
from functools import wraps
def retry_on_exception(max_attempts=5, base_delay=0.5, backoff=2.0, jitter=0.1, retry_on=(Exception,)):
def decorator(f):
@wraps(f)
def wrapper(*args, **kwargs):
attempt = 0
while True:
try:
return f(*args, **kwargs)
except retry_on as e:
attempt += 1
if attempt >= max_attempts:
raise
delay = base_delay * (backoff ** (attempt - 1))
delay = delay * (1 + random.uniform(-jitter, jitter))
time.sleep(delay)
return wrapper
return decorator
# Example usage wrapping the LLM generate call
class LLM:
# ... (previous code)
@retry_on_exception(max_attempts=4, base_delay=0.5, backoff=2.0)
def generate(self, prompt, max_tokens=512, temperature=0.2):
resp = self.client.responses.create(
model=self.model,
input=prompt,
max_output_tokens=max_tokens,
temperature=temperature
)
# (parsing as before)
...
Validation of model outputs
LLMs sometimes return malformed JSON, hallucinated tool names, or unexpected types. Validate everything before executing.
def validate_plan(plan: list, allowed_tools: list) -> None:
if not isinstance(plan, list):
raise ValueError("Plan should be a list")
for i, step in enumerate(plan):
if not isinstance(step, dict) or "action" not in step:
raise ValueError(f"Step {i} missing action")
action = step["action"]
if action == "call_tool":
if "tool" not in step or "args" not in step:
raise ValueError(f"Step {i} missing tool or args")
if step["tool"] not in allowed_tools:
raise ValueError(f"Step {i} requests unknown tool: {step['tool']}")
elif action == "finish":
if "note" in step and len(step["note"]) > 10000:
raise ValueError("Finish note too long")
elif action == "think":
continue
else:
raise ValueError(f"Step {i} invalid action: {action}")
Handling tool errors
Tools might fail (network errors, timeouts, exceptions). Decide for each tool whether failure should abort, retry, or be skipped. Log failures and provide meaningful diagnostics to the user.
def safe_execute_tool(registry: ToolRegistry, tool_name: str, args: dict, max_retries=2):
attempts = 0
while attempts <= max_retries:
try:
return registry.call(tool_name, args)
except Exception as e:
attempts += 1
# Log
print(f"Tool {tool_name} failed on attempt {attempts}: {e}")
if attempts > max_retries:
# Decide: propagate a structured error to the agent
return {"error": str(e)}
time.sleep(0.5 * attempts)
Dealing with hallucinations and factuality
LLM responses can be plausible but incorrect. To reduce hallucinations:
- Prefer retrieving facts via trusted tools (official APIs, databases).
- Ask the model to cite sources and verify with tool calls (e.g., web_search).
- Use temperature=0 for deterministic outputs when generating structured plans.
- Validate facts against a ground-truth source where possible.
Timeouts and circuit breakers
Implement timeouts on network calls and a circuit breaker to stop repeatedly calling a failing downstream service.
import threading
def call_with_timeout(func, args=(), kwargs=None, timeout=10):
result = {}
kwargs = kwargs or {}
def target():
try:
result['value'] = func(*args, **kwargs)
except Exception as e:
result['error'] = e
thread = threading.Thread(target=target)
thread.start()
thread.join(timeout)
if thread.is_alive():
raise TimeoutError("Operation timed out")
if 'error' in result:
raise result['error']
return result.get('value')
Logging, tracing, and observability
Maintain structured logs for:
- Prompts sent to the LLM (avoid logging PII in plain text; consider redaction)
- Tool calls and results
- Plan steps and validation outcomes
- Errors, stack traces, and retry attempts
For privacy and security, never log unredacted sensitive user data in plaintext. Use tokenization, pseudonymization, or redaction before storing logs that include user content.
6. Production readiness: deployment, monitoring, and governance
Transitioning from a demo to production requires additional considerations: secure secrets management, scalability, rate-limiting, testing, monitoring, and compliance.
Packaging and containerization
Containerize your agent for reproducible deployments. A minimal Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY pyproject.toml poetry.lock* /app/
# If using pip
COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app
ENV PYTHONUNBUFFERED=1
CMD ["python", "server.py"]
If your agent is exposed via HTTP, supply a small web server (FastAPI or Flask) that receives requests and forwards them to your agent instance. An example FastAPI app skeleton:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI()
class RequestInput(BaseModel):
user_input: str
class ResponseOutput(BaseModel):
answer: str
# Assume agent is created globally
# agent = MultiStepAgent(...)
@app.post("/api/agent", response_model=ResponseOutput)
def run_agent(payload: RequestInput):
try:
result = agent.run(payload.user_input)
return {"answer": result}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Scaling and concurrency
LLM calls are often the slowest part. To scale:
- Use a job queue (Celery, RabbitMQ, or managed queues) for long-running tasks
- Run multiple worker instances and autoscale based on queue depth
- Cache frequent queries and tool results (use Redis or a CDN)
- Use rate limiting to avoid overloading the LLM or downstream services
Authentication and authorization
Expose your agent behind an authenticated API. Use OAuth, JWTs, API gateway tokens, or a managed identity provider. For internal tools, enforce least privilege for the agent’s service account.
Secrets and key management
Once your agents are built, understanding the different operational modes available in Codex becomes critical for production deployment. Each mode offers different tradeoffs between autonomy, safety, and speed. Our comprehensive guide explains the complete guide to OpenAI Codex modes including Plan, Execute, and Review and how to choose the right mode for every task.
- Unit tests for tool functions and argument validation
- Integration tests mocking the LLM responses (do not call the live API in CI)
- End-to-end tests in a staging environment using limited API keys
- Automated behavioral tests to ensure the agent doesn’t violate policies ()
Monitoring and SLOs
Track metrics:
- Request rate, latency per request, and 95/99th percentile latencies
- LLM token usage and approximate cost per request
- Error rates per tool and LLM call
- Observability traces to correlate LLM prompts, tool calls, and responses
Cost controls
LLMs can be expensive. Implement:
- Rate limiting per user and per service
- Cost-aware routing: prefer cheaper models for low-risk queries
- Token budgeting and summarization strategies to limit prompt sizes
Safety, content filtering, and governance
Ensure your agent complies with policies and legal requirements:
- Apply content filters for disallowed content
- Use policy checks for privacy-sensitive requests
- Log policy-relevant decisions and maintain an audit trail
If your agent can take irreversible actions (tire multi-step confirmations and human approvals for high-risk operations.
Deployment checklist
Before going live, verify:
- Secrets are stored securely
- Rate limits and throttles are configured
- Prompts and logs do not leak PII
- Monitoring and alerts are in place
- Automated tests and a rollback plan exist
For migration notes or when updating older code, see .
Appendix: utility code, Dockerfile, testing suggestions
This appendix collects practical snippets and suggestions you can copy into a repo.
Complete, minimal agent server using FastAPI
# server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI
app = FastAPI()
client = OpenAI()
llm = LLM(client=client)
tool_registry = ToolRegistry()
tool_registry.register(Tool("web_search", "Search the web", web_search_stub))
tool_registry.register(Tool("calculator", "Compute arithmetic", calculator))
agent = MultiStepAgent(llm=llm, tools=tool_registry)
class RequestPayload(BaseModel):
user_input: str
class ResponsePayload(BaseModel):
answer: str
@app.post("/api/agent", response_model=ResponsePayload)
def run(payload: RequestPayload):
try:
ans = agent.run(payload.user_input)
return {"answer": ans}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# Run with: uvicorn server:app --host 0.0.0.0 --port 8000
Dockerfile (example)
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app
ENV PYTHONUNBUFFERED=1
EXPOSE 8000
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
Testing tips
- Mock LLM responses in unit tests by replacing
LLM.generatewith a deterministic stub. - Record cassettes of tool calls (VCR-like) to simulate external services.
- Use property-based tests to exercise input validation logic.
- Run end-to-end flows in staging with low-privilege API keys and strict logging/monitoring enabled.
Security considerations
- Use input sanitization to prevent injection attacks in tools that execute commands or queries.
- Avoid running arbitrary code returned by the model. Always map actions to pre-defined, validated functions.
- Implement privilege checks for tool calls that can access sensitive data.
If you want to adopt more advanced agent orchestration frameworks or existing agent libraries, evaluate them for security and policy compliance before integrating them into production.
” />
Access 40,000+ AI Prompts for ChatGPT, Claude & Codex — Free!
Subscribe to get instant access to our complete Notion Prompt Library — the largest curated collection of prompts for ChatGPT, Claude, OpenAI Codex, and other leading AI models. Optimized for real-world workflows across coding, research, content creation, and business.
Closing notes
Building custom AI agents with OpenAI’s Responses API involves two key responsibilities:
- Designing clear and verifiable interactions between the LLM and your code (plans, tool calls, validations)
- Engineering robust, secure infrastructure around the LLM calls (retries, monitoring, secrets, testing)
The examples in this guide are intentionally explicit and framework-agnostic. They shroduction agents:
- Start simple: a minimal agent which uses the R
- Introduce tools with a registry and explicit JSON plans
- Use a step alidation at each step
- Harden with retries, validation, logging, and secure deployment practices
For additional patterns and reference architectures, see . For migration assistance and version updates, consult . For guidance on automated tests and safety checks, see .
