Skip to main content

Command Palette

Search for a command to run...

Your Agent's Tool Interface Is the Wrong Abstraction to Fight Over

CLI vs. function calling is a serialization debate. The real engineering is in execution boundaries, contracts, and composability.

Published
7 min read
Your Agent's Tool Interface Is the Wrong Abstraction to Fight Over
A
I love building with and sharing about AI.

Developers are debating whether agents should call tools via JSON-schema function calls or CLI-like interfaces. It's a real discussion — CLIs offer composability, pipes, and a decades-old contract model. JSON-schema gives you structured validation and IDE-friendly introspection. Both sides have merit.

But the debate is happening at the wrong layer.

The tool interface — how an agent formats a request — is the least interesting part of the tool execution problem. The hard parts are: where does the tool run, who owns the execution context, and what happens when the tool fails? No interface format solves those questions.

The Interface Is a Serialization Detail

When an LLM "calls a tool," it's generating structured output. Whether that output looks like a CLI invocation (search --query "rust async" --limit 10) or a JSON object ({"query": "rust async", "limit": 10}), the semantics are identical. You're expressing a function name and a set of named arguments.

The real work starts after the model produces that output:

  1. Routing — Which system receives the call?
  2. Authentication — Does the caller have permission?
  3. Execution — Where does the code actually run?
  4. Error handling — What happens on failure, and who decides whether to retry?
  5. Result delivery — How does the output get back to the model?

None of these are serialization problems. Swapping JSON-schema for CLI syntax changes zero of them.

Where Tools Run Matters More Than How You Call Them

The real architectural decision is execution boundary. Consider the options:

Provider-side execution. The AI provider runs tools on their infrastructure. Simple to set up, but your data flows through their systems and you don't control auth, rate limiting, or error recovery.

Server-side execution. Tools run on your backend, inside your security boundary. You control database access, authentication context, and can apply your existing observability stack.

Client-side execution. Some tools need the browser — geolocation, clipboard, user confirmations. These can't run on a server at all.

Each boundary has different trust models, latency characteristics, and failure modes. A tool that queries your production database has fundamentally different requirements than one that reads the user's clipboard, regardless of whether either is invoked via CLI syntax or JSON.

In Octavus, tools are defined declaratively in the agent protocol and then implemented wherever they need to run:

tools:
  get-user-account:
    description: Look up account information
    parameters:
      userId: { type: string }

  get-browser-location:
    description: Get the user's current location

The protocol says what the tool does and what it accepts. The implementation decides where it runs:

// Server-side — runs in your backend
const session = client.agentSessions.attach(sessionId, {
  tools: {
    'get-user-account': async (args) => {
      return await db.users.findById(args.userId);
    },
    // get-browser-location has no server handler
    // → automatically forwarded to the client
  },
});
// Client-side — runs in the browser
const chat = useOctavusChat({
  transport,
  clientTools: {
    'get-browser-location': async () => {
      const pos = await new Promise((resolve, reject) => {
        navigator.geolocation.getCurrentPosition(resolve, reject);
      });
      return { lat: pos.coords.latitude, lng: pos.coords.longitude };
    },
  },
});

The agent doesn't know or care which side handled the call. The contract (name, parameters, description) is the stable surface. The execution is an infrastructure decision.

Contracts Over Conventions

The CLI camp's strongest argument isn't syntax — it's the implicit contract model. Unix tools have man pages, --help flags, and decades of convention around stdin/stdout/stderr. You know what grep does because the interface is the documentation.

That's a real insight, but it doesn't require CLI syntax. What it requires is a well-specified contract: what does this tool do, what does it accept, what does it return, and what can go wrong?

In agent systems, that contract should live in the agent's definition — not in the tool implementation:

tools:
  create-support-ticket:
    description: >
      Creates a support ticket in the internal system.
      Use when the user needs to escalate an issue or
      request human assistance. Returns the ticket ID
      and estimated response time.
    parameters:
      summary:
        type: string
        description: Brief description of the issue
      priority:
        type: string
        description: Ticket priority (low, medium, high, urgent)

This is a contract. The LLM knows what the tool does, what arguments it needs, and what to expect back. The implementation can change (swap ticket systems, add validation, change the database) without touching the contract. That's the same property that makes CLIs composable — stable interfaces with swappable implementations.

The Composability Question

The other argument for CLI-style tool interfaces is composability: piping output from one tool into another, chaining commands, building up complex operations from primitives. cat file.txt | grep "error" | wc -l is elegant because each tool does one thing and the pipe is the universal connector.

Agent systems need composability too, but the pipe metaphor breaks down. Agent tool calls aren't linear pipelines — they're trees. An agent might call three tools in parallel, use results from two of them to decide a fourth call, and retry a fifth based on an error from the first. That's not stdin | stdout. That's orchestration.

This is where workers come in. Instead of piping CLI commands, you compose agents:

workers:
  research-assistant:
    description: Researching a topic in depth
    display: stream
    tools:
      search: web-search

agent:
  model: anthropic/claude-sonnet-4-5
  system: system
  workers: [research-assistant]
  tools: [web-search, create-report]
  agentic: true

The parent agent can call the research-assistant worker the same way it calls any tool. The worker runs its own steps, uses its own model configuration, makes its own tool calls, and returns a result. That's composability at the right level — not piping text between processes, but delegating structured tasks between agents with typed inputs and outputs.

What Actually Deserves Your Attention

If you're building agent systems, here's where the real engineering problems live — none of which change based on your tool calling syntax:

Execution context. Does the tool handler have access to the current user's auth context? Can it read request headers? Octavus supports this by letting you create tool handlers dynamically per request:

export async function POST(request: Request) {
  const user = await validateToken(
    request.headers.get('Authorization')
  );

  const session = client.agentSessions.attach(sessionId, {
    tools: {
      'get-my-orders': async (args) => {
        // Tool runs with the requesting user's context
        return await db.orders.findByUser(user.id);
      },
    },
  });

  const events = session.execute(payload);
  return new Response(toSSEStream(events));
}

Error propagation. When a tool fails, the error message goes back to the LLM, which decides how to respond. That's a different error model than a CLI returning a non-zero exit code. The agent might retry, ask the user for different input, or try an alternative approach. Your error messages need to be LLM-legible, not just human-readable.

Interactive tools. Some tool calls shouldn't auto-complete — they need the user to confirm, provide input, or make a choice. That's a state machine problem (idle → streaming → awaiting-input → streaming), not an interface format problem.

Tool visibility. Should the user see that a tool is running? Should they see the arguments? The results? Display modes (hidden, name, description, stream) control the user experience of tool execution. CLI vs. JSON has nothing to say here.

The Pattern That Matters

The CLI vs. function-calling debate recapitulates an older argument: REST vs. RPC. That debate was also nominally about interface format (URLs and verbs vs. method names and arguments), but the actual lesson was about something else: having a uniform interface with clear contracts makes systems easier to compose, debug, and evolve.

The same lesson applies to agent tools. The format of the call doesn't matter much. What matters is:

  • Declare the contract separately from the implementation. The tool's name, description, and parameters should live in the agent definition, not be inferred from code.
  • Let execution boundaries be an infrastructure decision. The same tool contract should be implementable server-side, client-side, or delegated to another agent.
  • Make composition explicit. Workers, structured handoffs, and typed inputs/outputs beat implicit piping for the kind of non-linear execution agents actually do.
  • Own your tool execution. Tools that touch your data should run on your infrastructure, inside your auth boundary, with your observability.

The interface syntax? Use whatever your model provider gives you. It's the least consequential choice in the stack.