Agents and Tool Use: Models Are More Than Chat

Intuition: give the model a pair of “hands”

Chat models can only talk; agents can act on the world through tools: check weather, write code, query databases, call APIs. The core idea is to combine “thinking” and “acting”: the model first reasons about what information it needs, then calls a tool to obtain it, and continues reasoning from the result.

The ReAct framework makes this loop explicit: the model outputs an alternating sequence of “thought → action → observation → thought.” This is like how humans solve problems—not by pure contemplation, but by looking things up while reasoning.

Engineering view: reliability, state, and safety

In practice, the core challenge of agent systems is reliability:

Tool definition: Describe tool interfaces with JSON Schema or OpenAPI so the model knows each tool’s parameters and purpose. MCP (Model Context Protocol) aims to standardize this layer.
Call parsing: Model output is usually text; you need robust parsing to extract structured tool calls. Function-calling training lets models natively output structured calls, which is more stable than post-hoc parsing.
Error handling: Tools can fail, time out, or return errors. Agents need to retry, fall back, or switch to alternative tools.
State management: Multi-step tasks require maintaining conversation history, intermediate results, and plans. Reflexion and related work let agents self-reflect on failure causes and adjust strategy.
Safety boundaries: When agents can execute code or access external systems, permission control and sandboxing are critical. Prompt injection can let attackers hijack agent behavior.

Evaluating agents is harder than evaluating pure text generation: you must measure task completion rate, step efficiency, error recovery, and cost. Benchmarks like SWE-bench measure practicality through real coding tasks.

Research view: from single agents to multi-agent and autonomous systems

A limitation of single agents is that one model simultaneously handles planning, execution, memory, and reflection, which is error-prone and hard to scale. Multi-agent systems assign different roles to different instances: some plan, some execute, some verify, collaborating through conversation or shared state.

A deeper question is: what are the boundaries of agent “autonomy”? When a model can decide which tools to call, which files to modify, and which data to access, how do we define and supervise its goals? This is a cross-disciplinary area spanning technology, product, and ethics.

References

ReAct: Synergizing Reasoning and Acting in Language Models — Shunyu Yao et al. (2022)
ReAct interleaves reasoning and acting: LLM thinks (Thought), executes a tool call (Action), observes the result (Observation), and cycles. This is the prototype for modern AI agent frameworks, directly influencing LangChain, AutoGPT, and similar agent frameworks.
schick2023-toolformer
Makes model generate "API-calling tokens" itself and evaluate usefulness through self-supervision. Foundational paper for function-calling/tool-use training paradigm, directly influencing GPT-4 function calling design.
shinn2023-reflexion
Makes agent do natural language "post-mortem" after failure, injecting reflection into next round's prompt. "Gradient-free self-improvement" approach widely reused in coding agents, SWE-agent.
yang2024-sweagent
Proposes ACI (Agent-Computer Interface) concept, emphasizing "what tools/interface agent uses ≥ what model used". GPT-4 + good ACI improves SWE-bench 6x, establishing coding agent engineering methodology.
Model Context Protocol (MCP) — Anthropic (2024)
The Model Context Protocol (MCP) is an open standard proposed by Anthropic for how LLM applications communicate standardly with external tools, data sources, and services. Through unified "resources/tools/prompts" interfaces, any MCP-compatible tool can seamlessly connect to any MCP-compatible model — aiming to be the "USB standard" for AI tool use.