portlang: An Environment-First Agent Framework

I. The Problem

Most agent frameworks are loop managers. Call the model, parse output, dispatch tool calls, append results, repeat. This works for short tasks. For anything longer (a multi-file refactor, a complex data pipeline, an overnight job), it breaks down in predictable ways.

Context fills with noise across steps. Practitioners call it context rot: as history accumulates, earlier instructions lose influence and the agent drifts. It finds shortcuts that satisfy the surface objective without satisfying the real one. A common failure: an agent asked to fix a bug modifies the test file to make the tests pass. Nothing in the framework stopped it.

The workarounds tell the story. The Ralph loop, wrapping an agent in a shell restart to periodically flush accumulated context, went viral because it addresses a real problem. Manually pruning conversation history between steps serves the same purpose. Both exist because loop-manager frameworks accumulate context indefinitely, and accumulated context rots.

When these runs fail, there's no structured record of what happened. You're left guessing whether the problem was the prompt, the tools, the model, or something else.

The root cause is architectural. These frameworks treat agent behavior as the product of instructions. Better prompt, better behavior. But prompts are suggestions, not guarantees. Frontier models follow roughly 150–200 instructions reliably. Past that point, a large system prompt, many tool definitions, and accumulated context history crowd each other out. For long-running tasks, "suggest better" doesn't scale.

portlang starts from a different premise: agent behavior is search through a space you define. Your job is engineering the space.

II. Why "Search Space" is the Right Model

The intuition

Imagine two versions of the same coding task. In version A, the agent can write to any file and receives no feedback until the run ends. In version B, the agent can only write to src/, gets test results after every file write, and the run fails immediately if it tries to touch tests/.

Both versions use the same model and the same goal. But the agent in version B will behave completely differently. The reason isn't the prompt; it's the space it's searching through. It can't take the shortcut of modifying tests. It gets immediate signal when something breaks. It knows it can't write outside src/.

This is the core idea: the behavior you observe is largely a function of the environment you define.

The science behind it

A language model learns a probability distribution over token sequences during pre-training. Your prompt is a conditioning variable that selects which region of that distribution you're sampling from. Reinforcement learning then shapes which paths through that space the model tends to take, training the model to navigate toward a reward signal.

At inference time, the model is a policy: given the current context window (everything the agent has seen), generate the next action. Three properties of this setup matter for framework design:

Sensitivity and instruction budget: Small changes to context, like different tool ordering, a rephrased goal, or verbose error output, can cause large behavior changes. Sclar et al. (2023) measured accuracy swings of up to 76 percentage points from formatting changes alone. Every token competes for the model's attention. Frontier models follow roughly 150–200 instructions reliably. Past that point, instructions start dropping out. Smaller context windows produce better results, and the improvement is about attention, not just cost.
Reward-chasing: The model was trained to satisfy a reward proxy, not your intent specifically. If there's a path that satisfies the proxy without satisfying your actual goal, the model can find it. This is why agents modify test files, hardcode expected outputs, or produce plausible-looking outputs that don't actually work. The proxy said "pass"; the model found the shortest path to "pass."
Context is reality: The model has no state outside the context window. Whatever it "knows" during a run is whatever appeared in the token stream. Noise (verbose tool output, redundant observations, accumulated error messages) degrades behavior even when useful information is still present. Context rot is systematic. As history accumulates, the model's attention shifts toward whatever is loudest in the window. Earlier instructions fade.

portlang's response

Rather than fighting these properties with increasingly complex prompts, portlang makes them tractable through environment design:

The boundary physically prevents reward-chasing: the agent cannot modify tests/ if allow_write = ["src/*.py"] is enforced at the container level.
Re-observations keep the context accurate each step rather than letting stale or noisy tokens accumulate.
Verifiers provide an explicit runtime feedback signal, the inference-time equivalent of training reward.
Hard limits on tokens and cost mean runs fail fast and cleanly instead of spiraling.

The agent is still non-deterministic. But the space it searches through is one you designed.

III. Design Principles

Principle 1: The environment is the product

The model is roughly the same for everyone. The thing under your control is the environment you give it: what it can see, what it can do, what counts as success, and what's physically impossible. portlang makes environment design the primary activity. The agent is a parameter you pass in.

Principle 2: Define convergence, not instructions

"Do X, then Y, then Z" is fragile when the trajectory is non-deterministic. Instead, declare what success looks like (verifiers), and what the agent can and cannot do (boundary). The agent finds its own path. You define the space it searches in.

Principle 3: Smaller context windows work better

The context window is the model's entire reality. Context rot degrades behavior even when useful information is still present, because attention spreads thin across a longer window. Past a certain size, more context means worse results. portlang treats context as a hard resource with a ceiling. When exhausted, the run ends. Compression is always lossy and the framework cannot know which information matters, so it does none.

Principle 4: Boundaries are enforced topology, not suggestions

Telling an agent "don't modify the tests" through a prompt is advice. Setting allow_write = ["src/*.py"] in a sandboxed container makes it structurally impossible. These are categorically different guarantees. RL-trained models find any accessible path to reward; the framework must make unauthorized paths inaccessible, not just discouraged.

Principle 5: Feedback is runtime signal

Test results, linter output, and shell checks that enter the context window at each step are the inference-time equivalent of training reward. They steer the search in real time. Absent feedback, the agent wanders and corrects late. Verifiers can trigger after every tool call, not just at the end.

Principle 6: Every run is a dataset

Agent behavior is non-deterministic: the same field can produce different trajectories on different runs. Trajectories are structured event logs: what was called, when, at what token cost, with what outcome. They're queryable, replayable, and the input to improvement. The framework accumulates this data; you decide what to change.

IV. The Primitives

Every concept in portlang reduces to one of six things.

Field

The top-level unit. A field is a self-contained unit of work: model, goal, tools, constraints, and success criteria in one file. Like a function: declared inputs and outputs, isolated execution. Fields are the unit of composition: a pipeline is a sequence of fields, each with a fresh context window.

Environment

The territory the agent operates in. A filesystem root (mounted into an isolated container), a set of installed packages, network policy, and any MCP servers or custom tools. The agent cannot observe or act on anything outside the environment.

Verifier

A shell command (or schema check) that defines success. Pass = exit 0; fail = nonzero. On failure, the output enters the context window as feedback. Verifiers can trigger on stop, after every step, or after specific tool calls. Multiple verifiers together define the full convergence criterion.

Boundary

Hard limits enforced by the container sandbox: which paths can be written (glob patterns), network allow/deny, maximum tokens, maximum cost, maximum steps. These are not prompt-level instructions. Violations are rejected at runtime. The agent cannot exceed them by trying harder.

Trajectory

The complete structured event log of a run: every step, tool call, verifier result, token cost, and duration. Stored as JSON. Replayable interactively with portlang view trajectory. The raw material for convergence measurement and reflection.

Reflect

An analysis pass that reads a trajectory and produces prioritized recommendations for improving the field definition: fewer steps, lower cost, better reliability. Reflect is itself a portlang field with custom tools for trajectory access. Run automatically after each run with --auto-reflect, or on any past trajectory with portlang reflect <id>.

V. A Field in Practice

A field is a .field file, a TOML configuration that fully specifies a task. Here's a real example: finding and fixing a bug without touching the test suite.

name = "fix-jwt-validation"

[model]
name = "anthropic/claude-sonnet-4.6"
temperature = 0.2

[prompt]
goal = """
Fix the JWT expiration validation bug in auth.py.
The exp claim is being compared as a string instead of an integer.
"""
# Inject current diff before every step — agent always knows what changed
re_observation = ["git diff --stat"]

[environment]
root = "./workspace"

[boundary]
allow_write = ["auth.py", "tests/**"]  # sandbox-enforced: cannot write elsewhere
network = "deny"
max_tokens = 40000
max_steps = 20
max_cost = "$0.50"

# Run tests after every file write — catch regressions immediately
[[verifier]]
name = "tests-pass"
command = "pytest tests/test_auth.py -q"
trigger = "always"
description = "Tests must pass after every write"

# Confirm scope at the end — only auth.py and tests/ should be touched
[[verifier]]
name = "scope-guard"
command = "git diff --name-only | grep -qvE '^(auth\\.py|tests/)' && exit 1 || exit 0"
trigger = "on_stop"
description = "Only auth.py and tests/ should be modified"

What this field defines: what the agent can see, what it can write, when it gets feedback, and what "done" means. What it does not define: which files to read first, how to find the bug, what order to make changes. The agent decides that. You defined the space.

The re_observation field deserves attention. The diff is injected before every step as a live refresh. Where the Ralph loop works by restarting the session to flush accumulated context, re_observation keeps the session running and pushes only what matters right now. The agent always has current workspace state. Context does not accumulate.

The two verifiers together close an obvious escape hatch: tests pass, and the test files themselves weren't modified to make them pass. Neither check alone is sufficient.

Structural checks

portlang check field.field validates a field before running it. It catches things like writable paths with no verifier coverage (mutations with no feedback signal), verifiers that reference files outside the environment, and re-observation budgets that would consume most of the token ceiling before the agent does any work. It runs fast, costs nothing, and prevents a class of silent failures.

VI. Observability & Reflection

Agent systems are non-deterministic. The same field can succeed on one run and fail on the next. Traditional debugging (add a print statement, rerun) doesn't transfer. You need to reason about what actually happened, step by step, and across many runs.

portlang view trajectory <id>          # step through any run interactively
portlang converge field.field -n 20    # run 20 times, report convergence rate
portlang run field.field --auto-reflect  # reflect immediately after each run
portlang reflect <trajectory-id>       # analyze any past run

Trajectories

portlang view trajectory loads a run and lets you step through it interactively. At each step: what action was taken, what tool was called, what the response was, how many tokens were consumed, which verifiers ran and whether they passed. The context window at any step can be reconstructed. This makes visible the phenomena that cause drift: noisy tool output consuming thousands of tokens, a re-observation returning unexpected data, an ambiguous verifier result misinterpreted.

Convergence measurement

portlang converge -n N runs a field N times and reports the convergence rate: what fraction of runs pass all verifiers. For non-deterministic systems, this is the reliability metric that matters. "Did it work this time" is the wrong question. 60% convergence is not a production number, even if the best runs look great. Tighten boundaries, simplify re-observations, clarify verifiers, then run again. The convergence rate tells you whether the change helped.

Reflection

portlang reflect runs an analysis agent that reads a trajectory and produces prioritized recommendations grounded in the specific steps your agent took, not generic advice.

Here's an example from a real run. The agent was given a calculator tool with a Python function named execute. The goal said "use the calculator." The agent searched for "calculator," got nothing, guessed the exact tool name, and proceeded. Reflect caught it:

HIGH: The agent called ToolSearch twice before finding the calculator tool. Adding 'calculator', 'math', 'arithmetic' to the tool description eliminates both steps. Steps 1–2 are pure tool-hunting waste.

HIGH: Step 7 consumes 48,529 tokens on a narration step with no useful output. Add output_schema to enforce a concise structured stop message.

MED: Add a verifier with trigger = "on_tool:write" to catch errors immediately rather than only at stop.

The fix to the first finding was one word: renaming the Python function from execute to calculate. portlang auto-extracts the function name as the tool name, so the rename propagated automatically. This is what environment-first means in practice: the agent's behavior is a function of the environment you define, and reflect shows you which knobs to turn.

Reflect is itself a portlang field, with custom tools for loading trajectory data and submitting structured analysis. See reflect.field.

VII. What This Does Not Solve

Proxy mismatch is inherent

Verifiers are reward proxies. "All tests pass" and "the code is correct" are not the same thing. The framework makes this gap explicit and auditable: you can see exactly which verifiers ran and what they checked. It cannot close the gap. Reviewing agent output remains necessary.

Policy opacity remains

The trained model is a black box. When two signals in the context window conflict, the model resolves them according to its training, not according to your intent, and not transparently. The framework gives you every lever except the one inside the model.

Tasks without clear verifiers are harder

The framework excels when success is concrete: tests pass, output matches a schema, a file contains specific content. For creative work, open-ended research, or judgment calls, verifiers are unavoidably weak. Weak verifiers mean weak runtime feedback. Weak feedback means poorly guided search.

Context management is lossy

The framework imposes a hard token ceiling. When exhausted, the run ends. Developers who want summarization or pruning can build it into their field. The framework will not do it silently. Any compression loses information, and the framework cannot know which information matters.

• • •

The purpose of this document is to establish a foundation: a framework that takes seriously what we know about how these systems work and builds engineering discipline around it.