portlang

An Environment-First Agent Framework
Made by Port of Context
Technical Manifesto & Design Document (Draft 0.2, March 2026)
github.com/portofcontext/portlang

I. The Problem

The current generation of agent frameworks (LangChain, CrewAI, AutoGen, Semantic Kernel) share a common architecture: they are loop orchestrators. They manage turn-taking between a language model and tools. Call the model, parse output, dispatch tool calls, append results, repeat.

This architecture rests on an implicit model: the agent reasons through a task, and the framework provides instructions and tools. The developer's work is prompt engineering.

This model works for short tasks. For long tasks (overnight refactors, multi-file changes, complex pipelines) it degrades. The context window fills with noise. Early observations compound into persistent drift. The agent optimizes for whatever signal is easiest, which may differ from your intent.

These problems are structural consequences of how these systems work, following from the mathematics of pre-training and reinforcement learning. They require structural solutions.

This document describes a framework built on a different premise: agent behavior is search. The developer's primary job is engineering the search space.

II. Theoretical Foundations

The Prompt as Conditional Distribution

A pre-trained language model learns a distribution over token sequences. Given a prefix c, the model generates continuations by sampling from P(xk+1, xk+2, … | c). The prompt is a conditioning variable. It selects which region of the model's learned distribution we sample from.

Three consequences matter for framework design:

Reinforcement Learning as Search

Pre-training determines what the model can produce. Reinforcement learning determines what it will produce. The model becomes a policy πθ operating in an environment where state is the current context window, action is the model's output, and reward indicates action quality.

Three properties matter:

The Joint Inference: Search in a Conditioned Space

At inference, the pre-trained distribution and RL policy combine. The prompt conditions the distribution (narrowing what's reachable), and the policy navigates that space (selecting trajectories that maximize reward). The full trajectory is a sequence of (state, action, feedback) tuples, where each state is the accumulated context window.

Agents execute search. The agent executes a learned policy that navigates through possible trajectories toward a reward signal. Chain-of-thought tokens and planning steps are part of the trajectory. They enter the context window and re-condition the policy's next action.

III. Design Principles

Each principle is a direct consequence of the theoretical foundations. They are stated as design constraints every component must satisfy.

Principle 1: The environment is the product

The trained policy is opaque and approximately the same for everyone (Claude, GPT, Gemini). The only variables under your control are the context window and the environment. The framework must make environment design the primary activity. The agent is a parameter you pass in.

Principle 2: Declare convergence criteria

Imperative instructions ("do X then Y then Z") attempt to control the trajectory directly. But the trajectory is non-deterministic. The framework must let developers declare what success looks like (verifiers), what the agent can observe (environment), and what it cannot do (boundaries). The agent finds the trajectory. You define the space it searches in.

Principle 3: Context is a finite resource

The context window is the model's entire reality. Noise warps the distribution the model samples from. Adding irrelevant content degrades performance even when relevant information is still present. The framework must treat the context window as a finite resource with a hard ceiling. When the ceiling is reached, the run ends. Compression is always lossy.

Principle 4: Boundaries are enforced topology

Telling an agent "avoid X" through a prompt is a suggestion. Making X physically impossible through permissions is a guarantee. RL-trained agents find and exploit any accessible path to reward. The framework must enforce boundaries at the runtime level, making unauthorized trajectories structurally impossible.

Principle 5: Feedback is continuous runtime signal

Test results, linter output, and build signals that enter the context window reshape the agent's behavior at every step. They are the inference-time equivalent of the RL training reward. The framework must make feedback a first-class, continuous component of the execution loop.

Principle 6: Every run is a trajectory, and trajectories are data

Agent behavior is non-deterministic. The same field definition may produce different trajectories on different runs. Debugging agent failures requires inspecting the trajectory: what entered the context window, when the field shifted, where the search diverged. The framework must record complete trajectories as first-class, replayable, diffable data structures.

Principle 7: The framework learns from its own runs

Many properties of a good field definition cannot be determined before runtime. Which tools get used? Which observations cause divergence? How many tokens does a typical run consume? The framework must accumulate trajectory data across runs and surface patterns that help the developer refine field definitions over time. The framework counts, correlates, and reports. The developer decides what to change.

IV. The Six Primitives

Every concept in the framework reduces to one of six primitives. Each primitive corresponds to a variable in the theoretical model that the developer can control.

Field

The field is the top-level primitive. Like a function, it is a self-contained unit of work with declared inputs, outputs, and constraints. A field is defined in configuration and executed by the runtime.

Fields are the unit of composition. A pipeline is a sequence of fields. Parallel execution is multiple independent fields. The field boundary is the isolation boundary. Context does not leak between fields unless explicitly passed.

Environment

The environment is the territory the agent operates in, defined as an immutable snapshot with allowed mutation channels.

An environment specifies a filesystem snapshot (commit, container image, or file listing), a tool manifest (the set of tools that exist), and a network policy (what external resources are reachable). The agent cannot observe or act on anything outside the environment.

Environments are versioned and composable. A clean environment produces a clean context window. A noisy environment produces drift.

Verifier

A verifier is a function that evaluates current state and returns a signal: pass, fail, or numeric score. Verifier output enters the context window automatically.

Verifiers serve two purposes: they provide the runtime reward signal (the inference-time equivalent of RL training reward), and they define correctness. Strong, unambiguous feedback guides the search. Weak or absent feedback produces blind wandering.

Context Policy

The context policy has two components: a token budget (hard ceiling on context window size) and a re-observation schedule (commands that run before each step to keep context fresh).

When the token budget is reached, the run terminates. Automatic summarization and compression are disabled by default. Re-observations keep the context accurate but consume budget.

Boundary

A boundary defines the hard walls of the search space: filesystem write permissions, tool restrictions, network egress policy, cost limits, and step limits.

Boundaries are enforced by the sandbox. For structured tool calls, the runtime inspects and rejects violations. For shell execution, enforcement happens at OS level through container isolation.

Trajectory

A trajectory is the structured event log of an agent's execution. Each entry records the step number, action taken, environment response, verifier results, and token count.

Trajectories are structured data with typed fields, making them queryable, replayable, and diffable. They are the input to the adaptation system.

V. The Configuration Layer

A field is defined in a configuration file called field.toml. The configuration is human-readable, diff-friendly, and has clear structure-to-semantics mapping.

Complete Example: Scoped Bug Fix

[field]
name = "fix-jwt-validation"
model = "claude-sonnet-4-5"
prompt = """
Fix the JWT expiration validation bug in auth.py.
The exp claim is being compared as a string instead of
an integer. Only modify auth.py and the test file.
"""

[environment]
snapshot = "git:HEAD"
tools = ["read", "write", "bash"]
ephemeral = true

[[verifier]]
name = "tests-pass"
run = "pytest tests/test_auth.py -x"
when = "after_each_write"

[[verifier]]
name = "scope-guard"
run = "git diff --name-only | grep -qvE '^(auth\\.py|tests/)' && exit 1 || exit 0"
when = "before_terminal"

[context]
budget = 32_000
re_observe = ["git diff --stat"]

[boundary]
fs_write = ["auth.py", "tests/**"]
network = "deny_all"
max_steps = 30
max_cost = "$2.00"
sandbox = "container"

Note the structure. The developer declared a search space: what the agent can see (environment), what success looks like (verifiers), how much context is available (budget), and what is physically impossible (boundary). The runtime executes the search.

VI. The Runtime

The runtime is the execution engine that turns a field definition into a trajectory.

Sandbox Architecture

Each field executes in an isolated sandbox constructed from the environment definition. The filesystem snapshot is mounted (read-only base with copy-on-write layer for mutations), tool manifest determines which handlers are registered, and network policy is enforced at the namespace level.

The sandbox translates boundaries into topology. For structured tool calls (read, write), the runtime inspects parameters and rejects violations. For bash, enforcement happens at OS level: the container has no network interface if network = "deny_all", and the filesystem mount is read-only except for allowed paths.

The Agent Loop

The agent loop is deliberately simple:

  1. Execute scheduled re-observations and inject into context. Check token budget.
  2. Invoke policy with current context. Receive action (text, tool call, or stop).
  3. Check action against boundary. If violation, reject and inject rejection into context.
  4. Dispatch action to sandbox. Receive environment response.
  5. Dispatch triggered verifiers. Inject results into context.
  6. Check token budget. If exceeded, terminate.
  7. Record step to trajectory log.
  8. Check termination conditions. If met, end. Otherwise, return to step 1.

The simplicity is intentional. The loop is plumbing. The design work is in the field definition.

VII. Structural Checks

Before execution, the framework runs structural checks on the field definition (portlang check). These verify properties that can be checked without inference or natural language understanding.

What Can Be Checked Statically

What Cannot Be Checked Statically

VIII. Observability, Adaptation & Debugging

Agent systems are non-deterministic. The same field can produce success on one run and divergence on the next. Traditional debugging does not transfer. You need to reason about distributions of trajectories.

Trajectory Replay

Every run produces a trajectory (the structured event log). The portlang replay command loads a trajectory and lets you step through it. At each step, see the action taken, environment response, verifier results, and token count. The context window at any step can be reconstructed.

Replay makes visible the phenomena that cause drift: noisy tool output at step 5 consuming 3,000 tokens, re-observation returning unexpected data, ambiguous verifier signal misinterpreted.

Trajectory Comparison

The portlang diff command compares two trajectories from the same field definition at structural level. It aligns by action type and identifies the first point of structural divergence.

This is structural comparison. The framework compares action types, target paths, tool names, and verifier results. Most agent failures have structural causes: a different file was read, a different tool was called, a verifier triggered at a different point.

Adaptation Through Trajectory Data

The framework accumulates trajectory data across runs. Over time, patterns emerge. The framework surfaces these as reports. The developer decides.

This requires counting, correlating, and presenting. The framework is a query engine over trajectory data. It shows what happened. The developer decides what to change.

Convergence Rate

The portlang bench command runs a field N times and reports the convergence rate: what fraction of runs pass all verifiers. It also reports average trajectory length, cost distribution, and divergence clusters.

Running a field 100 times is expensive. The framework is honest about this. For expensive models or long-running fields, use cheaper proxies first. Run 5 times, check for obvious issues, adjust, then run larger benchmark.

IX. Composition & Multi-Agent

Multi-agent systems in current frameworks typically involve agents sending messages, accumulating context across boundaries. The theoretical model predicts this degrades: each agent's output enters others' context windows as tokens, compounding noise. DeepMind tested 180 agent configurations and found sequential multi-agent tasks degraded 39-70% versus single agent.

The framework takes a different approach: composition through fields and artifacts.

Pipelines

A pipeline is a sequence of fields connected by artifacts. Each stage is a self-contained field with its own environment, verifiers, context policy, and boundary. The output of one stage is a file (artifact) that becomes input to the next stage's environment.

Each field in a pipeline gets a fresh context window. Noise from stage 1 stays in stage 1. Only the artifact passes forward. This is the structural equivalent of process isolation.

Parallel Execution

Independent fields can run in parallel. Parallel fields are the natural pattern for tasks that decompose into independent sub-problems. Example: compiling different files where each compilation is an independent field with a clear verifier.

Parallel fields share nothing by default. If they need shared artifacts, it happens through explicit shared volumes with appropriate boundary constraints.

Gating

A gate is a verifier that blocks pipeline progression. It sits between stages and requires all verifiers to pass before the pipeline advances. Gates ensure artifacts meet quality criteria, preventing degraded output from contaminating the next field's search space.

X. What This Does Not Solve

Proxy mismatch is inherent

Verifiers are reward proxies. The gap between "all verifiers pass" and "the result is correct" will always exist in any reward-based system. The framework makes this gap visible and auditable. It cannot close it. Reviewing agent output remains necessary.

Policy opacity remains

The trained policy is a black box. The framework gives you every lever except the one inside the model. When two signals in the context window conflict, the policy resolves them according to its training. How it resolves conflicts remains partially opaque.

Complete safety requires more

Boundaries eliminate trajectories, which contributes to safety. Complete safety requires formal verification, adversarial testing, and runtime monitoring beyond what a developer framework provides. The framework is a necessary component of safe agent deployment.

Novel tasks remain hard

The framework works well on tasks with clear verifiers: coding, data transformation, structured output generation. Tasks where success is hard to specify (creative work, open-ended research, judgment calls) have weak verifiers. Weak verifiers produce weak runtime reward signals. Weak signals produce poorly guided search.

Coordination at scale is unsolved

Pipeline and parallel execution patterns handle simple composition. Complex multi-agent coordination (dynamic task allocation, negotiation, adaptive decomposition) is not yet solved here. The framework provides isolation primitives that prevent the worst failure modes. It does not solve coordination.

Context management is lossy by nature

The framework imposes a hard token budget. When exhausted, the run terminates. Developers who want summarization or pruning can build it. The framework treats lossless context compression as impossible. Any compression loses information, and the framework cannot know which information matters.

• • •

The purpose of this document is to establish a foundation: a framework that takes seriously what we know about how these systems work and builds engineering discipline around it.