Building an Intelligence Layer for Human Systems

What changed when I stopped treating agents as chat sessions and started treating them as an intelligence layer over artifacts, intentions, actions, and evaluation.

  • AI agents
  • Systems
  • Operations
The interesting shift is not from weaker models to stronger models. It is from prompting agents manually to operating systems that can hold intent, route action, and learn under governance.

For a while, my model of working with AI agents was simple: open a session, give it context, ask for help, and keep the thread alive as long as the work required.

That model works longer than people think. You can get a great deal done that way. But eventually it runs into a wall.

The problem is not just token cost, though token cost makes the problem obvious. The deeper issue is that chat is a fragile operating model. Context has to be dragged forward manually. Memory is inconsistent. Responsibilities blur. Every useful behavior depends on me remembering what to ask next and when to ask it.

At some point, I realized I was not really trying to build better prompts. I was trying to build a better operating environment.

That distinction matters because it changes what the system is for.

Hierarchy was always an information system

One reason Jack Dorsey’s essay and interview on moving from hierarchy to intelligence landed with me is that he frames hierarchy as an information-flow structure. Companies built management chains because information had to move through people. Updates, decisions, status, context, and intent all had to be relayed upward and downward in a way humans could manage at their scale.

That was a sensible solution to an older constraint.

But most modern work already produces a constant stream of artifacts:

  • messages
  • documents
  • pull requests
  • meeting transcripts
  • task updates
  • notes
  • dashboards
  • plans

Those artifacts are not just residue from the work. They are the work, or at least the legible trace of it.

If the work now leaves artifacts everywhere, then the obvious next question is whether a system can sit on top of those artifacts and help coordinate intelligence directly rather than relying entirely on humans to relay context by hand.

That is the question I keep circling.

Chat agents are not enough

A chat agent can be useful in the same way a very capable assistant is useful: you ask, it responds, and if the thread stays alive long enough it starts to feel like it understands the surrounding work.

But that understanding is usually shallow in the ways that matter operationally.

It may know what you said five turns ago, but not what role it is supposed to play in a larger system. It may remember facts from the session, but not which facts are shared state versus its local perspective. It may produce a good answer, but there is often no stable connection between that answer and the intention it was supposed to serve.

That is why I became less interested in “an agent that can answer” and more interested in a system that can:

  • hold explicit intent
  • route work when events happen
  • attach actions to responsibilities
  • evaluate whether outputs were actually good
  • improve without mutating itself invisibly

That is the difference between using agents and operating them.

What a system like this actually is

The easiest way to misunderstand a system like this is to treat it as a glorified event router or a daemon that launches prompts when files change.

Those things are true, but they are not the point.

What I have actually built is a backend intelligence layer that sits behind a larger human system. It is the autonomous layer that helps run the collective intelligence of different domains: a life, a company, a project, a team, or some smaller unit within them.

There are other ways to contribute manually. I can still write directly in Obsidian. I can still talk to an agent one-on-one. I can still make decisions without the system. It is not meant to replace those modes.

It is meant to be the layer that keeps running behind the scenes:

  • watching for signals
  • routing them to the right scoped agents
  • producing artifacts
  • evaluating what came back
  • surfacing misalignment
  • proposing improvements

It is the part of the system that tries to preserve continuity even when my attention moves elsewhere.

That is the point where the abstract idea starts becoming operational. If you really want a system like this, you eventually have to decide how it will represent purpose, how it will connect purpose to behavior, and how it will know whether behavior was actually useful.

The system starts with declared intent

One of the biggest changes in my own implementation is that intent is no longer implicit.

Every meaningful layer is moving toward an explicit program.md:

  • a global program
  • an org program
  • an agent-specific program

Those files are not implementation details. They are the declaration of what the system is for.

They describe:

  • objectives
  • intentions
  • constraints
  • boundaries
  • evaluation criteria
  • signals of misalignment

This matters because most automation systems know how to execute actions, but they do not know what those actions are supposed to serve. They have triggers without purpose.

The point of program.md is to make purpose first-class.

In practice that means an agent does not just “have a prompt.” It sits inside a declared context. A chief-of-staff agent is not interchangeable with a customer-development agent because they are not carrying the same intentions, the same boundaries, or the same theory of success.

That is a more important distinction than which model is running underneath.

actions.yaml is where intention becomes action

If program.md is the declaration of purpose, actions.yaml is the operational mapping from purpose to behavior.

This is where the system becomes more than a prompt runner.

The newer manifests do not just say:

  • run this prompt on this event

They now increasingly say:

  • this action has an action_id
  • it serves these intent_ids
  • it produces these artifacts or events
  • it is evaluated in this mode
  • it is judged by these quality gates

That is an important change because it creates traceability.

You can now ask:

  • Which actions exist for this intention?
  • Which intentions have no action behind them?
  • Which artifacts are being produced?
  • Which actions are high-volume but low-quality?
  • Which ones are expensive?
  • Which ones are drifting?

This is the point where a system starts becoming inspectable rather than magical.

And once a system is inspectable, it becomes governable.

Evaluation is not a side feature

Most agent systems stop once something has run.

That is where the real difficulty starts.

The hard question is rarely “did the agent produce output?” The hard question is “did the output serve the intention behind the action?”

That has pushed my implementation toward a more serious answer to evaluation.

Instead of one generic score, the system is moving toward multiple layers:

  • structural checks
  • rubric checks
  • alignment checks
  • efficiency checks
  • intention-to-action traceability

That means “what good looks like” has to be described, not assumed.

For some actions, good means a founder-facing briefing with explicit tradeoffs, owners, next steps, and evidence. For others, good means a routing action that moved the right signal without pretending to be a polished artifact. For others, good means data preparation or extraction quality.

This sounds obvious, but most systems still collapse all of these into one vague notion of success.

The better move is the opposite one: make the judgment legible.

This is where the Karpathy Loop becomes useful

One reason I keep coming back to the Karpathy Loop is that it treats natural language intent as something operational, not ornamental.

In my implementation, the loop is roughly:

program.md -> run -> evaluate -> proposal -> human approval -> updated behavior

That matters because it gives the system a way to learn without pretending that it should rewrite itself freely in production.

A lot of people hear “self-improving agents” and imagine total autonomy. I think that is mostly the wrong instinct.

The more credible version is narrower and more disciplined:

  • declare intent explicitly
  • act against the world
  • evaluate the results
  • detect repeated drift or weakness
  • generate proposals for change
  • require a human to approve those changes

That last step is not a temporary safety patch. It is the governance model.

The point is not to remove human judgment. The point is to give human judgment a better substrate:

  • better context
  • better evidence
  • better traceability
  • better candidate improvements

The system suggests. A human remains responsible for what becomes policy.

The role of the human changes

That is probably the part I care about most.

I do not think the interesting future is one where the AI “makes most of the decisions.” I think that framing causes people to either overstate what the system can do or understate what humans still need to own.

The more interesting shift is that the human role changes from:

  • manually relaying context
  • remembering every dependency
  • kicking off every next step
  • reviewing everything from scratch

to:

  • setting intent
  • defining boundaries
  • reviewing proposals
  • correcting misalignment
  • steering direction

That is a very different posture.

The system is not replacing leadership. It is changing what leadership needs to spend its time on.

If the operating environment is good, the human spends less time acting as a lossy message bus and more time acting as a governor of direction, meaning, and tradeoff.

Why this matters beyond one company

It would be easy to describe this as a company tool, but that would undersell the idea.

The same pattern applies to a life, not just a startup.

A life also has:

  • artifacts
  • competing priorities
  • recurring signals
  • explicit and implicit goals
  • operating constraints
  • drift between stated intention and actual behavior

The same is true for any meaningful project.

That is why I keep thinking about it less as a workflow tool and more as an intelligence layer for human systems. The unit is not just “the company.” The unit is any domain where intentions, actions, artifacts, and feedback need to stay coherent over time.

That is the larger ambition.

What I think comes next

I do not think we are heading toward a world where every company simply bolts an LLM onto its documentation and calls it intelligence.

The more durable systems will probably look more like this:

  • explicit declared intention
  • durable artifacts everywhere
  • scoped agents with real responsibilities
  • event-driven routing
  • evaluation tied to what good looks like
  • proposals instead of silent drift
  • human approval as governance

That is not a chatbot.

It is not just automation either.

It is an attempt to build less lossy systems for collective intelligence.

That phrase can sound grandiose if it stays abstract. What makes it real is that it starts with very mundane things: a meeting note, a file change, a morning briefing, a missed priority, a proposal for refinement, a boundary that should not be crossed.

The larger idea only becomes credible if it can survive those details.

That is what I am trying to build toward.

The interesting shift is not from weaker models to stronger models.

It is from prompting agents manually to operating systems that can hold intent, route action, evaluate results, and adapt under human governance.

That feels much closer to the future than another better chat window.