The database ledger and S3 overflow shelf for future agent runs.
Story-751 gave PRAMAAN agents a durable home: a new agents database schema for runs, sessions, turns, tool calls, and tool registry rows, plus a shared S3 helper for bodies that are too large to keep inline. The important idea is that Postgres keeps the searchable ledger, while S3 stores the heavy boxes.
Before this Story, the agent runtime had nowhere canonical to leave a trail. A lawyer could eventually ask a matter chat question, but the backend still needed tables for "which agent run happened?", "which chat session did it belong to?", "which tool calls did it make?", and "where did we put the long message body?"
That trail matters because agent work is not just a response string. It is an operational record: who asked, in which firm, against which matter, what model ran, how many tokens were used, what tools were called, what failed, and what can be safely replayed or audited later.
Story-751 was the foundation move toward a Postgres-owned agent ledger with S3 used only for bulky content. The agent runtime itself comes later. This Story built the shelves and labels before the warehouse gets busy.
The evidence room keeps a register at the desk: case number, box number, who checked it in, who touched it, and where it sits. The register does not hold the bulky evidence itself; the boxes sit on shelves. Here, Postgres is the register, and S3 is the shelf for large payloads. You search the register first, then fetch the box only when you really need it.
Why not put the whole chat transcript and every tool input directly into Postgres?
Answer: because chat turns and tool payloads can get large fast. Postgres is excellent at identity, relationships, indexes, RLS, and transaction-safe metadata. It is not where we want to stuff every large JSON body forever. Keeping small previews inline and large bodies in S3 gives us the useful database trail without turning hot tables into a storage closet.
The spec locked the ownership boundary before code started: pramaan-functions owns all migrations, including the agents schema, even though pramaan-agents will later run agent workers. That keeps schema authority in one backend repo instead of letting every runtime invent its own tables.
The planned database shape had five foundation tables: agent_run, chat_session, chat_turn, tool_call, and tool_registry. Append-heavy tables would be partitioned by time. Tenant data would use RLS through the same firm session-variable pattern as the rest of the backend. Large content would cross an S3 boundary at about 32 KB.
The plan also made one practical promise to future repos: shared Python modules would expose the schema metadata and S3 helpers so downstream agent code can import the contract without getting permission to own migrations.
The green boxes are the new Story-751 ownership: the hand-written migration, the importable metadata, and the S3 helper/settings. The white boxes are surrounding systems that either consume the schema later or enforce existing platform rules.
The original Story asked for the agents schema, five foundation tables, RLS, time-first partitioning for append-heavy tables, S3 offload helpers, initial tool registry seeds, and SQLAlchemy reflection helpers for pramaan-agents.
Prompt version tables, subagent registry tables, WebSocket connection state, production bucket provisioning, lifecycle policies, and the real runtime that writes these rows were all left for later. Story-751 is the foundation slab, not the finished building.
PR #186 added 792 lines and removed 2 lines across 6 files. The change was mostly one hand-written migration, plus two shared modules, S3 bucket settings, an Alembic autogenerate guard, and unit tests for the S3 boundary.
alembic/versions/8e7a6b5c4d3f_0044_agents_schema_foundation.py - start with the real database contract.src/shared/s3_offload.py - learn how large payloads leave Postgres and come back.tests/unit/shared/test_s3_offload.py - see the behavior the helper promises.src/shared/agents_db_models.py - read the importable SQLAlchemy table metadata.src/shared/settings.py - find the bucket name and environment fallback.alembic/env.py - see why the agents schema is excluded from autogenerate.This migration creates the agents schema, the foundation tables, RLS policies, initial month partitions for May through July 2026, runtime grants, and the first tool_registry rows for the MatterChatAgent tool names.
The most important implementation detail is that agent_run, chat_turn, and tool_call are partitioned by timestamp. That makes sense because these tables grow like an event log: new rows keep arriving, and old rows can later be managed by month.
Why does time partitioning change the shape of primary keys and foreign keys?
Because: PostgreSQL requires unique constraints on a partitioned parent table to include the partition key. A UUID-only primary key on agent_run would fight the time partition. The implementation solved that with agents.agent_run_key, a small anchor table that lets other rows reference a stable run_id while the heavy run rows stay partitioned by started_at.
This module decides whether content stays inline or moves to S3. Small strings come back as inline. Large strings are uploaded to a deterministic key shaped like agents/{firm_id}/{yyyymm}/{run_id}/{kind}/{sha256}.json.
That key shape is not decoration. It gives operations a sane shelf order: firm first, month second, run third, payload kind fourth, and content hash last. If you have ever searched a storage room where every box says "misc", this is the opposite of that.
Why does the helper compute a SHA-256 hash before storing the body?
Because: the hash gives us a stable content fingerprint. It helps name the object and gives future readers a way to reason about whether the payload they fetched is the payload the ledger said existed.
The tests cover the split that matters: inline threshold behavior, UTF-8 byte sizing, S3 upload behavior, required firm_id and run_id for offloaded content, fetching inline content, fetching S3 content, and invalid storage states.
This is the right test layer because the helper is a trust boundary. Later runtime code should not need to rediscover what "large" means or how a key is shaped. It should call the helper and get the contract.
This file defines SQLAlchemy Core table metadata for the agents schema. It is not the migration owner. Think of it as a printed map of the warehouse layout, not the construction crew that builds the warehouse.
The downstream value is practical: pramaan-agents can import the table shapes for bootstrap/reflection work without creating a second Alembic authority. That keeps schema changes flowing through pramaan-functions.
Settings gained AGENTS_BUCKET_NAME with a fallback of pramaan-agents-{env}, using PRAMAAN_ENV or ENV before falling back to local. The handler code should not read raw environment variables for this value.
The mental model is a wall socket: runtime code plugs into get_settings(). It does not pull wires out of the wall and guess voltage by reading os.environ everywhere.
The migration is hand-written, and the agents schema is intentionally excluded from SQLModel autogenerate. Without that guard, alembic check could see tables that are not in SQLModel metadata and propose nonsense drops.
Autogenerate is a measuring tape, not a judge. It only compares what it can see. If a hand-written schema is invisible to SQLModel metadata, the env guard tells Alembic not to treat that invisibility as a delete request.
The useful deviation is that the Story asked for five foundation tables, but the merged migration creates six. The extra table is agents.agent_run_key.
That was not scope creep. It was the clean way to satisfy two requirements that pull against each other: keep agent_run partitioned by started_at, and still let other tables point at a stable UUID run_id. The small anchor table holds that UUID identity; the partitioned run table holds the time-series details.
The other deviation is operational: the Story mentioned S3 bucket and lifecycle/IAM work, but the merged PR only added helper code and settings. Actual bucket provisioning and lifecycle policy remain ops work, which is the right boundary for infrastructure resources.
The main design error was caught during implementation: a partitioned Postgres parent table cannot keep a UUID-only primary key unless that uniqueness includes the partition column. Trying to make agent_run.run_id do everything would have made partitioning and references fight each other. The fix was agent_run_key.
Local full migration verification also hit an unrelated pre-existing blocker before this migration ran: revision 0037 failed because local business.clients.name contained NULL values while the migration tried to set the column NOT NULL. The Story-751 migration was smoke-tested in isolation with upgrade and downgrade inside a rollback transaction instead.
Full local repo checks were not clean for reasons outside this PR: historical lint outside the changed files, missing local pramaan_app role, DB-backed fixture failures, and mypy requiring targeted invocation. The new code still got focused ruff, pytest, mypy, import smoke, migration smoke, and diff whitespace checks.
agent_run_key looks like an extra table until you remember the Postgres partition rule. Do not delete it because the spec said "five tables"; it is the adapter between stable UUID identity and time partitioning.
Small content and large content have different storage paths. If content_storage says inline, read inline fields. If it says s3, use the S3 key and helper. Do not assume every turn body is in one place.
tool_registry is seeded/static. Runtime code should read active tools; it should not casually mutate the registry during agent execution.
The file on current main may include later Story-752 or Story-753 additions. For Story-751 history, read PR #186 and the merge commit context, not only today's file contents.
This Story intentionally stopped at the foundation. The runtime still needs to write real agent runs, chat turns, and tool calls. Bucket provisioning, IAM, and lifecycle policy need to be owned by infrastructure/ops work. Prompt version tables, subagent registry tables, and WebSocket connection state were explicitly deferred.
The next engineer should treat this schema like a ledger that is ready to be used, not like a complete agent product. The product behavior arrives when later Stories connect queue workers, WebSocket streaming, MatterChatAgent logic, and tool execution to these tables.
pramaan-functions own the agents migration even though pramaan-agents will use it?agent_run_key?If you cannot answer those without opening five tabs, ping Ankit before extending this surface. This is foundation code; small misunderstandings here become expensive later.