Story-753 * functions lane * Size M * Walkthrough

WebSocket agent bridge handlers

The browser-facing bridge between live chat and the queued agent runtime.

Pull Request
Built by
Conductor agent on 2026-05-21

The 30-second version

Story-753 added the Lambda handlers for API Gateway WebSocket routes: $connect, $disconnect, and $default. The important idea is simple: the WebSocket edge authenticates the browser once, stores a small connection record, then later messages use that record to enqueue agent work safely. The agent runtime stays decoupled behind SQS.

PART ONE - WHAT WE PLANNED TO DO AND WHY

1 Why does this exist?

Agent chat is not a normal request/response page. A lawyer asks a question, the browser needs to stay connected, the backend needs to start slower AI work, and the answer may come back later in pieces. A plain HTTP route is the wrong shape for that job.

Story-753 builds the WebSocket perimeter for that flow. It is the place where a browser connection becomes a trusted PRAMAAN connection: user resolved, firm resolved, connection stored, and later messages turned into queue jobs for the agents Lambda.

The key product reason is live chat needs a stable wire without making the browser wait for AI work inline. The WebSocket wire stays open. SQS carries the slower work. Postgres remembers who owns the wire.

[ ]

Think of a coat-check counter

When you enter a theater, the counter checks who you are, takes your coat, and gives you a numbered claim ticket. Later, you do not re-prove your whole identity every time you ask for the coat. The ticket links you to the stored item.

agents.ws_connection is that claim ticket shelf. $connect checks the Clerk token and stores the ticket. $default uses the ticket to know which firm and user sent the message. $disconnect removes the ticket when the person leaves.

Ask yourself

Why does $connect do the hard identity work instead of letting every chat message prove itself again?

Answer: because $connect is the trust boundary. Once the connection is accepted, later WebSocket events arrive with a connection_id, not a normal browser request with stable headers. The system needs a small database record that says, "this wire belongs to this firm and this user."

2 The plan

The spec locked Pattern C from ADR-070: $connect resolves auth and stores connection context, $default enqueues an agent run to SQS, and $disconnect cleans up the connection. The agent processing itself belongs in pramaan-agents, not inside the WebSocket handler.

There was also an explicit perimeter exception to the usual "functions is internal backend" rule from ADR-022. These WebSocket routes are public browser edge routes, so they live in pramaan-functions, but they still keep the runtime discipline: small handlers, bounded work, no long AI call inline.

The planned table was agents.ws_connection. It would store connection_id, firm_id, user_id, timestamps, and an optional Clerk session id. RLS would protect firm isolation. The default handler would send SQS payloads shaped for the agents Lambda.

Browser Clerk JWT router_ws.py STORY-753 $connect / $default / $disconnect raw API Gateway handlers Clerk JWKS verify token agents.ws_connection STORY-753 connection_id -> firm/user SQS agent run queue RLS firm scope 1 2 claims 3 4 5 6 later tokens return over WebSocket

The green boxes are Story-753. The browser connects with a token. The handler verifies it, stores the connection row under firm RLS, and later turns client messages into SQS jobs. The dashed return arrow is future streaming work owned by the agent runtime, not by this Story.

What the spec said would ship

The Story asked for all three handlers, the agents.ws_connection migration, server-side heartbeat support, query-string Clerk JWT validation, SQS enqueue, structured logging, metrics, and tests for auth, enqueue, ping, and disconnect cleanup.

Out of scope

The SQS consumer, the real AI agent, and the client-side WebSocket UI were separate Stories. This Story made the edge bridge and the claim-ticket shelf. It did not make the whole chat product.

PART TWO - HOW WE ACTUALLY DID IT

3 What got built

PR #187 added 984 lines across 6 files. The main file was src/app_api/router_ws.py. The rest of the change added the database migration, Clerk direct verification support, queue settings, shared agents table metadata, and unit tests.

Reading order

  1. alembic/versions/99b7d4c6a2f1_0045_agents_ws_connections.py - start with the table, RLS policy, and abandoned run status.
  2. src/app_api/router_ws.py - read the three route handlers in request order: connect, default, disconnect.
  3. src/shared/auth/clerk.py - see how direct Clerk JWKS verification supports WebSocket connect.
  4. src/shared/settings.py - inspect the Clerk and SQS settings added for this edge path.
  5. src/shared/agents_db_models.py - confirm shared agents metadata includes ws_connection and agent_run.connection_id.
  6. tests/unit/test_router_ws.py - read the behavior contract the team protected.
alembic/versions/99b7d4c6a2f1_0045_agents_ws_connections.pyDatabase contract

This migration creates agents.ws_connection. The primary key is connection_id, because API Gateway gives every WebSocket connection that id and sends it back on later events.

The row stores firm_id, user_id, connected_at, last_activity_at, and optional clerk_session_id. It also enables and forces RLS with a firm isolation policy, then grants the app role the needed read/write permissions.

The same migration adds nullable connection_id to agents.agent_run and allows abandoned as a status. That is how disconnect can say, "this browser wire is gone; queued or running work tied to it should no longer be treated as active chat."

Ask yourself

Why does the connection table live in the agents schema instead of identity?

Answer: because it is runtime state for agent chat. Identity owns durable users and memberships. This table owns a temporary live wire between a browser and the agent pipeline.

src/app_api/router_ws.pyWebSocket handlers

This is the center of the Story. It uses raw API Gateway event dictionaries because WebSocket $connect, $disconnect, and $default are not normal FastAPI route calls under Mangum.

handler_ws_connect reads the connection_id, verifies Clerk claims, resolves the PRAMAAN user, finds an active firm membership, sets DB session scope, and upserts the connection row. If the token is missing or the user has no active firm, the handler rejects the connection.

handler_ws_default validates the client message. A ping updates last_activity_at and posts pong. A real chat message must include matter_id, chat_session_id, and message, then it is sent to SQS with the stored firm/user context.

handler_ws_disconnect loads the stored context, marks queued/running agent runs for that connection as abandoned, and deletes the connection row. It does not try to kill a Lambda that may already be running elsewhere.

Gotcha

$disconnect cleanup is best-effort lifecycle bookkeeping, not a remote kill switch. Pattern C deliberately decouples browser wires from background agent execution.

src/shared/auth/clerk.pyConnect auth

Normal HTTP requests can lean on API Gateway's authorizer context. WebSocket connect is trickier: browsers do not reliably send standard auth headers during WebSocket upgrade, so the accepted V1 pattern passes the Clerk JWT as ?token=....

This file adds direct Clerk JWT verification through JWKS and a helper that converts verified claims into the existing CurrentUser shape. That keeps the rest of the code from inventing a second identity model just for WebSockets.

[ ]

Think of checking an ID at the door

The doorman does not keep asking for your passport after you are inside. He checks it at entry, writes your name on the guest list, and later staff use the guest list. The query token is the ID check; ws_connection is the guest list row.

src/shared/settings.pyRuntime knobs

The settings change adds Clerk JWT issuer, audience, leeway, a derived JWKS URL, and AGENTS_RUNS_QUEUE_URL. There is also a local fallback queue URL shaped like pramaan-agents-runs-{env}.

This matters because the handler should not read environment variables directly. The repo rule is that configuration flows through Settings and get_settings(). The WebSocket path follows that rule even though it is not a FastAPI route.

src/shared/agents_db_models.pyShared metadata

This file is SQLAlchemy Core metadata for the agents schema. Story-753 adds ws_connection and mirrors agent_run.connection_id.

The important mental model is that pramaan-functions owns the migration, while other agent-adjacent code may need to understand the table shape. Shared metadata is the card catalog entry; the migration is the carpenter who actually builds the shelf.

tests/unit/test_router_ws.pyBehavior checks

The tests cover the high-risk edges: connect inserts the connection row, missing token rejects, default enqueues the exact payload shape, ping posts pong, disconnect abandons runs and deletes the connection, and SQS failures return 502.

These tests use fakes for sessions, queue clients, and API Gateway management calls. That is a good fit here. The goal is to prove handler behavior without needing a live WebSocket API, Clerk tenant, SQS queue, and database for every unit test run.

Ask yourself

Why assert the SQS payload shape instead of just checking that send_message was called?

Answer: because the payload is the contract with pramaan-agents. A call with the wrong keys is like mailing a package with no address: the truck moved, but the delivery still fails.

4 Deviations from the plan

The core plan held: three handlers shipped, the connection table shipped, ping/pong shipped, SQS enqueue shipped, and disconnect cleanup shipped.

The biggest practical deviation is that the implementation uses raw Lambda handler dispatch in router_ws.py, not FastAPI router declarations. That is appropriate for API Gateway WebSocket route keys, but it means this path does not look like ordinary app HTTP routes.

The spec also described server heartbeats every 9 minutes. The PR added post_heartbeat(), the helper that sends {type: "heartbeat"}, but scheduling that every 9 minutes is not done inside this handler. In Lambda, a sleeping loop would be the wrong tool. The helper is the send primitive; the orchestration remains outside this route file.

5 Errors hit and how we fixed them

The PR body did not record a long debugging thread, but it did record verification limits honestly.

Targeted checks for the new surface passed: targeted ruff, tests/unit/test_router_ws.py, git diff --check, and make openapi. Broader local checks hit pre-existing blockers outside the patch: migration/script style issues, local DB role setup problems such as missing pramaan_app, SQLModel typing noise, and missing botocore stubs.

The fix was not to pretend the whole repo was green. The implementer narrowed verification to the WebSocket files and named the wider blockers in the PR. That is the right move when the local environment has known debt: protect the changed surface, then report the remaining risk clearly.

6 Gotchas and surprises

Gotcha

The token is in the WebSocket query string for V1. That is not a general preference for putting secrets in URLs; it is a WebSocket upgrade constraint. Logs must never print the raw token.

Gotcha

$default does not trust the message body for firm_id or user_id. It recovers those from agents.ws_connection. The browser can send the words; the server supplies the identity.

Gotcha

ping is not just a cute keepalive. It updates last_activity_at, which gives operators and cleanup jobs a real signal about whether the wire is alive.

Gotcha

Raw SQL appears in this file because it is a small Lambda edge path doing RLS session setup and simple lifecycle updates. Do not copy that as permission to bypass domain/service patterns in normal FastAPI routes.

7 What's still open

The bridge is in place, but several neighboring pieces live in other Stories:

8 Check yourself

Can you answer these?

  1. Why does $connect store firm_id and user_id instead of making $default trust the browser?
  2. Why is $default enqueue-only instead of running the agent inline?
  3. What does abandoned mean after disconnect, and what does it not mean?
  4. Why does the connection table need RLS if the handler already resolves the firm?
  5. Why is a query-string JWT acceptable here but still something logs must treat carefully?
  6. Which file would you read first if ping/pong stopped working?
  7. Which payload keys form the contract between pramaan-functions and pramaan-agents?

If you cannot answer those after reading the code, ping Ankit before changing this path. WebSocket bugs look small, but they sit right on the browser trust boundary.