Story-752 • functions lane • Size M • Walkthrough

Agents streaming infrastructure

The staging AWS shell that lets PRAMAAN accept long-running agent work without making users wait on a single HTTP request.

Issue

#752

Pull Request

pramaan-functions#188

Built by

Conductor agent on 2026-05-21

The 30-second version

Story-752 did not build the agent brain. It built the staging utility room around it: an SQS queue and DLQ, an internal agents Lambda shell, a WebSocket API shell, IAM permissions, alerts, smoke tests, and docs. The important idea is simple: agent runs can be slow, so we put a durable queue and streaming path between the user and the work.

1 Why does this exist?

When a lawyer asks an AI question inside PRAMAAN, the answer may take longer than a normal web request can safely wait. The browser needs quick confirmation that the message was received, and then it needs updates as the answer is produced. That is why ADR-070 picked Pattern C: WebSocket for live updates, SQS for durable work, and Lambda for bounded execution.

Story-752 was the infrastructure Story for that pattern. Its job was to create the staging AWS resources before the real consumer and chat logic arrived. Think of it as installing the pipes, breakers, labels, and warning lights before plugging in the kitchen appliances.

🍕

A pizza shop before opening night

Before a pizza shop takes real orders, someone has to set up the phone line, order rack, oven space, delivery counter, and smoke alarm. That setup does not make pizza yet, but without it the first busy night becomes chaos. Here, the browser is the customer, the WebSocket is the phone line, SQS is the order rack, the agents Lambda is the oven station, and CloudWatch alarms are the smoke alarm.

Ask yourself

If the agent takes 45 seconds, where should the user's request wait: inside an HTTP call, or in a queue built for waiting?

Answer: in the queue. HTTP calls are like standing at a counter while everyone behind you waits. SQS is the order rack: the request is durable, retryable, and visible to operators if it gets stuck.

2 The plan

The locked decisions were narrow and practical. ADR-067 said this repo should own its AWS provisioning path. Story-752's blocker section resolved that the path would be a manual GitHub Actions workflow using AWS CLI commands, not Terraform, CDK, or SAM. ADR-070 said the agents runtime must not become a public endpoint; work enters through SQS, and updates leave through WebSocket management calls.

The spec asked for a staging SQS queue and DLQ, a new agents execution role, a WebSocket API, alarms, a disabled SQS-to-Lambda event source mapping, and enough smoke testing to prove the shell was real. The real WebSocket handlers, real SQS consumer, and production rollout were intentionally left for later stories.

The green boxes are Story-752-owned infrastructure. The dashed arrow into the agents runtime matters: the SQS event source mapping exists, but it is disabled until the consumer story turns the conveyor belt on.

Ask yourself

Why create the event source mapping now if it stays disabled?

Answer: because provisioning should prove the shape of the system without accidentally processing messages before the consumer is ready. It is like installing a conveyor belt with the power switch off.

3 What got built

PR #188 added 1,007 lines across 3 files: one provisioning workflow, one operator document, and one focused unit test file. Read it as infrastructure first, documentation second, tests third.

Read in this order

.github/workflows/provision-agents-infra.yml — the actual staging provisioning contract.
docs/agents-infra.md — the operator map: what exists, how to smoke test, where alerts go.
tests/unit/test_agents_infra_workflow.py — the small set of invariants the team wanted locked.
ADR-070 — the reason WebSocket plus SQS is the chosen shape.
ADR-067 — the reason this repo owns the functions-side provisioning path.

.github/workflows/provision-agents-infra.ymlProvisioning workflow

This is the main artifact. It is a manual workflow_dispatch workflow limited to staging. That is important: Story-752 was not a deploy-on-every-push setup. It was a one-shot, operator-run provisioning path for the staging account.

The workflow creates or reconciles the SQS queue, DLQ, S3 offload bucket, DynamoDB WebSocket connection table, agents Lambda, empty dependency layer, bootstrap WebSocket handler Lambda, WebSocket API, SNS topic, AWS Chatbot Slack route, CloudWatch alarms, and the disabled SQS event source mapping. It also updates the existing functions role so app-side code can send messages and manage WebSocket connections later.

The agents runtime gets its own role, pramaan-agents-execution-role. That role can consume the runs queue, call Bedrock, read and write objects in the agents bucket, connect to RDS with IAM auth, write logs and traces, and manage WebSocket connections. The mental model: the agent gets a separate key ring, not the whole building master key.

Ask yourself

What stops the agents Lambda from becoming a public API by accident?

Answer: WebSocket API Gateway invokes the separate bootstrap WebSocket handler, not the agents runtime. The agents runtime is reached by SQS through an event source mapping, and that mapping starts disabled. Public traffic does not get a direct door into the agents Lambda.

Gotcha

The workflow grants execute-api:ManageConnections using a wildcard API and stage pattern. That is broader than a single WebSocket API ARN, so reviewers should remember it is an operational convenience that may deserve tightening once API IDs are stable.

docs/agents-infra.mdOperator map

This document explains how to run the workflow and what it creates. It lists the required dispatch inputs: RDS DB resource id, RDS DB user, Slack workspace id, and Slack channel id. That is the stuff an operator needs before pressing the button.

It also records the resource names and the smoke tests. The workflow sends one test message to SQS and opens a wscat connection to the default execute-api WebSocket URL. That proves the queue and socket shell exist, even though the real consumer and real streaming behavior arrive later.

The docs are careful about the internal-only boundary: the agents execution role is a Lambda execution role, not something SQS assumes directly. SQS wakes the Lambda through an event source mapping; it does not borrow the Lambda's keys.

📋

The operator clipboard

The workflow is the electrician doing the wiring. This document is the clipboard taped to the wall: breaker names, what each switch controls, and how to check whether the outlet is live.

tests/unit/test_agents_infra_workflow.pyWorkflow guardrails

The tests are static checks over the workflow text. They do not call AWS. Instead, they lock the pieces that would be easy to accidentally remove during cleanup: manual dispatch, staging-only input, required resource names, disabled event source mapping, batch size 1, max concurrency 10, and AWS Chatbot alert routing.

That is the right level for this Story. The risky thing here was not business logic returning the wrong JSON. The risky thing was an infrastructure workflow drifting away from the promised shape. These tests work like a checklist inspector: they cannot tell you the building is beautiful, but they can catch a missing exit sign.

Ask yourself

Why test strings in a YAML file instead of spinning up AWS in a unit test?

Answer: local tests need to be fast, cheap, and deterministic. AWS smoke tests belong inside the manual workflow because they need real account credentials and real regional resources.

4 Deviations from the plan

The biggest deviation is that the workflow provisions more than the first table in the issue might make you expect. It also creates the agents Lambda shell, an S3 offload bucket, a DynamoDB WebSocket connection table, an empty agents dependency layer, AWS Chatbot routing, and a bootstrap WebSocket handler Lambda. This matches the resolved blocker section, which expanded the provisioning checklist after the original scope table.

The SQS visibility timeout shipped as 1320 seconds, or 22 minutes, not the earlier 16-minute text. That aligns with the later doctrine notes in the issue body and gives more room than Lambda's 15-minute maximum.

The DLQ poison-message smoke test from the acceptance criteria did not land. Because the event source mapping is intentionally disabled, a poison message would not be consumed and retried into the DLQ yet. The workflow proves the queue can receive and the WebSocket shell can connect; it does not prove failure movement through a live consumer.

The issue asked for IAM role policy ARNs documented in repo. The committed docs list stable policy names and say the workflow summary prints exact ARNs after a run. That is useful operationally, but it is not the same as committed ARN values.

5 Errors hit and how we fixed them

The PR body does not describe a long debugging trail or review-driven rework. The meaningful implementation adjustments show up in the final shape: the workflow uses AWS Chatbot instead of a raw Slack webhook, keeps the SQS event source mapping disabled, and routes WebSocket traffic through a bootstrap handler instead of exposing the agents runtime.

There was one verification constraint worth remembering: repo-wide uv run ruff check was blocked by pre-existing lint failures in unrelated Alembic migrations and scripts. The PR verified the touched test file directly with uv run ruff check tests/unit/test_agents_infra_workflow.py, uv run pytest tests/unit/test_agents_infra_workflow.py -q, uv run mypy tests/unit/test_agents_infra_workflow.py, and git diff --check.

make openapi was not run because no FastAPI routes, DTOs, or trust-boundary payloads changed. That is not a shortcut; it follows the repo rule that OpenAPI snapshots matter when API contracts move.

6 Gotchas and surprises

Gotcha

Disabled means disabled. The SQS event source mapping is created with --no-enabled. Messages can sit in the queue, but the agents Lambda will not consume them until a later story enables the mapping.

Gotcha

The WebSocket routes have no real auth yet. They use authorization-type NONE because these are bootstrap smoke routes. Real auth and tenant handling belong with the WebSocket handler implementation.

Gotcha

The bootstrap Lambda code is generated inside the workflow. That keeps the Story narrow, but it also means the stub runtime is not normal app code you can import and unit test locally.

Gotcha

Some values are staging-account specific. Subnet ids, security group id, Lambda Insights layer ARN, and OTEL layer ARN are hard-coded for the current staging setup.

Gotcha

Each run publishes a new layer version. The empty agents dependency layer proves the wiring, but the workflow does not clean old layer versions yet.

7 What's still open

Story-752 deliberately stopped at infrastructure. The follow-up work is where the system becomes useful:

Enable the SQS consumer after the real agents runtime is ready.
Replace bootstrap WebSocket handlers with real connect, disconnect, and default behavior.
Add real WebSocket auth and tenant isolation at the perimeter.
Add DynamoDB permissions when handlers actually read and write connection records.
Prove DLQ movement with a live failing consumer, not just queue creation.
Tighten execute-api:ManageConnections scope once stable API ARNs are known.
Decide whether layer cleanup, S3 lifecycle, encryption, and versioning need a follow-up ops story.

8 Check yourself

Can you answer these?

Why does Pattern C need both WebSocket and SQS?
Which Lambda does API Gateway invoke, and why is that not the agents runtime?
What does it mean that the SQS event source mapping exists but is disabled?
Why is a DLQ useful for agent runs?
What did the workflow prove with wscat, and what did it not prove?
Why did this PR skip make openapi?
Which permissions look intentionally broad and should be revisited later?

If you cannot answer those yet, reread the diagram and the workflow card first. Then ask Ankit for the production mental model before changing this infrastructure.