Story-754 • 🤖 agents lane • Size L • Walkthrough

SQS consumer + Strands harness

The runtime skeleton that turns a chat message into a streaming response.

Issue

#754

Pull Request

pramaan-agents#1

Built by

Conductor agent on 2026-05-21

The 30-second version

We shipped a Lambda that listens to a queue, takes a chat message, runs it through Strands (an AI orchestration library), and streams the answer back to the user's browser one word at a time. Right now it just says hello world because the actual brain (MatterChatAgent) comes in Story-756. This Story is the plumbing.

1 Why does this thing exist?

When a lawyer types a question into the matter chat, that question has to travel from their browser to an AI model and back. But AI models are slow — they take 2 to 10 seconds. If we made the browser wait for the full answer, it would feel broken. So we stream the answer back one word at a time, the way ChatGPT does.

This Story builds the middle of that pipeline. Not the browser. Not the AI model. The middle.

🍕

Think of it like a pizza place

The customer (browser) places an order on the phone. The order taker (WebSocket) writes it down. The order goes to the kitchen queue (SQS). A chef (this Lambda) picks it up, makes the pizza (calls AI), and as each slice comes out of the oven, a runner takes it to the customer's table immediately (streaming) — instead of waiting for the full pizza to be done.

Story-754 built the chef station and the runner system. The chef just says "here's a slice that says hello" and another slice that says "world" right now — the recipe (MatterChatAgent) comes next Story.

Ask yourself

Why couldn't we just call the AI directly from the API? Why do we need a queue in the middle?

Answer: AI calls can take 30 seconds. API Gateway has a 29-second timeout. If we called AI from the API, requests would die. A queue lets the API return immediately (saying "we got your message") while the AI processes it in the background, and the WebSocket pushes the answer when ready. The queue is the patience buffer.

2 The shape of what we built

The Lambda we built is the green box in the middle. It's woken up by the queue, talks to the AI, saves the result to Postgres, and pushes tokens back through the WebSocket as they come in.

3 Tour of the code

We added 992 lines across 10 files. Don't try to read them in alphabetical order — that's how engineers waste a Saturday. Read them in the order the request flows.

📚 Read in this order

src/pramaan_agents/handler.py — the front door
src/pramaan_agents/runtime/payloads.py — the data shape
src/pramaan_agents/runtime/factory.py — wiring
src/pramaan_agents/runtime/app.py — the orchestrator
src/pramaan_agents/runtime/strands_harness.py — the AI call
src/pramaan_agents/runtime/websocket.py — the streaming pipe
src/pramaan_agents/db/store.py — saving to Postgres
src/pramaan_agents/runtime/tool_hooks.py — the placeholder for Story-756

handler.pyEntry point

The Lambda runtime calls lambda_handler(event, context). The event is a dict with a Records array — each record is one chat message from SQS.

What this file does: pulls each record, parses the JSON into an AgentRunRequest, hands it to the runtime, counts successes and failures, returns a summary.

Ask yourself

Why does the handler catch Exception for valid requests but let parsing errors crash through?

Because: if the JSON is malformed, SQS should retry the message and eventually send it to the dead-letter queue. If the JSON is fine but the AI call fails, that's "valid data we processed but couldn't help" — we mark it failed and move on, so SQS doesn't endlessly retry a doomed message.

payloads.pyData shapes

Defines the Pydantic models that describe what a chat request looks like on the wire. AgentRunRequest has firm_id, matter_id, request_id, the user's message, and metadata.

Why Pydantic and not plain dicts? Pydantic validates the data the moment it arrives. If the JSON is missing firm_id, we want to know in microsecond 1, not in line 200 of the handler.

📦

Pydantic is the receiving dock

Imagine a warehouse. Every truck arriving has a packing slip. The receiving dock checks the slip BEFORE letting the truck into the warehouse. If something's missing or wrong, the truck gets bounced. That's Pydantic. Once a payload makes it past Pydantic, every downstream function can trust it.

strands_harness.pyThe AI boundary

This is where we call Strands — the open-source library that wraps Bedrock and gives us agent semantics. The class has TWO branches:

Stub branch (default, used in tests): emits hello and world as two tokens
Real branch (only if PRAMAAN_AGENTS_STUB_AGENT=false): calls Strands' real Agent

Ask yourself

Why have a stub at all? Why not just always call the real Strands?

Three reasons: Tests run offline (no AWS credentials in CI). Determinism (stub always returns same tokens, real AI varies). Cost (stubs are free, real calls cost money).

Gotcha

The model ID comes from os.environ["BEDROCK_MODEL_ID"]. Right now defaults to apac.amazon.nova-pro-v1:0 because Claude Sonnet 4.5 is blocked by an AWS Marketplace card verification (3-5 business days). When that clears, the swap is ONE environment variable. Zero code change.

websocket.pyThe streaming pipe

TokenStreamer takes each token Strands emits and forwards it to the user's browser via the WebSocket API Gateway. emit(token) queues. flush() pushes. We batch a few tokens per API call to reduce overhead.

Ask yourself

What happens if the user closed their browser tab while the AI was still generating?

No crash. When we call post_to_connection on a dead connection, boto3 raises GoneException. The streamer catches that, logs it, marks the session disconnected, and tells the harness "stop, no one's listening." We don't waste tokens generating an answer no one will see.

store.pyDatabase

Talks to Postgres. Records the chat turn, tracks token counts, latency, model ID used.

Two interesting bits:

RDS IAM auth. No database password in code. The Lambda's IAM role generates a short-lived auth token at connect time. Rotating credentials by default.
32 KB S3 offload. If a chat turn is bigger than 32 KB (long document quote), we store the content in S3 and put the S3 reference in Postgres. Keeps the database fast.

🗂️

S3 offload is like a card catalog

Libraries don't keep entire books in the catalog cards — they keep a reference to where the book lives on the shelf. Same idea. Postgres holds the reference card. S3 holds the book.

tool_hooks.pyPlaceholder for Story-756

This file is INTENTIONALLY almost empty. It exists so Story-756 (MatterChatAgent + 7 tools) can plug in without touching the runtime code.

🔌

The wall socket pattern

You wire your house with wall sockets before you buy lamps. You don't know which lamps you'll buy, but you know the shape of the plug. tool_hooks.py is the wall socket. Story-756 plugs in the lamps.

4 Gotchas and surprises

Gotcha 1 — Strands always streams

Even when you want a single-shot answer, Strands uses ConverseStream under the hood. That means every Bedrock call needs the Anthropic use-case form approved and a valid AWS Marketplace payment method.

Gotcha 2 — IAM names are misleading

The IAM action bedrock:Converse doesn't exist. To authorize a Converse API call, you grant bedrock:InvokeModel. Same for ConverseStream → InvokeModelWithResponseStream.

Gotcha 3 — region prefixes differ by model age

Older Anthropic models use apac. prefix (e.g., apac.anthropic.claude-3-5-sonnet-20240620-v1:0). Newer ones use global. (e.g., global.anthropic.claude-sonnet-4-5-20250929-v1:0). If you guess wrong, you get "model not supported" errors.

5 What's still open

Cost cap. Spec said abort if projected cost > $0.50/run. Implementation has the hook but not the projection logic.
Structured JSON logs. Currently Python stdlib logging. Story-760 doctrine will dictate Pino-style schema.
Live AWS smoke test. Real wscat roundtrip against staging hasn't happened — blocked on Bedrock billing verification.
MatterChatAgent. The actual agent prompt, 7 tools, 4 guardrails — all Story-756.

🎓 Check yourself

After reading this walkthrough, you should be able to answer:

Why is there a queue between the API and the AI Lambda?
Why does Strands have a stub branch in our harness?
What happens to the streaming if the user closes their browser mid-response?
Why do we have tool_hooks.py if it's mostly empty?
What's the difference between apac.anthropic... and global.anthropic... model IDs?
Why is the IAM action bedrock:InvokeModel when the API call is Converse?

If you can't answer any of these, go back and re-read that section. If you still can't, ping Ankit in Slack — that's a doc gap.