The runtime skeleton that turns a chat message into a streaming response.
We shipped a Lambda that listens to a queue, takes a chat message, runs it through Strands (an AI orchestration library), and streams the answer back to the user's browser one word at a time. Right now it just says hello world because the actual brain (MatterChatAgent) comes in Story-756. This Story is the plumbing.
When a lawyer types a question into the matter chat, that question has to travel from their browser to an AI model and back. But AI models are slow โ they take 2 to 10 seconds. If we made the browser wait for the full answer, it would feel broken. So we stream the answer back one word at a time, the way ChatGPT does.
This Story builds the middle of that pipeline. Not the browser. Not the AI model. The middle.
The customer (browser) places an order on the phone. The order taker (WebSocket) writes it down. The order goes to the kitchen queue (SQS). A chef (this Lambda) picks it up, makes the pizza (calls AI), and as each slice comes out of the oven, a runner takes it to the customer's table immediately (streaming) โ instead of waiting for the full pizza to be done.
Story-754 built the chef station and the runner system. The chef just says "here's a slice that says hello" and another slice that says "world" right now โ the recipe (MatterChatAgent) comes next Story.
Why couldn't we just call the AI directly from the API? Why do we need a queue in the middle?
Answer: AI calls can take 30 seconds. API Gateway has a 29-second timeout. If we called AI from the API, requests would die. A queue lets the API return immediately (saying "we got your message") while the AI processes it in the background, and the WebSocket pushes the answer when ready. The queue is the patience buffer.
The Lambda we built is the green box in the middle. It's woken up by the queue, talks to the AI, saves the result to Postgres, and pushes tokens back through the WebSocket as they come in.
We added 992 lines across 10 files. Don't try to read them in alphabetical order โ that's how engineers waste a Saturday. Read them in the order the request flows.
src/pramaan_agents/handler.py โ the front doorsrc/pramaan_agents/runtime/payloads.py โ the data shapesrc/pramaan_agents/runtime/factory.py โ wiringsrc/pramaan_agents/runtime/app.py โ the orchestratorsrc/pramaan_agents/runtime/strands_harness.py โ the AI callsrc/pramaan_agents/runtime/websocket.py โ the streaming pipesrc/pramaan_agents/db/store.py โ saving to Postgressrc/pramaan_agents/runtime/tool_hooks.py โ the placeholder for Story-756The Lambda runtime calls lambda_handler(event, context). The event is a dict with a Records array โ each record is one chat message from SQS.
What this file does: pulls each record, parses the JSON into an AgentRunRequest, hands it to the runtime, counts successes and failures, returns a summary.
Why does the handler catch Exception for valid requests but let parsing errors crash through?
Because: if the JSON is malformed, SQS should retry the message and eventually send it to the dead-letter queue. If the JSON is fine but the AI call fails, that's "valid data we processed but couldn't help" โ we mark it failed and move on, so SQS doesn't endlessly retry a doomed message.
Defines the Pydantic models that describe what a chat request looks like on the wire. AgentRunRequest has firm_id, matter_id, request_id, the user's message, and metadata.
Why Pydantic and not plain dicts? Pydantic validates the data the moment it arrives. If the JSON is missing firm_id, we want to know in microsecond 1, not in line 200 of the handler.
Imagine a warehouse. Every truck arriving has a packing slip. The receiving dock checks the slip BEFORE letting the truck into the warehouse. If something's missing or wrong, the truck gets bounced. That's Pydantic. Once a payload makes it past Pydantic, every downstream function can trust it.
This is where we call Strands โ the open-source library that wraps Bedrock and gives us agent semantics. The class has TWO branches:
hello and world as two tokensPRAMAAN_AGENTS_STUB_AGENT=false): calls Strands' real AgentWhy have a stub at all? Why not just always call the real Strands?
Three reasons: Tests run offline (no AWS credentials in CI). Determinism (stub always returns same tokens, real AI varies). Cost (stubs are free, real calls cost money).
The model ID comes from os.environ["BEDROCK_MODEL_ID"]. Right now defaults to apac.amazon.nova-pro-v1:0 because Claude Sonnet 4.5 is blocked by an AWS Marketplace card verification (3-5 business days). When that clears, the swap is ONE environment variable. Zero code change.
TokenStreamer takes each token Strands emits and forwards it to the user's browser via the WebSocket API Gateway. emit(token) queues. flush() pushes. We batch a few tokens per API call to reduce overhead.
What happens if the user closed their browser tab while the AI was still generating?
No crash. When we call post_to_connection on a dead connection, boto3 raises GoneException. The streamer catches that, logs it, marks the session disconnected, and tells the harness "stop, no one's listening." We don't waste tokens generating an answer no one will see.
Talks to Postgres. Records the chat turn, tracks token counts, latency, model ID used.
Two interesting bits:
Libraries don't keep entire books in the catalog cards โ they keep a reference to where the book lives on the shelf. Same idea. Postgres holds the reference card. S3 holds the book.
This file is INTENTIONALLY almost empty. It exists so Story-756 (MatterChatAgent + 7 tools) can plug in without touching the runtime code.
You wire your house with wall sockets before you buy lamps. You don't know which lamps you'll buy, but you know the shape of the plug. tool_hooks.py is the wall socket. Story-756 plugs in the lamps.
Even when you want a single-shot answer, Strands uses ConverseStream under the hood. That means every Bedrock call needs the Anthropic use-case form approved and a valid AWS Marketplace payment method.
The IAM action bedrock:Converse doesn't exist. To authorize a Converse API call, you grant bedrock:InvokeModel. Same for ConverseStream โ InvokeModelWithResponseStream.
Older Anthropic models use apac. prefix (e.g., apac.anthropic.claude-3-5-sonnet-20240620-v1:0). Newer ones use global. (e.g., global.anthropic.claude-sonnet-4-5-20250929-v1:0). If you guess wrong, you get "model not supported" errors.
logging. Story-760 doctrine will dictate Pino-style schema.wscat roundtrip against staging hasn't happened โ blocked on Bedrock billing verification.After reading this walkthrough, you should be able to answer:
tool_hooks.py if it's mostly empty?apac.anthropic... and global.anthropic... model IDs?bedrock:InvokeModel when the API call is Converse?If you can't answer any of these, go back and re-read that section. If you still can't, ping Ankit in Slack โ that's a doc gap.