This page is the single visual reference for the TalkIDE production infrastructure on DigitalOcean. It exists so the team has something concrete to hold on to and so infrastructure decisions can be made against an accurate picture rather than tribal memory.
All diagrams are inline Mermaid and live in version control next to the prose. They
describe the production environment (talkide-prod, NYC3).
1 · System Context
The platform, its actors and the external systems it depends on.
flowchart TB
EU["End User (Vibecoder)"]
OP["Platform Operator"]
AEU["App End User"]
TID["<b>TalkIDE Platform</b>"]
ANT["Anthropic API"]
STRIPE["Stripe (test mode)"]
MG["Mailgun"]
DO["DigitalOcean"]
GL["GitLab"]
PB["Porkbun (registrar)"]
EU --> TID
OP --> TID
AEU --> TID
TID --> ANT
TID --> STRIPE
TID --> MG
TID --> DO
TID --> GL
PB -.-> DO
See system-context.md for the annotated version.
2 · Container Diagram
Deployable units and how they communicate. See containers.md for the
full table form.
flowchart LR
Browser["Browser"]
subgraph k8s["DigitalOcean Kubernetes — talkide-prod"]
FE["TalkIDE FE"]
BE["TalkIDE BE<br/>+ gateway-proxy"]
WK["talkide-worker<br/><i>(LIVE since 2026-05-21)</i>"]
UAPP["User-app pod"]
NFS["NFS server"]
end
PGA[("PG cluster A<br/>control-plane")]
PGB[("PG cluster B<br/>data-plane")]
SP[("DO Spaces")]
REG[("Container Registry")]
ANT["Anthropic API"]
Browser --> FE --> BE
Browser --> UAPP
BE --> PGA
BE --> NFS
BE --> WK
BE --> UAPP
WK --> BE
WK --> NFS
BE -->|gateway-proxy| ANT
UAPP --> PGB
UAPP --> SP
BE --> REG
3 · DigitalOcean Production Topology
The most important diagram on this page — the full physical layout: one Kubernetes cluster, two Postgres clusters, NFS, Spaces, the registry, the load balancer and DNS.
flowchart TB
DNS["<b>DO DNS — talkide.app</b><br/>apex · www · api · *.talkide.app (wildcard)<br/>NS: ns1/2/3.digitalocean.com"]
LB["<b>talkide-prod-lb</b><br/>DO Load Balancer"]
subgraph vpc["talkide-prod-vpc — private network (NYC3)"]
direction TB
subgraph cluster["K8s cluster: talkide-prod — node pool talkide-prod-pool-1 (s-4vcpu-8gb)"]
direction TB
ING["ingress-nginx<br/>(ns: ingress-nginx)"]
subgraph nscp["ns: talkide (control-plane)"]
BE["talkide-be<br/>Spring Boot · :8080"]
FE["talkide-fe<br/>nginx · static"]
NFSP["NFS server pod<br/>nfs-server-provisioner"]
end
subgraph nstenant["ns: {tenant}-{env} — one per tenant-environment"]
WK["talkide-worker pod<br/><i>(LIVE since 2026-05-21)</i>"]
UAPP["user-app pod<br/>BE+FE single pod :8080"]
JOB["ephemeral Jobs<br/>Kaniko · gradle build/test"]
end
end
subgraph pga["PG cluster A — talkide-prod-pg (control-plane)"]
direction TB
PGAdb[("DB: talkide<br/>+ durable session state")]
PGApool["PgBouncer pools<br/>talkide-tx (tx, size 18)<br/>talkide (session, size 3)"]
PGApool --> PGAdb
end
subgraph pgb["PG cluster B — talkide-dataplane-pg (data-plane)"]
direction TB
PGBpb["self-host PgBouncer<br/>edoburu · SCRAM-SHA-256<br/>:5432"]
PGBdb[("DB: talkide_dataplane<br/>schema-per-app:<br/>tk_t{id}_p{slug}_{env}")]
PGBpb --> PGBdb
end
BV["DO Block Volume<br/>talkide-prod-nfs-vol (20–50 GB)"]
end
SPACES["<b>DO Spaces — talkide-prod-space</b><br/>apps/user_{id}/app_{slug}/uploads · generated<br/>platform/db-backups · exports · logs · artifacts"]
REG["<b>DO Container Registry</b><br/>registry.digitalocean.com/talkide<br/>BE · FE · talkide-worker · userapp-cache · N app images"]
DNS --> LB --> ING
ING --> FE
ING --> BE
ING --> UAPP
BE -->|":25061 pooled · :25060 direct"| PGApool
UAPP -->|"via self-host PgBouncer"| PGBpb
BE -.->|"K8s API: provision ns / pods / Jobs"| nstenant
WK -->|"gateway-proxy"| BE
JOB -->|"push image"| REG
BE --> REG
NFSP --> BV
BE -.->|RWX mount| NFSP
WK -.->|RWX mount| NFSP
JOB -.->|RWX mount| NFSP
UAPP --> SPACES
3.1 · Database — two physically separate clusters
ADR-023 split what used to be one cluster (ADR-016) into two, after the single-cluster model hit the ~25-connection limit of a DO Basic Postgres cluster.
| Property | Value |
|---|---|
| Cluster | talkide-prod-pg (DO Managed PG 18, Basic 1 GB / 1 vCPU / 10 GB) |
| Direct port | 25060 — per-app DB provisioner, session state |
| Pooler port | 25061 — BE runtime + Liquibase |
Pool talkide-tx | transaction mode, size 18; JDBC needs prepareThreshold=0 |
Pool talkide | session mode, size 3 — Liquibase needs pg_advisory_lock |
| Holds | platform DB talkide, durable conversation/session state |
!!! danger “Connection budget”
Before any prod BE restart, verify the cluster A connection budget — lowering
the talkide-tx pool from 18 has already caused a prod-down incident. Tenant app
pods must never consume direct 25060 slots on cluster A.
| Property | Value |
|---|---|
| Cluster | talkide-dataplane-pg (DO Managed PG 18) |
| DB | talkide_dataplane — schema-per-app model |
| Pooler | self-hosted PgBouncer (edoburu/pgbouncer), in-cluster Service :5432 |
| Auth | SCRAM-SHA-256 verifier in a dataplane_auth.credentials table; PgBouncer auth_query against a SECURITY DEFINER function |
| Isolation | each app gets schema tk_t<tenantId>_p<slug>_<env> + per-app role + ALTER ROLE … SET search_path |
!!! note “Why a self-host PgBouncer, not pgcat or DO Managed PgBouncer”
DO Managed PgBouncer cannot do per-role passthrough; pgcat’s auth_query is
MD5-only and pg_authid is unreadable even to doadmin on DO Managed PG. Only
PgBouncer ≥ 1.14 supports client→pooler SCRAM-SHA-256 with auth_query. Full
rationale in ADR-023 (Revize 3).
3.2 · Storage strategy — separation of concerns
| Storage | Holds | Backed by |
|---|---|---|
NFS (talkide-prod-nfs-vol) | Per-project working tree + local .git/ history + build pipeline workspace — source of truth for user code | DO Block Volume, RWX, storageClass nfs-persistent. Pods run as non-root UID 1000+ (NFS root squashing). |
DO Spaces (talkide-prod-space) | Per-app user uploads + generated content; platform/ prefixes for db-backups, exports, logs, artifacts | S3-compatible, versioning enabled |
| Per-app Postgres schema | Structured app data | Cluster B, schema-per-app |
Pod-local emptyDir | Ephemeral temp files | Node-local |
3.3 · DNS & domains
| Host | Serves |
|---|---|
talkide.app | TalkIDE FE (workspace, Studio) |
www.talkide.app | redirect to apex |
api.talkide.app | TalkIDE BE — REST + SSE |
*.talkide.app | user apps — <uuid> for DEV preview, <slug> for PROD published |
Reserved slugs (TalkIDE subdomains, mail/CDN/marketing prefixes, environment names,
auth actions) are rejected by the Create Project slug validator. The full list is in the
project root CLAUDE.md.
4 · Per-tenant Runtime Topology (ADR-024 — worker LIVE, build Jobs planned)
flowchart TB
subgraph cp["ns: talkide (control-plane — Spring Boot / Kotlin)"]
BE["TalkIDE BE<br/>identity · tenant/billing data · quota authority<br/>worker orchestration"]
GW["gateway-proxy<br/>holds raw ANTHROPIC_API_KEY<br/>rate-limit + billing accounting"]
BE --- GW
end
subgraph tns["ns: {tenant}-{env} (e.g. mirek-prod) — ResourceQuota keyed by plan"]
WK["talkide-worker pod<br/>Node/TS · Agent SDK in-process<br/>stateful: session + transcript on NFS<br/>3-week resume · survives BE redeploy"]
BJ["gradle build/test Job<br/>ephemeral · bounded · parallel"]
KJ["Kaniko image build Job<br/>ephemeral"]
end
NFS["NFS — per-project working tree"]
ANT["Anthropic API"]
REG["Container Registry"]
BE -->|"create / destroy worker pod (K8s API)"| WK
WK -->|"agent calls — HMAC token, NOT the API key"| GW
GW -->|"proxied inference"| ANT
WK -->|"dispatch (K8s Role-scoped)"| BJ
WK -->|"dispatch"| KJ
WK -.->|RWX| NFS
BJ -.->|RWX + gradle cache PVC| NFS
KJ --> REG
WK -->|"SSE stream tokens + events"| BE
WK -->|"usage / activity reporting (HMAC seam, be#104)"| BE
Thin-seam contract — the dividing line ADR-024 draws:
Control-plane (BE, talkide ns) | Worker (<tenant>-<env> ns) |
|---|---|
| Identity / JWT issuance | Mara / Anthropic SDK runtime (in-process) |
| Tenant / project / billing persistence | SSE stream to the FE |
| Quota / budget authority | Dispatch of build/test K8s Jobs |
| Worker orchestration (create/destroy) | Usage / activity reporting back to control-plane |
| Gateway policy — holds the raw Anthropic key | — (worker only ever holds an internal HMAC token) |
Why hybrid topology: the agent is stateful and I/O-bound → a long-running worker pod; gradle build/test is stateless and bounded → ephemeral Jobs the cluster scheduler can parallelise and OOM-isolate; image build stays a Kaniko Job (ADR-019, unchanged).
5 · Deploy User-App Flow
How a project goes from Create Project to a published <slug>.talkide.app.
sequenceDiagram
actor U as User
participant BE as TalkIDE BE
participant NFS as NFS
participant Snap as SnapshotService
participant Build as Kaniko Build Job
participant Reg as DO Registry
participant K8s as K8s API
participant DB as Data-plane provisioner
participant Ing as Ingress reconciler
U->>BE: Create Project (name, slug, accent, stack)
BE->>NFS: scaffold working tree (.talkide/, backend/, frontend/, CLAUDE.md)
Note over BE: PROJECT row, status DRAFT
U->>BE: Apply Version (DEV) or Publish (PROD, ADR-022)
BE->>Snap: snapshot working tree → tarball
BE->>Build: enqueue Kaniko Job
Build->>NFS: read source
Build->>Reg: push image talkide/{slug}:{version}
Build-->>BE: image digest
BE->>K8s: ensure namespace {tenant}-{env} (ADR-015 / ADR-026)
BE->>DB: ensure per-app schema + role (cluster B, SCRAM)
BE->>K8s: apply Deployment + Service (same-origin single pod :8080)
K8s-->>BE: rollout ready
BE->>Ing: reconcile Ingress
alt DEV preview
Ing->>K8s: upsert Ingress host {uuid}.talkide.app
else PROD published
Ing->>K8s: upsert Ingress host {slug}.talkide.app
BE->>BE: git tag version, PROJECT.status = PUBLISHED
end
BE-->>U: URL live
Note over BE,Ing: DEV per build, PROD on explicit Publish only (ADR-026 Environments)
6 · Diagram maintenance & known mismatches
- Mermaid, inline. All overview diagrams are inline Mermaid rendered by the
mermaid2plugin — no build step, they version with the prose. - Retired (2026-05-23): the legacy Structurizr model (
overview/workspace.dsl) and pre-rendered SVGs indocs/assets/diagrams/*.svgwere removed. They predate the two-cluster split (ADR-023) and worker extraction (ADR-024) and are not coming back. Authoritative sources: this page,containers.md, the inline Mermaid in each spec/UC, and ADR-023/024/026.
Thanks for the feedback.