Infrastructure Topology -- TalkIDE Internal Docs

TalkIDE internal documentation

This page is the single visual reference for the TalkIDE production infrastructure on DigitalOcean. It exists so the team has something concrete to hold on to and so infrastructure decisions can be made against an accurate picture rather than tribal memory.

All diagrams are inline Mermaid and live in version control next to the prose. They describe the production environment (talkide-prod, NYC3).

1 · System Context

The platform, its actors and the external systems it depends on.

flowchart TB
    EU["End User (Vibecoder)"]
    OP["Platform Operator"]
    AEU["App End User"]

    TID["<b>TalkIDE Platform</b>"]

    ANT["Anthropic API"]
    STRIPE["Stripe (test mode)"]
    MG["Mailgun"]
    DO["DigitalOcean"]
    GL["GitLab"]
    PB["Porkbun (registrar)"]

    EU --> TID
    OP --> TID
    AEU --> TID
    TID --> ANT
    TID --> STRIPE
    TID --> MG
    TID --> DO
    TID --> GL
    PB -.-> DO

See system-context.md for the annotated version.

2 · Container Diagram

Deployable units and how they communicate. See containers.md for the full table form.

flowchart LR
    Browser["Browser"]

    subgraph k8s["DigitalOcean Kubernetes — talkide-prod"]
        FE["TalkIDE FE"]
        BE["TalkIDE BE<br/>+ gateway-proxy"]
        WK["talkide-worker<br/><i>(LIVE since 2026-05-21)</i>"]
        UAPP["User-app pod"]
        NFS["NFS server"]
    end

    PGA[("PG cluster A<br/>control-plane")]
    PGB[("PG cluster B<br/>data-plane")]
    SP[("DO Spaces")]
    REG[("Container Registry")]
    ANT["Anthropic API"]

    Browser --> FE --> BE
    Browser --> UAPP
    BE --> PGA
    BE --> NFS
    BE --> WK
    BE --> UAPP
    WK --> BE
    WK --> NFS
    BE -->|gateway-proxy| ANT
    UAPP --> PGB
    UAPP --> SP
    BE --> REG

3 · DigitalOcean Production Topology

The most important diagram on this page — the full physical layout: one Kubernetes cluster, two Postgres clusters, NFS, Spaces, the registry, the load balancer and DNS.

flowchart TB
    DNS["<b>DO DNS — talkide.app</b><br/>apex · www · api · *.talkide.app (wildcard)<br/>NS: ns1/2/3.digitalocean.com"]
    LB["<b>talkide-prod-lb</b><br/>DO Load Balancer"]

    subgraph vpc["talkide-prod-vpc — private network (NYC3)"]
        direction TB

        subgraph cluster["K8s cluster: talkide-prod — node pool talkide-prod-pool-1 (s-4vcpu-8gb)"]
            direction TB
            ING["ingress-nginx<br/>(ns: ingress-nginx)"]

            subgraph nscp["ns: talkide (control-plane)"]
                BE["talkide-be<br/>Spring Boot · :8080"]
                FE["talkide-fe<br/>nginx · static"]
                NFSP["NFS server pod<br/>nfs-server-provisioner"]
            end

            subgraph nstenant["ns: {tenant}-{env} — one per tenant-environment"]
                WK["talkide-worker pod<br/><i>(LIVE since 2026-05-21)</i>"]
                UAPP["user-app pod<br/>BE+FE single pod :8080"]
                JOB["ephemeral Jobs<br/>Kaniko · gradle build/test"]
            end
        end

        subgraph pga["PG cluster A — talkide-prod-pg (control-plane)"]
            direction TB
            PGAdb[("DB: talkide<br/>+ durable session state")]
            PGApool["PgBouncer pools<br/>talkide-tx (tx, size 18)<br/>talkide (session, size 3)"]
            PGApool --> PGAdb
        end

        subgraph pgb["PG cluster B — talkide-dataplane-pg (data-plane)"]
            direction TB
            PGBpb["self-host PgBouncer<br/>edoburu · SCRAM-SHA-256<br/>:5432"]
            PGBdb[("DB: talkide_dataplane<br/>schema-per-app:<br/>tk_t{id}_p{slug}_{env}")]
            PGBpb --> PGBdb
        end

        BV["DO Block Volume<br/>talkide-prod-nfs-vol (20–50 GB)"]
    end

    SPACES["<b>DO Spaces — talkide-prod-space</b><br/>apps/user_{id}/app_{slug}/uploads · generated<br/>platform/db-backups · exports · logs · artifacts"]
    REG["<b>DO Container Registry</b><br/>registry.digitalocean.com/talkide<br/>BE · FE · talkide-worker · userapp-cache · N app images"]

    DNS --> LB --> ING
    ING --> FE
    ING --> BE
    ING --> UAPP

    BE -->|":25061 pooled · :25060 direct"| PGApool
    UAPP -->|"via self-host PgBouncer"| PGBpb
    BE -.->|"K8s API: provision ns / pods / Jobs"| nstenant
    WK -->|"gateway-proxy"| BE
    JOB -->|"push image"| REG
    BE --> REG
    NFSP --> BV
    BE -.->|RWX mount| NFSP
    WK -.->|RWX mount| NFSP
    JOB -.->|RWX mount| NFSP
    UAPP --> SPACES

3.1 · Database — two physically separate clusters

ADR-023 split what used to be one cluster (ADR-016) into two, after the single-cluster model hit the ~25-connection limit of a DO Basic Postgres cluster.

Property	Value
Cluster	`talkide-prod-pg` (DO Managed PG 18, Basic 1 GB / 1 vCPU / 10 GB)
Direct port	`25060` — per-app DB provisioner, session state
Pooler port	`25061` — BE runtime + Liquibase
Pool `talkide-tx`	transaction mode, size 18; JDBC needs `prepareThreshold=0`
Pool `talkide`	session mode, size 3 — Liquibase needs `pg_advisory_lock`
Holds	platform DB `talkide`, durable conversation/session state

!!! danger “Connection budget” Before any prod BE restart, verify the cluster A connection budget — lowering the talkide-tx pool from 18 has already caused a prod-down incident. Tenant app pods must never consume direct 25060 slots on cluster A.

Property	Value
Cluster	`talkide-dataplane-pg` (DO Managed PG 18)
DB	`talkide_dataplane` — schema-per-app model
Pooler	self-hosted PgBouncer (`edoburu/pgbouncer`), in-cluster Service `:5432`
Auth	SCRAM-SHA-256 verifier in a `dataplane_auth.credentials` table; PgBouncer `auth_query` against a `SECURITY DEFINER` function
Isolation	each app gets schema `tk_t<tenantId>_p<slug>_<env>` + per-app role + `ALTER ROLE … SET search_path`

!!! note “Why a self-host PgBouncer, not pgcat or DO Managed PgBouncer” DO Managed PgBouncer cannot do per-role passthrough; pgcat’s auth_query is MD5-only and pg_authid is unreadable even to doadmin on DO Managed PG. Only PgBouncer ≥ 1.14 supports client→pooler SCRAM-SHA-256 with auth_query. Full rationale in ADR-023 (Revize 3).

3.2 · Storage strategy — separation of concerns

Storage	Holds	Backed by
NFS (`talkide-prod-nfs-vol`)	Per-project working tree + local `.git/` history + build pipeline workspace — source of truth for user code	DO Block Volume, RWX, storageClass `nfs-persistent`. Pods run as non-root UID 1000+ (NFS root squashing).
DO Spaces (`talkide-prod-space`)	Per-app user uploads + generated content; `platform/` prefixes for db-backups, exports, logs, artifacts	S3-compatible, versioning enabled
Per-app Postgres schema	Structured app data	Cluster B, schema-per-app
Pod-local `emptyDir`	Ephemeral temp files	Node-local

3.3 · DNS & domains

Host	Serves
`talkide.app`	TalkIDE FE (workspace, Studio)
`www.talkide.app`	redirect to apex
`api.talkide.app`	TalkIDE BE — REST + SSE
`*.talkide.app`	user apps — `<uuid>` for DEV preview, `<slug>` for PROD published

Reserved slugs (TalkIDE subdomains, mail/CDN/marketing prefixes, environment names, auth actions) are rejected by the Create Project slug validator. The full list is in the project root CLAUDE.md.

4 · Per-tenant Runtime Topology (ADR-024 — worker LIVE, build Jobs planned)

flowchart TB
    subgraph cp["ns: talkide (control-plane — Spring Boot / Kotlin)"]
        BE["TalkIDE BE<br/>identity · tenant/billing data · quota authority<br/>worker orchestration"]
        GW["gateway-proxy<br/>holds raw ANTHROPIC_API_KEY<br/>rate-limit + billing accounting"]
        BE --- GW
    end

    subgraph tns["ns: {tenant}-{env} (e.g. mirek-prod) — ResourceQuota keyed by plan"]
        WK["talkide-worker pod<br/>Node/TS · Agent SDK in-process<br/>stateful: session + transcript on NFS<br/>3-week resume · survives BE redeploy"]
        BJ["gradle build/test Job<br/>ephemeral · bounded · parallel"]
        KJ["Kaniko image build Job<br/>ephemeral"]
    end

    NFS["NFS — per-project working tree"]
    ANT["Anthropic API"]
    REG["Container Registry"]

    BE -->|"create / destroy worker pod (K8s API)"| WK
    WK -->|"agent calls — HMAC token, NOT the API key"| GW
    GW -->|"proxied inference"| ANT
    WK -->|"dispatch (K8s Role-scoped)"| BJ
    WK -->|"dispatch"| KJ
    WK -.->|RWX| NFS
    BJ -.->|RWX + gradle cache PVC| NFS
    KJ --> REG
    WK -->|"SSE stream tokens + events"| BE
    WK -->|"usage / activity reporting (HMAC seam, be#104)"| BE

Thin-seam contract — the dividing line ADR-024 draws:

Control-plane (BE, `talkide` ns)	Worker (`<tenant>-<env>` ns)
Identity / JWT issuance	Mara / Anthropic SDK runtime (in-process)
Tenant / project / billing persistence	SSE stream to the FE
Quota / budget authority	Dispatch of build/test K8s Jobs
Worker orchestration (create/destroy)	Usage / activity reporting back to control-plane
Gateway policy — holds the raw Anthropic key	— (worker only ever holds an internal HMAC token)

Why hybrid topology: the agent is stateful and I/O-bound → a long-running worker pod; gradle build/test is stateless and bounded → ephemeral Jobs the cluster scheduler can parallelise and OOM-isolate; image build stays a Kaniko Job (ADR-019, unchanged).

5 · Deploy User-App Flow

How a project goes from Create Project to a published <slug>.talkide.app.

sequenceDiagram
    actor U as User
    participant BE as TalkIDE BE
    participant NFS as NFS
    participant Snap as SnapshotService
    participant Build as Kaniko Build Job
    participant Reg as DO Registry
    participant K8s as K8s API
    participant DB as Data-plane provisioner
    participant Ing as Ingress reconciler

    U->>BE: Create Project (name, slug, accent, stack)
    BE->>NFS: scaffold working tree (.talkide/, backend/, frontend/, CLAUDE.md)
    Note over BE: PROJECT row, status DRAFT

    U->>BE: Apply Version (DEV) or Publish (PROD, ADR-022)
    BE->>Snap: snapshot working tree → tarball
    BE->>Build: enqueue Kaniko Job
    Build->>NFS: read source
    Build->>Reg: push image talkide/{slug}:{version}
    Build-->>BE: image digest

    BE->>K8s: ensure namespace {tenant}-{env} (ADR-015 / ADR-026)
    BE->>DB: ensure per-app schema + role (cluster B, SCRAM)
    BE->>K8s: apply Deployment + Service (same-origin single pod :8080)
    K8s-->>BE: rollout ready

    BE->>Ing: reconcile Ingress
    alt DEV preview
        Ing->>K8s: upsert Ingress host {uuid}.talkide.app
    else PROD published
        Ing->>K8s: upsert Ingress host {slug}.talkide.app
        BE->>BE: git tag version, PROJECT.status = PUBLISHED
    end
    BE-->>U: URL live

    Note over BE,Ing: DEV per build, PROD on explicit Publish only (ADR-026 Environments)

6 · Diagram maintenance & known mismatches

Mermaid, inline. All overview diagrams are inline Mermaid rendered by the mermaid2 plugin — no build step, they version with the prose.
Retired (2026-05-23): the legacy Structurizr model (overview/workspace.dsl) and pre-rendered SVGs in docs/assets/diagrams/*.svg were removed. They predate the two-cluster split (ADR-023) and worker extraction (ADR-024) and are not coming back. Authoritative sources: this page, containers.md, the inline Mermaid in each spec/UC, and ADR-023/024/026.

Was this page helpful?

Thanks for the feedback.