Skip to content

Deploy to Azure

For a single developer, docforge serve on stdio is enough — Claude Code or Cursor spawns the process. For team use, you want a hosted HTTP API so every teammate’s assistant can hit the same index.

Seven Azure resources in one resource group (~€900/month at default SKUs in West Europe with the Qwen3-Embedding-4B GPU embedder on a workload-profile environment):

  • Postgres Flexible Server (Burstable B1ms, 32 GB) with pgvector enabled at provisioning time.
  • Container App running docforge serve --api with Entra ID authentication enabled (1 vCPU / 1 GiB).
  • Container App: embedder running the Qwen3-Embedding-4B model on a GPU workload profile (NC8as_T4). The search API delegates embedding to this service via EMBEDDER_URL, keeping the API replicas small and fast to start.
  • Container Registry (Standard — required for the ~13.6 GB embedder image; ACR Basic’s 10 GB quota is too small).
  • Key Vault (Standard) holding CONFLUENCE_API_TOKEN, HF_TOKEN, and database credentials.
  • Log Analytics workspace (30-day retention) for Container App logs.
  • Container Apps managed environment (Consumption plan).

Teammates use a lightweight MCP client that shells out to the hosted API.

Bicep templates under deploy/azure/ in the repo cover:

  • Postgres Flexible Server with pgvector installed at provisioning time.
  • Container App environment with 1 always-on search-api replica (cold-start ~30 s for container spin-up; the search API no longer loads the model in-process since the v0.3 Phase 4b embedder split). The GPU-backed Qwen3-Embedding-4B embedder runs with minReplicas: 2 — the ~10 GB model loads into VRAM in 2-3 minutes for model load (Qwen-4B on T4 GPU); always-on avoids this cold start in production.
  • Managed identity for pulling from Key Vault.

Set auth.mode: entra in docforge.yml and provide AZURE_TENANT_ID + AZURE_CLIENT_ID via environment. The FastAPI app validates JWTs against your tenant’s OpenID config and logs the authenticated user_oid to query_log.

See threat-model.md in the repo for the full trust model (single-tenant, single-company, authenticated users trusted).

Run docforge ingest from anywhere that can reach the database (a jump box, GitHub Actions runner, or the container itself). Ingest is idempotent — safe to schedule on cron.

  • Query telemetry: the query_log table records every search (user_oid, query, request_ms, timestamp). Retention defaults to 180 days; a cleanup loop inside the API deletes rows older than that.
  • Latency: python -m docforge.scripts.latency_report --since '7 days' prints P50/P95/P99 from query_log.request_ms.
  • Health: GET /health is unauthenticated and DB-independent; wire it to the Container App liveness probe.
  • Cold-start window. Search-api with minReplicas=1 avoids container cold-starts in steady state; post-deployment the first request pays a ~30 s container spin-up cost (no model load — that responsibility moved to the embedder in Phase 4b). The GPU-backed Qwen3-Embedding-4B embedder loads the ~10 GB model into VRAM in 2-3 minutes for model load (Qwen-4B on T4 GPU); run with minReplicas: 2 to keep the embedder warm. All cold-start latency is included in P95 as honest signal.
  • Orphan pruning. When you remove a source from sources.yml, run docforge ingest --purge-orphans (dry-run) and then --confirm to delete. No auto-purge.
  • Backups. Postgres Flexible Server Standard_B1ms gets 7-day PITR by default. Test the restore procedure annually:
    Terminal window
    az postgres flexible-server restore \
    --resource-group <rg> \
    --name <new-server-name> \
    --source-server <source-server-name> \
    --restore-time '<ISO-8601 timestamp within last 7 days>'
    The restore creates a new server; the source is untouched. After verifying the restore, drop the new server.