Architecture
The data flow
Section titled “The data flow”1. Sources
Section titled “1. Sources”docforge ingests from two source types:
- Confluence spaces via the REST API v2. Pages are fetched by ID (configured in
sources.yml), authenticated with an email + API token. Content is pulled as Confluence storage-format HTML. - Local git repositories on disk. The crawler matches configured glob patterns (default:
README.md,CLAUDE.md,docs/**/*.md). It does not clone remote URLs — clone first, then point docforge at the checkout.
Each source gets a stable identifier (confluence_page_id or file path) and a SHA-256 content_hash computed from the raw content.
2. Ingest — docforge ingest
Section titled “2. Ingest — docforge ingest”- Deduplicate. Compare
content_hashagainst what’s stored. Matching hashes skip re-processing. - Parse. BeautifulSoup splits HTML into semantic sections (
<h1>,<h2>, paragraphs, code blocks). Confluence macros are handled where meaningful. - Chunk. Token-aware splitter (default 500 tokens). Respects section boundaries; splits paragraphs only when a section exceeds the limit. Section titles are prepended to each chunk for context.
- Embed. Sentence-transformers loads Qwen3-Embedding-4B (Apache 2.0, 1024-dim). Falls back to
all-MiniLM-L6-v2(384-dim) if the primary load fails. - Store.
sources(metadata + hash) andchunks(text + embedding + HNSW index) tables in Postgres.ON DELETE CASCADEkeepschunksconsistent withsources.
Per-source errors are isolated: one bad Confluence page does not abort the run; a summary lists failures at the end.
3. Storage — Postgres + pgvector
Section titled “3. Storage — Postgres + pgvector”sourcestable: metadata (type, URL, title, tags,content_hash,last_crawled_at, status).chunkstable: text, section title, 1024-dimembedding, foreign key to source.- HNSW index on
embeddingfor cosine-similarity search (vector_cosine_ops).
The whole index fits in a Standard_B1ms Postgres Flexible Server for a corpus under ~50K chunks.
4. Serve
Section titled “4. Serve”Two surfaces, one in-process (CLI) and one hosted (multi-user team deployment):
docforge serve— FastMCP server over stdio. Local single-user use (Claude Code, Cursor with MCP). Loads the embedding model in-process.docforge serve --api— FastAPI over HTTP. Hosted deployment with multiple users via Entra ID authentication. Since v0.3 Phase 4b, the API offloads embedding to a separate embedder Container App by settingEMBEDDER_URL. Search API replicas drop from ~2 GB RSS to ~400 MB and cold-start in ~30 s (just container spin-up; no model load). The embedder hosts the model behind a shared-secret bearer token (EMBEDDER_TOKEN); the GPU-backed Qwen3-Embedding-4B embedder loads the ~10 GB model into VRAM in 2-3 minutes — run withminReplicas: 2to avoid cold starts in production.
Both surfaces expose a single primary tool: search_documentation(query, user_name, team_name, area_name?, limit?). Results include source URL + title + section attribution.
What docforge is not
Section titled “What docforge is not”- A chat UI. docforge has no frontend; it hands context to whatever assistant calls it.
- A multi-tenant SaaS. docforge assumes a single-company trust boundary — authenticated users can query any indexed source.
- A hybrid retrieval engine. Retrieval is dense-only today (cosine similarity on embeddings). BM25 fusion is on the roadmap.
- A permission-aware RAG. There are no per-document ACLs at query time.
These are conscious scope decisions. If you need any of them, Onyx is likely a better fit.