Skip to content

Architecture Session Workflow

This document defines the systematic process for conducting architecture design sessions using the Architect tool and knowledge library.

Prerequisites

  • Architect API is running and accessible
  • API base URL is set via ARCHITECT_API_URL environment variable (default: http://localhost:30010/api/v1)
  • Knowledge files are available at knowledge/
  • Knowledge embeddings are indexed (check via GET /knowledge/reindex/status; if not indexed, run POST /knowledge/reindex)

Workflow Steps

Step 1: Read Project Context

Read the project from the API: - GET /clients — find or create the client - GET /clients/{id}/projects — find or create the project - Note the cloud_providers array and description - Create a version if one doesn't exist

Step 2: Identify Architecture Pattern

Read the project description and infer the architecture pattern. Confirm with the user before proceeding.

Available patterns: - three-tier-web — web application with presentation, application, and data tiers - microservices — decomposed services with independent deployment - data-pipeline — data ingestion, transformation, and storage - static-site — pre-built assets served via CDN - hybrid-cloud — spanning on-prem and cloud - event-driven — event sourcing, CQRS, pub/sub - multi-cloud — multiple public cloud providers - edge-computing — edge/IoT with cloud aggregation - ai-ml-infrastructure — GPU compute, training pipelines, model serving - saas-multi-tenant — multi-tenant SaaS application - cdn-fronted-onprem — on-prem with external CDN - disaster-recovery-implementations — DR-focused design

Step 3: Load Knowledge Files

Load files in this order:

Always load:

  1. General files (knowledge/general/*.md) — all files. These ask WHAT decisions need to be made.
  2. Failure patterns (knowledge/failures/*.md) — anti-patterns to avoid.

Load based on project:

  1. Provider files (knowledge/providers/{provider}/*.md) — for each provider in the project's cloud_providers. These ask HOW to implement with that provider.
  2. Pattern file (knowledge/patterns/{pattern}.md) — for the identified pattern.

Load if applicable:

  1. Compliance files (knowledge/compliance/{framework}.md) — if compliance requirements are identified.
  2. Framework files (knowledge/frameworks/{provider}-well-architected.md) — for the Well-Architected review pass at the end.

Load for on-prem projects:

  1. On-prem specificgeneral/load-balancing-onprem.md, general/networking-physical.md, general/cost-onprem.md if any provider is on-prem (nutanix, vmware, openstack).
  2. Cross-cutting toolsproviders/hashicorp/* if Terraform/Vault/Consul are in scope, providers/prometheus-grafana/* for on-prem observability, providers/ceph/* for Ceph storage.

Step 3b: RAG Discovery Pass

After loading knowledge files by rule, run a vector search to discover additional relevant content that the taxonomy might not connect. This supplements the rule-based loading — it does not replace it.

  1. Search with the project description:

    POST /knowledge/search
    {"query": "<project description + cloud providers + pattern>", "top_k": 20, "exclude_files": ["<already loaded files>"]}
    

  2. Review results for cross-cutting items — the vector search may surface relevant checklist items from:

  3. Compliance files not initially loaded (e.g., a GDPR item triggered by a data residency mention)
  4. Provider files for adjacent technologies (e.g., Ceph storage items for an OpenStack project)
  5. Pattern files that partially overlap (e.g., DR items for a migration project)
  6. General files with niche items that apply (e.g., virtual appliance migration for VMware projects)

  7. Load any additional files that have multiple high-scoring hits (score > 0.5). These indicate the file is genuinely relevant to the project scope.

  8. Note individual high-scoring items from files you choose not to fully load — ask these as supplementary questions during Step 4.

Step 4: Systematic Questioning

Walk through the loaded knowledge files' checklist items. Process in this order:

  1. Critical items first — items tagged [Critical] across all loaded files
  2. Recommended items — items tagged [Recommended]
  3. Optional items — items tagged [Optional], ask only if time/budget allows

For each checklist item: - Ask the user ONE question at a time - Record the question via POST /versions/{id}/questions - Record the answer via PATCH /versions/{id}/questions/{id} - If the answer implies an architectural decision, create an ADR IMMEDIATELY via POST /versions/{id}/adrs - Track coverage via POST /versions/{id}/coverage (record which item was addressed)

MANDATORY: RAG-Augmented Questioning

After recording each answer, check the suggestions array in the question response. These are vector search results — checklist items from non-loaded files that are semantically related to the answer. For each suggestion with score > 0.5: - If it raises a concern not yet addressed, ask it as a follow-up question - If it confirms an existing decision, note it in the coverage - If it's from a file not yet loaded, consider loading that file

Additionally, combine knowledge library content with general domain expertise when formulating questions and evaluating answers. The knowledge library provides specific checklist items and configuration details; general expertise provides context about why those items matter, how they interact, and what the trade-offs are. Use both together: - Knowledge library provides the WHAT — specific items to check, configuration values, vendor-specific details - General expertise provides the WHY — architectural reasoning, trade-off analysis, experience-based judgment - RAG search provides the CONNECTIONS — related items across the library that the taxonomy doesn't link

MANDATORY: Hosting Model Question (for on-prem platforms)

If the project uses VMware, Nutanix, or OpenStack, ask this Critical question FIRST before any infrastructure questions:

  • VMware: "On-premises, VMware Cloud on AWS, Azure VMware Solution, or Google Cloud VMware Engine?"
  • Nutanix: "On-premises or Nutanix Cloud Clusters (NC2) on AWS/Azure?"
  • OpenStack: "Self-hosted or hosted OpenStack provider?"

This determines the entire infrastructure layer — on-prem requires physical hardware design, cloud-hosted eliminates it entirely.

MANDATORY: Category Coverage Gate

Before proceeding to Step 6 (diagrams), verify ALL applicable knowledge categories have been addressed. Print this full checklist and confirm each:

IMPORTANT: This checklist must be generated dynamically by scanning the knowledge/ directory. Do NOT rely on a hardcoded list — new knowledge files are added regularly. At the start of each session, scan knowledge/general/*.md, knowledge/providers/{provider}/*.md, knowledge/patterns/{pattern}.md, knowledge/compliance/*.md, and knowledge/failures/*.md to build the complete checklist for the project.

General Categories (scan knowledge/general/*.md for ALL files):
[ ] Compute — instance types, sizing, scaling, HA
[ ] Networking — segmentation, load balancing, DNS, CDN
[ ] Data — database, backup, replication, encryption
[ ] Security — access control, secrets, encryption
[ ] Observability — monitoring, logging, alerting
[ ] Disaster Recovery — RPO/RTO, failover, backup strategy
[ ] Cost — budget, estimates, optimization
[ ] Deployment — CI/CD, IaC tool, strategy
[ ] Identity — authentication, authorization
[ ] Physical Infrastructure — host sizing, network switches (if on-prem)

Conditional General Categories (include when applicable):
[ ] Inventory Analysis — if raw VM/server inventory data was provided
[ ] Virtual Appliance Migration — if virtual appliances (F5, Infoblox, etc.) exist
[ ] VDI Migration Strategy — if VDI/Horizon workloads are in scope
[ ] Multi-Site Migration Sequencing — if multiple sites are being migrated
[ ] Physical Server Scope — if physical servers exist alongside VMs
[ ] Colocation Constraints — if any sites are in colocation facilities
[ ] Facility Lifecycle — if facility lease expiry or decommission is a factor
[ ] Workload Migration — if migrating from one platform to another
[ ] Database Migration — if databases are being migrated
[ ] Capacity Planning — if sizing new infrastructure
[ ] Hardware Sizing — if on-prem hardware procurement is needed

Provider-Specific (scan knowledge/providers/{provider}/*.md for ALL files):
[ ] {provider}/compute — provider-specific compute items
[ ] {provider}/networking — provider-specific networking items
[ ] {provider}/storage — provider-specific storage items
[ ] {provider}/security — provider-specific security items
[ ] {provider}/observability — provider-specific monitoring items
[ ] {provider}/data-protection — provider-specific backup/DR items
[ ] {provider}/migration-tools — provider-specific migration tooling (if migrating)
[ ] {provider}/* — any other provider files loaded

Pattern-Specific (if applicable):
[ ] {pattern} — all checklist items from the pattern file
[ ] hypervisor-migration — if migrating between hypervisors (VMware to Nutanix, etc.)

Compliance (if applicable):
[ ] {compliance framework} — all Critical items from the compliance file

Failure Patterns:
[ ] failures/networking — verified no networking anti-patterns
[ ] failures/data — verified no data anti-patterns
[ ] failures/scaling — verified no scaling anti-patterns
[ ] failures/security — verified no security anti-patterns
[ ] failures/deployment — verified no deployment anti-patterns

Do NOT proceed to diagrams if any category shows uncovered Critical items. Record all coverage items via POST /versions/{id}/coverage.

Step 5: Gap Analysis

After walking through all checklist items: - Review any unchecked items — ask if they should be addressed or deferred - Ask: "What else am I missing?" for novel requirements not covered by checklists - Check coverage via GET /versions/{id}/coverage — verify no Critical items are unaddressed - Check the failure patterns — verify no anti-patterns are present in the design - Any gaps discovered should be noted for addition to knowledge files later

MANDATORY: RAG Gap Scan

Run a final vector search using the accumulated project context (all questions, answers, and ADR titles) to find any checklist items that were never addressed:

  1. Search with each major ADR decision:

    POST /knowledge/search
    {"query": "<ADR title + decision summary>", "top_k": 5, "exclude_files": ["<loaded files>"]}
    

  2. Search with the overall project summary — combine all answered questions into a project summary and search for anything the library covers that wasn't asked about.

  3. Flag any Critical items from the results that have no corresponding coverage entry — these are potential gaps that the rule-based loading missed.

Step 6: Generate Diagrams

Only after the architecture is fully specified, generate diagrams:

Engine Selection

  • Python diagrams library: when cloud-provider icons are needed (AWS, Azure, GCP, Nutanix, VMware, OpenStack icons). ALWAYS prefer this for cloud architecture diagrams.
  • D2: only for non-cloud-specific diagrams (sequence diagrams, generic flowcharts, process flows).

Detail Levels

Generate diagrams at appropriate detail levels: - Conceptual (sort_order: 0) — high-level overview for stakeholders. Major components and data flows only. - Logical (sort_order: 1) — service boundaries, network layout, AZs/regions for architects. - Detailed (sort_order: 2) — specific resources, security groups, configurations for engineers. - Security-focused (sort_order: 3) — if compliance is involved, a dedicated security controls diagram.

Process

  1. Create artifact via POST /versions/{id}/artifacts
  2. Trigger render via POST /versions/{id}/artifacts/{id}/render
  3. Verify render status is "success"
  4. If render fails, fix source code and re-render

Step 7: IaC Planning

After the architecture is specified, plan the Infrastructure as Code:

  1. Ask which IaC tool(s) — this is a user decision, not assumed. Options vary by provider (see general/iac-planning.md). Create an ADR for the choice.
  2. Define module structure — group resources by tier, service, or environment.
  3. Define state management — remote backend, locking, environment separation.
  4. Create resource inventory — list every resource to be provisioned with:
  5. IaC module assignment
  6. Provider resource type
  7. Complexity level (Simple / Moderate / Complex)
  8. Estimate IaC effort — based on resource count and complexity.
  9. Note manual steps — any resources provisioned outside IaC (bootstrap, one-time setup).

Include the IaC plan in the design document.

Step 8: Generate Documents

Create supporting documents: - Cost Estimate — itemized cost breakdown based on the architecture decisions - Architecture Document — if requested, summarizing all decisions (can use POST /templates/render for starting structure)

Step 9: Well-Architected Review

Run the relevant Well-Architected Framework checklist as a final review pass: - Load knowledge/frameworks/{provider}-well-architected.md - Walk through each pillar's checklist items - Track any findings as questions or ADRs - This is a validation step, not a design step — it should confirm the design is sound

Step 10: Generate Design Document

Create a comprehensive design document artifact that compiles all project data: - Executive summary (project scope, providers, pattern, cost) - Changes from previous version (if applicable) - All questions and answers (table) - All ADRs (full text) - Architecture diagrams (embedded as images) - Infrastructure details (components table, configuration) - Service descriptions and dependencies - Implementation details per component (see below) - Cost estimate (with version comparison if applicable) - IaC plan (tool, module structure, resource inventory with complexity, effort estimate) - Resilience and data protection (HA, backup, RPO/RTO) - POC to Production gap (if POC pattern) - Coverage checklist (which knowledge categories were addressed, deferred, or N/A) - Success criteria

MANDATORY: Implementation Details from Knowledge Library + RAG

For EVERY major component in the architecture (each database, compute tier, network element, storage, security control, monitoring system), generate an implementation detail section by:

  1. Cross-reference the relevant knowledge file for that component
  2. Run a RAG search for the component to find additional relevant items across the entire library:
    POST /knowledge/search
    {"query": "<component name + technology + key decisions>", "top_k": 10}
    
    This surfaces related checklist items from compliance files, failure patterns, and other providers that the rule-based loading may not have connected.
  3. Extract Critical and Recommended checklist items that apply (from both loaded files AND RAG results)
  4. Generate a configuration table with specific settings, not just the decision name:
  5. Setting name
  6. Recommended value (based on project context: scale, RPO/RTO, compliance)
  7. Rationale
  8. Include security configuration from the knowledge file
  9. Include monitoring/alerting recommendations
  10. Include common mistakes to avoid from failure pattern files
  11. Consult vendor documentation (via reference links) for current best practices
  12. Apply general domain expertise — the knowledge library provides specific checklist items, but implementation details should also reflect architectural best practices, performance tuning guidance, and operational experience that go beyond what any checklist covers

Example — instead of just "Aurora Multi-AZ", include:

Setting Value Rationale
Engine Aurora PostgreSQL 16 Latest stable
Instance class db.r6g.xlarge Sized for workload
Multi-AZ Enabled RPO requirement
Backup retention 14 days Enterprise standard
Encryption KMS CMK At-rest encryption
Enhanced monitoring 60s Performance visibility
Performance Insights Enabled Query analysis
Security group Allow 5432 from app SG only Least privilege
Deletion protection Enabled Prevent accidents

This transforms the design document from a decision summary into an implementation-ready specification.

The design document should be auto-generated from the API data — not written manually.

Coverage Artifact: Also create a separate "Coverage Checklist" artifact that shows: - Every knowledge file loaded for this version - Every Critical item: addressed (with question ID), deferred (with reason), or N/A - Every Recommended item: addressed or skipped - Summary: X of Y Critical items addressed, Z deferred - Fetch via GET /versions/{id}/coverage

Step 11: Retrospective

After the session: - Note any knowledge gaps discovered during the session - Create issues or PRs to add missing items to knowledge files - This ensures the knowledge library improves with every project

Version Changes

When creating a new version of an existing project (e.g., migrating a component, adding a service, changing a provider), follow this process:

Step A: Identify What Changed

Document the change clearly: - What component is being added, removed, or replaced? - Why is the change being made? - What version did it change from?

Step B: Load Knowledge for Changed Components

Load the knowledge files specific to the new or changed components. For example: - Replacing in-cluster PostgreSQL with RDS → load providers/aws/rds-aurora.md - Adding a CDN → load providers/cloudflare/cdn-dns.md or providers/aws/cloudfront-waf.md - Moving from K3s to EKS → load providers/aws/containers.md

Step C: Walk Through New Checklist Items

Walk through the checklist items in the newly loaded knowledge files — these are questions that weren't relevant in the previous version but are now. Ask one at a time, create ADRs immediately.

Do not skip this step. Every component change introduces new decisions that must be explicitly addressed, not assumed.

Step D: Update Artifacts

  1. Clone artifacts from the previous version
  2. Update diagrams to reflect the change
  3. Update the cost estimate with a comparison table (old vs new)
  4. Regenerate the design document with a "Changes from vX.Y.Z" section

Step E: Create Version Change ADR

Create an ADR documenting the version change itself: - What changed and why - What was considered - What the consequences are (cost, complexity, risk)

API Reference

All endpoints use the base URL from ARCHITECT_API_URL environment variable.

Action Method Endpoint
List clients GET /clients
Create client POST /clients
List projects GET /clients/{id}/projects
Create project POST /clients/{id}/projects
Create version POST /projects/{id}/versions
Create question POST /versions/{id}/questions
Answer question PATCH /versions/{id}/questions/{id}
Create ADR POST /versions/{id}/adrs
Create artifact POST /versions/{id}/artifacts
Trigger render POST /versions/{id}/artifacts/{id}/render
Export PDF GET /versions/{id}/artifacts/{id}/export-pdf
List templates GET /templates
Render template POST /templates/render
Record coverage POST /versions/{id}/coverage
Update coverage PATCH /versions/{id}/coverage/{id}
List coverage GET /versions/{id}/coverage
Coverage summary GET /versions/{id}/coverage/summary
Search knowledge (RAG) POST /knowledge/search
Reindex knowledge POST /knowledge/reindex
Reindex status GET /knowledge/reindex/status

Rules

  1. Ask one question at a time — never batch questions
  2. Create ADRs immediately — every architectural decision gets an ADR, don't wait
  3. Use stack-specific icons — always prefer Python diagrams library for cloud diagrams
  4. Do thorough analysis first — don't generate diagrams until the architecture is fully specified
  5. Track everything — every question and answer goes through the API
  6. Check failure patterns — verify the design doesn't match known anti-patterns
  7. Feed back gaps — any missing knowledge items get added to the library
  8. Version changes require new questions — every component change triggers a knowledge file review and new checklist walkthrough for the changed components
  9. Generate design documents — every version gets a comprehensive design document compiled from API data
  10. IaC tool is a user decision — always ask which IaC tool(s) to use, never assume. Create an ADR for the choice.
  11. Include IaC plan — every design document includes a resource inventory with module structure and complexity estimates
  12. Consult vendor documentation — use WebFetch to check reference links in knowledge files when answering questions about pricing, feature availability, configuration, or service limits. Don't rely solely on training data.
  13. Use RAG at every stage — run vector searches after loading files (Step 3b), after each answer (Step 4 suggestions), during gap analysis (Step 5), and during artifact generation (Step 10). The knowledge library is comprehensive but taxonomically organized — RAG finds the cross-cutting connections.
  14. Combine all three knowledge sources — the knowledge library provides specific checklists, RAG provides cross-cutting discovery, and general domain expertise provides architectural reasoning and trade-off analysis. Use all three together for every decision. Never rely on only one source.