AI and ML Managed Services¶

Scope¶

Cross-provider strategy for consuming managed AI and ML services: build vs buy decisions for AI capabilities, foundation model selection (proprietary vs open-source), API-based AI vs self-hosted models, data residency and compliance for AI services, cost patterns (token-based, per-transaction, provisioned throughput), responsible AI and content safety, model evaluation and benchmarking, RAG architecture decisions. Does not cover GPU infrastructure or MLOps pipelines — see patterns/ai-ml-infrastructure.md. Does not cover provider-specific service details — see provider files.

Checklist¶

Why This Matters¶

AI managed services are the fastest-growing cloud spend category. Token-based pricing makes costs unpredictable without monitoring — a single application with poorly optimized prompts or unnecessary context can consume tens of thousands of dollars monthly. Organizations that skip model evaluation and deploy the most capable (and expensive) model for every use case overspend dramatically. Provisioned throughput commitments lock in capacity but waste money if utilization is low. The cost difference between models can be 10-100x for similar quality on narrow tasks, making model selection one of the highest-leverage cost decisions in modern cloud architectures.

Data privacy is the most common blocker for enterprise AI adoption. Many AI services process data through shared infrastructure, and model providers' data retention policies vary significantly. Regulated industries (healthcare, finance, government) require explicit confirmation that customer data is not used for model training and that data residency requirements are met. Responsible AI guardrails are not optional — without content filtering and output validation, AI services can generate harmful, biased, or factually incorrect content that creates legal and reputational risk. Organizations that defer responsible AI decisions to post-launch face costly retrofits and potential compliance violations.

Common Decisions (ADR Triggers)¶

ADR: AI Service Consumption Model¶

Context: The organization needs AI capabilities and must decide between consuming managed API services or hosting models on its own infrastructure.

Options:

Criterion	API-Based Managed Service	Self-Hosted Open-Source Models	Hybrid
Operational complexity	Low — provider manages infrastructure, scaling, and model updates	High — requires GPU provisioning, model serving (vLLM, TGI), monitoring, and updates	Medium — manage self-hosted for some workloads, API for others
Cost model	Per-token or per-request — variable, scales with usage	GPU instance hours — fixed infrastructure cost regardless of usage	Mixed — optimize each workload for its cost profile
Data privacy	Data sent to provider — must verify retention and training opt-out policies	Data stays in your infrastructure — full control over data lifecycle	Sensitive data on self-hosted, non-sensitive on API
Model selection	Limited to provider catalog (typically 5-20 models)	Any open-source model (thousands available on Hugging Face)	Best of both — proprietary quality + open-source flexibility
Latency	Network round-trip + provider queue time (50-5000ms typical)	Network-local, no queue (10-500ms typical with dedicated GPU)	Route latency-sensitive requests to self-hosted
Scaling	Automatic but subject to provider rate limits and throttling	Manual capacity planning, limited by GPU availability	Scale API usage elastically, self-hosted for baseline

Decision drivers: Data sensitivity and regulatory requirements, request volume and cost projections, latency requirements, model customization needs, team GPU/ML infrastructure expertise, and vendor lock-in tolerance.

ADR: Foundation Model Selection¶

Context: The application requires a foundation model for one or more AI tasks and must choose among available proprietary and open-source options.

Options: - Proprietary API models (Anthropic Claude, GPT-4, Gemini): Highest general quality, frequent updates, no infrastructure management. Highest per-token cost, vendor lock-in, data leaves your infrastructure. - Large open-source models (Llama 70B+, Mixtral 8x22B): Near-proprietary quality on many tasks, full data control, no per-token cost. Requires significant GPU infrastructure (4-8+ GPUs), operational expertise, and slower update cycles. - Small open-source models (Llama 8B, Mistral 7B, Gemma 7B): Good quality for narrow tasks, runs on single GPU, lowest infrastructure cost. Lower quality on complex reasoning, requires task-specific evaluation, may need fine-tuning. - Task-specific models (embedding models, classification models, code models): Optimized for specific task types, often smaller and faster. Limited to their trained task, requires separate model for each capability.

Decision drivers: Task complexity (complex reasoning favors larger models), latency budget, cost per request at projected volume, data residency requirements, and whether the task is well-defined (narrow tasks can use smaller models) or open-ended (favors larger models).

ADR: Domain Adaptation Strategy¶

Context: The foundation model needs domain-specific knowledge or behavior that it does not exhibit out of the box.

Options: - Prompt engineering: Add instructions, examples, and constraints to the prompt. Zero training cost, immediate iteration, works with any model. Limited by context window size, increases per-request token cost, prompt injection risk. - RAG (Retrieval-Augmented Generation): Retrieve relevant documents and inject them into the prompt context. Keeps knowledge up-to-date without retraining, works with any model, auditable sources. Requires vector database infrastructure, retrieval quality directly affects output quality, adds latency. - Fine-tuning: Train the model on domain-specific examples. Highest quality for well-defined tasks, reduces prompt size, can encode domain behavior. Requires labeled training data (hundreds to thousands of examples), compute cost for training, and retraining when the base model updates. - Combination (prompt engineering + RAG + fine-tuning): Layer approaches — fine-tune for behavior, RAG for knowledge, prompts for task instructions. Highest quality ceiling but highest complexity and cost.

Decision drivers: Availability and volume of training data, how frequently domain knowledge changes, acceptable latency, budget for ongoing training compute, and whether the adaptation is behavioral (fine-tuning) or knowledge-based (RAG).

ADR: Provisioned Throughput vs On-Demand¶

Context: The production workload has steady-state AI request volume and requires consistent latency and availability.

Options: - On-demand: Pay per token/request with no commitment. Zero waste at low volume, automatic access to latest models. Subject to throttling during peak demand, higher per-token cost, no latency guarantees. - Provisioned throughput: Reserve model capacity at a fixed hourly/monthly rate. Guaranteed throughput and consistent latency, lower effective per-token cost at high utilization. Wasted spend if utilization is below commitment, requires capacity planning, commitment period (1 month to 1 year). - Mixed (on-demand + provisioned baseline): Provision for steady-state, burst to on-demand for peaks. Optimizes cost and availability. More complex capacity planning, requires monitoring and adjustment.

Decision drivers: Request volume predictability, latency consistency requirements, budget for committed spend, and willingness to manage capacity planning.

ADR: Vector Database Selection for RAG¶

Context: The RAG architecture requires a vector database to store and retrieve document embeddings.

Options:

Criterion	pgvector (PostgreSQL)	Managed Vector Service (OpenSearch Serverless, Azure AI Search, Vertex Vector Search)	Dedicated Vector DB (Pinecone, Qdrant, Weaviate)
Operational overhead	Low — extension on existing PostgreSQL	Medium — managed service but separate infrastructure	Medium — additional service to manage (SaaS or self-hosted)
Scale limit	Millions of vectors (adequate for most enterprise RAG)	Billions of vectors with auto-scaling	Billions of vectors, purpose-built for scale
Query performance	Good for < 10M vectors, degrades at scale without tuning	High — optimized for vector search at scale	Highest — purpose-built indexing and retrieval
Cost	Minimal — uses existing database	Per-query + storage, can be expensive at scale	Per-vector + query, varies widely by provider
Integration complexity	Low — SQL interface, familiar tooling	Medium — provider-specific SDK and query syntax	Medium — dedicated client libraries and APIs
Hybrid search	Limited (text + vector requires additional setup)	Built-in (keyword + vector + metadata filtering)	Built-in (most support hybrid search natively)

Decision drivers: Vector count (millions vs billions), existing infrastructure (already running PostgreSQL?), query latency requirements, hybrid search needs (combining keyword and vector search), and team familiarity with the technology.

ADR: Single-Provider vs Multi-Provider AI Strategy¶

Context: The organization must decide whether to standardize on one AI provider or use multiple providers for different capabilities.

Options: - Single provider: Standardize on one platform (e.g., all on Bedrock, all on Azure OpenAI). Simplified operations, consolidated billing, consistent API patterns. Vendor lock-in, limited model selection, single point of failure for all AI capabilities. - Multi-provider with abstraction layer: Use multiple providers behind an AI gateway or abstraction layer. Access to best-of-breed models, provider failover, competitive pricing leverage. Higher operational complexity, prompt compatibility challenges across models, abstraction layer maintenance. - Multi-provider without abstraction: Use each provider's native SDK directly. Lowest initial setup, full access to provider-specific features. Tightly coupled application code, difficult to switch providers, no centralized monitoring or cost tracking.

Decision drivers: Model quality requirements (some models are only available on specific platforms), risk tolerance for vendor lock-in, operational complexity budget, and whether the organization has a broader cloud provider standardization strategy.

ADR: AI Gateway / Proxy Layer¶

Context: The organization uses AI services across multiple applications and needs centralized management of API access, cost tracking, and observability.

Options: - No gateway (direct API calls): Each application manages its own API keys, rate limiting, and error handling. Simplest initial setup, no additional infrastructure. Decentralized cost tracking, duplicated retry/fallback logic, no centralized observability. - Open-source gateway (LiteLLM, Portkey, custom): Deploy a proxy layer that normalizes API calls across providers. Centralized key management, unified logging, model routing. Requires infrastructure to host and operate, community support varies, may lag behind provider API changes. - Commercial AI gateway: Use a managed gateway service with built-in analytics, cost management, and compliance features. Richest feature set, vendor support. Additional cost, potential latency overhead, dependency on gateway provider availability.

Decision drivers: Number of AI-consuming applications, number of AI providers in use, cost visibility requirements, compliance and audit logging needs, and team capacity to operate additional infrastructure.