Red Hat OpenShift on IBM Cloud (ROKS) and IBM Cloud Kubernetes Service (IKS)¶

Scope¶

This file covers the two managed Kubernetes platforms on IBM Cloud -- Red Hat OpenShift on IBM Cloud (ROKS), the IBM-managed OpenShift offering, and IBM Cloud Kubernetes Service (IKS), the IBM-managed upstream Kubernetes offering -- including the ROKS-vs-IKS positioning decision, cluster sizing (worker-node flavors, masters, scaling), the VPC-vs-Classic deployment split that mirrors the broader IBM Cloud infrastructure divide, supported OpenShift / Kubernetes versions and the support window, networking (Calico vs OVN, VPC ALB / NLB, ingress, multi-zone), storage (VPC Block CSI, VPC File CSI, ODF), the IBM Cloud Container Registry integration including Vulnerability Advisor, multi-region and DR patterns, and the relationship with OpenShift on PowerVS for Power-anchored modernization. For platform-level IAM and account setup see providers/ibm/cloud-platform.md. For VPC, Direct Link, and Transit Gateway see providers/ibm/networking.md. For OpenShift platform fundamentals see providers/openshift/*.

Checklist¶

Platform Choice¶

[Critical] Is the ROKS vs IKS decision made against the application portfolio's container runtime and operator dependencies -- ROKS for OpenShift-specific tooling (OperatorHub, Source-to-Image, Tekton Pipelines as OpenShift Pipelines, OpenShift Service Mesh, OpenShift GitOps / ArgoCD operator, Quay, OpenShift AI, MachineSets) and for teams with existing OpenShift skills, IKS for upstream Kubernetes workloads with no OpenShift dependency and cost-sensitive deployments? (ROKS adds OpenShift license entitlement to the cluster cost. IKS is the cheaper choice when the workload does not need OpenShift-specific features. Moving from IKS to ROKS after the fact is a workload re-platform, not a cluster upgrade.)
[Critical] Is the OpenShift version chosen against the published support window -- ROKS currently supports 4.18, 4.19, and 4.20 as the actively-maintained set with rolling 4.x version availability tied to Red Hat's OpenShift Container Platform release cadence, and clusters on unsupported versions face a forced upgrade or recreate? (IBM publishes per-version support tables with tentative end-of-support dates; do not lock the design to a single 4.x version without a documented upgrade plan to the next major release.)
[Recommended] For the IKS path, is the Kubernetes version chosen against the upstream support window and IBM's published per-version support table, with cluster autoscaling and rolling worker-pool updates accounted for in the upgrade plan? (IKS supports a smaller set of upstream Kubernetes versions than the entire upstream cadence; lagging too far behind triggers a forced upgrade.)

Deployment Infrastructure¶

[Critical] Is the VPC vs Classic deployment chosen with explicit awareness that VPC is the strategic direction and the only path for new architectures, while Classic remains for legacy clusters in flight or for the specific patterns that still require classic infrastructure (certain bare-metal worker SKUs, classic-only Direct Link landings)? (VPC clusters use VPC-Gen2 with VPC-native networking, VPC Block / File CSI drivers, and VPC load balancers. Classic clusters use the older SoftLayer stack. Mixing both in a single environment is operationally awkward; pick one per cluster and document migration intent for any classic clusters.)
[Critical] Is the multi-zone deployment model chosen -- single-zone (development and test only), multizone within one region (production baseline, three zones in MZR), or multi-region (DR / GSLB pattern with separate clusters per region) -- and is at least 3 worker nodes per zone met for HA, with the master plane spread across zones automatically by IBM? (A single-zone production cluster is an audit finding; multi-zone is the production minimum. ROKS and IKS master planes are managed and free of charge -- worker nodes are the billed component.)
[Critical] Is the worker-pool sizing modelled against application requirements -- worker flavor (CPU / memory / disk / accelerator class), worker count per zone, autoscaler bounds, and the secondary disk for container storage on classic vs VPC Block CSI on VPC -- with the per-node pod ceiling and per-pool quota considered? (Right-sizing is workload-specific; the common pitfall is choosing too-small workers and then discovering pod density limits force a re-pool. ROKS / IKS support multiple worker pools per cluster with different flavors so different workload classes can coexist.)
[Recommended] Are dedicated host groups evaluated for workloads with tenant-isolation, regulated, or compliance requirements where shared-tenant VPC infrastructure is not acceptable? (Dedicated hosts come at a price premium but provide hardware isolation -- the only path to single-tenant compute on ROKS / IKS without going to classic bare metal.)

Networking¶

[Critical] Is the cluster network plugin consistent with the cluster's deployment path -- Calico (the IKS default and ROKS classic default) or OVN-Kubernetes (the ROKS-on-VPC modern default and the OpenShift-wide strategic direction) -- and is the pod CIDR and service CIDR non-overlapping with VPC subnets, other clusters, and on-premises networks? (Cluster CIDR overlaps with on-premises ranges produce broken pod-to-on-premises connectivity over Direct Link / Transit Gateway. The default 172.17.0.0/18 pod and 172.21.0.0/16 service CIDRs frequently collide with real-world enterprise IP plans; override them explicitly at cluster creation.)
[Critical] Is the ingress and load-balancer model chosen -- VPC ALB (the modern default, IBM Cloud Application Load Balancer, integrated with Cloud Internet Services for edge protection), VPC NLB (L4, for non-HTTP workloads or static-IP requirements), OpenShift Router (HAProxy) for OpenShift-native routes, or classic NLB / ALB for classic clusters -- and is the relationship between the OpenShift Router and the IBM-managed external load balancer documented? (The router-on-OpenShift plus VPC ALB pattern is the production shape for ROKS-on-VPC; misconfigured layering produces double-NAT and broken source-IP preservation.)
[Recommended] Is outbound traffic protection enabled on ROKS-on-VPC 4.15+ clusters (the default since 4.15), with the implications understood -- egress is restricted to allow-listed destinations and additional outbound rules must be added for image registries, OperatorHub, telemetry, and external dependencies? (Outbound traffic protection is the IBM Cloud answer to "the cluster makes too many outbound calls to unknown destinations." Disabling it is a deliberate decision, not a default.)

Storage¶

[Critical] Is the persistent storage class selected per workload -- VPC Block CSI (per-pod block volumes, IOPS tiers, the default for stateful workloads), VPC File CSI (RWX shared filesystem for workloads needing multi-pod access), IBM Cloud Object Storage via the COS plug-in for object-store workloads, or OpenShift Data Foundation (ODF) for OpenShift-native software-defined storage with snapshots, replication, and S3-compatible buckets in-cluster? (ODF is the right answer for OpenShift-anchored stateful workloads needing snapshot and replication semantics; VPC Block CSI is the cheaper default. Mixing all four in one cluster without a documented per-workload mapping is the source of "we have storage but cannot tell what is using what" findings.)
[Recommended] Are storage tiers (IOPS profile for VPC Block, performance tier for VPC File) selected per persistent volume claim rather than relying on a single default class, and is the cost model aware of the tier-driven price delta? (A workload that needs database-tier IOPS deployed on the general-purpose tier produces a performance ticket; the reverse produces a cost ticket.)

IBM Cloud Container Registry¶

[Critical] Is IBM Cloud Container Registry (ICR) used as the primary private registry -- with Vulnerability Advisor enabled for image scanning, trusted-content policies (image signing / Notary) enabled where applicable, namespace-scoped IAM for access control, and regional registries chosen for image-pull locality -- rather than relying on the in-cluster OpenShift internal registry for cross-cluster image distribution? (The OpenShift internal registry is fine for in-cluster build outputs; it is not the right answer for organization-wide image distribution, multi-cluster pull, or external consumer access. ICR with Vulnerability Advisor is the supported IBM-managed answer.)
[Recommended] Is the image-pull-secret pattern automated -- either via the IBM-managed default pull secret reconciled per namespace, or via an external secrets controller (External Secrets Operator, IBM Secrets Manager integration) -- rather than long-lived hand-rolled docker-config secrets? (Hand-rolled pull secrets expire silently when API keys rotate; the failure mode is "all new pods crash with ImagePullBackOff" hours after a routine key rotation.)

Identity, Operations, and DR¶

[Critical] Is cluster IAM integrated with platform IAM -- platform users granted access via Access Groups, Trusted Profiles used for compute-resource-to-IBM-Cloud-service authentication from within pods (eliminating long-lived Service ID API keys baked into deployments), and the OpenShift OAuth integration pointed at the same enterprise IdP (Entra ID, Okta, ADFS) used for the rest of IBM Cloud? (See providers/ibm/cloud-platform.md for the IAM model. The single most common ROKS / IKS audit finding is "the cluster has its own user model unrelated to enterprise IdP" -- which means employee offboarding does not actually remove cluster access.)
[Recommended] Is observability wired up -- IBM Cloud Logs (the current platform logging product, formerly LogDNA) for cluster and application logs, IBM Cloud Monitoring (the current monitoring product, formerly Sysdig) for metrics, and Activity Tracker for control-plane audit events -- with retention aligned to the compliance regime? (Default cluster logging captures only short-window data in-cluster; long-retention requires an explicit logging-instance and routing configuration.)
[Recommended] Is the multi-region DR pattern explicit -- two clusters in two regions with GSLB via Cloud Internet Services, application-level data replication (Db2 on Cloud cross-region, COS cross-region, in-app replication), and a documented runbook for fail-over -- rather than relying on cluster-level backups alone? (ROKS / IKS clusters are not themselves replicated across regions; the DR pattern is "two clusters" plus an application-level replication strategy, not a managed cross-region cluster.)

Modernization Adjacencies¶

[Recommended] For Power-anchored modernization (an existing IBM i / AIX core on PowerVS or on-premises Power), is the relationship between ROKS on VPC (for modernized x86 microservices) and OpenShift on PowerVS (for Power-native containerized workloads or Power-specific runtimes) understood and documented? (Both are managed OpenShift offerings, but they run on different hardware and have different licensing implications. The typical pattern is ROKS-on-VPC for the modern x86 tier with cross-cluster traffic to PowerVS-resident services over PER + Transit Gateway; full OpenShift-on-PowerVS adoption is a narrower case.)

Why This Matters¶

ROKS and IKS are the IBM-managed container platforms, and the choice between them is structural, not cosmetic. ROKS bundles the OpenShift license entitlement and adds the OpenShift ecosystem (OperatorHub, Source-to-Image, OpenShift Pipelines and GitOps, OpenShift Service Mesh, OpenShift AI). IKS is upstream Kubernetes without the OpenShift surface and the OpenShift cost. Picking IKS for a workload portfolio that ends up depending on OpenShift operators is a re-platform; picking ROKS for a portfolio that never uses any OpenShift-specific feature is paying for entitlement that is not consumed. The decision is best made on the application-portfolio dependency inventory, not on team preference.

The VPC-vs-Classic choice mirrors the broader IBM Cloud platform split. VPC is the strategic direction with OVN-Kubernetes networking, VPC Block / File CSI storage, VPC ALB / NLB load balancing, and integrated IAM access. Classic is the legacy path that remains for clusters in flight and for specific bare-metal-on-classic patterns that have not been replicated on VPC. Greenfield clusters should be VPC-only. Designs that include classic clusters need a documented retirement plan; "we have classic because we have classic" is not a design.

The OpenShift version cadence on ROKS is tightly coupled to Red Hat's OCP release cadence -- IBM publishes per-version support windows with tentative end-of-support dates, and clusters on unsupported versions face a forced upgrade or a recreate. The 4.18 / 4.19 / 4.20 set is the actively-maintained envelope as of 2026; locking a design to a single version without a documented upgrade plan to the next is a multi-year operational debt. Cluster upgrades for OpenShift require admin acknowledgement of removed APIs and operator compatibility; the operations team that does not own an upgrade calendar will find one imposed.

Networking on ROKS / IKS has the same characteristic gotchas as every other managed Kubernetes platform plus one IBM Cloud specific: the default pod CIDR 172.17.0.0/18 and service CIDR 172.21.0.0/16 collide with real-world enterprise IP plans roughly half the time, and overlaps with on-premises ranges produce silent pod-to-on-prem connectivity failures over Direct Link or Transit Gateway. The CIDR is locked at cluster creation; getting it right up front is mandatory. The ingress shape -- OpenShift Router on ROKS, ALB / NLB on IKS, CIS in front for edge protection -- needs a documented per-workload owner; misconfigured layering produces double-NAT and broken source-IP preservation that surface as application-level bugs, not network-level errors.

IBM Cloud Container Registry with Vulnerability Advisor is the supported registry for production workloads on ROKS / IKS. The OpenShift internal registry is fine for build outputs but is not the right answer for organization-wide image distribution, cross-cluster pull, or external consumer access. The image-pull-secret pattern needs automation (the IBM-managed default pull secret reconciled per namespace, or External Secrets Operator integration) rather than hand-rolled docker-config secrets; the latter expires silently on API key rotation and produces multi-hour outages.

Multi-region DR is "two clusters, application-level replication, GSLB" -- not a managed cross-region cluster. The architecture choice that determines DR quality is the application-level data replication strategy, not the cluster configuration. Customers expecting managed-Kubernetes DR equivalent to RDS Multi-AZ are mis-mapping their mental model.

Common Decisions (ADR Triggers)¶

ROKS vs IKS -- ROKS (OpenShift entitlement included, OperatorHub, OpenShift-specific tooling, larger cost) vs IKS (upstream Kubernetes, smaller cost, no OpenShift features). Decide on application-portfolio dependency, not on team preference.
VPC vs Classic cluster -- VPC (modern, strategic, OVN networking, VPC CSI, the only path for greenfield) vs Classic (legacy, required for specific bare-metal or classic-only patterns). Greenfield should be VPC.
Multi-zone within one region vs multi-region -- multi-zone (three zones in MZR, production HA baseline, single cluster) vs multi-region (two clusters in two regions, GSLB via CIS, application-level replication, full DR). Multi-zone is the production minimum; multi-region is for tier-1 DR-required workloads.
Worker-flavor and pool topology -- single homogeneous worker pool (simple, cheaper at small scale) vs multiple specialised pools (different flavors per workload class, dedicated pools for GPU / memory-intensive / general-purpose). Specialized pools are mandatory for non-trivial workload portfolios.
VPC Block CSI vs VPC File CSI vs ODF -- Block (per-pod RWO volumes, IOPS tiers, cheaper) vs File (RWX shared filesystem, multi-pod access) vs ODF (in-cluster software-defined storage with snapshots, replication, S3 buckets, larger operational surface). Pick by access mode and snapshot requirements.
IBM Cloud Container Registry vs OpenShift internal registry vs third-party (Quay, Harbor) -- ICR (IBM-managed, Vulnerability Advisor, regional registries, default choice) vs OpenShift internal (build outputs only) vs third-party (Quay for OpenShift-native workflows, Harbor for self-hosted compliance). ICR is the default for production.
Outbound traffic protection on/off -- on (default 4.15+, restricts egress to allow-listed destinations, higher security, larger ops surface to maintain allow-lists) vs off (open egress, easier ops, weaker security posture). Production should leave it on; the allow-list maintenance is the cost of doing it right.
OpenShift OAuth pointed at enterprise IdP vs OpenShift-local users -- enterprise IdP (lifecycle managed by IdP, no orphaned cluster access on employee offboarding, mandatory for enterprise deployment) vs OpenShift-local users (break-glass admin accounts only). Local users in production is an audit-finding pattern.
ROKS-on-VPC vs ROKS-on-PowerVS for Power-anchored modernization -- ROKS-on-VPC for the modern x86 microservice tier with cross-cluster traffic to PowerVS-resident services vs full ROKS-on-PowerVS for Power-native containerized workloads requiring Power runtimes (ppc64le images). The first is the typical pattern; the second is a narrower case.

Reference Links¶

Red Hat OpenShift on IBM Cloud overview -- ROKS architecture, version support, getting started
IBM Cloud Kubernetes Service overview -- IKS architecture, version support, getting started
ROKS supported OpenShift versions -- per-version support table with end-of-support dates
Creating a VPC cluster -- VPC-Gen2 deployment, worker flavors, multizone setup
ROKS / IKS locations -- current MZR / SZR footprint per offering
VPC Block Storage CSI driver -- IOPS tiers and storage class configuration
IBM Cloud Container Registry overview -- ICR, namespaces, Vulnerability Advisor
Vulnerability Advisor -- image vulnerability scanning for ICR
OpenShift Data Foundation on ROKS -- ODF deployment on VPC clusters
IBM Cloud DNS / Global Load Balancer for multi-cluster DR -- GSLB via CIS for cross-region failover