IBM Power Management Plane (HMC, FSP, PowerVC, LPM)¶
Scope¶
This file covers the management plane of an on-premises IBM Power estate: the Hardware Management Console (HMC) and its virtual variant (vHMC), the Flexible Service Processor (FSP) firmware coordination across multi-frame fleets, IBM PowerVC for OpenStack-style provisioning and image management, Live Partition Mobility (LPM) prerequisites and operational practice, and the broader management-network and console-access design that wraps around them. It is the companion to providers/ibm/power-onprem.md, which covers the workload-side concerns (LPAR sizing, AIX/IBM i support, backup, HA, ILMT). Management-plane mistakes typically surface at migration time as "we cannot move this workload without an outage" surprises -- mismatched HMC code levels, FSP firmware drift between source and target frames, missed LPM prerequisites, ad-hoc HMC-only provisioning that does not scale. For the PowerVS-hosted parallel of PowerVC see providers/ibm/powervs.md; for Skytap-on-Azure where the management plane is provider-operated see providers/skytap/cloud.md.
Checklist¶
Hardware Management Console (HMC)¶
- [Critical] Is the HMC topology dual-HMC per managed system (one primary, one secondary, both peers with simultaneous management capability), with each HMC connected to both FSP ports on each managed system through an isolated private management network? (IBM supports a maximum of two HMCs per managed system. With dual-FSP systems (E-class), each HMC must have an active connection to both FSPs to survive an FSP failure during a planned maintenance window. A single-HMC estate cannot perform DLPAR, LPM, firmware update, or partition activation while the HMC is being serviced -- and an HMC failure during an unplanned outage extends the outage by hours.)
- [Critical] Are HMC code levels aligned with the Power generation being managed and with the firmware levels of the managed systems, per the published HMC and System Firmware Supported Combinations matrix? (HMC V10 R2 M1030 supports Power8/Power9/Power10. HMC V11 R1 M1110 supports Power9/Power10/Power11 but drops Power8 support. Power11 firmware (FW1110.20) requires HMC V11.1.1110.0 minimum. Upgrading an HMC connected to a FW1050-level system to V11 R1 M1110 requires system firmware at MH1050_058 (FW1050.20) or higher first. Mismatches block firmware updates and LPM operations.)
- [Critical] Is the choice of physical HMC (7063-CR2 appliance) vs virtual HMC (vHMC) on PowerVM vs vHMC on x86 (KVM, VMware, Hyper-V) documented per site, with the operational tradeoffs explicit? (vHMC reduces hardware footprint and simplifies DR (the appliance is a VM), but vHMC on x86 and vHMC on PowerVM use different PTF streams (7042/x86 vs 7063/POWER) and they are not interchangeable. vHMC has no graphics-adapter support, local console is command-line only, and DLPAR of the HMC LPAR itself is not supported. Production estates often run one physical HMC plus one vHMC as the secondary.)
- [Recommended] Is the HMC network design a dedicated, isolated service network -- separate from production VLANs, with each HMC running as a DHCP server on a non-overlapping IP range for FSP discovery, and routable only through audited jump hosts? (Flat networks that mix HMC, FSP, and production traffic are a common audit finding. The recommended IBM topology is a private management LAN per HMC with the FSPs as DHCP clients; production traffic should never traverse the service network.)
- [Recommended] Is HMC user access role-based (hmcsuperadmin, hmcoperator, hmcviewer, hmcservicerep, hmcpe), federated to the enterprise IdP where supported, with command-line and GUI activity captured to a centralised audit log retained for the compliance-required window? (Most estates run with a handful of shared root-equivalent accounts and no audit trail -- both a compliance finding and a forensics blind spot. HMC supports LDAP/Kerberos for authentication; centralised syslog collection from the HMC is straightforward and rarely configured.)
- [Recommended] Is the HMC backup, restore, and patch cadence operationalised -- HMC critical-console-data backup before every code update, PTF apply windows scheduled with the managed-system firmware cycle, and DR-side HMC reachable from the DR site? (An HMC failure with no recent console-data backup means rebuilding partition profiles by hand. The console-data backup is a single command (
bkconsdata) and is routinely skipped.)
Flexible Service Processor (FSP)¶
- [Critical] Is FSP firmware level inventoried across every managed system and aligned to a documented baseline -- with the understanding that the Power9/Power10/Power11 firmware release schedule (refreshed by IBM through 2025-2026) drives the upgrade calendar? (Firmware drift between frames in an HA pair, or between LPM source and target frames, blocks LPM and complicates support cases. IBM publishes a per-generation supported-firmware roadmap; estates that do not track it discover end-of-service after the fact.)
- [Critical] Is each firmware update classified as concurrent vs disruptive before the maintenance window is scheduled -- with explicit awareness that systems not managed by an HMC always update disruptively, that Power10 SBE (Self Boot Engine) updates inside a concurrent service pack add 3-5 minutes per processor chip at the next IPL, and that attached EMX0 PCIe Gen3 I/O expansion drawers with EMXH fanout modules force a disruptive update? (Concurrent updates run while LPARs continue to run; disruptive updates require partition shutdown. Misclassifying a disruptive update as concurrent is how outages happen.)
- [Recommended] On dual-FSP enterprise systems (E1080, E1050, E1150, E1180), is FSP failover configured and tested -- with both FSPs reachable from both HMCs -- so that a single FSP failure does not isolate the managed system from its HMCs? (Dual-FSP redundancy only works if both FSPs are wired, both HMCs see both FSPs, and the failover path has been validated. Estates often discover the secondary FSP was never cabled when the primary fails.)
PowerVC¶
- [Critical] Is the choice between PowerVC Standard Edition (infrastructure-as-a-service: image catalog, automated placement, multi-system management, SAN/network drivers) and PowerVC for Private Cloud (adds self-service portal, approval workflows, deploy templates, metering for chargeback) made deliberately, or is provisioning still being handled directly through HMC alone? (PowerVC is the right answer for any estate with frequent LPAR churn, self-service requirements, or more than a few frames; HMC-only provisioning becomes a bottleneck above that scale. Current version is PowerVC for Private Cloud 2.3.x (2.3.2 ships with a bundled RHEL instance, removing the BYOL requirement for the PowerVC management LPAR itself).)
- [Critical] Are the PowerVC storage and network drivers verified against the actual fabric -- supported SAN arrays (IBM FlashSystem / Storwize / DS8000, EMC VMAX/VNX/Unity/PowerMax, NetApp ONTAP, HPE 3PAR/Primera, Hitachi VSP), supported Fibre Channel switches (Brocade, Cisco MDS), supported network types (SEA, SR-IOV, NPIV)? (PowerVC drives zoning, LUN creation, LUN mapping, masking, snapshots, and LUN copy on supported arrays directly. An unsupported array works only with manual SAN provisioning and forfeits much of PowerVC's value; verify the array+firmware combination against the current PowerVC support matrix before designing on it.)
- [Recommended] Is the image catalog and capture workflow designed -- AIX
mksysb-based or PowerVCcapture-based golden images, image-versioning discipline, image-update cadence to absorb TLs and SPs, and IBM i image management (which has its own constraints around Licensed Internal Code and IBM i release)? (Most PowerVC deployments accumulate a sprawl of one-off images. A small curated catalog with disciplined refresh is the operational pattern that scales; an uncontrolled catalog is technical debt.) - [Recommended] Are host groups and placement policies designed to match the operational topology -- striping policy (spread VMs across hosts for HA), packing policy (concentrate VMs to free up frames for maintenance), affinity/anti-affinity rules for application tiers, and explicit dedicated host groups for IBM i or licensing-sensitive workloads? (PowerVC's automated placement is only as good as the host-group design behind it. Leaving the default placement policy in place is functionally equivalent to manual placement.)
- [Optional] Is the OpenStack API surface used for automation (Terraform, Ansible, Tower/AAP, in-house tooling), with the awareness that PowerVC exposes a tailored OpenStack subset -- Nova, Cinder, Glance, Neutron compatible but with PowerVM-specific extensions, and not every upstream OpenStack feature is supported? (For estates standardising on Terraform or Ansible across hyperscalers, PowerVC's OpenStack API is the integration point. Plan for the gaps where upstream-OpenStack expectations do not hold.)
Live Partition Mobility (LPM)¶
- [Critical] Are LPM prerequisites met for every LPAR that will rely on it -- PowerVM Enterprise Edition licensed on both source and target frames, dual VIOS pairs at source and target, fully virtualised I/O (no dedicated adapters), shared SAN visibility from both VIOS pairs with the
no_reserveattribute on all migrating disks, NPIV for SAN attach, Mover Service Partition (MSP) attribute enabled on at least one VIOS per frame, RMC connectivity between HMC and every relevant partition, and synchronised time-of-day across VIO servers? (Missing any of these is the typical "we can't LPM this LPAR" discovery, and it is always discovered when the maintenance window has already started. Validate LPM eligibility per LPAR proactively, not at the point of need.) - [Critical] Are processor compatibility modes documented per LPAR and matched to the LPM target generation -- with the awareness that direct Power7-to-Power10 LPM is not supported (requires a two-step migration through Power8 or Power9), and that Power11 LPM compatibility follows the same generation-skip constraints? (Compatibility mode is set at LPAR activation, not migration time. Estates running long-lived LPARs in default-mode end up locked to a narrower set of target frames than they expected. The effective and configured modes must both be supported by the destination.)
- [Recommended] Are concurrent-LPM limits, MSP bandwidth, and management-network latency factored into LPM scheduling -- per-system concurrency caps (typically 4-16 concurrent LPMs depending on system class and VIOS configuration), MSP bandwidth between source and target across the management/SAN network, and round-trip latency between source and target that affects migration time but not LPM eligibility? (Bulk LPM operations to evacuate a frame for maintenance routinely exceed concurrency limits and serialise unexpectedly. Plan the evacuation window against the actual concurrency cap, not a notional one.)
- [Recommended] Is LPM positioned as a routine maintenance tool or as an occasional migration tool, with the operational discipline that comes with each choice -- if routine, every LPAR should be LPM-validated continuously and the team should evacuate frames for monthly firmware updates; if occasional, the prerequisites should be checked before each campaign rather than maintained on an ongoing basis? (Estates that "use LPM occasionally" tend to find the prerequisites broken when they need it. The fix is either to commit to routine use or to accept that LPM requires a verification step at every use.)
Other Management-Plane Concerns¶
- [Recommended] Is Cloud Management Console (CMC) considered for multi-site fleet visibility (inventory, capacity, security, performance, logging across HMC-managed estates), or is fleet-wide reporting being assembled by hand from per-HMC exports? (CMC consolidates the data HMCs already collect into a single pane; it is a low-risk add to an existing estate and pays back quickly for any multi-site operations team.)
- [Recommended] Is out-of-band serial / Operations Console / 5250 access to IBM i partitions designed deliberately -- terminal server, dedicated IBM i Access Client Solutions deployment, or HMC-virtual-terminal as the primary console path -- so that an HMC outage does not leave IBM i operators unable to reach a partition? (IBM i operations tooling assumes a working 5250 path. An HMC-only console strategy is fragile; estates that rely solely on HMC vterm for IBM i discover the fragility during HMC outages.)
Why This Matters¶
The management plane is where Power-estate problems are silently created and visibly exposed. A workload-side audit looks clean (LPARs sized, OS supported, backups running, HA configured) while underneath, the HMC pair is mismatched on code level, half the FSPs are running firmware older than the HA partner, PowerVC has not been deployed and provisioning is a manual HMC click-path, and the LPM eligibility check has not been run in two years. None of this matters until a frame needs maintenance, a workload needs to move, or a migration project assumes LPM works -- and then all of it matters at once.
The HMC version compatibility matrix is the most frequent source of management-plane failure. HMC code, system firmware, and Power generation are pinned to each other by the published supported-combinations matrix; a stale HMC blocks the firmware update that the new HMC requires, and the only path through is a careful upgrade sequence (system firmware to a transitional level, then HMC, then system firmware to target). Estates that have not tracked the matrix as part of routine operations discover at upgrade time that they cannot upgrade without an intermediate hop. Dropping Power8 support in HMC V11 is the most recent example -- estates with mixed Power8/Power9/Power10 frames cannot adopt HMC V11 until the Power8 frames are retired, and many estates are unaware of the boundary.
FSP firmware drift across frames is the next quietly accumulating problem. Concurrent service packs hide the drift -- LPARs keep running -- but the drift compounds against LPM eligibility, against support contracts that specify minimum firmware, and against HA pairs that assume firmware parity. The Power10 SBE behaviour (concurrent service pack with SBE changes adds 3-5 minutes per chip at the next IPL) is the kind of nuance that turns a planned reboot from a 5-minute outage into a 90-minute one if not anticipated. Treating firmware update as a per-frame ad-hoc activity, rather than a fleet-wide coordinated activity managed against a documented baseline, is the operational anti-pattern.
PowerVC is the lever that turns an HMC-driven estate into a cloud-style estate. The estates that have deployed PowerVC operate Power infrastructure with self-service, image catalogs, automated placement, and integration into hyperscaler-style tooling (Terraform, Ansible). The estates that have not, click through HMC for every LPAR change. Choosing PowerVC late -- after the operational habits have set into HMC-only patterns -- is materially harder than choosing it day one; it is a project, not a configuration change. Migrations that involve a large Power footprint should examine PowerVC deployment as a precursor, not an afterthought.
LPM is the operational capability that separates planned maintenance from planned outages. With LPM working, a frame can be evacuated for firmware update, hardware replacement, or migration with zero workload downtime; without it, every frame maintenance is a workload-coordination exercise. The prerequisites (PowerVM Enterprise, dual VIOS, NPIV, shared SAN, MSP, processor-compatibility mode) are not exotic, but they must all hold simultaneously per LPAR, and they degrade silently when adapters are added without virtualization or when new LPARs are activated in default processor modes. The operational discipline is continuous validation, not periodic checking.
Common Decisions (ADR Triggers)¶
- Dual HMC vs single HMC -- dual HMC (peer model, redundant management plane, survives HMC failure during outage, supports rolling HMC upgrades) vs single HMC (lower cost, simpler, single point of failure during planned maintenance and unplanned outages). For any production estate with HA or LPM requirements, dual HMC is the production baseline; single HMC is acceptable only for development frames or very small estates where extended outage windows are tolerable. Document the choice and the failure-mode analysis explicitly.
- Physical HMC (7063-CR2) vs vHMC on PowerVM vs vHMC on x86 -- physical HMC (dedicated hardware, simplest support story, no host dependency) vs vHMC on PowerVM (consolidates onto existing Power frames, but introduces a circular dependency where the HMC managing a frame also runs on it) vs vHMC on x86 (consolidates onto existing x86 virtualization, independent of the Power estate, but uses a different PTF stream than vHMC on POWER and many estates run mixed vHMC types by accident). Production patterns often run one physical HMC plus one vHMC on x86 as the redundant pair to break the circular dependency.
- PowerVC deployment vs HMC-only provisioning -- PowerVC (image catalog, automated placement, self-service, OpenStack API for automation, integration into Terraform/Ansible, additional licensing and operations team commitment) vs HMC-only (no additional software, manual provisioning through HMC GUI/CLI, scales poorly above small estates). The ADR is the inflection point: above roughly 5 frames or with any self-service / chargeback / automation requirement, PowerVC is the answer; below that, HMC-only suffices.
- PowerVC Standard Edition vs PowerVC for Private Cloud -- Standard (image catalog, placement, multi-system management) vs Private Cloud (adds self-service portal, approval workflows, deploy templates, metering for chargeback). Private Cloud is required if business stakeholders will provision LPARs themselves; Standard suffices for an infrastructure team that owns provisioning.
- LPM as routine maintenance tool vs occasional migration tool -- routine (every LPAR validated continuously, frames evacuated for every firmware update, processor compatibility modes managed proactively, MSPs and dual VIOS treated as production-baseline) vs occasional (LPM available but not used for routine maintenance, prerequisites verified per campaign). The cost is the operational discipline; the payoff is eliminating planned-outage windows. For any 24x7 or financial-services workload the ADR generally lands on routine.
- Concurrent vs disruptive firmware update windows -- always plan for concurrent where possible (no workload outage, longer IPL on next reboot for SBE changes on Power10/11) vs schedule disruptive when concurrent is not available (planned outage window, full IPL, faster total time). The ADR is the policy: which workload tiers will accept concurrent firmware update without re-test, which require post-update validation, and how the disruptive-only paths (no-HMC systems, EMX0 with EMXH attached) are scheduled.
Reference Links¶
- HMC and System Firmware Supported Combinations -- canonical compatibility matrix between HMC code, system firmware, and Power generation
- Hardware Management Console Support and Downloads -- HMC code releases, PTFs, recommended-fixes pages
- Virtual HMC appliance (vHMC) Overview -- vHMC on PowerVM vs vHMC on x86, supported hypervisors, PTF stream separation
- Configuring Redundant HMCs -- dual-HMC topology, FSP wiring, DHCP IP-range partitioning
- Concurrent vs Disruptive Firmware Update and Upgrade on Power Systems -- classification rules including Power10 SBE and EMX0/EMXH constraints
- Power9, Power10 and Power11 System FW Release Planned Schedule (2025-2026) -- IBM-published firmware roadmap
- IBM Power Virtualization Best Practices Guide v5.0 (October 2025) -- LPAR sizing, VIOS, LPM, and management-plane best practices
- IBM PowerVC product page -- Standard Edition vs Private Cloud Edition positioning
- IBM PowerVC for Private Cloud 2.3 documentation -- current PowerVC documentation, OpenStack API surface, image catalog, placement policies
- PowerVC Lifecycle Information -- PowerVC version support windows
- Difference between PowerVC Standard and PowerVC for Private Cloud -- self-service portal, approval workflows, metering
- Live Partition Mobility (IBM Support) -- LPM prerequisites, mover service partitions, validation workflow
- Best Practices for Live Partition Mobility (LPM) Networking -- MSP bandwidth, management-network latency
- Migration combinations of processor compatibility modes -- supported source/target processor-compatibility-mode pairs across Power generations
- Requirements for IBM i LPM -- IBM-i-specific LPM prerequisites
- Cloud Management Console (CMC) -- multi-site fleet visibility and reporting layer above HMC
See Also¶
providers/ibm/power-onprem.md-- workload-side concerns (LPAR sizing, AIX/IBM i support, backup/HA, ILMT, software licensing) that this file complementsproviders/ibm/powervs.md-- IBM Power Virtual Server, the IBM-Cloud-hosted parallel where the management plane (HMC, PowerVC equivalents, LPM-equivalent migration tooling) is IBM-operatedproviders/skytap/cloud.md-- Skytap on Azure (Kyndryl Cloud Uplift), where the entire Power management plane is provider-operated and customers interact through Skytap's APIsproviders/kyndryl/private-cloud.md-- Kyndryl-managed Power estates where Kyndryl operates HMC, PowerVC, and LPM on the customer's behalfgeneral/workload-migration.md-- migration wave methodology where LPM and PowerVC are the on-ramps for moving workloads between frames during a refreshgeneral/disaster-recovery.md-- DR patterns where dual-HMC reachability across sites is a prerequisite for cross-site failoverpatterns/hybrid-cloud.md-- hybrid patterns where on-prem Power management plane connects to cloud-side automation toolchains via the PowerVC OpenStack API