The GCP Platform Migration Playbook — How I Move Workloads Without Breaking Production
The call usually comes after something goes wrong. A migration that was supposed to take three months is now in month seven. Two microservices are running in parallel across on-prem and GCP and nobody is sure which one is authoritative. A cutover window got missed because a dependency nobody documented turned out to be load-bearing. I’ve been brought in to rescue migrations like this more than once — and in every case, the root cause wasn’t technical complexity. It was the absence of a repeatable methodology.
GCP migrations fail at the planning layer, not the execution layer. The teams I work with are technically capable. What they’re missing is a structured framework for sequencing decisions — what to move first, how to validate before cutting over, how to handle the inevitable surprises without stopping the programme. This playbook is how I run every GCP migration engagement, from a 20-service SaaS platform to a large enterprise with hybrid connectivity requirements.
Migration is the C (Cloud-Native by design) and L (Lifecycle Operations) pillars of the SCALE Framework in practice. You’re not just lifting workloads — you’re redesigning how they operate. Done right, the destination platform is more observable, more automated, and more resilient than what you left behind. Done wrong, you’ve reproduced your on-prem problems in the cloud at higher cost.
Why Most GCP Migrations Stall
Before the methodology, it’s worth naming the failure modes I see consistently:
The “Big Bang” trap. Teams plan a single cutover weekend to move everything at once. The cutover fails at 2am, the rollback takes four hours, and the programme loses six months of political momentum.
Undocumented dependencies. Application A connects to Application B through a shared database that nobody put in the architecture diagram. You find out during the cutover window.
Terraform introduced too late. The first three workloads get stood up manually in the console to “move fast.” By workload ten, you have three different network topologies and no repeatable deployment process.
No definition of done. Teams move a workload to GCP but keep the on-prem instance running “just in case.” Six months later both are still live and the on-prem instance is handling production traffic.
The playbook below addresses each of these directly.
Phase 1: Discovery — Know What You’re Actually Moving
Duration: 2–4 weeks depending on estate size
The most important phase and the one most teams rush. I never commit to a migration timeline until discovery is complete. Here’s what I’m mapping:
Workload inventory. Every application, service, and batch job — with its runtime, dependencies, data stores, and integration points documented. For an enterprise engagement I worked on recently involving a large Canadian healthcare platform moving from on-prem to GCP, the discovery phase uncovered 14 undocumented service-to-service integrations that would have broken the migration plan entirely. That two-week investment saved months.
Dependency mapping. I build a directed dependency graph — what calls what, what shares what, what has a hard startup order requirement. This becomes the sequencing input for Phase 3.
Data classification. What data is sensitive, regulated, or subject to residency requirements? This determines which workloads need VPC Service Controls, which projects need specific org policy constraints, and what the compliance posture of the destination landing zone needs to be.
Migration mode per workload. For each service I assign one of four modes:
Rehost (Lift & Shift): Move as-is to GCE or GKE. Fastest, least value — use only for workloads with short remaining lifespan.
Replatform: Containerize and deploy to GKE or Cloud Run with minimal code changes. The most common mode.
Refactor: Redesign to be cloud-native — stateless, event-driven, serverless. High effort, high long-term value.
Retire: Some workloads don’t need to migrate. Decommission them instead.
Output of Phase 1: A migration inventory spreadsheet, dependency graph, data classification map, and per-workload migration mode assignment. This is the document that drives everything else.
Phase 2: Foundation — Terraform First, Always
Duration: 2–3 weeks
Nothing moves until the landing zone is production-ready and fully Terraform-managed. This is non-negotiable in my engagements.
The foundation includes the resource hierarchy (org, folders, projects), Hub-and-Spoke network topology, org policy constraints, IAM model with Workload Identity Federation, VPC Service Controls for regulated data, and centralized logging and monitoring. I cover the full landing zone architecture in the GCP Landing Zone Blueprint.
The Terraform-first rule. Every resource that gets created during the migration — VPCs, GKE clusters, Cloud SQL instances, IAM bindings — is defined in Terraform before it’s applied. No exceptions. Teams that stand up resources manually to “move faster” during Phase 2 always pay the price in Phase 4 when they can’t reproduce the environment or explain it to an auditor.
Connectivity to on-prem. For hybrid migrations — where workloads need to communicate between on-prem and GCP during the transition period — I establish either Cloud Interconnect or Cloud VPN at this stage. The network path needs to be stable and tested before any workload moves.
Monitoring baseline. I deploy Datadog or Google Cloud Operations at the foundation level so that every workload migration has observability from day one. You can’t validate a cutover if you can’t see what the workload is doing.
Phase 3: Sequencing — Move in Dependency Order
Duration: Variable — typically 8–16 weeks for a mid-market platform
This is where the dependency graph from Phase 1 pays off. The sequencing rule is simple: move leaves before roots. Start with the services that have no downstream dependents. Work backwards toward the core shared services last.
Wave planning. I group workloads into migration waves of 3–6 services. Each wave has a defined scope, a validation checklist, and a go/no-go gate before cutover. Waves typically run every 2–3 weeks.
Parallel running. For all but the simplest workloads, I run the GCP instance in parallel with the on-prem instance for a defined validation period — typically 1–2 weeks. Both instances are live, GCP is receiving a read-only or mirrored traffic subset, and we’re comparing outputs. This is the safety net that makes cutovers boring.
The validation checklist per workload:
Functional parity confirmed (automated test suite passing)
Performance benchmarks within 10% of on-prem baseline
Logging and alerting confirmed in Google Cloud Operations / Datadog
Secrets confirmed injected via Secret Manager CSI Driver (not env vars)
IAM bindings confirmed — no over-permissioned service accounts
Dependency connections confirmed from GCP to both remaining on-prem services and previously migrated GCP services
Nothing moves to cutover without a signed-off checklist.
Phase 4: Cutover — Make It a Non-Event
The goal of a well-run migration is that the cutover is the most boring part of the programme.
By the time you reach cutover for any given workload, you’ve been running the GCP instance in parallel for 1–2 weeks, your validation checklist is signed off, and you’ve rehearsed the DNS/traffic switch at least once in non-prod. The cutover itself should take under 30 minutes.
My cutover sequence:
Confirm on-prem instance is in read-only / quiesced state
Final data sync if applicable
Switch DNS / load balancer to GCP endpoint
Confirm traffic routing via monitoring
Run smoke tests against GCP endpoint
Declare cutover complete — do not touch on-prem instance for 48 hours
The 48-hour rule. I never decommission the on-prem instance immediately after cutover. It stays in a stopped-but-available state for 48 hours minimum. If something surfaces in production that wasn’t caught in parallel running, the rollback path is a single DNS switch — not a rebuild.
After 48 hours with no issues, the on-prem instance is decommissioned. This is your definition of done. Not “running on GCP” — decommissioned on-prem.
Phase 5: Optimisation — The Work That Justifies the Migration
Most migration engagements end at cutover. I push clients to schedule an optimisation phase 4–6 weeks after the final wave.
This is where the cloud-native value gets captured: right-sizing compute, enabling committed use discounts, implementing autoscaling that wasn’t possible on-prem, consolidating observability, and cleaning up any technical debt that accumulated during the migration waves. For a healthcare platform migration I ran, this phase delivered a 34% reduction in monthly GCP spend within six weeks of the final cutover — purely through right-sizing and commitment planning.
The GCP Architecture & Modernization service covers how I structure ongoing advisory engagements after the migration is complete.
The Migration Playbook at a Glance
| Phase | Key Output | Duration |
|---|---|---|
| 1. Discovery | Inventory, dependency graph, migration modes | 2–4 weeks |
| 2. Foundation | Terraform landing zone, hybrid connectivity, monitoring | 2–3 weeks |
| 3. Sequencing | Wave plan, parallel running, validation checklists | 8–16 weeks |
| 4. Cutover | Traffic switch, 48hr hold, decommission | Per wave |
| 5. Optimisation | Right-sizing, cost reduction, cloud-native improvements | 4–6 weeks post-migration |
What I Look for in a Migration That’s Already in Trouble
If you’re reading this mid-migration and things aren’t going well, here are the three questions I ask first:
Is everything in Terraform? If not, stop adding workloads and get the existing GCP resources into IaC before the configuration drift becomes unmanageable.
Is there a clear definition of done per workload? If on-prem instances aren’t being decommissioned, you don’t have a migration — you have a replication. Set the decommission date as part of the wave plan.
Is there a dependency graph? If you’re discovering integrations during cutovers, go back to Phase 1. A week of discovery now is worth months of cutover failures later.
Related reading:
How I Think About GCP Platform Architecture — The SCALE Framework — the architectural methodology that defines the destination platform
Hardening the Foundation: How I Build Secure GCP Landing Zones — the Phase 2 foundation every migration lands on
GCP VPC Service Controls — data perimeter controls for regulated workloads in the migration
Enterprise Platform Modernization — how I work with enterprise teams on large-scale GCP programmes
GCP Architecture & Modernization Services — how I engage for migration programmes
Case Studies — real migration and modernization engagements
Planning a GCP Migration? Let’s Start With Discovery.
The most expensive mistake in a GCP migration is starting the clock before you know what you’re moving. A two-week discovery engagement with me will produce a migration inventory, dependency map, and wave plan that de-risks the entire programme before a single workload moves.