The GCP Platform Migration Playbook — How I Move Workloads Without Breaking Production

The call usually comes after something goes wrong. A migration that was supposed to take three months is now in month seven. Two microservices are running in parallel across on-prem and GCP and nobody is sure which one is authoritative. A cutover window got missed because a dependency nobody documented turned out to be load-bearing. I’ve been brought in to rescue migrations like this more than once — and in every case, the root cause wasn’t technical complexity. It was the absence of a repeatable methodology.

GCP migrations fail at the planning layer, not the execution layer. The teams I work with are technically capable. What they’re missing is a structured framework for sequencing decisions — what to move first, how to validate before cutting over, how to handle the inevitable surprises without stopping the programme. This playbook is how I run every GCP migration engagement, from a 20-service SaaS platform to a large enterprise with hybrid connectivity requirements.

Migration is the C (Cloud-Native by design) and L (Lifecycle Operations) pillars of the SCALE Framework in practice. You’re not just lifting workloads — you’re redesigning how they operate. Done right, the destination platform is more observable, more automated, and more resilient than what you left behind. Done wrong, you’ve reproduced your on-prem problems in the cloud at higher cost.

Why Most GCP Migrations Stall

Before the methodology, it’s worth naming the failure modes I see consistently:

The “Big Bang” trap. Teams plan a single cutover weekend to move everything at once. The cutover fails at 2am, the rollback takes four hours, and the programme loses six months of political momentum.

Undocumented dependencies. Application A connects to Application B through a shared database that nobody put in the architecture diagram. You find out during the cutover window.

Terraform introduced too late. The first three workloads get stood up manually in the console to “move fast.” By workload ten, you have three different network topologies and no repeatable deployment process.

No definition of done. Teams move a workload to GCP but keep the on-prem instance running “just in case.” Six months later both are still live and the on-prem instance is handling production traffic.

The playbook below addresses each of these directly.

Phase 1: Discovery — Know What You’re Actually Moving

Duration: 2–4 weeks depending on estate size

The most important phase and the one most teams rush. I never commit to a migration timeline until discovery is complete. Here’s what I’m mapping:

Workload inventory. Every application, service, and batch job — with its runtime, dependencies, data stores, and integration points documented. For an enterprise engagement I worked on recently involving a large Canadian healthcare platform moving from on-prem to GCP, the discovery phase uncovered 14 undocumented service-to-service integrations that would have broken the migration plan entirely. That two-week investment saved months.

Dependency mapping. I build a directed dependency graph — what calls what, what shares what, what has a hard startup order requirement. This becomes the sequencing input for Phase 3.

Data classification. What data is sensitive, regulated, or subject to residency requirements? This determines which workloads need VPC Service Controls, which projects need specific org policy constraints, and what the compliance posture of the destination landing zone needs to be.

Migration mode per workload. For each service I assign one of four modes:

  • Rehost (Lift & Shift): Move as-is to GCE or GKE. Fastest, least value — use only for workloads with short remaining lifespan.

  • Replatform: Containerize and deploy to GKE or Cloud Run with minimal code changes. The most common mode.

  • Refactor: Redesign to be cloud-native — stateless, event-driven, serverless. High effort, high long-term value.

  • Retire: Some workloads don’t need to migrate. Decommission them instead.

Output of Phase 1: A migration inventory spreadsheet, dependency graph, data classification map, and per-workload migration mode assignment. This is the document that drives everything else.

Phase 2: Foundation — Terraform First, Always

Duration: 2–3 weeks

Nothing moves until the landing zone is production-ready and fully Terraform-managed. This is non-negotiable in my engagements.

The foundation includes the resource hierarchy (org, folders, projects), Hub-and-Spoke network topology, org policy constraints, IAM model with Workload Identity Federation, VPC Service Controls for regulated data, and centralized logging and monitoring. I cover the full landing zone architecture in the GCP Landing Zone Blueprint.

The Terraform-first rule. Every resource that gets created during the migration — VPCs, GKE clusters, Cloud SQL instances, IAM bindings — is defined in Terraform before it’s applied. No exceptions. Teams that stand up resources manually to “move faster” during Phase 2 always pay the price in Phase 4 when they can’t reproduce the environment or explain it to an auditor.

Connectivity to on-prem. For hybrid migrations — where workloads need to communicate between on-prem and GCP during the transition period — I establish either Cloud Interconnect or Cloud VPN at this stage. The network path needs to be stable and tested before any workload moves.

Monitoring baseline. I deploy Datadog or Google Cloud Operations at the foundation level so that every workload migration has observability from day one. You can’t validate a cutover if you can’t see what the workload is doing.

Phase 3: Sequencing — Move in Dependency Order

Duration: Variable — typically 8–16 weeks for a mid-market platform

This is where the dependency graph from Phase 1 pays off. The sequencing rule is simple: move leaves before roots. Start with the services that have no downstream dependents. Work backwards toward the core shared services last.

Wave planning. I group workloads into migration waves of 3–6 services. Each wave has a defined scope, a validation checklist, and a go/no-go gate before cutover. Waves typically run every 2–3 weeks.

Parallel running. For all but the simplest workloads, I run the GCP instance in parallel with the on-prem instance for a defined validation period — typically 1–2 weeks. Both instances are live, GCP is receiving a read-only or mirrored traffic subset, and we’re comparing outputs. This is the safety net that makes cutovers boring.

The validation checklist per workload:

  • Functional parity confirmed (automated test suite passing)

  • Performance benchmarks within 10% of on-prem baseline

  • Logging and alerting confirmed in Google Cloud Operations / Datadog

  • Secrets confirmed injected via Secret Manager CSI Driver (not env vars)

  • IAM bindings confirmed — no over-permissioned service accounts

  • Dependency connections confirmed from GCP to both remaining on-prem services and previously migrated GCP services

Nothing moves to cutover without a signed-off checklist.

Phase 4: Cutover — Make It a Non-Event

The goal of a well-run migration is that the cutover is the most boring part of the programme.

By the time you reach cutover for any given workload, you’ve been running the GCP instance in parallel for 1–2 weeks, your validation checklist is signed off, and you’ve rehearsed the DNS/traffic switch at least once in non-prod. The cutover itself should take under 30 minutes.

My cutover sequence:

  1. Confirm on-prem instance is in read-only / quiesced state

  2. Final data sync if applicable

  3. Switch DNS / load balancer to GCP endpoint

  4. Confirm traffic routing via monitoring

  5. Run smoke tests against GCP endpoint

  6. Declare cutover complete — do not touch on-prem instance for 48 hours

The 48-hour rule. I never decommission the on-prem instance immediately after cutover. It stays in a stopped-but-available state for 48 hours minimum. If something surfaces in production that wasn’t caught in parallel running, the rollback path is a single DNS switch — not a rebuild.

After 48 hours with no issues, the on-prem instance is decommissioned. This is your definition of done. Not “running on GCP” — decommissioned on-prem.

Phase 5: Optimisation — The Work That Justifies the Migration

Most migration engagements end at cutover. I push clients to schedule an optimisation phase 4–6 weeks after the final wave.

This is where the cloud-native value gets captured: right-sizing compute, enabling committed use discounts, implementing autoscaling that wasn’t possible on-prem, consolidating observability, and cleaning up any technical debt that accumulated during the migration waves. For a healthcare platform migration I ran, this phase delivered a 34% reduction in monthly GCP spend within six weeks of the final cutover — purely through right-sizing and commitment planning.

The GCP Architecture & Modernization service covers how I structure ongoing advisory engagements after the migration is complete.

The Migration Playbook at a Glance

PhaseKey OutputDuration
1. DiscoveryInventory, dependency graph, migration modes2–4 weeks
2. FoundationTerraform landing zone, hybrid connectivity, monitoring2–3 weeks
3. SequencingWave plan, parallel running, validation checklists8–16 weeks
4. CutoverTraffic switch, 48hr hold, decommissionPer wave
5. OptimisationRight-sizing, cost reduction, cloud-native improvements4–6 weeks post-migration

What I Look for in a Migration That’s Already in Trouble

If you’re reading this mid-migration and things aren’t going well, here are the three questions I ask first:

Is everything in Terraform? If not, stop adding workloads and get the existing GCP resources into IaC before the configuration drift becomes unmanageable.

Is there a clear definition of done per workload? If on-prem instances aren’t being decommissioned, you don’t have a migration — you have a replication. Set the decommission date as part of the wave plan.

Is there a dependency graph? If you’re discovering integrations during cutovers, go back to Phase 1. A week of discovery now is worth months of cutover failures later.

Related reading:

Planning a GCP Migration? Let’s Start With Discovery.

The most expensive mistake in a GCP migration is starting the clock before you know what you’re moving. A two-week discovery engagement with me will produce a migration inventory, dependency map, and wave plan that de-risks the entire programme before a single workload moves.

Explore my GCP Architecture & Modernization Services

Book a Free GCP Architecture Review

Buoyant Cloud Inc
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.