How to Build an Internal Developer Platform on GCP

Building a Internal Developer Platform on GCP

I’ve built variations of this platform at Tangerine Bank, Telus Health, Loblaws, and several SaaS companies. The specifics differ. The pattern doesn’t.

Every engagement starts from the same place. GCP projects created manually through the console. Terraform that technically exists but runs differently in every team. IAM bindings nobody remembers adding. CI/CD pipelines held together by one person’s tribal knowledge. Security reviewed after the fact — if at all.

This page documents the architecture I use to fix that. It covers the two-plane design, the pipeline model, the Service Golden Path, the DevSecOps pipeline sequence, the guardrail layers, and the GCP resource model. I’ve structured it as a complete reference — each section links to a deeper post where the topic warrants one.

Why most GCP platforms fail before they're built

Before getting into the architecture, it’s worth naming why so many platform efforts stall or get abandoned.

The most common failure mode I see is teams trying to build Level 3 before they’ve solved Level 1. They start designing a self-service portal, a service catalog, or an internal API layer — and spend six to twelve months building infrastructure that developers never adopt because the foundational problems (inconsistent projects, broken IAM, no pipeline standards) are still unsolved underneath.

The second failure mode is treating the platform as a product instead of a set of architectural decisions. A platform team that tries to build something impressive usually builds something unused. A platform team that tries to make the right thing the easy thing usually builds something that sticks.

The architecture in this post is designed to solve Level 1 and Level 2 problems completely — then give you a clean foundation to evolve toward Level 3 when adoption justifies it.

The two-plane architecture — the core design decision

Everything in this architecture follows from one decision: separating the Control Plane from the Delivery Plane.

These two planes have different owners, different permission models, different release cadences, and different blast radii. Collapsing them into a single pipeline is the most common structural mistake I see on GCP platform teams — and it’s the one that turns the platform team into a bottleneck for every application team underneath them.

The Control Plane

The Control Plane is where platform decisions live. It’s owned and operated exclusively by the platform team. Its responsibilities are:

GCP project creation and baseline configuration — every project is provisioned by the Platform Pipeline, never manually. By the time an application team gets access to a project, the foundations are already in place.
Org-level policy enforcement — Org Policy constraints, VPC configuration, log sinks, billing alerts, and budget controls are all set here.
IAM and Workload Identity Federation setup — the WIF pool, provider, and service account bindings that eliminate service account keys entirely.
Artifact Registry provisioning — a regional container registry per project, ready for the service pipeline to use.
Service Golden Path templates and scaffolding — the reusable pipeline definitions and Terraform modules that application teams call into. More on this below.

The Platform Pipeline that drives all of this runs as a dedicated privileged service account. It has approval gates before any change reaches production. Application teams never interact with it.

The Delivery Plane

The Delivery Plane is where application teams own their delivery lifecycle — within the boundaries the Control Plane established.

Developers push from their IDE to their GitHub App and Infra repos. GitHub Actions picks up those changes and runs the full pipeline: DevSecOps scanning, infrastructure provisioning via the Golden Path, and deployment to GKE or Cloud Run. The Delivery Plane has narrow IAM permissions scoped to its own GCP project. It cannot modify org-level policies, alter IAM foundations, or touch other teams’ resources.

The separation is what gives application teams genuine speed. They’re not waiting on the platform team for approvals. They’re operating inside a well-defined, automated boundary — and the boundary stays out of their way when they’re doing the right thing.

The Service Golden Path

The Golden Path is the most important concept in this architecture. It’s what makes the Delivery Plane self-service without being a free-for-all.

A Golden Path is the opinionated, well-supported route from “I need a new service” to “I have a running, secure, observable service in GCP.” Every decision that a developer shouldn’t have to make — networking, IAM, encryption, observability wiring — is already encoded in the path.

In practice, the Golden Path works through a service configuration file that lives in the repo root. The developer declares what they need: runtime (GKE or Cloud Run), region, database requirements, caching requirements, exposure level (internal or external), and SLO targets. That declaration drives everything downstream. The pipeline parses it, selects the right Terraform modules, validates against OPA policies, provisions infrastructure, and deploys the application.

The developer never writes a firewall rule. They never configure a service account. They never touch raw Terraform. They expressed intent — the platform encoded execution.

What the Golden Path covers end to end:

Repository scaffolding — standard repo structure, branch protections, initial pipeline definition
Pipeline inheritance — service pipelines call reusable workflows from the Platform Repo, not custom logic
Infrastructure provisioning — databases, caches, storage buckets, DNS, TLS — all driven from the service config
Security defaults — least-privilege service accounts, private networking, encryption at rest and in transit — on by default
Observability wiring — Cloud Monitoring SLOs, log-based metrics, and alerting channels configured automatically from declared SLO targets

The Platform Pipeline — what runs in the Control Plane

The Platform Pipeline lives in a dedicated GitHub Platform Repo and is the only pipeline in the architecture with broad GCP permissions. It’s the engine that creates and governs every GCP project in the organisation.

When a new team needs a project, a platform engineer adds a configuration entry to the environment definitions in the Platform Repo. The Platform Pipeline takes it from there — no manual console work, no gcloud commands, no tickets to a cloud admin.

What the Platform Pipeline provisions per project:

GCP project, linked to the correct billing account and folder
Required API enablement — container, sqladmin, secretmanager, artifactregistry, monitoring, and others depending on workload type
Workload Identity Federation pool and provider, bound to the team’s GitHub repository
VPC — shared or dedicated depending on the isolation tier
Artifact Registry repository — regional, with appropriate IAM bindings
Org Policy constraints — serial port access disabled, service account key creation disabled, OS login enforced, and others
Central log sink — project logs forwarded to the org-level BigQuery or Cloud Storage destination
Cloud Monitoring workspace linkage and alerting channel configuration
Budget alert — configured at 80% and 100% of the project’s monthly allocation

Every Terraform plan in the Platform Pipeline is validated by OPA/Conftest before apply. Production changes require a manual approval gate from a platform lead. There is no fast path to production in the Control Plane.

The Service Pipeline — what runs in the Delivery Plane

Application teams own their service pipeline, but they don’t write it from scratch. The service pipeline calls reusable workflows from the Platform Repo. When the platform team updates a workflow, every service picks up the change automatically. This is how you maintain consistent security posture across dozens of services without a manual review process for each one.

The full pipeline sequence maps directly to the DevSecOps block in the architecture:

1. Secret Scanning The first gate. Every commit is scanned for secrets, API keys, and credentials before anything else runs. A detected secret fails the pipeline immediately — no build, no deploy.

2. Unit Testing Standard application test suite. Language-specific. Failure blocks the pipeline.

3. SAST and SCA Static Application Security Testing scans the source code for security vulnerabilities. Software Composition Analysis scans third-party dependencies for known CVEs. Both run in parallel. Critical findings block the pipeline.

4. Container Image Build and Push The application is containerised and pushed to the project’s Artifact Registry repository. The image tag is passed forward to the scanning stage.

5. Container Image Scanning The built image is scanned for OS-level and application-level vulnerabilities before it can be deployed. Critical findings block the pipeline.

6. Infrastructure Provisioning The service config file is parsed. Terraform selects the appropriate modules, generates a plan, and runs it through OPA/Conftest policy validation. On policy pass, infrastructure is provisioned or updated — Cloud SQL instance, Memorystore cluster, storage buckets, service account, DNS, TLS. This step only runs when infrastructure declarations in the service config have changed.

7. DAST Dynamic Application Security Testing runs against a deployed instance of the service in a staging environment. Because this requires network access to the running service, it executes on the GitHub Actions Runner Controller hosted inside the VPC on GKE — not on GitHub-hosted runners.

8. Blue/Green Deployment The application is deployed using a blue/green strategy. Traffic shifts to the new version only after health checks pass. On failure, traffic shifts back automatically.

GitHub Actions Runner Controller on GKE

The GitHub Actions Runner Controller (ARC) runs on GKE inside the platform VPC. This is a deliberate architectural decision, not infrastructure convenience.

GitHub-hosted runners have no network route to internal GCP resources — private GKE API endpoints, Cloud SQL on private IPs, internal Cloud Run services, or private load balancers. Every organisation I’ve worked with that starts on GitHub-hosted runners eventually hits this wall when they try to run DAST or integration tests against non-public services.

ARC on GKE solves this cleanly. Pipeline jobs run as ephemeral pods inside the VPC, with full network access to internal resources. Each pod gets a short-lived WIF-derived identity, is destroyed after the job completes, and carries no state between runs. The runner scale set is configured with a minimum of one runner and scales up under load — cost stays low outside of active pipeline runs.

The ARC deployment itself is managed by the Platform Pipeline, so application teams never provision or manage runners. They reference the runner label in their workflow and the infrastructure is already there.

Guardrails — three layers that don't slow teams down

The platform only works if application teams can’t break it — accidentally or otherwise. But guardrails implemented badly become the very bottleneck the platform was supposed to eliminate. I enforce them in three layers designed to be invisible to developers doing the right thing.

Layer 1 — IAM Boundaries

This is structural, not configurable. The Platform Pipeline service account has org-level permissions. The service pipeline service account has permissions scoped to one GCP project. Developers never get cloud credentials directly — they interact with GCP entirely through pipelines authenticated via WIF.

You cannot misconfigure your way around this layer. The boundary is in the IAM bindings, not in a policy document.

Layer 2 — Policy as Code (OPA / Conftest)

Every Terraform plan in both pipelines runs through Conftest before apply. The policies are maintained in the Platform Repo and cover:

No public-facing resources unless the service config explicitly declares external exposure
No primitive roles — roles/owner and roles/editor are denied unconditionally
Approved regions only — resources outside the organisation’s approved GCP regions are blocked
Approved machine types and SKUs — prevents teams from provisioning oversized or unapproved resource types
Encryption requirements — storage resources must have CMEK or Google-managed encryption explicitly configured

These checks run in CI and return immediate, actionable feedback. A policy violation tells the developer exactly which resource triggered which rule. No platform team review required. No ticket. No wait.

Layer 3 — Module Abstraction

Developers don’t get raw Terraform. They get a service configuration interface with a deliberately narrow surface area. They declare what they need — a Postgres database, a Redis cache, internal or external exposure, a security tier. The Terraform modules handle everything below that: VPC firewall rules, private service networking, service account bindings, encryption configuration, backup policies.

This layer is what gives developers real autonomy. They can move fast, provision infrastructure self-service, and never accidentally misconfigure a network or over-grant an IAM role — because those decisions are not in their surface area.

The GCP resource model — what's in every project

Every GCP project provisioned by the platform contains a consistent baseline. The table below shows the full resource set, who provisions each resource, and under what condition it exists.

Resource	Provisioned by	Condition
GKE Cluster	Platform Module	Always
Artifact Registry	Platform Module	Always
VPC and Subnets	Platform Module	Always
IAM & WIF Configuration	Platform Module	Always
Cloud Logging Sink	Platform Module	Always
Budget Alert	Platform Module	Always
Cloud SQL	Service Module	If `db:` declared in service config
Memorystore (Redis)	Service Module	If `cache:` declared in service config
Cloud Storage Bucket	Service Module	If `storage:` declared in service config
Cloud Monitoring SLOs	Service Module	If SLO targets declared in service config
External Load Balancer + TLS	Service Module	If `exposure: external` in service config

The design rule is clean ownership. The Platform Module provisions shared project infrastructure. The Service Module provisions service-scoped resources. Neither layer is touched manually. Everything is Terraform, everything is in version control, and the full resource history is auditable.

Observability — platform-owned vs. service-owned

Observability ownership follows the same two-plane split.

The Platform Pipeline owns:

Org-level log sink to central Cloud Storage or BigQuery — provisioned at project creation, never touched by application teams
Cloud Billing budget alert per project
Cloud Monitoring alerting channel configuration — Slack and PagerDuty webhooks wired up before the first service deploys
Uptime check infrastructure at the platform level

The Service Pipeline owns:

Application-level log-based metrics
Cloud Monitoring SLOs — generated automatically from the latency and availability targets declared in the service config
Service-specific dashboards
Custom alert policies tied to SLO burn rates

The practical outcome: every service has baseline observability from the moment it’s deployed, regardless of how much the development team has invested in monitoring. The platform team doesn’t chase down teams to configure log sinks. Budget alerts don’t go missing. And when an incident happens, the org-level log sink means every project’s logs are already in one place.

Where this sits on the IDP maturity curve

I’d place this architecture at Level 2 on the IDP maturity curve. It has automated project creation, self-service service deployment via the Golden Path, full DevSecOps pipeline enforcement, WIF throughout, and observability wired from configuration. Application teams are genuinely self-sufficient within a well-enforced boundary.

What it deliberately does not have:

A self-service UI portal
A custom control plane API
A dynamic service catalog or Backstage integration

Those are Level 3 features — and they’re the right investment once you have real adoption at Level 2. I’ve seen teams spend twelve months building Level 3 infrastructure and end up with shelfware because the foundational problems were never solved. The portal is not what makes a platform succeed. Developer adoption is what makes a platform succeed — and adoption comes from solving the actual pain, not from building an impressive interface on top of unresolved chaos.

When teams are actively using the Golden Path, asking for self-service capabilities the current architecture can’t provide, and you have a platform team with capacity to maintain a portal — that’s when Level 3 makes sense. Not before.

How this maps to the SCALE Framework

This platform architecture is a direct implementation of the SCALE Framework I apply across my GCP engagements.

Security by Design — WIF eliminates service account key exposure. OPA/Conftest policies enforce guardrails at plan time, not review time. Org Policies prevent configuration drift. IAM boundaries between Control Plane and Delivery Plane are structural, not advisory.

Cloud-Native — GKE with ARC for pipeline execution. Workload Identity Federation for authentication. Cloud Monitoring native SLOs. Org Policies for governance. No third-party control planes required to operate at this level.

Automation/IaC — Everything is Terraform. Platform module, service module, ARC deployment, WIF configuration, monitoring resources, budget alerts — all in version control, all auditable, nothing manual.

Lifecycle Operations — The two-plane separation gives platform and service changes different release cadences, different approval gates, and different blast radii. A bad service deployment cannot affect the platform. A platform change cannot accidentally trigger a service deployment.

Elastic Scalability — The Golden Path scales horizontally with no additional platform team involvement. Onboarding a new team is a configuration entry in the Platform Repo and a service config in their repo. No manual setup. No tickets. No knowledge transfer session required.

Final thought

This architecture is not the ceiling. It’s the floor — deliberately designed to be built incrementally and evolved as adoption grows.

An Internal Developer Platform is not a product. It’s a set of architectural decisions, encoded in automation. The decisions in this post — two planes, Golden Path, WIF, OPA guardrails, GitHub Actions with ARC — are the ones I’ve found to deliver adoption consistently, across startups and enterprises alike.

Make the platform boring. Let the applications be interesting.

Working on a GCP platform that’s outgrown its current setup?

If your team is somewhere between ClickOps and overengineered — inconsistent environments, security bolted on after the fact, pipelines only one person understands — this is the kind of engagement I run.

I typically start with a two-week platform assessment: current state architecture review, gap analysis against your security and compliance requirements, and a pragmatic roadmap your team can actually execute.

Let’s talk about your platform →

Amit Malhotra

Amit Malhotra is a Principal GCP Architect and founder of Buoyant Cloud, with 20+ years of IT experience — including 6+ years hands-on with Google Cloud architecture and DevSecOps. He has designed and delivered production-grade cloud platforms for enterprises including RBC, Tangerine Bank, Telus Health, Loblaws, and Ford, as well as high-growth SaaS teams across Canada and the USA.

How to Build an Internal Developer Platform on GCP

Building a Internal Developer Platform on GCP

Why most GCP platforms fail before they're built

The two-plane architecture — the core design decision

The Control Plane

The Delivery Plane

The Service Golden Path

The Platform Pipeline — what runs in the Control Plane

The Service Pipeline — what runs in the Delivery Plane

GitHub Actions Runner Controller on GKE

Guardrails — three layers that don't slow teams down

Layer 1 — IAM Boundaries

Layer 2 — Policy as Code (OPA / Conftest)

Layer 3 — Module Abstraction

The GCP resource model — what's in every project

Observability — platform-owned vs. service-owned

Where this sits on the IDP maturity curve

How this maps to the SCALE Framework

Final thought

Working on a GCP platform that’s outgrown its current setup?

Amit Malhotra

Related Blog Posts

Why Your Drata SOC 2 Audit Is Flagging GCP Controls — And How to Fix Them

How I Think About GCP Platform Architecture — The SCALE Framework Explained

4 Ways to Inject GCP Secrets into GKE: A Practical Guide