
I’ve built variations of this platform at Tangerine Bank, Telus Health, Loblaws, and several SaaS companies. The specifics differ. The pattern doesn’t.
Every engagement starts from the same place. GCP projects created manually through the console. Terraform that technically exists but runs differently in every team. IAM bindings nobody remembers adding. CI/CD pipelines held together by one person’s tribal knowledge. Security reviewed after the fact — if at all.
This page documents the architecture I use to fix that. It covers the two-plane design, the pipeline model, the Service Golden Path, the DevSecOps pipeline sequence, the guardrail layers, and the GCP resource model. I’ve structured it as a complete reference — each section links to a deeper post where the topic warrants one.
Before getting into the architecture, it’s worth naming why so many platform efforts stall or get abandoned.
The most common failure mode I see is teams trying to build Level 3 before they’ve solved Level 1. They start designing a self-service portal, a service catalog, or an internal API layer — and spend six to twelve months building infrastructure that developers never adopt because the foundational problems (inconsistent projects, broken IAM, no pipeline standards) are still unsolved underneath.
The second failure mode is treating the platform as a product instead of a set of architectural decisions. A platform team that tries to build something impressive usually builds something unused. A platform team that tries to make the right thing the easy thing usually builds something that sticks.
The architecture in this post is designed to solve Level 1 and Level 2 problems completely — then give you a clean foundation to evolve toward Level 3 when adoption justifies it.
Everything in this architecture follows from one decision: separating the Control Plane from the Delivery Plane.
These two planes have different owners, different permission models, different release cadences, and different blast radii. Collapsing them into a single pipeline is the most common structural mistake I see on GCP platform teams — and it’s the one that turns the platform team into a bottleneck for every application team underneath them.
The Control Plane is where platform decisions live. It’s owned and operated exclusively by the platform team. Its responsibilities are:
The Platform Pipeline that drives all of this runs as a dedicated privileged service account. It has approval gates before any change reaches production. Application teams never interact with it.
The Delivery Plane is where application teams own their delivery lifecycle — within the boundaries the Control Plane established.
Developers push from their IDE to their GitHub App and Infra repos. GitHub Actions picks up those changes and runs the full pipeline: DevSecOps scanning, infrastructure provisioning via the Golden Path, and deployment to GKE or Cloud Run. The Delivery Plane has narrow IAM permissions scoped to its own GCP project. It cannot modify org-level policies, alter IAM foundations, or touch other teams’ resources.
The separation is what gives application teams genuine speed. They’re not waiting on the platform team for approvals. They’re operating inside a well-defined, automated boundary — and the boundary stays out of their way when they’re doing the right thing.
The Golden Path is the most important concept in this architecture. It’s what makes the Delivery Plane self-service without being a free-for-all.
A Golden Path is the opinionated, well-supported route from “I need a new service” to “I have a running, secure, observable service in GCP.” Every decision that a developer shouldn’t have to make — networking, IAM, encryption, observability wiring — is already encoded in the path.
In practice, the Golden Path works through a service configuration file that lives in the repo root. The developer declares what they need: runtime (GKE or Cloud Run), region, database requirements, caching requirements, exposure level (internal or external), and SLO targets. That declaration drives everything downstream. The pipeline parses it, selects the right Terraform modules, validates against OPA policies, provisions infrastructure, and deploys the application.
The developer never writes a firewall rule. They never configure a service account. They never touch raw Terraform. They expressed intent — the platform encoded execution.
What the Golden Path covers end to end:
The Platform Pipeline lives in a dedicated GitHub Platform Repo and is the only pipeline in the architecture with broad GCP permissions. It’s the engine that creates and governs every GCP project in the organisation.
When a new team needs a project, a platform engineer adds a configuration entry to the environment definitions in the Platform Repo. The Platform Pipeline takes it from there — no manual console work, no gcloud commands, no tickets to a cloud admin.
What the Platform Pipeline provisions per project:
Every Terraform plan in the Platform Pipeline is validated by OPA/Conftest before apply. Production changes require a manual approval gate from a platform lead. There is no fast path to production in the Control Plane.
Application teams own their service pipeline, but they don’t write it from scratch. The service pipeline calls reusable workflows from the Platform Repo. When the platform team updates a workflow, every service picks up the change automatically. This is how you maintain consistent security posture across dozens of services without a manual review process for each one.
The full pipeline sequence maps directly to the DevSecOps block in the architecture:
1. Secret Scanning The first gate. Every commit is scanned for secrets, API keys, and credentials before anything else runs. A detected secret fails the pipeline immediately — no build, no deploy.
2. Unit Testing Standard application test suite. Language-specific. Failure blocks the pipeline.
3. SAST and SCA Static Application Security Testing scans the source code for security vulnerabilities. Software Composition Analysis scans third-party dependencies for known CVEs. Both run in parallel. Critical findings block the pipeline.
4. Container Image Build and Push The application is containerised and pushed to the project’s Artifact Registry repository. The image tag is passed forward to the scanning stage.
5. Container Image Scanning The built image is scanned for OS-level and application-level vulnerabilities before it can be deployed. Critical findings block the pipeline.
6. Infrastructure Provisioning The service config file is parsed. Terraform selects the appropriate modules, generates a plan, and runs it through OPA/Conftest policy validation. On policy pass, infrastructure is provisioned or updated — Cloud SQL instance, Memorystore cluster, storage buckets, service account, DNS, TLS. This step only runs when infrastructure declarations in the service config have changed.
7. DAST Dynamic Application Security Testing runs against a deployed instance of the service in a staging environment. Because this requires network access to the running service, it executes on the GitHub Actions Runner Controller hosted inside the VPC on GKE — not on GitHub-hosted runners.
8. Blue/Green Deployment The application is deployed using a blue/green strategy. Traffic shifts to the new version only after health checks pass. On failure, traffic shifts back automatically.
The GitHub Actions Runner Controller (ARC) runs on GKE inside the platform VPC. This is a deliberate architectural decision, not infrastructure convenience.
GitHub-hosted runners have no network route to internal GCP resources — private GKE API endpoints, Cloud SQL on private IPs, internal Cloud Run services, or private load balancers. Every organisation I’ve worked with that starts on GitHub-hosted runners eventually hits this wall when they try to run DAST or integration tests against non-public services.
ARC on GKE solves this cleanly. Pipeline jobs run as ephemeral pods inside the VPC, with full network access to internal resources. Each pod gets a short-lived WIF-derived identity, is destroyed after the job completes, and carries no state between runs. The runner scale set is configured with a minimum of one runner and scales up under load — cost stays low outside of active pipeline runs.
The ARC deployment itself is managed by the Platform Pipeline, so application teams never provision or manage runners. They reference the runner label in their workflow and the infrastructure is already there.
The platform only works if application teams can’t break it — accidentally or otherwise. But guardrails implemented badly become the very bottleneck the platform was supposed to eliminate. I enforce them in three layers designed to be invisible to developers doing the right thing.
This is structural, not configurable. The Platform Pipeline service account has org-level permissions. The service pipeline service account has permissions scoped to one GCP project. Developers never get cloud credentials directly — they interact with GCP entirely through pipelines authenticated via WIF.
You cannot misconfigure your way around this layer. The boundary is in the IAM bindings, not in a policy document.
Every Terraform plan in both pipelines runs through Conftest before apply. The policies are maintained in the Platform Repo and cover:
roles/owner and roles/editor are denied unconditionallyThese checks run in CI and return immediate, actionable feedback. A policy violation tells the developer exactly which resource triggered which rule. No platform team review required. No ticket. No wait.
Developers don’t get raw Terraform. They get a service configuration interface with a deliberately narrow surface area. They declare what they need — a Postgres database, a Redis cache, internal or external exposure, a security tier. The Terraform modules handle everything below that: VPC firewall rules, private service networking, service account bindings, encryption configuration, backup policies.
This layer is what gives developers real autonomy. They can move fast, provision infrastructure self-service, and never accidentally misconfigure a network or over-grant an IAM role — because those decisions are not in their surface area.
Every GCP project provisioned by the platform contains a consistent baseline. The table below shows the full resource set, who provisions each resource, and under what condition it exists.
| Resource | Provisioned by | Condition |
|---|---|---|
| GKE Cluster | Platform Module | Always |
| Artifact Registry | Platform Module | Always |
| VPC and Subnets | Platform Module | Always |
| IAM & WIF Configuration | Platform Module | Always |
| Cloud Logging Sink | Platform Module | Always |
| Budget Alert | Platform Module | Always |
| Cloud SQL | Service Module | If db: declared in service config |
| Memorystore (Redis) | Service Module | If cache: declared in service config |
| Cloud Storage Bucket | Service Module | If storage: declared in service config |
| Cloud Monitoring SLOs | Service Module | If SLO targets declared in service config |
| External Load Balancer + TLS | Service Module | If exposure: external in service config |
The design rule is clean ownership. The Platform Module provisions shared project infrastructure. The Service Module provisions service-scoped resources. Neither layer is touched manually. Everything is Terraform, everything is in version control, and the full resource history is auditable.
Observability ownership follows the same two-plane split.
The Platform Pipeline owns:
The Service Pipeline owns:
The practical outcome: every service has baseline observability from the moment it’s deployed, regardless of how much the development team has invested in monitoring. The platform team doesn’t chase down teams to configure log sinks. Budget alerts don’t go missing. And when an incident happens, the org-level log sink means every project’s logs are already in one place.
I’d place this architecture at Level 2 on the IDP maturity curve. It has automated project creation, self-service service deployment via the Golden Path, full DevSecOps pipeline enforcement, WIF throughout, and observability wired from configuration. Application teams are genuinely self-sufficient within a well-enforced boundary.
What it deliberately does not have:
Those are Level 3 features — and they’re the right investment once you have real adoption at Level 2. I’ve seen teams spend twelve months building Level 3 infrastructure and end up with shelfware because the foundational problems were never solved. The portal is not what makes a platform succeed. Developer adoption is what makes a platform succeed — and adoption comes from solving the actual pain, not from building an impressive interface on top of unresolved chaos.
When teams are actively using the Golden Path, asking for self-service capabilities the current architecture can’t provide, and you have a platform team with capacity to maintain a portal — that’s when Level 3 makes sense. Not before.
This platform architecture is a direct implementation of the SCALE Framework I apply across my GCP engagements.
Security by Design — WIF eliminates service account key exposure. OPA/Conftest policies enforce guardrails at plan time, not review time. Org Policies prevent configuration drift. IAM boundaries between Control Plane and Delivery Plane are structural, not advisory.
Cloud-Native — GKE with ARC for pipeline execution. Workload Identity Federation for authentication. Cloud Monitoring native SLOs. Org Policies for governance. No third-party control planes required to operate at this level.
Automation/IaC — Everything is Terraform. Platform module, service module, ARC deployment, WIF configuration, monitoring resources, budget alerts — all in version control, all auditable, nothing manual.
Lifecycle Operations — The two-plane separation gives platform and service changes different release cadences, different approval gates, and different blast radii. A bad service deployment cannot affect the platform. A platform change cannot accidentally trigger a service deployment.
Elastic Scalability — The Golden Path scales horizontally with no additional platform team involvement. Onboarding a new team is a configuration entry in the Platform Repo and a service config in their repo. No manual setup. No tickets. No knowledge transfer session required.
This architecture is not the ceiling. It’s the floor — deliberately designed to be built incrementally and evolved as adoption grows.
An Internal Developer Platform is not a product. It’s a set of architectural decisions, encoded in automation. The decisions in this post — two planes, Golden Path, WIF, OPA guardrails, GitHub Actions with ARC — are the ones I’ve found to deliver adoption consistently, across startups and enterprises alike.
Make the platform boring. Let the applications be interesting.
If your team is somewhere between ClickOps and overengineered — inconsistent environments, security bolted on after the fact, pipelines only one person understands — this is the kind of engagement I run.
I typically start with a two-week platform assessment: current state architecture review, gap analysis against your security and compliance requirements, and a pragmatic roadmap your team can actually execute.
Let’s talk about your platform →