TL;DR: Make your GCP platform reliable and operable with real observability—monitoring, alerting, and systems your engineering team can actually trust in production.

SRE & Observability on GCP

I help engineering teams design reliable GCP platforms by building proper observability, monitoring, and operational practices into the system architecture from the start. My focus is on making production systems measurable, debuggable, and resilient by design — not by heroics. I’m Amit Malhotra, a Principal GCP Architect based in Toronto with 20+ years in IT and 6+ years hands-on with Google Cloud Monitoring, distributed tracing, SLO design, and GKE observability. I’ve designed and operated observability stacks for regulated enterprises and SaaS platforms across Canada and the USA — environments where production incidents have real business and compliance consequences.

Observability and SRE practices are the operational layer of the SCALE Framework — the Lifecycle Operations pillar that ensures platforms stay reliable, debuggable, and continuously improving after they’re built and deployed. Every platform I design has observability woven in from the architecture stage, not instrumented as an afterthought.

The Problem Most Teams Face

Why Production Reliability Stays a Problem — Even After You Add Monitoring

Most GCP platforms have monitoring. Very few have observability. The difference is whether your monitoring tells you something is wrong or tells you why it’s wrong and where. I’m typically brought in when one or more of these patterns has taken hold:

Monitoring exists but nobody trusts it — dashboards full of metrics that don’t clearly reflect user-facing health, alert thresholds set arbitrarily, and engineers who’ve learned to ignore the noise because most alerts don’t indicate real problems
Alerts are so noisy they’re ignored — alert fatigue from dashboards that fire on every resource spike rather than on actual user impact, with on-call engineers conditioned to silence alerts rather than investigate them
Logs are scattered and inconsistent — application logs in Cloud Logging, infrastructure logs in a separate project, GKE node logs somewhere else, and no consistent log format or correlation strategy that lets you trace a request across services during an incident
No SLOs or error budgets — reliability managed by intuition and incident count rather than explicit service level objectives that define what ‘good enough’ looks like and give engineering teams a structured way to balance reliability investment against feature velocity
Incidents are handled reactively — no defined incident response process, no runbooks, no clear escalation path, and post-incident reviews that produce action items nobody follows up on because there’s no mechanism to ensure they get done
Root causes are rarely addressed — the same categories of incidents recurring because investigations stop at the immediate fix rather than the systemic cause, and no continuous improvement loop to drive reliability engineering work

The result is a platform that technically works — but is operationally fragile, with a reliability posture that depends on individual engineers rather than the system itself.

What I Typically Work On

The SRE and Observability Work I Do

Observability Strategy Design: I design the end-to-end observability strategy for your GCP platform — what to instrument, how to structure logs, what metrics reflect real system health, how to implement distributed tracing across services, and how to make the observability data actionable for both on-call engineers and engineering leadership.
Google Cloud Monitoring Implementation: I implement Cloud Monitoring correctly — metric collection from GKE workloads, Cloud Run services, and GCP managed services; custom dashboards that show system health in terms of user impact rather than raw resource utilisation; and alerting policies tied to SLOs rather than arbitrary thresholds.
SLO and Error Budget Design: I define service level objectives with your engineering and product teams — what the right reliability targets are for each service, how to measure them accurately with Cloud Monitoring, and how to implement error budget tracking that gives engineering teams a structured framework for reliability investment decisions.
Centralised Logging and Tracing: I design and implement centralised logging across all GCP projects — consistent log formats, log sinks to a dedicated observability project, log-based metrics, and distributed tracing with Cloud Trace or OpenTelemetry so you can follow a request from the load balancer through every downstream service during an incident.
Incident Response Workflows: I design incident response processes that work in practice — clear severity definitions, escalation paths, runbook templates, incident commander roles, and blameless postmortem practices that produce systemic improvements rather than individual blame.
Operational Dashboards: I build environment-level and service-level dashboards that give platform teams and engineering leadership a clear, accurate picture of system health — SLO burn rate dashboards, GKE cluster health views, service dependency maps, and cost visibility dashboards that make the platform’s operational state visible at a glance.

Observability by Design

Observability Is Not a Tool — It’s an Architectural Property

The difference between a platform where you can diagnose a production incident in 10 minutes and one where it takes 3 hours is almost never the monitoring tooling. It’s whether the system was designed to be observable — whether the right signals were instrumented, logs were structured to be searchable, traces were propagated across service boundaries, and dashboards were built around user-facing health rather than infrastructure metrics.

Metrics that reflect real system health — SLO-aligned metrics measuring user-facing availability and latency, not just CPU and memory utilisation. Cloud Monitoring custom metrics from GKE workloads and Cloud Run services instrumented with meaningful health signals.
Logs that support debugging — structured JSON logs with consistent field names, request IDs propagated across service calls, error context that includes enough information to diagnose without needing to reproduce the issue, and log-based metrics that surface error rates in dashboards.
Traces that show service interactions — distributed tracing with Cloud Trace or OpenTelemetry propagated across all services, so you can see the full request path, identify latency hotspots, and understand how a slow downstream dependency is affecting user-facing response times.
Alerts tied to user impact — alerting policies based on SLO burn rate rather than resource thresholds, so alerts fire when users are actually being affected and stay quiet when resource spikes don’t translate to degraded experience.
Dashboards for platform visibility — environment-level dashboards showing the health of the whole platform, service-level dashboards showing individual service SLOs, and GKE node and namespace dashboards for infrastructure debugging — all built around the questions engineers actually ask during incidents.

The goal: you can understand exactly what your system is doing at any moment — and diagnose any production issue without guesswork.

SRE Practices in Production

Turning Reliability Into an Engineering Discipline

SRE practices give engineering teams a framework for making reliability decisions systematically rather than reactively. Combined with a well-designed GCP platform foundation, these practices are the difference between a team that manages reliability through heroics and one that engineers it. I apply:

Service Level Objectives: Explicit, measurable reliability targets for each service — defined in terms of user-facing outcomes (availability, latency, error rate) rather than infrastructure metrics. SLOs give engineering teams a shared language for reliability and a data-driven framework for prioritising reliability work versus feature development.
Error Budgets: The operational complement to SLOs — error budgets define how much unreliability a service is allowed before reliability work takes priority over feature work. Teams with healthy error budgets can deploy aggressively. Teams burning their error budget slow down and invest in reliability. This makes reliability decisions explicit and data-driven rather than subjective.
Blameless Postmortems: Post-incident review processes designed to identify systemic causes rather than assign individual blame — structured postmortem templates, action item tracking, and follow-through mechanisms that ensure postmortem learnings actually improve the platform rather than being filed and forgotten.
Capacity Planning: GCP resource capacity planning informed by traffic projections, load testing results, and historical scaling patterns — so your platform handles growth without emergency scaling events or over-provisioning that inflates costs.
Incident Management Processes: Defined incident severity levels, escalation paths, on-call rotation design, runbook templates for common failure modes, and clear incident commander responsibilities — so the first minutes of an incident are spent diagnosing rather than figuring out who does what.
Continuous Improvement Loops: Reliability review cadences, SLO tracking dashboards, error budget burn reports, and quarterly reliability retrospectives that create a continuous feedback loop between production incidents and platform improvements.

Typical SRE Architecture on GCP

The Observability Stack I Design for GCP Platforms

The SRE and observability architectures I design for GCP platforms typically include these components — tailored to your platform’s scale, team structure, and operational requirements:

Google Cloud Monitoring and Logging: Cloud Monitoring as the primary metrics platform — Workload Metrics from GKE, custom metrics from applications, Cloud Run service metrics, and GCP managed service metrics all surfaced in a consistent monitoring workspace. Cloud Logging for centralised log aggregation across all GCP projects with log-based alerting and log-based metrics.
Distributed Tracing: Cloud Trace or OpenTelemetry instrumentation across all services — trace context propagated through HTTP headers and gRPC metadata so every request can be followed end-to-end. Trace sampling configured to capture 100% of error traces and a representative sample of successful requests without prohibitive cost.
Health Checks and Probes: GKE liveness and readiness probes configured correctly for every workload — not the default settings, but probes tuned to the actual startup time and health semantics of each service. Cloud Load Balancing health checks aligned with backend service health requirements.
Centralised Alerting: Alert policies managed as code — Terraform-managed Cloud Monitoring alert policies with SLO burn rate conditions, notification channels configured for the right escalation paths, and alert documentation that tells on-call engineers what to do when an alert fires.
Environment-Level Dashboards: Separate dashboards for platform health, individual service SLOs, GKE cluster and namespace health, and cost visibility — built in Cloud Monitoring and accessible to both the engineering team and engineering leadership without requiring access to raw metrics.
Automated Remediation: Where appropriate — GKE Horizontal Pod Autoscaler and Vertical Pod Autoscaler for capacity management, Cloud Run concurrency and scaling configuration for serverless workloads, and automated rollback triggers in CI/CD pipelines when error rates exceed defined thresholds after a deployment.

Who This Is For

Is SRE and Observability the Right Investment Right Now?

SRE and observability work delivers the most value when there are production workloads that need to be reliable and a team that needs the tools and practices to keep them that way. This is a good fit for:

Teams running real production systems on GCP — with real users, real SLAs, and real consequences when things go wrong. The Enterprise Platform Modernization and SaaS & Technology Platforms pages cover the production reliability context in more detail.
Organisations with frequent incidents — where the same categories of production issues keep recurring and there’s no structured framework for understanding why or preventing recurrence
Companies scaling traffic or users — where the current informal operational practices worked at previous scale but are showing strain as the platform grows and more engineers are on-call
Platform teams owning shared infrastructure — where reliability of the platform directly affects the productivity and deployment velocity of every application team using it. See the Platform Engineering service for how SRE integrates with IDP design.
Engineering teams building operational maturity — moving from informal, reactive incident management to structured SRE practices with SLOs, error budgets, and continuous improvement cadences

This is not the right fit for simple static websites, one-off internal tools, or teams without production traffic. If you’re earlier in your GCP journey, observability is still built into every platform engagement I run through the

GCP Architecture & Modernization service.

LET’S TALK

Want a GCP Platform That’s Reliable by Design — Not by Luck?

Good observability and SRE practices make production incidents shorter, less frequent, and less stressful — because the platform tells you what’s wrong and your team knows what to do. I start with a free 30-minute architecture review: an honest look at your current monitoring and operational practices, what’s creating the most risk, and what a properly instrumented GCP platform looks like for your environment. You work directly with me, Amit Malhotra, throughout — no account layer, no hand-offs.

Let’s Talk

Speak Directly With Amit Malhotra

Email

amit@buoyantcloudtech.com

Operating From

Based in Toronto (EST), working with engineering teams across Canada & USA

Ready to Architect Your Future on Google Cloud?

Speak directly with me — a Principal Cloud Architect — about your GCP architecture, security, platform engineering, or MLOps goals. I typically respond within one business day.

✓ Free 30-minute call ✓ No proposal, no pressure ✓ Responds within one business day

Get In Touch

Trusted Technical Advisor

Amit works as a true architecture partner, not just a consultant. He focuses on making the right decisions early and designing systems that remain maintainable as they scale. His guidance helped us avoid costly redesigns and establish a solid cloud foundation from the start.

Architecture leadership

Amit helped us redesign our Google Cloud architecture to support rapid growth without increasing operational complexity. His ability to simplify difficult architectural decisions and design scalable platform foundations had an immediate impact on our engineering velocity and system reliability.

Platform engineering & DevSecOps

We engaged Amit to build a secure and scalable platform on Google Cloud with Terraform, Cloud Run, Kong API gateway and automated CI/CD. He brought deep hands-on expertise and designed everything with long-term operability in mind. Our deployment process is now significantly more reliable and secure.