Systems In The Wild

Architecture observations on complex distributed systems.
SPOFs in Modern Cloud-Native Architectures

The SPOFs You Did Not Design

Single points of failure are one of the oldest concepts in systems engineering. They are also one of the most misunderstood in modern architectures. Cloud-native platforms were designed to eliminate them. Redundancy, replication, distribution across zones and regions. The assumption is that if no single component is irreplaceable, the system has no SPOF. That assumption is structurally incomplete. What changed is not the presence of single points of failure. What changed is where they live, how they manifest, and why they remain invisible until an incident exposes them. ...

May 4, 2026 · 9 min · 1870 words · Andre Rocha
Cost Optimization vs Risk Concentration

Cost Optimization vs Risk Concentration in Hosted Control Planes

Hosted control planes are presented as a cost optimization strategy. They are also a risk consolidation strategy. The industry treats these as separate conversations. One belongs to FinOps reports. The other belongs to architecture reviews. ...

May 1, 2026 · 7 min · 1484 words · Andre Rocha
Starling murmuration shifting shape mid-flight

What Breaks When: An Interactive Cluster Failure Explorer

Let me show you something. Architecture diagrams are static. They show components, boundaries, and arrows. What they do not show is what happens when one of those components fails. This one does. I laid out five OpenShift cluster patterns side by side. None of them are hypothetical, I pulled each from real production environments: single cluster with multiple node pools, hosted control planes, ACM-federated fleets, air-gapped stacks, and isolated compliance zones. Click any component and watch what propagates. The red pulse is the blast radius: everything impacted, directly or indirectly, when that component fails (FN-0002). ...

April 18, 2026 · 3 min · 555 words · Andre Rocha
Hidden Reliability Risks

The Hidden Reliability Risks in Multi-Cluster Kubernetes

Multi-cluster Kubernetes is often introduced as a solution to failure. In practice, it does something more subtle. It changes the shape of failure. Failures do not disappear. They stop being local, predictable, and contained. They become distributed, indirect, and delayed. The most dangerous part is not the failure itself. These failure modes share a pattern: they rarely appear in architecture diagrams, do not violate best practices, and only become visible under specific lifecycle events. ...

April 6, 2026 · 6 min · 1170 words · Andre Rocha
Cloud-Native Fragility

Cloud-Native, Same Old Fragility

Modern systems are distributed. But fragility didn’t disappear. It just became harder to see. They run across clusters, regions, providers . They are observable, containerized, orchestrated . ...

March 23, 2026 · 3 min · 549 words · Andre Rocha
Platform Governance

Translating OpenShift Health into Business Risk

The gap no one owns Most OpenShift environments can report their health status with precision. Very few can report their risk position with confidence. Clusters expose thousands of signals: node conditions, operator status, etcd latency, certificate countdowns… The data exists. What rarely exists is a structured translation layer between platform health and business risk. ...

March 4, 2026 · 10 min · 1980 words · Andre Rocha
Platform Governance

Why Most OpenShift DR Strategies Fail at Executive Level

Most enterprise OpenShift disaster recovery strategies are designed to satisfy audits, not to survive real incidents They describe recovery procedures, declare RPO and RTO targets, and satisfy audit checklists. What they rarely do is demonstrate recovery capability under realistic conditions. This distinction matters more than it appears. Having a D.R. plan and having D.R. capability are fundamentally different things. The first is a document. The second is a measurable organizational competence that requires investment, testing, and continuous validation. ...

March 2, 2026 · 10 min · 1951 words · Andre Rocha
Platform Governance

Platform Governance as a Control System in Multi-Cluster Kubernetes

Does it really matter? Let’s explore five items and try to answer that question. 1. Multi Clusters Organizations operating multi-cluster Kubernetes fleets face a structural risk that is rarely discussed in architectural reviews: governance gaps that remain invisible until an audit fails or an incident escalates. The cost is measurable. Undetected configuration drift increases incident blast radius. Inconsistent RBAC baselines extend audit preparation from days to weeks. Clusters onboarded without active policy enforcement create compliance blind spots that accumulate silently. ...

February 26, 2026 · 5 min · 1036 words · Andre Rocha