ARTICLE

Risk Is a System Problem: Why Your Framework Isn’t Failing, Your Architecture Is

Contributor

CLDigital

1 month ago

Reading Time

9 minutes

By Joleen Engela, Customer Success Manager

If your risk program still stumbles despite “textbook” frameworks, you’re not alone. In 2025, most firms I work with aren’t suffering from a lack of standards (ISO 31000, COSO, NIST, you’ve got them). The friction lives elsewhere: in the architecture, the way data moves, how services depend on suppliers, how alerts roll up, and how people make decisions inside tightly-coupled systems. Frameworks describe what good looks like; architecture determines whether you can actually do it.

Below is a practical field guide to diagnosing risk as a system problem and redesigning your information, process, and vendor architecture so your chosen frameworks finally deliver.

1) Risk is emergent, not modular, so treat it like a system

Modern enterprises are complex, tightly-coupled systems. Small changes propagate in non-linear ways, which is why incidents rarely look like your playbooks. Safety scientists and systems engineers have said this for decades: complex systems fail in surprising ways (Cook), accidents often emerge from interactions rather than single “root causes”, and resilient organizations anticipate, monitor, respond, and learn continuously. If you apply a neat framework to a messy system without changing the underlying architecture, you get brittle compliance, not resilience.

What this means: The goal isn’t just “more controls.” It’s structural: map dependencies, reduce single points of failure, improve observability, and close the loop from signals → decisions → outcomes.

2) Your frameworks aren’t the problem, your execution surface is

ISO 31000 tells you risk management should be integrated, structured, and customized. COSO reframes risk as part of strategy and performance. Those are good north stars. But neither ISO nor COSO can make incomplete data, siloed tooling, or fragile vendor chains behave. That’s on your execution surface—the architecture that turns policy into behavior. (CISA)

Four architectural layers where “good frameworks” go to die

Data & reporting architecture
If your risk registers, control libraries, incidents, test results, and vendor records live in spreadsheets and disconnected point tools, you can’t aggregate or escalate fast enough. Banking supervisors learned this the hard way, codifying BCBS 239 to force better risk data aggregation and reporting after the financial crisis. Even a decade later, many banks still struggle to comply, precisely because their data and IT architectures are not fit for purpose.
Process & control architecture
Paper-perfect controls fail when they’re not embedded in live workflows. NIST’s systems security and cyber-resilience guidance is explicit: resilience must be designed into systems (identity, configuration, monitoring, recovery), not added as an afterthought.
Service & technology architecture
Operational resilience policies in the UK (FCA/PRA) push firms to define important business services, set impact tolerances, and then prove in testing that they can stay within those tolerances. If your service architecture and run-books don’t align to that view, you can’t evidence resilience “in action.”
Vendor & ecosystem architecture
DORA and critical third-party regimes exist because third-party concentration and digital supply chain risk are structural. If your vendor architecture assumes a stable, benign supply chain, your risk program will always be surprised.

3) “But we followed the framework” vs. “Why it still failed”: systemic patterns

a) Data that won’t join up

When risk, compliance, security, continuity, and third-party teams each maintain their own lists, you get reporting by copy-paste. BCBS 239 calls for completeness, accuracy, and timeliness, none of which are possible on islands of spreadsheets. Supervisors still report gaps across the industry, underlining that the problem is architectural, not conceptual. (Deloitte)

b) Hidden interdependencies

The 2024 CrowdStrike endpoint update incident that triggered widespread Windows crashes is a textbook example of ecosystem coupling: a single vendor misstep propagates globally in minutes. Organizations without accurate dependency maps and fail-safes found themselves grounded. That’s not a “control testing” failure; it’s an architecture failure.

c) Software supply chain fragility

From SolarWinds to Log4j, third-party components sit inside your estate, often invisibly. CISA and ENISA have both urged better software bills of materials (SBOMs), supplier oversight, and continuous monitoring to cope with this reality. Frameworks alone don’t discover embedded risk; system visibility does.

d) Cloud concentration and critical third parties

Financial regulators worry about systemic crises if a small set of infrastructure providers hiccup. The UK authorities have proposed designations and standards for critical third parties; ENISA warns about cloud concentration risk for Europe. If your architecture assumes “the cloud never fails,” you’re designing for wishful thinking.

e) Testing that proves plans, not capability

UK rules require firms to demonstrate they can stay within impact tolerances for their important services. If your testing only validates whether documents exist, you’ll pass a paper audit but fail in production. Scenario-based, cross-functional exercises, run on real systems with real vendors, reveal the truth.

4) How to re-architect for risk: a practical blueprint

A. Build a service-centric map of your enterprise

Identify important business services (not just applications). For each service, map customers, outcomes, dependencies, RTO/RPO, and third-party links. Tie risks and controls to services, not just departments. This aligns how you operate with how the FCA/PRA expect you to evidence resilience.

B. Fix your risk data architecture (or your reports will keep lying)

Adopt BCBS 239 principles beyond banking: complete, accurate, timely data with unified taxonomies for risks, controls, obligations, and incidents. Treat data lineage and quality checks as controls. If you can’t answer “what changed, where, and why?” in seconds, you don’t have audit-ready architecture.

C. Invest in observability, because you can’t manage what you can’t see

Traditional monitoring often fails in distributed, hybrid, and AI-accelerated stacks. Observability provides end-to-end insight and anomaly detection across cloud/edge/legacy, shrinking mean time to detect and reducing cascade risk.

D. Treat third-party risk as part of the system, not a questionnaire

Under DORA, ICT third-party risk and incident reporting are not optional. Extend your service maps to vendors: tier by criticality, define failure playbooks, test switch-overs, and require SBOMs for critical software. Don’t wait for annual attestations to discover exposure.

E. Engineer for failure (on purpose)

Borrow from Site Reliability Engineering and chaos engineering: set SLOs and error budgets for important services, inject failures in a controlled way, and learn before production does. This is how you move from “checklist confidence” to observed capability.

F. Bake resilience into system design, not just policy

Use NIST SP 800-160 and NIST cyber-resilience guidance to specify design controls (identity, configuration baselines, segmentation, backup/restore, recovery orchestration) as architectural requirements. Then verify them continuously.

5) A 90/180-day action plan to turn frameworks into outcomes

First 90 days — Stabilize the foundations

Name the services: Publish a first cut of important business services with owners and impact tolerances; align with board-approved definitions.
Unify the language: Adopt a single taxonomy for risk, controls, obligations, and incidents; de-duplicate registers; and implement change logs. BCBS 239-style data discipline starts here.
Map critical dependencies: For each service, record first-order vendors and upstream platforms (identity, messaging, data stores). Identify obvious single points of failure and “one-vendor” concentrations.
Stand up observability for the crown jewels: Instrument your top services for end-to-end visibility, latency, errors, saturation, and dependency health, so you can actually see degradation early.

Next 90–180 days — Prove capability

Run live, cross-functional scenario tests: Validate at least two plausible scenarios (e.g., critical vendor outage; credential provider failure). Exercise playbooks, communications, and fallback paths. Gather evidence that ties to impact tolerances.
Close the software supply-chain gap: Require SBOMs for critical components, subscribe to vulnerability feeds, and test your patch/feature-flag strategy on non-prod first. (Log4j showed how “invisible” components bite.)
Operationalize DORA-aligned oversight: For in-scope EU entities, implement ICT incident classification, reporting workflows, and third-party monitoring aligned to DORA timelines.
Publish board-grade metrics: Use service-centric KPIs/KRIs: time to detect, time to recover, % of services inside tolerance, % of critical vendors with tested failover, and risk data freshness (a BCBS-239 proxy).

6) Architecture patterns that actually lower systemic risk

Pattern 1: Service-first governance

Make services, not org charts, the unit of accountability. Each service has an owner, risk profile, controls, vendors, SLOs, and test schedule. This mirrors how supervisors (FCA/PRA) and SRE disciplines think about reliability.

Pattern 2: Two-plane design, control & execution

Separate the control plane (policy, approvals, monitoring, automation) from the executionplane (apps, data paths). Use the control plane to enforce guardrails (identity, config drift, deployment policies) and to auto-collect evidence for audits. NIST advocates this kind of engineered assurance.

Pattern 3: Diversity & graceful degradation

Avoid monocultures: two DNS providers, multi-region data replicas, and tested read-only service modes. ENISA and the UK CTP work highlight why concentration risk must be addressed structurally, not with paper mitigations.

Pattern 4: Continuous learning loops

Adopt Hollnagel’s four potentials: anticipate (horizon scanning), monitor (observability), respond (playbooks & drills), learn (post-incident reviews). This turns incidents into capability improvements rather than blame cycles.

7) What “good” looks like, evidence regulators and auditors actually trust

Timely, explainable risk data: A single view of risks/controls with lineage and change history (BCBS 239).
Service-mapped resilience: Important services with impact tolerances, tested scenarios, and documented results (FCA/PRA).
Third-party oversight in motion: Continuous monitoring, rapid classification/reporting, and tested contingency paths (DORA).
Engineered resilience: NIST-aligned security/resilience controls embedded in system design, validated by observability and drills (NIST SP 800-160 / cyber-resilience).

8) A final word on “framework vs. architecture”

If you feel like you’re “doing the right things” and still losing sleep, you probably are, because the problem isn’t your checklist. It’s that risk is a system property, and systems answer only to architecture. The good news: when you shift from policy-first to design-first, your frameworks stop feeling like friction and start surfacing as evidence of competence.

Or as Richard Cook put it: in complex systems, failure is the normal state, the discipline is designing so that the inevitable doesn’t become existential.

Ready to make your frameworks work because your architecture does? CLDigital 360 helps teams map important services, unify risk data, monitor critical vendors, and prove capability with real-world testing and auditable evidence, so regulators, auditors, and customers see resilience in action.

See how a system-first architecture turns risk into reliability. Request a personalized demo.

RECOMMENDED

The CLDigital Blog

Dive into our powerful decision analytics, explore modern solutions for risk processes, and join us as we empower organizations to adapt, deliver, and thrive in an ever-changing world.

Let's Connect

Discover how our platform can help you achieve better outcomes and you prepare for what’s next in risk and resilience.

Risk Is a System Problem: Why Your Framework Isn’t Failing, Your Architecture Is

CLDigital

The CLDigital Blog

Why Real-Time Compliance Is the Only Compliance That Matters

Are Your Risk Scores Lying to You? Why Quantification Needs Context

Checklist Overload: A Smarter Approach to Exercises, DR Tests & Scenario Simulations

Let's Connect

Solutions

Platform

Success

Company

Insights