Advantages of a comprehensive redundancy approach in system design
In today's interconnected world, the reliability and resilience of digital systems have never been more critical. A recent example of the consequences of neglected backup systems is a government agency that experienced several days of public service unavailability. This underscores the importance of system redundancy, a method that duplicates parts of a system to take over if something fails, with the goal of reducing or eliminating downtime during unplanned outages.
The concept of redundancy has evolved in response to the growth of distributed infrastructure, cloud services, and hybrid models. Traditional approaches, such as synchronous hardware failover, are no longer sufficient to meet the complexity and scale of modern systems. Instead, a service-centric, environment-aware approach is now considered the best practice.
This modern approach focuses redundancy at the service or application level, rather than just on hardware or individual systems. It involves designing redundancy specific to the criticality and failure impact of each service, including supporting systems like source code repositories and deployment pipelines, which are often overlooked but are essential to service continuity.
Hybrid and multi-cloud features such as elastic resource allocation and network virtualization (e.g., VLANs, VXLANs, overlay networks, SD-WAN) are also utilized to increase flexibility and isolate failures without hardware duplication. Implementing integrated platforms that unify virtual machines and container workloads, like Red Hat OpenShift Virtualization, can streamline management and improve resilience by consolidating compute, storage, and orchestration under one system.
Incorporating Zero Trust security principles and strict access policies ensures every user, device, and connection is validated and workload traffic is segmented to dynamically limit lateral movement and enhance resilience in hybrid environments. Edge computing can also play a strategic role, providing local redundancy for offline tasks or leveraging edge networks as part of the redundancy solution.
However, investing in redundancy that outstrips risk is wasteful, while insufficient redundancy can be considered negligent. To mitigate this risk, businesses should assess the right level of redundancy for their tech stack by outlining the criticality of each system and quantifying the potential impact on customers and revenue. Cross-region replication, limited multi-cloud escape hatches, and rigorous offline restore tests are recommended.
Netflix introduced the Chaos Monkey tool for chaos engineering, which forces systems to prove redundancy in real-time by disabling production instances randomly. This approach moves redundancy beyond traditional hardware failover towards software-defined, policy-driven, and workload-specific resilience strategies that suit the complexity and scale of modern hybrid and virtualized infrastructures.
As IT leaders shift to hybrid/cloud infrastructure, virtualization, and containerization, they must change their approach to redundancy. The current best practice emphasizes a service-centric, environment-aware approach rather than just duplicating servers or physical components. Key to success is defining and automating redundancy at the service level, ensuring visibility and control over all layers from network to application.
In a cloud environment, IT leaders must manage things on a service or product basis, which requires a different approach to redundancy compared to focusing on servers. For instance, Facebook's October 2021 outage was caused by a cut off of internal DNS due to a BGP misconfiguration, highlighting the need for local routing backups and internal DNS redundancy. Cloud platforms provide zones, but do not guarantee immunity against regional control-plane outages or long-haul fiber cuts.
Organizations often overlook the importance of validating behaviour in cloud native failovers. A manufacturing firm suffered significant revenue loss due to relying on a single provider cloud redundancy during an extended outage. Effective redundancy should not only focus on hardware, but also on business-critical dependencies across availability zones, clouds, and even the edge.
According to Chris Astley, head of cloud at KPMG UK, with redundancy, if one component fails, another can come in to keep the system working. The key is to ensure that this redundancy is fit for purpose and regularly tested to maintain system reliability and resilience. Robust system redundancy can save time and money during outages, but it is not a silver bullet. IT leaders must regularly test their solutions to ensure they are fit for purpose.
- In the ever-growing data-and-cloud-computing industry, the importance of a service-centric, environment-aware approach to cybersecurity and infrastructure redundancy is paramount, as highlighted by the recent example of a government agency's service unavailability.
- The modern approach to redundancy in business and finance involves designing redundancy specific to the criticality and impact of each service, such as source code repositories and deployment pipelines, utilizing hybrid and multi-cloud features like elastic resource allocation and network virtualization.
- cybersecurity best practices, like implementing Zero Trust security principles and automating redundancy at the service level, are crucial in maintaining system reliability and resilience, especially in hybrid and virtualized infrastructures, and should be regularly tested to ensure they are fit for purpose.