Advantages of Implementing a Comprehensive Redundancy Plan

In today's interconnected world, businesses are increasingly relying on distributed infrastructure, cloud services, and hybrid models to power their operations. However, this shift has also highlighted the importance of system redundancy to ensure business continuity and prevent costly downtime.

Redundancy: A Necessary Evolution

Redundancy, in essence, involves having copies of components within a system to take over if a failure occurs. Traditionally, redundancy was focused on individual systems or servers. However, the concept has evolved to encompass service-level redundancy, accounting for all supporting systems like deployment pipelines, repositories, and more.

The Importance of Redundancy

The purpose of redundancy is to reduce downtime and, in some cases, prevent it entirely. A clear benefit of robust system redundancy is the time and money saved in the event of an outage. For instance, Facebook's October 2021 outage was caused by a cut off of internal DNS due to a BGP misconfiguration, demonstrating that even hyperscalers can fail without local routing backups or internal DNS redundancy.

Best Practices for Redundancy in Hybrid and Edge Cloud Environments

Plan Redundancy & Disaster Recovery: Design hybrid cloud systems with built-in redundancy using tools like Azure Site Recovery and backup services to maintain business continuity during failures.
Secure Connectivity & Identity Management: Use VPN, private communication channels, and unified identity platforms with multi-factor authentication to secure redundant pathways in hybrid environments.
Consistent Governance & Monitoring: Apply unified governance policies across environments and use monitoring tools to detect issues proactively.
Service-Specific Redundancy: Shift redundancy focus from just systems/servers to service-level redundancy, accounting for all supporting systems.
Edge Computing Redundancy: Implement local redundancy at the edge for offline capabilities or leverage the distributed edge network to maintain continuity.
Resource Optimization & Automation: Right-size resources, use auto-scaling, load balancing, and automate monitoring and alerts to maintain efficient and fault-tolerant operations.

Assessing the Appropriate Redundancy Level

To determine the right level of redundancy, businesses should:

Define Business Impact: Evaluate impact and tolerance to downtime for each workload or service to prioritize redundancy efforts.
Analyze Workload Characteristics: Consider data sensitivity, integration complexity, and performance requirements to choose redundancy methods.
Balance Costs vs. Availability: Use cost management tools to find the optimal redundancy level that meets availability targets without wastage.
Tailor to Hybrid and Edge Needs: Because hybrid and edge environments vary widely, redundancy should be customized to local constraints, connectivity, and operational needs.

In a cloud environment, managing things on a service or product basis is a different approach, with supporting systems requiring the same level of redundancy. Using attack surface heat maps and downtime cost modeling can help businesses determine the appropriate redundancy level without overengineering or overspending.

Netflix's Approach to Redundancy: Chaos Engineering

Netflix, a pioneer in cloud services, introduced the Chaos Monkey tool for chaos engineering, disabling production instances randomly to ensure redundancy in real-time. This proactive approach helps businesses avoid failures in both primary and backup systems, potentially leaving public services unavailable for several days.

Conclusion

In summary, implementing redundancy in hybrid/cloud with edge demands a service-oriented, multi-layered approach with security, monitoring, and automation wrapped in a governance framework. Redundancy level decisions should be driven by business impact, workload needs, and cost considerations to ensure resilience without unnecessary overhead.

In the cloud environment, it is crucial to ensure not only the redundancy of individual systems or servers, but also the service-level redundancy of supporting systems such as deployment pipelines, repositories, and data-and-cloud-computing services, to maintain business continuity and prevent costly downtime.
To prevent outages and save money in a cloud environment, businesses must consider implementing robust system redundancy, including local routing backups and internal DNS redundancy for consistency and redundancy at the edge, as well as using tools for disaster recovery, multi-factor authentication for secure connectivity, unified governance policies, and monitoring tools for early detection of issues.