Solutions Architect's Handbook
上QQ阅读APP看书,第一时间看更新

Disaster recovery and business continuity

In the previous section, you learned about using high availability and fault tolerance to handle application uptime. There may be a situation when the entire region where your data center is located goes down due to massive power grid outages, earthquakes, or floods, but your global business should continue running. In such situations, you must have a disaster recovery plan where you will plan your business continuity by preparing sufficient IT resources in an entirely different region, maybe in different continents or countries.

When planning disaster recovery, a solution architect must understand an organization's Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO means how much downtime a business can sustain without any significant impact. RPO indicates how much data loss a business can resist. A reduced RTO and RPO means more cost, so it is essential to understand whether the business is mission-critical and needs minimal RTO and RPO.

The following architecture diagram shows a multi-site disaster recovery architecture where the primary data center location is in Ireland, Europe, and the disaster recovery site is in Virginia, USA, hosted on AWS public cloud. In this case, your business will continue operating, even if something happens to the entire European region or to the public cloud. The fact that the disaster recovery plan is multi-site to achieve minimal RTO and RPO means minimal to no outage and no data loss:

Hybrid multi-site disaster recovery architecture

The following are the most common disaster recovery plans, all of which you will learn about in Chapter 12, DevOps and Solution Architecture Framework:

  • Backup and Store: This plan is the least costly and has maximum RTO and RPO. In this plan, all the server's machine images, and database snapshots should be stored in the disaster recovery site. In the event of a disaster, the team will try to restore the disaster site from a backup.
  • Pilot Lite: In this plan, all the server's machine images are stored as a backup, and a small database server is maintained in the disaster recovery site with continual data sync from the main site. Other critical services, such as Active Directory, may be running in small instances. In the event of a disaster, the team will try to bring up the server from the machine image and scale up a database. Pilot Lite is a bit more costly than the backup and recovery option but has less RTO and RPO than Backup and Store.
  • Warm Standby: In this plan, all the application servers and the database server (running on lower capacity) instances in the disaster recovery site and continue to sync up with the leading site. In the event of a disaster, the team will try to scale up all the servers and databases. Warm Standby is costlier than the Pilot Lite option but has less RTO and RPO.
  • Multi-site: This plan is the most expensive and has a near-zero RTO and RPO. In this plan, a replica of the leading site maintains in a disaster recovery site with equal capacity and actively serves user traffic. In the event of a disaster, all traffic will be routed to an alternate location.

Often, organizations choose a less costly option for disaster recovery, but it is essential to do regular testing and make sure the failover is working. The team should put a routine checkpoint in operational excellence to make sure there's business continuity in the event of disaster recovery.