Azure for Architects
上QQ阅读APP看书,第一时间看更新

Azure high availability

Achieving high availability and meeting high SLA requirements is tough. Azure provides lots of features that enable high availability for applications, from the host and guest OS to applications using its PaaS. Architects can use these features to get high availability in their applications using configuration instead of building these features from scratch or depending on third-party tools.

In this section, we will look at the features and capabilities provided by Azure to make applications highly available. Before we get into the architectural and configuration details, it is important to understand concepts related to Azure's high availability.

Concepts

The fundamental concepts provided by Azure to attain high availability are as follows:

Availability sets

The fault domain

The update domain

Availability zones

As you know, it's very important that we design solutions to be highly available. The workloads might be mission-critical and require highly available architecture. We will take a closer look at each of the concepts of high availability in Azure now. Let's start with availability sets.

Availability sets

High availability in Azure is primarily achieved through redundancy. Redundancy means that there is more than one resource instance of the same type that takes control in the event of a primary resource failure. However, just having more similar resources does not make them highly available. For example, there could be multiple VMs provisioned within a subscription, but simply having multiple VMs does not make them highly available. Azure provides a resource known as an availability set, and having multiple VMs associated with it makes them highly available. A minimum of two VMs should be hosted within the availability set to make them highly available. All VMs in the availability set become highly available because they are placed on separate physical racks in the Azure datacenter. During updates, these VMs are updated one at a time, instead of all at the same time. Availability sets provide a fault domain and an update domain to achieve this, and we will discuss this more in the next section. In short, availability sets provide redundancy at the datacenter level, similar to locally redundant storage.

It is important to note that availability sets provide high availability within a datacenter. If an entire datacenter is down, then the availability of the application will be impacted. To ensure that applications are still available when a datacenter goes down, Azure has introduced a new feature known as availability zones, which we will learn about shortly.

If you recall the list of fundamental concepts, the next one in the list is the fault domain. The fault domain is often denoted by the acronym FD. In the next section, we will discuss what the FD is and how it is relevant while designing highly available solutions.

The fault domain

Fault domains (FDs) represent a group of VMs that share a common power source and network switch. When a VM is provisioned and assigned to an availability set, it is hosted within an FD. Each availability set has either two or three FDs by default, depending on the Azure region. Some regions provide two, while others provide three FDs in an availability set. FDs are non-configurable by users.

When multiple VMs are created, they are placed on separate FDs. If the number of VMs is more than the FDs, the additional VMs are placed on existing FDs. For example, if there are five VMs, there will be FDs hosted on more than one VM.

FDs are related to physical racks in the Azure datacenter. FDs provide high availability in the case of unplanned downtime due to hardware, power, and network failure. Since each VM is placed on a different rack with different hardware, a different power supply, and a different network, other VMs continue running if a rack snaps off.

The next one in the list is the update domain.

The update domain

An FD takes care of unplanned downtime, while an update domain handles downtime from planned maintenance. Each VM is also assigned an update domain and all the VMs within that update domain will reboot together. There can be as many as 20 update domains in a single availability set. Update domains are non-configurable by users. When multiple VMs are created, they are placed on separate update domains. If more than 20 VMs are provisioned on an availability set, they are placed in a round-robin fashion on these update domains. Update domains take care of planned maintenance. From Service Health in the Azure portal, you can check the planned maintenance details and set alerts.

In the next section, we will be covering availability zones.

Availability zones

This is a relatively new concept introduced by Azure and is very similar to zone redundancy for storage accounts. Availability zones provide high availability within a region by placing VM instances on separate datacenters within the region. Availability zones are applicable to many resources in Azure, including VMs, managed disks, VM scale sets, and load balancers. The complete list of resources that are supported by availability zones can be found at https://docs.microsoft.com/azure/availability-zones/az-overview#services-that-support-availability-zones. Being unable to configure availability across zones was a gap in Azure for a long time, and it was eventually fixed with the introduction of availability zones.

Each Azure region comprises multiple datacenters equipped with independent power, cooling, and networking. Some regions have more datacenters, while others have less. These datacenters within the region are known as zones. To ensure resiliency, there's a minimum of three separate zones in all enabled regions. Deploying VMs in an availability zone ensures that these VMs are in different datacenters and are on different racks and networks. These datacenters in a region relate to high-speed networks and there is no lag in communication between these VMs. Figure 2.1 shows how availability zones are set up in a region:

Availability zones in an Azure region
Figure 2.1: Availability zones in a region

You can find more information about availability zones at https://docs.microsoft.com/azure/availability-zones/az-overview.

Zone-redundant services replicate your applications and data across availability zones to protect from single points of failure. 

If an application needs higher availability and you want to ensure that it is available even if an entire Azure region is down, the next rung of the ladder for availability is the Traffic Manager feature, which will be discussed later in this chapter. Let's now move on to understanding Azure's take on load balancing for VMs.

Load balancing

Load balancing, as the name suggests, refers to the process of balancing a load among VMs and applications. With one VM, there is no need for a load balancer because the entire load is on a single VM and there is no other VM to share the load. However, with multiple VMs containing the same application and service, it is possible to distribute the load among them through load balancing. Azure provides a few resources to enable load balancing:

Load balancers: The Azure load balancer helps to design solutions with high availability. Within the Transmission Control Protocol (TCP) stack, it is a layer 4 transport-level load balancer. This is a layer 4 load balancer that distributes incoming traffic among healthy instances of services that are defined in a load-balanced set. Level 4 load balancers work at the transport level and have network-level information, such as an IP address and port, to decide the target for the incoming request. Load balancers are discussed in more detail later in this chapter.

Application gateways: An Azure Application Gateway delivers high availability to your applications. They are layer 7 load balancers that distribute the incoming traffic among healthy instances of services. Level 7 load balancers can work at the application level and have application-level information, such as cookies, HTTP, HTTPS, and sessions for the incoming request. Application gateways are discussed in more detail later in this chapter. Application gateways are also used when deploying Azure Kubernetes Service, specifically for scenarios in which ingress traffic from the internet should be routed to the Kubernetes services in the cluster.

Azure Front Door: Azure Front Door is very similar to application gateways; however, it does not work at the region or datacenter level. Instead, it helps in routing requests across regions globally. It has the same feature set as that provided by application gateways, but at the global level. It also provides a web application firewall for the filtering of requests and provides other security-related protection. It provides session affinity, TLS termination, and URL-based routing as some of its features.

Traffic Manager: Traffic Manager helps in the routing of requests at the global level across multiple regions based on the health and availability of regional endpoints. It supports doing so using DNS redirect entries. It is highly resilient and has no service impact during region failures as well.

Since we've explored the methods and services that can be used to achieve load balancing, we'll go ahead and discuss how to make VMs highly available.

VM high availability

VMs provide compute capabilities. They provide processing power and hosting for applications and services. If an application is deployed on a single VM and that machine is down, then the application will not be available. If the application is composed of multiple tiers and each tier is deployed in its own single instance of a VM, even downtime for a single instance of VM can render the entire application unavailable. Azure tries to make even single VM instances highly available for 99.9% of the time, particularly if these single-instance VMs use premium storage for their disks. Azure provides a higher SLA for those VMs that are grouped together in an availability set. It provides a 99.95% SLA for VMs that are part of an availability set with two or more VMs. The SLA is 99.99% if VMs are placed in availability zones. In the next section, we will be discussing high availability for compute resources.

Compute high availability

Applications demanding high availability should be deployed on multiple VMs in the same availability set. If applications are composed of multiple tiers, then each tier should have a group of VMs on their dedicated availability set. In short, if there are three tiers of an application, there should be three availability sets and a minimum of six VMs (two in each availability set) to make the entire application highly available.

So, how does Azure provide an SLA and high availability to VMs in an availability set with multiple VMs in each availability set? This is the question that might come to mind for you.

Here, the use of concepts that we considered before comes into play—that is, the fault and update domains. When Azure sees multiple VMs in an availability set, it places those VMs on a separate FD. In other words, these VMs are placed on separate physical racks instead of the same rack. This ensures that at least one VM continues to be available even if there is a power, hardware, or rack failure. There are two or three FDs in an availability set and, depending on the number of VMs in an availability set, the VMs are placed in separate FDs or repeated in a round-robin fashion. This ensures that high availability is not impacted because of the failure of the rack.

Azure also places these VMs on a separate update domain. In other words, Azure tags these VMs internally in such a way that these VMs are patched and updated one after another, such that any reboot in an update domain does not affect the availability of the application. This ensures that high availability is not impacted because of the VM and host maintenance. It is important to note that Azure is not responsible for OS-level and application maintenance.

With the placement of VMs in separate fault and update domains, Azure ensures that all VMs are never down at the same time and that they are alive and available for serving requests, even though they might be undergoing maintenance or facing physical downtime challenges:

VM distribution across update and fault domains
Figure 2.2: VM distribution across fault and update domains

Figure 2.2 shows four VMs (two have Internet Information Services (IIS) and the other two have SQL Server installed on them). Both the IIS and SQL VMs are part of availability sets. The IIS and SQL VMs are in separate FDs and different racks in the datacenter. They are also in separate update domains.

Figure 2.3 shows the relationship between fault and update domains: 

Layout of update and fault domains in an availability set
Figure 2.3: Layout of update domains and FDs in an availability set

So far, we have discussed achieving high availability for compute resources. In the next section, you will learn how high availability can be implemented for PaaS.

High-availability platforms

Azure has provided a lot of new features to ensure high availability for PaaS. Some of them are listed here:

Containers in app services

Azure Container Instances groups

Azure Kubernetes Service

Other container orchestrators, such as DC/OS and Swarm

Another important platform that brings high availability is Service Fabric. Both Service Fabric and container orchestrators that include Kubernetes ensure that the desired number of application instances are always up and running in an environment. What this means is that even if one of the instances goes down in the environment, the orchestrator will know about it by means of active monitoring and will spin up a new instance on a different node, thereby maintaining the ideal and desired number of instances. It does this without any manual or automated interference from the administrator.

While Service Fabric allows any type of application to become highly available, orchestrators such as Kubernetes, DC/OS, and Swarm are specific to containers. Also, it is important to understand that these platforms provide features that help in rolling updates, rather than a big bank update that might affect the availability of the application.

When we were discussing high availability for VMs, we took a brief look at what load balancing is. Let's take a closer look at it to better understand how it works in Azure.

Load balancers in Azure

Azure provides two resources that have the functionality of a load balancer. It provides a level 4 load balancer, which works at the transport layer within the TCP OSI stack, and a level 7 load balancer (application gateway), which works at the application and session levels.

Although both application gateways and load balancers provide the basic features of balancing a load, they serve different purposes. There are a number of use cases in which it makes more sense to deploy an application gateway than a load balancer.

An application gateway provides the following features that are not available with Azure load balancers:

Web application firewall: This is an additional firewall on top of the OS firewall and it gives the ability to peek into incoming messages. This helps in identifying and preventing common web-based attacks, such as SQL injection, cross-site scripting attacks, and session hijacks.

Cookie-based session affinity: Load balancers distribute incoming traffic to service instances that are healthy and relatively free. A request can be served by any service instance. However, there are applications that need advanced features in which all subsequent requests following the first request should be processed by the same service instance. This is known as cookie-based session affinity. An application gateway provides cookie-based session affinity to keep a user session on the same service instance using cookies.

Secure Sockets Layer (SSL) offload: The encryption and decryption of request and response data is performed by SSL and is generally a costly operation. Web servers should ideally be spending their resources on processing and serving requests, rather than the encryption and decryption of traffic. SSL offload helps in transferring this cryptography process from the web server to the load balancer, thereby providing more resources to web servers serving users. The request from the user is encrypted but gets decrypted at the application gateway instead of the web server. The request from the application gateway to the web server is unencrypted.

End-to-end SSL: While SSL offload is a nice feature for certain applications, there are certain mission-critical secure applications that need complete SSL encryption and decryption even if traffic passes through load balancers. An application gateway can be configured for end-to-end SSL cryptography as well.

URL-based content routing: Application gateways are also useful for redirecting traffic to different servers based on the URL content of incoming requests. This helps in hosting multiple services alongside other applications.

Azure load balancers

An Azure load balancer distributes incoming traffic based on the transport-level information that is available to it. It relies on the following features:

An originating IP address

A target IP address

An originating port number

A target port number

A type of protocol—either TCP or HTTP

An Azure load balancer can be a private load balancer or a public load balancer. A private load balancer can be used to distribute traffic within the internal network. As this is internal, there won't be any public IPs assigned and they cannot be accessed from the internet. A public load balancer has an external public IP attached to it and can be accessed via the internet. In Figure 2.4, you can see how internal (private) and public load balancers are incorporated into a single solution to handle internal and external traffic, respectively:

Distributing traffic using Azure load balancers
Figure 2.4: Distributing traffic using Azure load balancers

In Figure 2.4, you can see that external users are accessing the VMs via the public load balancer, and then the traffic from the VM is distributed across another set of VMs using an internal load balancer.

We have done a comparison of how Azure load balancers differ from Application Gateways. In the next section, we will discuss application gateways in more detail.

The Azure Application Gateway

An Azure load balancer helps us to enable solutions at the infrastructure level. However, there are times when using a load balancer requires advanced services and features. These advanced services include SSL termination, sticky sessions, advanced security, and more. An Azure application gateway provides these additional features; the Azure application gateway is a level 7 load balancer that works with the application and session payload in a TCP OSI stack.

Application gateways have more information compared to Azure load balancers in order to make decisions on request routing and load balancing between servers. Application gateways are managed by Azure and are highly available.

An application gateway sits between the users and the VMs, as shown in Figure 2.5:

Connecting users and VMs through Azure Application Gateway
Figure 2.5: An Azure application gateway

Application gateways are a managed service. They use Application Request Routing (ARR) to route requests to different services and endpoints. Creating an application gateway requires a private or public IP address. The application gateway then routes the HTTP/HTTPS traffic to configured endpoints.

An application gateway is similar to an Azure load balancer from a configuration perspective, with additional constructs and features. Application gateways can be configured with a front-end IP address, a certificate, a port configuration, a back-end pool, session affinity, and protocol information.

Another service that we discussed in relation to high availability for VMs was Azure Traffic Manager. Let's try to understand more about this service in the next section.

Azure Traffic Manager

After gaining a good understanding of both Azure load balancers and application gateways, it's time to get into the details of Traffic Manager. Azure load balancers and application gateways are much-needed resources for high availability within a datacenter or region; however, to achieve high availability across regions and datacenters, there is a need for another resource, and Traffic Manager helps us in this regard.

Traffic Manager helps us to create highly available solutions that span multiple geographies, regions, and datacenters. Traffic Manager is not similar to load balancers. It uses the Domain Name Service (DNS) to redirect requests to an appropriate endpoint determined by the health and configuration of the endpoint. Traffic Manager is not a proxy or a gateway, and it does not see the traffic passing between the client and the service. It simply redirects requests based on the most appropriate endpoints.

Azure Traffic Manager helps to control the traffic that is distributed across application endpoints. An endpoint can be termed as any internet-facing service hosted inside or outside of Azure.

Endpoints are internet-facing, reachable public URLs. Applications are provisioned within multiple geographies and Azure regions. Applications deployed to each region have a unique endpoint referred to by DNS CNAME. These endpoints are mapped to the Traffic Manager endpoint. When a Traffic Manager instance is provisioned, it gets an endpoint by default with a .trafficmanager.net URL extension.

When a request arrives at the Traffic Manager URL, it finds the most appropriate endpoint in its list and redirects the request to it. In short, Azure Traffic Manager acts as a global DNS to identify the region that will serve the request.

However, how does Traffic Manager know which endpoints to use and redirect client requests to? There are two aspects that Traffic Manager considers to determine the most appropriate endpoint and region.

Firstly, Traffic Manager actively monitors the health of all endpoints. It can monitor the health of VMs, cloud services, and app services. If it determines that the health of an application deployed to a region is not suitable for redirecting traffic, it redirects the requests to a healthy endpoint.

Secondly, Traffic Manager can be configured with routing information. There are six traffic routing methods available in Traffic Manager, which are as follows:

Priority: This should be used when all traffic should go to a default endpoint, and backups are available in case the primary endpoints are unavailable.

Weighted: This should be used to distribute traffic across endpoints evenly, or according to defined weights.

Performance: This should be used for endpoints in different regions, and users should be redirected to the closest endpoint based on their location. This has a direct impact on network latency.

Geographic: This should be used to redirect users to an endpoint (Azure, external, or nested) based on the nearest geographical location. This can help in adhering to compliance related to data protection, localization, and region-based traffic collection.

Subnet: This is a new routing method and it helps in providing clients with different endpoints based on their IP addresses. In this method, a range of IP addresses are assigned to each endpoint. These IP address ranges are mapped to the client IP address to determine an appropriate returning endpoint. Using this routing method, it is possible to provide different content to different people based on their originating IP address. 

Multivalue: This is also a new method added in Azure. In this method, multiple endpoints are returned to the client and any of them can be used. This ensures that if one endpoint is unhealthy, then other endpoints can be used instead. This helps in increasing the overall availability of the solution.

It should be noted that after Traffic Manager determines a valid healthy endpoint, clients connect directly to the application. Let's now move on to understand Azure's capabilities in routing user requests globally.

In the next section, we will be discussing another service, called Azure Front Door. This service is like Azure Application Gateway; however, there is a small difference that makes this service distinct. Let's go ahead and learn more about Azure Front Door.

Azure Front Door

Azure Front Door is the latest offering in Azure that helps route requests to services at a global level instead of a local region or datacenter level, as in the case of Azure Application Gateway and load balancers. Azure Front Door is like Application Gateway, with the difference being in the scope. It is a layer 7 load balancer that helps in routing requests to the nearest best-performing service endpoint deployed in multiple regions. It provides features such as TLS termination, session affinity, URL-based routing, and multiple site hosting, along with a web application firewall. It is similar to Traffic Manager in that it is, by default, resilient to entire region failures and it provides routing capabilities. It also conducts endpoint health probes periodically to ensure that requests are routed to healthy endpoints only.

It provides four different routing methods:

Latency: Requests will route to endpoints that will have the least latency end to end.

Priority: Requests will route to a primary endpoint and to a secondary endpoint in the case of the failure of the primary.

Weighted: Requests will route based on weights assigned to the endpoints.

Session Affinity: Requests in a session will end up with the same endpoint to make use of session data from prior requests. The original request can end up with any available endpoint.

Deployments looking for resilience at the global level should include Azure Front Door in their architecture, alongside application gateways and load balancers. In the next section, you will see some of the architectural considerations that you should account for while designing highly available solutions.