Don't Let Your .NET Applications Fail: Resiliency with Polly

One aspect of application development that is often overlooked, especially by beginner developers is application resilience.

A lot of tutorials often focus on the happy path of execution, omitting the details of potential errors that can occur.

Example

Consider the following, although a bit simplified ASP.NET MVC example:

[HttpPost]
public async Task<ActionResult<ResponseModel>>
 CreateOrderAsync(OrderModel order)
{
    var cart = await _cartService.GetCartItemsAsync(UserId);
    if (cart.Items.Count == 0)
    {
        return new ResponseModel
        {
            // ... omitted for brevity ...
        }
    }

    var orderEntries = cart.Items.Select(c => c.ToDbModel(UserId));
    var order = new Order
    {
        UserId = UserId,
        DatePlaced = DateTime.UtcNow,
        Entries = orderEntries,
        CartIdempotencyToken = cart.IdempotencyToken
    };

    _context.Orders.Add(order);
    await _context.SaveChangesAsync();
    // User should no longer have the items in their cart 
    // after they've placed an order
    await _cartService.EmptyAsync(UserId, cart.IdempotencyToken);
    return new ResponseModel
    {
        // ... omitted for brevity ...
    };
}

Apart from the absence of some obvious error handling (what happens if the user's cart can't be found), the code looks decent enough at first glance. It is able to retrieve the entities from the

CartService

, map them to the database entities and store them as a part of

Order

entity.

I've tested it, it works!

Sure enough, the code is correct algorithmically - it does exactly what you've asked it to do. You have tested it with various different inputs and came to the conclusion that no matter what data you give it - the processing will be done correctly. So what's the problem?

Async - state, trapped in time

Request/response model and async/await makes the code look linear. It's pretty obvious where the data is coming from, and where it goes. It's sure convenient.

However, if we don't look into the nature of asynchronous processing, it makes it very easy to miss an important detail - asynchronous processing is stretched in time and usually involves 3rd party resources that can potentially fail at any point in time. A resource that was available just a moment ago, when we were executing the top of the function, may very well be down at this point.

This service is not alone in the world - it this case it interacts with the

CartService

(which may make calls to a microservice over the network) and the Database. It becomes pretty obvious that the author of this code example was focused on an ideal condition when both of them are always available and don't return any errors. However, the reality is a lot more complicated than that - there may be network problems when the service becomes unreachable or straight-up refuses to process the request correctly due to issues of its own (remember the fallacies of distributed computing?).

Although the result of a happy path is correct, we haven't even thought about a plethora of potential issues:

What is the time requirement for this endpoint? May it be the case, that after a certain period of time it's better to just straight up give up on the request processing and return an error informing the client to try again at a later time (timeout).
What happens if the cart data request fails? Is it safe for us to retry it? How many times? What kind of retry intervals are safe to use without overwhelming the upstream service?
What if the database store operation fails? What kind of response should the user get? Are we able to retry it too? (For example, if it failed due to a network problem).
What if the cart cleaning operation fails? Is it essential to clear it, or in the worst-case scenario we can keep it? Can we retry it?

Ok, It's complicated, is there a better way?

Sure is! Polly comes to the rescue!

Polly is resilience and transient-fault-handling library that allows us to very easily express the policies that will help to deal with various issues.

With Polly, it becomes very easy to describe retries, timeout, caching, and many more policies or their combinations.

Building and using policies

One thing that you should decide right away - is your policy going to be asynchronous or synchronous one, because depending on your choice of a policy builder method you will get back either

Policy

AsyncPolicy

instance and using these two together can be quite challenging.

Usually though - you'll be making asynchronous calls through your policies so let's use that as an example.

// Let's build our simple timeout policy
// This policy will timeout after 3 seconds
var timeoutPolicy = Policy.TimeoutAsync(3);

// Note that this also supports optimistic cancellation
var res = await timeoutPolicy
	.ExecuteAsync(ct => TestAsync(ct), CancellationToken.None);

Ok, so what is going on in this example? We build a policy that specifies a timeout rule, and on the next line we are using that policy to call an asynchronous method called

TestAsync(...)

This method also supports optimistic cancellation (we explicitly notify it when it's time to stop through the

CancellationToken

) and we are making use of that.

AsyncPolicy.ExecuteAsync

has an overload that gives us access to an internal

CancellationToken

of the policy and we can pass that to our method to achieve the desired result.

However, notice how I've passed

CancellationToken.None

as a second parameter? That's right, if you wish, Polly allows you to also use your own

CancellationToken

that will be linked to the internal one to terminate the execution even sooner. Pretty awesome!

Basic retries

As discussed earlier, Polly supports a lot of things out of the box, but for now, let's focus on the most basic example - retries with exponential backoff.

From the official Polly wiki:

// Retry a specified number of times, using a function to 
// calculate the duration to wait between retries based on 
// the current retry attempt (allows for exponential backoff)
// In this case will wait for
//  2 ^ 1 = 2 seconds then
//  2 ^ 2 = 4 seconds then
//  2 ^ 3 = 8 seconds then
//  2 ^ 4 = 16 seconds then
//  2 ^ 5 = 32 seconds
Policy
  .Handle<SomeExceptionType>()
  .WaitAndRetryAsync(5, retryAttempt => 
	TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)) 
  );

In this example, we can see a policy that will retry to execute your code at most 5 times, each time increasing the delay between calls. This is very useful in situations when you don't want to overwhelm the upstream servers with retries.

The exponential backoff mechanism will allow your system to balance out and find a suitable rate of calls to upstream servers, even if they are experiencing temporary problems/load spikes. Do note, that it will be beneficial to introduce some randomness (jitter) into the retry policy to avoid all of the retries happening at the same time.

This also partially helps to reduce the possibility that your service will cause Denial Of Service for the upstream server. More on that in the next chapter. Take extra care when retrying calls to services with side effects (e.g. sending emails to users through SMTP service) because exception does not always mean that action was not executed by the service, and you may unintentionally execute it multiple times while trying to retry a failed call.

Circuit breaker

Sometimes, if the rate of failures is too high, it's probably a good idea to give the upstream servers some time to recover while partially degrading the functionality of your own application.

Imagine this scenario - we have a factory that makes car engines. These travel through various assembly steps on a conveyor belt and then checked at the end to ensure quality. In case manufacturing yield is high enough (say only 1 in 10000 engines is defective) - just removing defective part is a good enough solution. On the other hand, if the failure rate is above 30% - something is definitely wrong and it's worth stopping the whole conveyor for inspection.

With web services we can do exactly the same thing - if we see that the failure rate of our requests is too high, maybe it's not worth making a request at all? Let's give our upstream servers some time to deal with whatever issue they are having while degrading our application a little bit.

It may not be suitable for all scenarios, but if it's intended to provide non-critical functionality (say recommendations for a purchased product in an e-shop) - it's is useful to temporarily disable that feature while showing the users a pop-up with an explanation that the service is experiencing some temporary high load.

Circuit breaker policy does exactly that - it allows us to temporarily stop making upstream calls in case the failure rate is above a certain threshold, or a certain amount of consecutive exceptions of a specified type occur.

var policy = Policy
  .Handle<HttpRequestException>()
  .CircuitBreaker(
    exceptionsAllowedBeforeBreaking: 2, 
    durationOfBreak: TimeSpan.FromMinutes(1)
  );

In this example - if two consecutive calls through this policy throw an exception of type

HttpRequestException

, the circuit will break and stay broken for a duration of 1 minute, meaning that any call made through this policy in that interval will throw a

BrokenCircuitException

It's up to the application developer to properly handle this exception and possibly return some kind of meaningful message to the client describing what exactly has happened.

Policy wrapping

I won't be explaining all of the policy variants available, but by this time I hope you already saw how powerful these are. But wait, there is more! You can wrap one policy with the other to achieve even more complex behavior.

Consider this example with two separate policies:

// Timeout policy, 
// requests cancellation when the execution time exceeds 
// a specified amount.
var timeoutPolicy = Policy.TimeoutAsync(3);

// Fallback policy
// If an exception of a specified type occurs 
// during method execution,
// it will return a predefined result
var fallbackPolicy = Policy<string>
	.Handle<SomeException>()
  .Or<OperationCancelledException>()
	.FallbackAsync("Fallback result");

The first one will just cancel the execution of a method as discussed earlier, while the other one is a bit more interesting - in case the method throws an exception with type

SomeException

OperationCancelledException

, it will return a predefined result

"Fallback result"

instead. But what if we could combine these two? Can we do that? Easy!

var combined = fallbackPolicy.WrapAsync(timeoutPolicy);
var result = await combined.ExecuteAsync(...);

And that's it, now we have a policy that will either return the original result of a method, or a fallback result if the operation times out or throws

SomeException

The order of the wraps will actually affect the behavior, so make sure to pay close attention to it because it may give you unexpected results. In the example above -

fallbackPolicy

is the outer one (will operate on the results returned or exceptions thrown by the

timeoutPolicy

), and the

timeoutPolicy

will operate on the results of the method passed to

.ExecuteAsync(...)

Integrations with HttpClient

With

Microsoft.Extensions.Http.Polly

package installed you can call

.AddPolicyHandler(...)

method on your

IHttpClientBuilder

's to handle some trivial cases like responses with 5XX or 408 status codes and retry with a chosen strategy.

This lifts this concern from the layers that use these

HttpClient

's.

Polly.Extensions.Http

package even provides that behavior out of the box with its:

HttpPolicyExtensions.HandleTransientHttpError()

Consider this example code:

public void ConfigureServices(IServiceCollection services)
{
	...
	services.AddHttpClient<T>(client => {...}) 
	// You may want to do some additional configuration 
	// on your http clients (like base address)
		.AddPolicyHandler(GetHttpRetryPolicy());
	...
}

private static IAsyncPolicy<HttpResponseMessage> GetHttpRetryPolicy()
{
	return HttpPolicyExtensions.HandleTransientHttpError()
	.RetryAsync(3);
}

In this example, we are using a function provided for us by

Polly.Extensions.Http

to get the policy with the default behavior that handles various HTTP status codes and simply retries the operation.

Do note, however, that for the

.AddPolicyHandler()

extension to work you'll need to configure a typed or named HTTP client (the parameterless implementation of

.AddHttpClient()

just returns

IServiceCollection

). More information about named and typed clients can be found in the official Microsoft documentation.

Some tools have built-in resilience mechanisms

Polly is not the only way to get resilience - if you look closely into some of the tools you are using already, you might discover that they also provide resilience mechanisms. One notable example of such mechanisms is available can be found in EF Core, when using MS SQL Server.

Let's take a look at the configuration:

// Startup.cs from any ASP.NET Core Web API
public class Startup
{
    // Other code ...
    public IServiceProvider ConfigureServices(IServiceCollection services)
    {
        // ...
        services.AddDbContext<CatalogContext>(options =>
        {
            options.UseSqlServer(Configuration["ConnectionString"],
            sqlServerOptionsAction: sqlOptions =>
            {
                sqlOptions.EnableRetryOnFailure(
                maxRetryCount: 10,
                maxRetryDelay: TimeSpan.FromSeconds(30),
                errorNumbersToAdd: null);
            });
        });
    }
//...
}

In this example, the connection will be reattempted no more than 10 times, with a maximum delay of 30 seconds.

errorNumbersToAdd

specifies additional SQL Server error codes that will be handled by this retry policy.

How we use it

To simplify the reuse of common policies, we at TeleSoftas, use a policy registry. It is exactly what it sounds like - it's a registry of policies that you can address by the unique (e.g. string) key. It is very convenient to register the policy registry at the start of the application inside of the DI container, add policies, and then resolve the needed ones later through the

IPolicyRegistry<Tkey>

interface.

Adding a registry

To add the registry to your DI container, simply call

services.AddPolicyRegistry();

in your

Startup.ConfigureServices(...)

Registering policies

To register a policy, simply call

policyResgistry.Add(Tkey key, TPolicy policy)

method. The example below shows us how to register a policy that we have already seen - a retry policy that handles transient HTTP errors. This configuration introduces a 2-second delay between retries and only does so 3 times.

policyRegistry.Add(PolicyNameConstants.Transient,
		HttpPolicyExtensions
		.HandleTransientHttpError()
		WaitAndRetryAsync(3, (i) => TimeSpan.FromSeconds(2)));

Retrieving and using the policy

Simply inject your policy registry into the service where you need it and request the needed policy by it's key like so:

public class MyService
{
	private readonly IAsyncPolicy<HttpRequestMessage> _policy;
	
	public MyService(IReadonlyPolicyRegistry<string> policyRegistry, ...)
	{
		_policy = _policyRegistry
		.Get<IAsyncPolicy<HttpRequestMessage>>(PolicyNameConstants.Transient);
		...
		// Rest is omitted
		...
	}
	
	public async Task<HttpResponseMessage> DoStuffAsync()
	{
		_policy.ExecuteAsync(async () => 
		{
			//Do the work
		});
		//Do something else
	}
}

Final thoughts

First and foremost - get to know your tools. Some of them already provide near-effortless ways to improve your app stability. Adding retry policy with few lines of code is next to effortless but it might save you a lot of headache.

Second - make sure to focus not only on the happy path of execution but carefully plan the failure and graceful degradation strategies, especially when dealing with external services.

I hope that with this brief introduction to resilience policies I've persuaded you to go through your code and identify spots for potential improvement, where any potential errors are simply dismissed and not handled properly.