Transient Error Handling with Polly Part 2

Dec 3, 2020 19:17 · 2907 words · 14 minute read circuit breaker 600 service next

On today’s Visual Studio Toolbox, Carl continues showing us Polly, which is a way we can control for transient errors in our apps. [MUSIC] >> Hi, welcome to Visual Studio Toolbox. I’m your host Robert Green, and this is part two of our look at Polly, which is a tool that gives us the ability to handle transient errors. So you have services in the Cloud talking to each other, they fail. How do you handle that? We’re going to continue with Carl Franklin. >> Hi. >> Hey, Carl. >> Hey.

00:34 - >> In part 1, we did an overview of Polly. What’s the issue it’s trying to solve, and we started looking at how it works. We’re just going to pick up where we left off in this episode. >> That’s right. If you don’t know what this is, go back and watch the last episode, but the short answer is Polly is an open source project that my company, Api-next, took over and enhanced. It’s now in the.NET Foundation. It is also part of.NET Core. If you use the HttpClient factor, you can configure policies automatically with that. Also, it’s very popular.

01:14 - It’s getting about 150,000 downloads a day. It’s good, it’s clean. A lot of people use it. In the last episode, I did three demos. Now, we’re moving on to the next one, which is wait and retry a number of times with enough retries. The last demo we had wait and retry three times with a 200 millisecond delay in between each retry. Now, we’re going to retry a number of times with enough retries so that it actually works.

01:46 - >> So again, we’re making a call, which is always going to fail, but a policy gives us the ability to specify what to do about it other than sit there forever and don’t provide any information versus try a number of times, wait and retry. So this is really, you’ve got an error. A service isn’t talking to another service and you’re setting up policies to say, well, what am I going to do about it automatically? >> Exactly. We’re calling this web API where we’re passing a value, we’re getting a reply with that value back, but it’s programmed to fail after the fourth request within a five-second window. One through three requests one through three within five seconds, they work. After that, they don’t.

02:32 - What we’re doing here is we’re using a wait-and-retry-async policy handling all exceptions, and of course, we can be specific about the exceptions we want to handle. We’re going to retry 20 times, and every time there’s a retry, we’re going to wait 200 milliseconds. This is what happens in-between those retries. This is a new exception handler, tell the user what they want. So we’re basically showing, in a console application, those messages in yellow. Again, the last demo we did, this number was three. We were waiting three times and it was failing. Now, we’re waiting 20 times and it will succeed after 20 retries, because that’s enough time to cover it. Essentially, what’s going to happen, we just make an HTTP request and we wait for a while, and then it comes back that it worked. >> In the previous examples, we never saw 4, 5, 6, 7, 8, 9, 10, 11,12.

03:41 - They just never came back, right? >> Right. >> Now, we’re waiting long enough and retrying long enough that they eventually succeed? >> Right. Again, this code isn’t prescriptive. This isn’t saying this is what you should do. We’re just exercising all the different policies so that you can get a handle on what they do and how they behave. The next one is wait and retry forever. Now, why would you do that? You’re asking for trouble, don’t you think? >> Yeah.

04:11 - >> However, there is a really good use case for this one and it’s just wait-and-retry-forever-async, 200 milliseconds between each one. I was actually writing a WPF app that was like a Wizard. In one page, you were gathering data from a device. It was a Kinect, actually, a Microsoft Kinect. When you hit “Next”, it took all that data and submitted it to an API in the Cloud, and then it couldn’t actually go to the next part of the wizard until we got a response from that server that had the magic numbers and the results because it had some magic sauce that it did. The application could not continue.

04:58 - There is nothing you can do in this application until we get a response, so wait and retry forever is good. We’re basically putting up a thing that says, “Retrying, and if it doesn’t work, check your internet connection and just come back later.” We were saving the state of it so we could just go back in and run it, and it would pick up. But there might be some situations where you don’t want to fail, you just want to wait and retry forever. That’s [inaudible] It’s going to look exactly the same as the last demo which retried 20 times because we knew that was enough.

05:33 - Now, we’re not specifying 20, we’re just retrying until it actually succeeds. A little bit different. >> You could add into that code that asks the user, do you want to keep retrying? >> Sure. >> It’s cancellable, right? The user [inaudible] wait forever. >> Right. We have a cancellation token. Yeah. >> Okay. >> At anytime, they can press the button that fires that cancellation. The next one, this is an interesting one and this is all code.

06:05 - It’s a wait and retry with an exponential back-off. I think this is as close to prescriptive as we get. This is a really good way to do wait and retry. The whole idea is that every time through the retry, we change the number of seconds or whatever. We exponentially increment the timeout is what I’m saying. The first time it’s 200 milliseconds, then it’s 400 milliseconds, then it’s 800 milliseconds, etc. This is all just done with a little bit of code. This is a wait and retry, and I think we have a maximum number of six retries here, but you could do a wait and retry forever with an incremental back-off, but you can see how it slows down, 200, 400, 800, 1,600. >> The idea being that this thing’s not responsive, so rather than just nag, nag, nag, you start slowing down, because if it hasn’t come back in 800, it’s not going to come back in 900? >> Correct. >> That’s the theme there? Okay. >> Yeah. >> That’s cool. >> This is a really good strategy, but you can also combine this with other things, like a circuit breaker, which I think is the next thing that we’re talking about.

07:27 - These circuit breakers, what I was talking about before with the I love Lucy, Lucy and Ethel handling the strawberries that are coming down here. The chocolate-cover strawberries are coming down the conveyor belt, and they’re coming too fast and they’re throwing them over there. What are they doing with all those requests? The idea is that when you have a downstream service that’s struggling for whatever reason, Azure, AWS, whatever might have rebooted or there’s a problem with your service or, hey, maybe your credit card expired and they decided to shut it off on you. I don’t know what is was. There’s some problem with that service. If all these other services start hammering that service, it amounts to a denial of service attack. They’ll never recover, right? >> Right.

08:16 - >> Rather than continuing to send even with a timeout, even with an exponential timeout, you can break the circuit. What that means is that when the circuit is open, no calls go through. But remember this is happening at the policy level. It’s not going to fail to the client, but it is going to just wait, and you basically tell it how long it needs to wait before it closes the circuit again. That’s what happens. It gets a little more complex demo-wise, but it’s a very, very powerful tool, the circuit breaker. It’s a well-known pattern too.

08:59 - If you think about it, this is a great example of nesting policies. We’ve got a wait and retry policy that waits 200 milliseconds and then keeps retrying. Then we have a circuit breaker policy and we’re going to break if the action fails four times in a row. That’s what this four here is. We’re going to wait three seconds after that and then we’re going to do a test run and see if it works. If it doesn’t work, we’re going to keep the circuit open.

09:41 - If it does work, we’re going to close the circuit and allow everything to go through. Now, this is a different metaphor than database connections, which are exactly opposite. When a database connection is open, you can use it. When it’s closed, you can’t. When a circuit is closed, that means electricity is running through it. It works. When a circuit is open, that’s like a circuit breaker. There’s nothing going through it. It’s a little bit of a different metaphor, but you get the idea. Now, look at this. In our try and catch, we have our wait and retry policy execute. Then inside that, we have the circuit breaker policy execute. So they’re nested, and yes, I’ll show you how to clean this up in a minute. But that’s the whole idea, is that you can nest these policies right from outside to inside. Watch this.

Again, we’re going to get some different colors here, 10:37 - but I’ll explain what happens. All right. So that’s enough. We can just just look at what happens here. The first three work okay. Now, we have our wait and retry policy, too many requests. After three seconds, the circuit breaker kicks in and says breaking the circuit, I’m sorry, four. This is 1, 2, 3, 4. This is the power of async with console applications.

11:09 - After four, it says, breaking the circuit for three seconds, and then we get these exceptions that fail. Then the next one, it’s called a half-open circuit. We’re trying, we’re making a trial. Then that call worked, we’re closing the circuit again and everything works. During this whole time, during this time here, we are not sending any requests through. Even though the wait and retry is trying to resend, but it’s not allowing them through. >> That is cool.

Of course, 11:49 - for your particular app and your particular scenario, you can play around with the actual policies and how many times you want to retry and how long you want to wait based on the application, based on user preferences really. >> Yeah, exactly. >> It’s been interesting way to do it. You can ask people, how long do you want to sit there twiddling your thumbs before we give up? >> The cool part is there’s a way that you can update those variables while the application is running, because there’s a configuration store that you can just change and it will populate. You don’t have to stop the application just to change the policy. >> That’s great. >> The next one is I told you we would clean this up. This is a thing called a policy wrap, and policy wrap is a part of poly where we have our two policies.

12:47 - This is exactly the same as the last one, a wait and retry policy in our circuit breaker policy. But now we’re using a policy wrap. We’re seeing Policy.WrapAsync. Here’s the outer policy and here’s the inner policy. >> Okay. Cool. >> Now, instead of having these nested, we just call policy wrap execute async >> Got it. >> Ain’t that cool? >> Yeah. Very cool. >> The result is exactly the same as the last demo. >> It’s just cleaner code. >> It’s cleaner code. It’s easier to read. Think if you have three nested. >> Yeah. Exactly. >> Junior programmer, you get to debug that.

13:35 - Now, we’ve got a wrap with three, a wait and retry circuit breaker and we’ve got a fall back, and fall back policy is like the last resort. That’s when you throw up your hands and say, “I’m done. This didn’t work. We’re finally going to report an exception to the user,” but we want to do that in a nice way. We don’t want to control the message that goes to the user rather than whatever our infrastructure service gives us an error. You want to tell the user, nicely, that this failed and sorry, retry back in an hour. >> Didn’t work. Nobody knows why. >> Didn’t work.

Everything works, 14:20 - but you see the fallback catch is filled with, let me see. The circuit is now open and not allowing calls and then the response is please try again later. You can substitute whatever message you want here. So that’s what the fallback is. Let me just show this real quick. Here’s our wait and retry, here’s our circuit breaker, and now, here’s our fallback policy, which we’re handling a broken circuit exception. We’re saying, please try again later, that we substituted that message.

15:00 - Then essentially we have a fallback for any exceptions. We have four. A fallback for circuit breaker and a fallback for any exception, which we just did. Now, get this, we have two wraps. We have a wrap that wraps a wrap. The resilience strategy which is wait and retry and circuit breaker. We have another one that wraps fallback for circuit breaker. >> Is there any limit to the number of wrappings you can do? >> No. >> Cool.

15:34 - >> We have the fallback for any exception wrapping, fallback for circuit breaker wrapping. My resilience strategy, which is these two. There’s essentially five policies going on here. Then we use the policy wrap just like before in one. Very cool stuff. I mean, there’s more to it. How much time do we have? >> We’re at 15, so we should probably wrap up at this point. >> What I want to tell people is that these two demos right here, bulkhead isolation, I just want to explain what that is. So a bulkhead on an ocean liner, let’s not use the Titanic as an example because it didn’t have bulkheads, or maybe it did, but I don’t know. Modern ocean liners separate their hall into compartments. So think of them like sealed off rooms. If it was just one big hall, if they hit an iceberg or got a torpedo anywhere in the hall, the whole ship would sink. But if they’re cordoned off into these sections, one section might get filled up with water, but it wouldn’t sink the whole ship. So that’s the metaphor.

16:51 - The metaphor is if you’ve got a service which is calling two downstream services, and one of those services goes down, you don’t want that to affect the other service. How it can is that all the resources are going to retrying this service that’s down, and then this service that’s actually not down is a victim of that because you’re using all these server resources for this guy and this one gets none of them. That’s what that demo does and you can explore that in the samples on your own time. Think of it like multithreading, but in the context of a service call. It’s a lot easier than multithreading. >> It seems like poly is very easy to use and extremely powerful.

17:41 - I think in a nutshell, it gives us the ability to handle these transient errors when a service is failing and do something way more useful to the user and just show a spinning circle or everybody waits for the service to come back. Is that a good summary of what it is? >> That is probably the best summary I’ve ever heard. I mean, that’s essentially what you want to do. You don’t want to just let failures happen, especially when you’re in the middle of service to service communication and something happens with service A, nobody’s there to press the retry button. You have to account for those things. The other thing that I didn’t mention is that the whole idea of the Chaos Monkey, the Netflix Chaos Monkey is built in to poly.

18:30 - There’s another project that you can use in poly, it’s called Simian Chaos Monkey, that you can use to do random delays and failures and just to test out the resiliency of your system. It’s like a complete package. It’s really good stuff. >> Awesome. Thanks so much for coming on and showing this to us. We’ll have links to all the demos and the repo where you can get it. Highly recommend everybody start playing around with this. This is really cool stuff. All right. Hope you guys enjoyed that and we will see you next time on Visual Studio Toolbox. [MUSIC] .