Breaking Down the Breaker

15 February 2017

I recently built and released a new project in my new favorite programming language: Elixir. The project is an HTTP request circuit breaker, simply named Breaker because I own an extremely creative automatic name generator.

What’s a “Circuit Breaker”?

The circuit breaker pattern is used to protect your application from failing calls to some remote application. It wraps some remote resource (say, an external web service or a database) and can help your application fail fast and gracefully degrade when that remote resource is misbehaving. Martin Fowler has a much more detailed explanation.

It can wrap anything that isn’t absolutely essential to the functionality of your application, but you wouldn’t wrap calls to your primary database if you absolutely need it to render the page being requested. In that case, you want your application to fail in just the way it will and hopefully it’ll fix itself or someone will be along soon.

A good real world example can be found using Netflix. I’m sure you’ve been watching Netflix and fast-forwarding or rewinding through an episode of your current favorite TV show, like Once Upon A Time, and suddenly, the thumbnails aren’t there to show you a glimpse of where you are in the episode. It’s possible, and likely, that the service Netflix uses to calculate and/or send you those thumbnails was down or took too long to respond. But, your Netflix client didn’t hang and wait for them to load, it just decided we’re not having thumbnails right now and went on letting you fast-forward and rewind. This is graceful degradation and the circuit breaker pattern is one of the big keys to making this happen. In fact, Netflix is really good at this kind of thing because their application is composed of lots and lots of distinct services and their application is bigger than you probably think it is.

A Circuit Breaker can protect your user experience

When properly configured, wrapping calls to a remote service in a circuit breaker can make sure your users aren’t waiting too long for non-essential functionality. A good example of this would be a recommendation service for an eCommerce site. When someone looks at their previous orders or their wish list, it’s helpful to show them other products that they might like, based on products they’ve already purchased or plan to purchase. Unfortunately, in preparing the page, it requires you to send a call to the Recommendation Service with a list of the customer’s recent purchases or the contents of their wish list. Then, the Recommendation Service has to respond with some products, but it might take some time to find the right products. When your application is getting lots and lots of requests, the Recommendation Service can slow down to a crawl, taking upwards of 5 seconds to respond because if it’s calculation-heavy nature. Meanwhile, your user is waiting longer than 5 seconds to load the page that could be loaded in 1 second and they get frustrated with their experience.

You could enforce a shorter timeout for requests to the Recommendation Service, but now every user is still waiting too long and your system is not providing any relief to the poor, bogged-down Recommendation Service, so that doesn’t really accomplish anything.

You could load the recommendations after the page has loaded (via some AJAX call), but that may cause some screen reflow issues on different devices and can result in a poor user experience anyway. Not to mention that this wouldn’t work for more complex workflows.

On the other hand, you could wrap the calls in a Circuit Breaker for the Recommendation Service. The Circuit Breaker will record the results of each request to the Recommendation Service and then fail fast when it looks like the Recommendation Service is in trouble. For example, if requests start to timeout too often, it won’t bother issuing the HTTP request and just return an error instead. It’ll wait awhile before allowing another HTTP request to go through, giving the Recommendation Service some time to catch up and cool off. This means, when the Recommendation Service is severely overloaded, your pages can still load in the 1 second your users expect.

This does require some application design thinking

If you read all of that and thought something along the lines of, “Yeah, but that means I have to plan for instances when the remote service is down,” then you’re on the right track.

You absolutely do need to plan for that case and you absolutely should.

When building your application (or applying the circuit breaker pattern to it) you should keep in mind which calls are essential and which are “nice-to-haves”. This empowers your application to have lots of cool features (like recommendations, recent purchases from friends, etc.) while not missing essential business functionality (like purchasing products) when things aren’t at 100%.

That being said, different circuit breaker implementations calculate the breaker’s health differently, so lets take a look at my library, Breaker and get into some code stuff.

How Breaker calculates health

Breaker uses a rolling window, based on Netflix’s Hysterix. Essentially, we want to tell if a remote service is healthy or not based on the errors we record in some recent window of time. There are 2 options used to configure the window, bucket_length and window_length, an option for the error_threshold and an option for the timeout. These are the options you probably want to play with to optimize your circuit breaker.

timeout and error_threshold are probably self-explanatory, the maximum time to wait for a request and the ratio of errors allowed before we decide to back off. The other two options are a bit more complex.

Both options are measurements, bucket_length is measured in time (ms) and window_length is measured in buckets. Buckets, then, allow us to further break down the window and are how we roll the window, discounting older requests and making room for new ones as time passes. Defaults say that buckets are 1 second in length, that the window contains 10 buckets, and that we’ll tolerate 5% of requests being errors. Let’s say we have an example application that needs to make about 10 requests per second.

After starting the circuit breaker in our application, each bucket is filling up with our 10 requests. None of these are errors yet, so we have an effective error rate of 0%. After 10 seconds, we have more buckets than our window_length allows, so we remove the last one (the oldest) and the requests counted there are no longer used to calculate the error rate, those counts are lost to the void. While things are going well, we’re counting about 100 requests at a time.

Then, things start picking up. We’re still sending about 100 requests at a time, but another application is also calling our external service and it has started to slow down trying to handle those calls in addition to our own. We start seeing an occasional timeout error, only once every 10 seconds or so, putting our error rate at 1%. Since this is how long our bucket is, the bucket containing the error gets rotated out about when a new timeout gets recorded.

But, things get worse and we start seeing more timeouts and, finally, we have more than 5 timeouts in our 100 requests, tripping the breaker. Now, when the application makes a request, it just receives a %Breaker.OpenCircuitError{} immediately instead of waiting the default 3 seconds for a timeout. As buckets are rotated out, when the error rate drops below the 5% mark, we’ll allow more requests through. If those start timing out again, the breaker may trip again. Or, if that was enough time for the external service to catch up, maybe we’ll see less timeouts and things will go back to normal.

Given that example, the values for the timeout, error_threshold, bucket_length, and window_length options can really make a difference. Take a look at the project REAMDE for more tips on adjusting these values.

Added Overhead

I’m sure you’re now thinking, “This sounds like it’ll add some overhead to my application’s requests.”

Well, you’re right, any extra calculations will add overhead. So, I ran some benchmarks to make sure that the overhead isn’t intolerable, or that we can at least fix something we can measure. You can run the benchmarks too by cloning Breaker and running mix bench

It turns out, one of the reasons I like Elixir is because it’s fast. On my pitiful machine (Core 2 Duo 2.1 Ghz, 4 GB RAM with Chrome and Spotify open):

To give a little more insight:

So, these aren’t super simple operations, strictly speaking.

Essentially, we’re adding about a millisecond to each request for the opportunity to save thousands of milliseconds when things aren’t going perfectly. I’m sure there’s also some opportunity for improvement, as I haven’t optimized anything.

I’d love to hear from you and get feedback about the design or implementation (or your use) of this library. My contact information is at the bottom of the page!

Related Posts