Monday, June 20, 2016

When things may get out of control: circuit breakers in practice. Hystrix.

It is amazing how tightly interconnected modern software systems are. Mostly every simple application has dependency on some external service or component, not to mention emerging at a great pace Internet of Things (or simply IoT) movement. It is good and not so at the same time, let us see why ...

There are many use cases when relying on other services, provided by someone externally or internally, makes a lot of sense (messaging, billing, taxes, payments, analytics, logistics, ...) but under the hood every such integration poses risks to our applications: they become dependent on availability and operationability of those services. Network latency, spikes of load, just banal software defects, each of these unknowns can bring our applications on its knees, making users and partners dissatisfied, to say it mildly.

The good news are there is a pattern we can employ to mitigate the risks: circuit breaker. Firstly explained in great details in the Release It! book by Michael T. Nygard, circuit breakers became the de-facto solution for dealing with external services. The idea is pretty simple: track the state of the external service on a given time interval to collect the knowledge about its availability. If the failure is being detected, circuit breaker opens, signalling that external service should better not be invoked for some time.

There are plenty of circuit breaker implementations available but because we are on JVM, we are going to talk about three of those: Netflix Hystrix, Akka and Apache Zest. To keep the posts considerably short, the topic of our discussion is going to be split in two parts: Netflix Hystrix followed by Akka and Apache Zest.

To show off circuit breakers in action, we are going to build a simple client around https://freegeoip.net/: public HTTP web API for software developers to search the geolocation of IP addresses. The client will return just brief geo-details about particular IP or hostname, wrapped into GeoIpDetails class:

@JsonIgnoreProperties(ignoreUnknown = true)
public final class GeoIpDetails {
    private String ip;
    @JsonProperty("country_code") private String countryCode;
    @JsonProperty("country_name") private String countryName;
    private double latitude;
    private double longitude;
}
So let us get started ...

Undoubtedly, Netflix Hystrix is the most advanced and thoroughly battle-tested circuit breaker implementation at the disposal of Java developers. It is built from the ground up to support asynchronous programming paradigm (heavily utilizing RxJava for that) and to have a very low overhead. It is more than just circuit breaker, it is full-fledged library to tolerate latency and failures in distributed systems, but we will touch upon basic Netflix Hystrix concepts only.

Netflix Hystrix has surprisingly simple design and is built on top of command pattern, with HystrixCommand in its core. Commands are identified by keys and are organized in groups. Before we are going to implement our own command, it is worth to talk about how Hystrix isolates the external service integrations.

Essentially, there are two basic strategies which Hystrix supports: offload the work somewhere else (using dedicated thread pool) or do the work in the current thread (relying on semaphores). Using dedicated thread pools, also known as the bulkhead pattern, is the right strategy to use in most use cases: the calling thread is unblocked, plus the timeout expectations could be set as well. With semaphores, the current thread are going to be busy till the work is completed, successfully or not (timeouts are claimed to be also supported since 1.4.x release branch but there are certain side effects).

Enough theory for now, let us jump into the code by creating our own Hystrix command class to access https://freegeoip.net/ using Apache HttpClient library:

public class GeoIpHystrixCommand extends HystrixCommand<String> {
    // Template: http://freegeoip.net/{format}/{host}
    private static final String URL = "http://freegeoip.net/";
    private final String host;
 
    public GeoIpHystrixCommand(final String host) {
        super(
            Setter
                .withGroupKey(HystrixCommandGroupKey.Factory.asKey("GeoIp"))
                .andCommandKey(HystrixCommandKey.Factory.asKey("GetDetails"))
                .andCommandPropertiesDefaults(
                    HystrixCommandProperties.Setter()
                        .withExecutionTimeoutInMilliseconds(5000)
                        .withMetricsHealthSnapshotIntervalInMilliseconds(1000)
                        .withMetricsRollingStatisticalWindowInMilliseconds(20000)
                        .withCircuitBreakerSleepWindowInMilliseconds(10000)
                    )
                .andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("GeoIp"))
                .andThreadPoolPropertiesDefaults(
                    HystrixThreadPoolProperties.Setter()
                        .withCoreSize(4)
                        .withMaxQueueSize(10)
                )
        );
        this.host = host;
    }
 
    @Override
    protected String run() throws Exception {
        return Request
            .Get(new URIBuilder(URL).setPath("/json/" + host).build())
            .connectTimeout(1000)
            .socketTimeout(3000)
            .execute()
            .returnContent()
            .asString();
    }
}

The first thing to get from this snippet is that Hystrix commands have a myriad of different properties which are initialized in the constructor. Command group and key, set to "GeoIp" and "GetDetails" respectively, we have already mentioned. Thread pool key, set to "GeoIp", and thread pool properties (for example, core pool size and maximum queue size) allow to tune thread pool configuration, the default execution isolation strategy used by Hystrix. Please notice that multiple commands may refer to the same thread pool (shedding the load for example), but semaphores are not shared.

Other GeoIpHystrixCommand command properties, arguably most important ones, would need some explanation:

  • executionTimeoutInMilliseconds sets the hard limit on overall command execution before timing out
  • metricsHealthSnapshotIntervalInMilliseconds indicates how often the status of the underlying circuit breaker should be recalculated
  • metricsRollingStatisticalWindowInMilliseconds defines the duration of rolling window to keep the metrics for the circuit breaker
  • circuitBreakerSleepWindowInMilliseconds sets the amount of time to reject requests for opened circuit breaker before trying again

It is worth to mention that Hystrix has sensible default value for every property so you are not obliged to provide them. However, those defaults are quite aggressive (in a very good sense) so you may need to relax some. Hystrix has a terrific documentation which talks about all the properties (and their default values) in details.

Another option which Hystrix incorporates is fallback in case the command execution was not successful, timed out or circuit breaker is tripped. Although fallback is optional, it is very good idea to have one, in case of https://freegeoip.net/ we may just return an empty response.

    @Override
    protected String getFallback() {
        return "{}"; /* empty response */
    }

Great, we have our command, and now what? There are multiple ways Hystrix command could be invoked. The most straightforward one is just synchronous execution using execute() method, for example:

public class GeoIpService {
    private final ObjectMapper mapper = new ObjectMapper();
 
    public GeoIpDetails getDetails(final String host) throws IOException {
        return mapper.readValue(new GeoIpHystrixCommand(host).execute(), 
            GeoIpDetails.class);
    }
}

In case of asynchronous execution, Hystrix has a couple of options, ranging from bare Java's Future to RxJava's Observable, for example:

public Observable<GeoIpDetails> getDetailsObservable(final String host) {
    return new GeoIpHystrixCommand(host)
        .observe()
        .map(result -> {
             try {
                 return mapper.readValue(result, GeoIpDetails.class);
              } catch(final IOException ex) {
                  throw new RuntimeException(ex);
              }
        });
}

The complete sources of the project example is available on Github.

If your project is built on top of very popular Spring Framework, there is a terrific out-of-the box Hystrix support using convenient (auto)configuration and annotations. Let us take a quick look on the same command implementation using Spring Cloud Netflix project (certainly, along with Spring Boot):

@Component
public class GeoIpClient {
    @Autowired private RestTemplate restTemplate;

    @HystrixCommand(
        groupKey = "GeoIp",
        commandKey = "GetDetails",
        fallbackMethod = "getFallback",
        threadPoolKey = "GeoIp",  
        commandProperties = {
            @HystrixProperty(
                name = "execution.isolation.thread.timeoutInMilliseconds", 
                value = "5000"
            ),
            @HystrixProperty(
                name = "metrics.healthSnapshot.intervalInMilliseconds", 
                value = "1000"
            ),
            @HystrixProperty(
                name = "metrics.rollingStats.timeInMilliseconds", 
                value = "20000"
            ),
            @HystrixProperty(
                name = "circuitBreaker.sleepWindowInMilliseconds", 
                value = "10000"
            )
        },
        threadPoolProperties = {
            @HystrixProperty(name = "coreSize", value = "4"),
            @HystrixProperty(name = "maxQueueSize", value = "10")
        }
    )
    public GeoIpDetails getDetails(final String host) {
        return restTemplate.getForObject(
            UriComponentsBuilder
                .fromHttpUrl("http://freegeoip.net/{format}/{host}")
                .buildAndExpand("json", host)
                .toUri(), 
            GeoIpDetails.class);
    }
 
    public GeoIpDetails getFallback(final String host) {
        return new GeoIpDetails();
    }
}

In this case the presence of Hystrix command is really hidden so the client just dials with a plain, injectable Spring bean, annotated with @HystrixCommand and instrumented using @EnableCircuitBreaker annotation.

And last, but not least, there are quite a few additional contributions for Hystrix, available as part of the Hystrix Contrib project. The one we are going to talk about first is hystrix-servo-metrics-publisher which exposes a lot of very useful metrics over JMX. It is essentially a plugin which should be explicitly registered with Hystrix, for example here is one of the ways to do that:

HystrixPlugins
    .getInstance()
    .registerMetricsPublisher(HystrixServoMetricsPublisher.getInstance());

When our application is up and running, here is how it looks like in JVisualVM (please notice that the com.netflix.servo MBean is going to appear only after the first Hystrix command execution or instrumented method invocation so you may not see it immediately on application start):

When talking about Hystrix, it is impossible not to mention Hystrix Dashboard: terrific web UI to monitor Hystrix metrics in real time.

Thanks again to Spring Cloud Netflix, it is very easy to integrate it into your applications using just @EnableHystrixDashboard annotation and another project from Hystrix Contrib portfolio, hystrix-metrics-event-stream which exposes Hystrix metrics over event stream. The complete version of the Spring-based project example is available on Github.

Hopefully at this point you would agree that, essentially, every integration with external services (which are most of the time just a black boxes) introduces instability into our applications and may cause cascading failures and serious outages. With this regards, Netflix Hystrix could be a life saver, worth adopting.

In the next part we are going to look at another circuit breaker implementations, namely the one available as part of Akka toolkit and Apache Zest.

All projects are available under Github repository.