Giter Club home page Giter Club logo

reliability-engineering-exercise's Introduction

Reliability Engineering Exercise

This is a simple application for the Spring Reliability Engineering exercise. To see my answers to the questions, head over to the Questions.

Running this Application

This application uses DataDog as a monitoring solution and a Datadog API key is required to run. However, it will be easy to switch to any monitoring solution supported by Micrometer.

Running Locally
# If you don't have redis
brew install redis

# or you can use docker
docker-compose up -d 

export DATADOG_API_KEY=YOUR_API_KEY

./gradlew bootRun

# To view all available metrics
curl http://localhost:8081/actuator/metrics
Load Tests

For simplicity and explainability, load tests were constructed in JMeter. In the future, this project will likely include Gatling load tests as well.

jmeter -n -t <LOAD-TEST>.jmx -Jhost=<HOST-OF-APP> -Jport=<PORT-OF-APP>

Background

This repo is an example of a simple Spring application that might exhibit two specific latency and throughput behaviors.

These behaviors are:

1. A higher mean in latency than the 99th percentile.

This behavior requires a small subset of requests to be extreme outliers in the latency distribution. This long tail distribution of latencies results in in a mean that is actually above even the 99th percentile of requests. A simple JUnit test that shows how this is possible is viewable here.

To recreate this behavior in a load test an endpoint was created that irregularly (1/100ish) times requires an expensive operation. This operation could be a cache miss, a slow network operation or even a periodic expensive computation. The load test regularly calls this endpoint and the occasional outlier in latency drags the mean of latency above even the 99th percentile. A graph over an hour of this load test executing is shown below.

Mean Above 99

Notice the graph shows that the average (gray line) usually remains above even the purple 99th percentile. This is the result of the orange line which represents the max latency and is an order of magnitude greater than the 99th percentile and the average latency. (The graph is on a logarithmic scale.) In a more realistic scenario, the max latency would not be so static.

For reference, the throughput graph across the same period is shown below.

Throughput

The Load Test

The load test for this scenario executes one thread group with multiple threads that consistently polls the "/occasional-image-processing" endpoint.

2. Periodic and regular spikes in throughput and latency.

(Note: the following is a purely theoretical situation that most likely was not the cause of the sample graphs for this exercise.)

In order to recreate this behavior, a Spring Webflux app is used that has a simulated caching layer. The vast majority of requests do not need the cache or even use a value found in the cache. A small subset of requests are ‘cache misses’ that require additional processing or an expensive network lookup.

To simulate the irregular ‘cache miss’ a spring caching layer with a Redis backend is utilized. A time to live (ttl) on the cache is set to 1 minute. During the load test values will regularly expire in the cache and require ‘cache miss’ lookups. A 20 second Thread.sleep is added to the code path that is executed when a lookup is required. This thread sleep obviously delays the response of the request that hits the ‘cache miss’. However, the thread pause has implications for other requests.

Spring Webflux utilizes an intentionally small number of threads and achieves powerful concurrency by relying on the reactive paradigm and nonblocking I/O. The thread sleep is a blocking operation that prevents the thread from processing new or existing requests. In this simulated setup, a queue of requests begins to build upon the thread with the thread sleep. In a real Webflux application, blocking network calls, database lookups, or even computationally expensive operations could achieve the same result and completely exhaust a thread.

When the cache miss and thread sleep is over the thread can process the backed up requests. Because the application was unable to process these requests while they were waiting the ‘waiting time’ is not included in the time measurement or the throughput calculation. The application quickly works through the requests in the queue and includes them in the monitoring calculations. This flurry of requests may result in the spikes in throughput. This behavior is evident in the sample graphs shown below generated from a simplistic load test.

Spikes In Throughput

Notice the periodic spikes in throughput are correlated with massive spikes in max latency.

In this example, the mean is not typically higher than the 99th percentile. However, It would be possible in the future to adjust the ratios to generate a mean that is higher than the 99th percentile. A graph of the mean and the 99th percentile is shown below.

Mean and 99

The Load Test

The load test for this scenario executes two thread groups. One thread group consistently polls the ‘/cache-lookup’ endpoint that occasionally triggers a cache miss and simulates a subset of traffic that occasionally requires expensive thread blocking operations. Another thread group simulates consistent high volume and quick low latency traffic by hitting the "/no-lookup" endpoint. Because calls to this endpoint may be stuck waiting for the expensive operation to finish, the HTTP calls in this thread group have low timeouts.

reliability-engineering-exercise's People

Contributors

matthewmcnew avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.