Iteration 2: Design for Failure

Deadline: 1st December 2021.

Please report any issues and unclarities, and I will update the exercise continously.

Learning objective

Improve Availability (Stability) of SkyCave by introducing timeouts (or circuit breakers), and introducing (simple) health monitoring. (To limit workload substantially, you only need to address integration points related to your group's own microservice.)

Prerequisites

You have a working SkyCave microservice architecture from the previous exercise including your own and two other (one-man groups: one other) groups' services.

Exercise

Implement a safe failure mode behaviour in your SkyCave for all integration points that are interacting with your own group's REST service (that is, 3--4 integration points). The requirements are to implement the TimeOut stability pattern (optionally the CircuitBreaker pattern), more specifically:
- Max 5 second response time for any service call (integration point) to your group's own service in case it is slow responding or connection is broken.
- Argue for, and implement, 'graceful degradation' behaviour for your service's functionality. That entails two requirements: A) that 'daemon' and 'cmd' never fails (thread never fails with exceptions or is slow responding) when interacting with your group's REST service and B) a 'best-effort' answer is relayed back to the 'cmd' and thus the player, outlining the issue at hand and recommending a proper action.
- Optional/alternative: Instead of implementing the Timeout pattern, guard all integration points with a circuitbreaker. Parameters are: Maximal 5 second response time; maximal 3 attempts before going to OPEN state; and 20 second delay until attempting HALFOPEN state. Communicate the breaker's state back to the player in 'cmd'.
See hints and guides below.
Highly Optional: Implement safe failure modes for the integration points to the two other groups' REST services. (No real new learning by doing it, but the pleasure of a really robust SkyCave system.)
Augment your service with a path '(hostname:port)/health' that present health information, as outlined by Nygard (p. 169). Update the compose-file for the SkyCave architecture so your group's own REST service is guarded by healthchecks and suitable restart policies.
Optional: Guard also the SkyCave daemon and 'even more optionally' consumed services with healtchecks - you need to negotiate deadlines with collaborating groups to facilitate smooth integration regarding the consumed services.

Hand-in:

Continue the Word or PDF report, following the updated Mandatory Report Template (LaTeX template) by writing Chapter 2. The template contains placeholder sections, that includes a short text explaining what I expect you to deliver. Handin by submitting to BS.

Evaluation:

Your report is evaluted pass/not pass initially (and with some ideas for improvement). The final report (also including the next mandatory exercise) is finally evaluated along with the final oral defense for a final grade for this course.

Hints and Guides

Safe Failure Modes

Note that in contrast to the 'timeout-quote-service' exercise from the seminar in which you could get away with just returning a 'it-did-not-work-sorry' quote from the quote service, in this case you have to do something in the daemon's architecture (The PlayerServant or perhaps better in the invokers) to properly address the safe failure mode. It is not sufficient to return, say, a 'RoomRecord' with a description "this room does not exist", right? So - catch connection/timeout exceptions from your library, convert them to CaveIPCExceptions (or your own variants - I have made a subclass called CaveFailureModeException) and let the server side catch them to provide proper feed back to the 'cmd'.

Implementation hint: The 'cmd' receives marshalled 'ReplyObject's and if its 'statusCode' is outside the 200-299 interval, the client side Broker library will instead throw a 'frds.broker.IPCException' which is caught in the CmdInterpreter's 'readEvalLoop()' method. This is a proper place to handle safe failure modes by informing the player, that something unusual happened and other actions needs to be taken. Have a look at the test case 'TestCmdFailureHandling' in the client project.

Healthcheck

You have to add an additional GET /health path to your REST service, but you are not required to update the first API specification section of your report.

If you add healthchecks to the SkyCave daemon itself, you either just update the 'CaveUriTunnelServerRequestHandler' code, or add a new implementation/subclass which forces you to overwrite the SKYCAVE_SERVERREQUESTHANDLER_IMPLEMENTATION in the CPF. To inspect the state of the container, the

docker inspect minus minus format='{{json .State.Health}}' (container-id)

comes in handy (or use 'docker ps').

Academic writing

If you a new to writing reports in an academic setting, please consult my review guide. Basically, you should write clear and concise, demonstrate systematic work and document your work convincingly. Easier to say than to do...