In this article we will have a look at what happens when a reactive application stops working the way we want to, what type of issues we can encounter and what are the main causes for that. Furthermore, we will explore different ways of adding traceability to the Reactive Pipeline, see what is happening in it and then move on to techniques we can deploy to tackle and solve these issues. In the second part we will discuss what happens when we manually deal with threads and schedulers and what is the right course of action for that kind of issues.
Anatomy of the Reactive Pipeline
Before diving into what kind of issues can happen in the pipeline and what we can do to solve them, it’s important to understand what are the main components of a Reactive Pipeline and how subscribers interact with it.
Once we understand the normal working flow of a reactive application, we can explore its different modes of failure and start evaluating possible solutions.
Broadly speaking the Reactive Pipeline can be chronologically divided into three phases:
- Assembly phase
- Subscription phase
- Runtime phase
Each one of them is specific to a different lifecycle of the pipeline, and it is as follows:
- The Assembly phase is when the Project Reactor framework actually builds the pipeline, and the end result of this is the creation of the Publisher object (Mono/Flux).
This phase is called declarative because the developer declares the chain of operators that transform the given Publisher, but nothing is executed at this point. Source
- The Subscription phase is when the Project Reactor framework creates the Subscriber object and the end result is a Subscription object that will then receive the results of the Reactive Pipeline in which the data is fed from the Publisher object.
This phase is the trigger that enables the Reactive Pipeline to start the data flow and actually produce some results that can be consumed.
- The Runtime phase or Execution phase represents the moment when the pipeline is already publishing data from one end (the upstream) to another end (the downstream/subscriber) and it’s where the errors can manifest.
This is the phase where we can mitigate/resolve most of the issues we will encounter when working with Reactive Applications.
Now that we have the concept of the different phases clear, let’s have a look at the concrete issues that can happen and how we can mitigate them in a Reactive Application.
Types of Reactive Issues and Their Mitigations
In general, there are many types of errors that an application can encounter during its lifecycle, but what we are interested in are the errors that are directly related to the reactive nature of the application, and as such are linked with the Spring Boot 2 framework and more specifically with the way Project Reactor works.
As soon as you introduce asynchronous code, you create a boundary in the program between the code that schedules and the code that asynchronously executes, and this “misalignment” of code and times of execution is what makes the evergreen stacktrace much less useful, unfortunately for us.
Given that we cannot actually rely on the stacktrace to convey meaningful information and give us a clue on what can be the source of the issue, due to the asynchronous nature of the application, we need to take advantage of different operators and techniques to triage, debug and mitigate these issues.
On top of the less usefulness of the stacktrace in the reactive world, there’s also the compound issue of the different phases (Assembly vs Subscription) that makes it even harder to understand where the error occurs, as we will just see Project Reactor code in the stacktrace and no code of our own.
So, what techniques can we deploy to trace, debug and then mitigate the errors that can occur in the Reactive Application we just built?
The list below is a good summary of them:
- The “log()” method
- The “doOnError()” method
- The global, stop-the-world and performance heavy “onOperatorDebug()”
- The less invasive “checkpoint()”
These four techniques will allow us to gain a better insight on what’s happening in our pipeline, what kind of data is flowing through it and which operators are called for each subscriber, both in a global and in a “pipeline local” way.
The “log()” method
The “log()” method is a quick way to get an easy bird eye’s view of what is going on at one step of your sequence. Use it liberally during development, with the possibility of specifying a “name” to each log call to differentiate them, making it easier to identify which part of the pipeline emitted that data that’s been logged.
The types of operations that are logged by the “log()” method are:
Each of these operators will be logged every time they occur, and also, it’s possible to specify a custom logger as well as whitelist just some of these signals if we are not interested in all of them.
This is the simplest and most straightforward operator that can be used when debugging issues in the Reactive Pipeline, but nonetheless is the cheapest one in terms of performance impact (compared to the global hook installer) and it can always be switched off if needed.
The purpose of this operator is answering the questions “How do I know what is happening when we subscribe to the Publisher?” and “I want to see the event sequence of this stream.”
With it, we can clearly see what is happening when we subscribe (in fact, we can also see when we are subscribing thanks to the logging of the “onSubscribe” event) and we can also observe the data that’s flowing through the pipeline via the logging of the “onNext” and “request” events.
All in all, this is a very easy to use, yet moderately powerful operator that will make it easier for the developer to precisely understand what is going on in the Reactive Pipeline.
The “doOnError()” method
This operator is intended to be used when the pipeline throws an error and we want to consume that error, most commonly for logging purposes.
The way the operator is used is very straightforward and we can pass a consumer function to it which will receive the Throwable/Exception object that represents the error and we will have a chance to act on it.
While it’s useful to log information about the error that occurred, this is not the clearest way to obtain more insights on what actually failed, as the Throwable/Exception object will not contain a clear stacktrace for the reasons we mentioned before.
What we will achieve will be adding more context to an error that was going to be logged anyway (unless explicitly disabled) with the hope that it could be useful when looking for the root cause.
This is pretty much all that needs to be said about this operator, it’s useful as an additional tool when debugging reactive issues, but it’s not that useful when used on its own.
The “onOperatorDebug()” method
This operator is probably the most powerful one we have at our disposition to deploy and use when debugging reactive issues.
What it does is that every single operator in the whole application will log a detailed, in-order stacktrace of what led to a given failure so that we can understand precisely where the issue occurred, where the failure was and what operators and in which order they have been called before the error happened.
This operator enables a recorder that is capable of capturing assembly stack trace information when the Publisher is created and also some very important information like the operator description, the chain order and so on.
If an “onError” event reaches one operator, it will attach that assembly stacktrace to the onError ’s Throwable (as a suppressed Exception). As a result, when you see the stacktrace you’ll get a more complete picture of both the runtime AND the assembly, which is something that greatly helps understand the issue. Issue.
This operator is so powerful also because it is a cross-cutting tool that can show the developer what was planned (Assembly phase) and what actually happened (Runtime phase), and any difference between the two is where our bug/issue will be found.
The power of this operator lies in the fact that all the other operators in our entire application will be instrumented to log stacktraces that have been re-ordered so we can clearly see what led to a failure and where exactly our code stopped working.
The downside of this operator is surely the heavy performance impact it has on the application, that’s why it’s discouraged from being used in production.
As long as we are developing, we can enable it to verify that our assumptions about errors are correct and that we truly mitigated the root cause of a given issue.
The “checkpoint()” method
This operator is intended to provide the same functionality of the “onOperatorDebug()” method but only for a given pipeline and not for the whole application.
What this means is that the performance impact has been greatly reduced if not almost eliminated because now the operator stacktraces are wrapped and instrumented just for a given pipeline, which it’s much better in terms of performance hit to the whole application.
There are different variants of the “checkpoint()” operator, but most of the time, what we would use is the one which takes a description argument, that will be printed in case error occurred in upstream of the checkpoint.
By using the checkpoint() operator, it is possible to activate the assembly trace capture only at that specific point in the pipeline, allowing for a much less impactful performance hit.
While it is true that the shit from imperative/blocking to reactive programming brings in a lot of performance enhancements and gains, it’s also true that the debugging experience changes a lot and we need to be ready for those changes and new techniques.
We explored the tools available for debugging reactive applications in the context of Spring WebFlux, examining each tool and possible scenarios where it was appropriate to use them.
In the next article, the second part, we will have a look at what happens when we introduce manual threads and schedulers into our own reactive application.
Stay tuned for the second part next week!