Tracking down problems with Jaeger
Hunter
Case Debugging
The microarchitecture of an application shown here also makes it clear why debugging problems in this kind of environment can easily become a Herculean task. A very simple example makes this clear: Imagine a user who accesses a distributed application through a web application and wants to query the data stored there. Instead of the expected data, however, they only see an error message. So where does the problem lie?
The potential sources of error are almost infinite. The component that accepts the request from the web application or the load balancer upstream might not be working properly. Also, the load balancer between the first and second components might not be configured correctly and, as a result, simply drop the request. However, it could just as easily be because of network problems between the physical systems on which the application's components run – and these components are likely to be packaged in containers.
Other possibilities are that the request is reaching the database, but the database cannot respond appropriately or the persistent memory available to the database is faulty. Access to memory over the network might not be working because one of the switches in the setup has failed and is generating junk. The database might even be delivering the desired data, which then gets stuck somewhere on the way back.
How is an admin or developer, for which an active cloud-ready application already looks like the proverbial black box, supposed to find the root of the trouble now?
Conventional Means Will Fail
Classic approaches such as reading logfiles generated by individual parts of the application are often frustrating because the components do not generate any log messages – not because their developers have not bothered to implement the function correctly, but because, in a distributed system, individual containers often have no persistent storage for logs. Even if the app were capable of logging, it would not know where to put the logs, at least locally.
Classic monitoring approaches are also difficult or impossible to implement, because event monitoring or simply monitoring metrics data is of little practical help when you are debugging an individual problem. After all, you have nothing to gain if your monitoring says all systems are go because it is unable to detect individual problems. To cut a long story short, microarchitecture applications require a complete rethink of debugging and monitoring, which is where Open Telemetry and its Jaeger implementation come into play.
Standard and Application
Right from the outset, I need to clarify that Open Telemetry and Jaeger are not identical, although the terms are often used synonymously.
Open Telemetry is a standard [1] that describes a communication interface that application developers can integrate into their products to communicate over a defined protocol with the outside world. The goal is always to collect and export telemetry data (log and metrics data and tracing information from data streams) in a standardized format. On top of that, Open Telemetry now also offers a variety of clients for integration into programming languages, such as Go or Python.
Jaeger is a concrete implementation of tracing functions in line with the Open Telemetry standard that provides a framework that helps admins evaluate Open Telemetry data [2].
What sounds abstract and complicated in theory is far easier to understand in practice. It is therefore worth recalling once again the fictitious microarchitecture application at the beginning of this article and the failure scenario described for it to illustrate the benefits of Open Telemetry and Jaeger.
Buy this article as PDF
(incl. VAT)