Tracking down problems with Jaeger
Hunter
Jaeger as a Framework
In the example here, after integrating the Open Telemetry framework, the admin or developer has an application that generates metrics, logging, and tracing data. However, that is only half the battle because you also need to display and make sense of the data. In this case, Jaeger becomes a possible implementation of a tracing framework. Although Jaeger initially offered client libraries, it has since retired them in favor of Open Telemetry's counterparts.
What is left is the Jaeger server, which itself comprises several components. Interestingly, they do not just run on a single host: A Jaeger agent supports clients on the target systems (or in the target containers), which often exist in large numbers. The task is as simple as can be: to field the data created and collected by the clients. In container applications, the agent is usually integrated as another Sidecar. In practice, it ensures that the application itself does not need to know where to send the acquired trace, log, and metrics data.
What seems to be trivial is of great importance in practice, because the Jaeger configuration is autonomous and independent of the application's own configuration thanks to the Jaeger agent, which means it can also be changed on the fly.
From the App to the User
What happens to the data once it reaches the Jaeger agent depends on the setup. Several options initially rely on the Jaeger collector, which in turn collects and samples all the available data from the various agent instances. Sampling is primarily intended to remove redundant information from the acquired data to reduce the overall volume. With no rigid rules, the collector uses adaptive sampling. The administrator or the developer can influence this process in a separate configuration, if needed.
From the collector, the data then moves on to a database. Jaeger relies on persistent storage in the background to save and process the acquired information. However, the choices here are Elasticsearch, Cassandra, or Kafka instead of legacy relational databases. Once the data is stored, the developers recommend using Apache Spark to optimize the database content with Spark jobs available in the Jaeger repository.
Finally, the Jaeger query component reads data from the traces database on the basis of user-defined parameters. All visualization tools, including Jaeger's own, always access the query component and never access the database directly. The database obviously will be under a tremendous load with new traces arriving on one side and queries for existing data piling in from the query component on the other. Jaeger query, by the way, is the application that uses Jaeger's own user interface (UI; Figure 3) when it queries data from Jaeger.
To Cache or Not To Cache?
Because of the potentially high base load, the Jaeger developers have set up an option for caching the database with a (possibly additional) Kafka instance. The collector then writes its data to the cache instead of directly to the main database, and the intermediary Kafka instance asynchronously pushes the data to the persistant database in the background.
In this setup, the additional load produced by Spark no longer directly affects the back-end database but is offloaded to the caching Kafka instance by Apache Flink. This relatively complex architecture does have practical uses. Large microarchitecture applications often produce large volumes of traces with millions of spans in a very short time. The interaction of the tools mentioned here easily keeps the setup usable and stable in this case.
Buy this article as PDF
(incl. VAT)