A new approach to more attractive histograms in Prometheus

Off the Chart

Switchover

To use native histograms in Prometheus, you need to enable them with the

--enable-feature=native-histograms

flag. As a result, data collection on the target is done in the Protobuf format. You will want to use the latest available version of Prometheus, because this feature is considered experimental and is still subject to ongoing change. In Go, the client_golang v0.14 library supports native histograms, for which you need to modify the code as shown in Listing 2.

Listing 2

Mod for Native Histograms

requestDuration: prometheus.NewHistogramVec(
  prometheus.HistogramOpts{
    Name:    "prometheus_http_request_duration_seconds",
    Help:    "Histogram of latencies for HTTP requests.",
-   Buckets: []float64{.1, .2, .4, 1, 3, 8, 20, 60, 120},
+   NativeHistogramBucketFactor: 1.1,
+   NativeHistogramMaxBucketNumber: 150,
  },
  []string{"handler"},
)

In slightly simplified terms, native histograms use exponentially staggered buckets to cover the entire float64 range of values. The width of the ranges increases by a constant factor that can be adjusted to balance overhead and accuracy. A factor of 1.1 (each successive range is 10% wider than the previous one) provides a good compromise and results in eight ranges per power of two (e.g., between 1 and 2, 2 and 4, 4 and 8, etc.).

You can specify the maximum number of buckets with the

NativeHistogramMaxBucketNumber

option. Without this setting, the number of areas could grow uncontrollably and cause high memory useage. Limiting this option to a defined number allows the native histogram to recalculate its ranges and aggregate some of them, if needed, to reduce memory useage.

Results

Figure 3 shows a comparison for computing the 95th percentile of a specific query for identical data. At each point on the x axis, the corresponding y value represents the threshold below which 95 percent of the requests fell at that point. On the left you see the computations with a legacy histogram, and on the right with the native variant.

Figure 3: A comparison computing the 95th percentile of a specific query over time with legacy (left) and native (right) histograms. © PromCon EU 2022 [3]

The comparison shows the native histograms provide a far more accurate estimate of the percentile value. In contrast, legacy histograms can misleadingly suggest the values are identical before and after a peak. In fact, the percentile value in the example before the peak is about 350ms, and about 500ms after the peak. This difference of 150ms is not represented by legacy histograms.

Conclusions

The Prometheus data model has experienced a major change with the addition of native histograms [3]. This new feature offers a more accurate representation of the data and allows for superior computation of percentiles compared with legacy histograms. The move to Protobuf to collect metrics is a significant upgrade.

The Author

Julien Pivotto has been instrumental in the development and evolution of Prometheus as the maintainer. He is one of the founders of O11y (https://o11y.eu), a company that provides premium support for various open source monitoring tools such as Prometheus, Thanos, and Grafana.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Time-series-based monitoring with Prometheus
    As Prometheus gave fire to mankind, the distributed monitoring software with the same name illuminates the admin's mind in native cloud environments, offering metrics for monitored systems and applications.
  • Sustainable Kubernetes with Project Kepler
    Measure, predict, and optimize the carbon footprint of your containerized workloads.
  • Monitoring, alerting, and trending with the TICK Stack
    If you are looking for a monitoring, alerting, and trending solution for large landscapes, you will find all the components you need in the TICK Stack.
  • Detect anomalies in metrics data
    Anomalies in an environment's metrics data are an important indicator of an attack. The Prometheus time series database automatically detects, alerts, and forecasts anomalous behavior with the Fourier and Prophet models of the Prometheus Anomaly Detector.
  • I/O Profiling at the Block Level

    Understanding how applications perform I/O is important not only because of the volume of data being written and read, but because the performance of some applications is dependent on how I/O is conducted. In this article we profile I/O at the block layer to help you make the best storage decisions.

comments powered by Disqus