« Previous 1 2 3 Next »
Monitoring in the Google Cloud Platform
Cloud Gazer
Creating Dashboards
The Google Cloud Platform already gives you a large number of dashboards for monitoring, but you might still want to create your own. To do so, go to the Operations | Monitoring | Dashboards page, which shows all existing dashboards. Moreover, you can use as inspiration sample templates from a large library on the SAMPLE LIBRARY tab.
To create your own dashboard, press Create Dashboard and assign a name. Now select elements you want to show on the dashboard from the list of charts. The selection list is quite extensive. In addition to line diagrams, you can choose from stack and bar charts, heat maps, tables, gage charts, scorecards, text, and warnings.
For example, you could start by adding a line diagram as an element and titling it CPU load , selecting VM Instance as the resource type, and setting the metric to CPU load (1m) . Now you can add a second chart by clicking ADD CHART at the top (Figure 2). Again, choose a line diagram and specify Received packets as the label. The resource type is again VM Instance , and the metric this time is Received packets (gce instance) .
Monitoring Services
Classic monitoring often focuses on measured values at the resource level. In this case, it was information such as CPU utilization, network packets transmitted, or memory utilization. However, modern applications consist of a large set of individual components, so it is not very useful to look at a very large set of metrics individually. It makes more sense to look at the application as a whole and, in particular, measure the application's most important values – availability and latency.
To explain how this works, I'll look at a sample application that is publicly available on GitHub and is based on Google's App Engine service. To deploy it, open the cloud shell in the GCP console and clone the repository as follows: First choose the region that suits your use case (I chose Western Europe),
git clone https://github.com/haggman/HelloLoggingNodeJS.git
Next, deploy the application with the commands
cd HelloLoggingNodeJS gcloud app create --region=europe-west3 gcloud app deploy
The deployment process takes one or two minutes, and after it completes, you will see the URL of the application in the shell output. On opening the web page, no big surprises jump out: It's yet another "Hello World!" Now the task is to generate some load on the application with the code snippet
while true; do curl -s https://$DEVSHELL_PROJECT_ID.appspot.com/random-error -w '\n' ;sleep .1s; done
The cloud shell now continuously generates log output, which you can safely ignore. To monitor the service, you need to familiarize yourself with the definitions of service-level indicators (SLIs) and service-level objectives (SLOs). SLI metrics measure the reliability of a service. To do this, select a metric, divide the positive events by the number of total events, and multiply by 100 – for example, SLI (%)=(Positive events/Valid events) x 100.
The classic SLI values are availability or latency, of which you do not want to exceed a certain limit. You can define SLOs on this basis as an agreement of compliance with certain metric values. SLOs need to be measurable metrics that have been documented and shared among the stakeholders. SLOs are also often part of a service-level agreement (SLA) that documents the key performance indicators (KPIs) the customer expects from a provider.
Creating SLOs
In the App Engine application [1] cloned earlier in this article, approximately every 1,000th call throws an error. To set up service monitoring, go to the left-hand menu in the GPC GUI and select Operations | Monitoring | Services . You will immediately notice that the App Engine application is already displayed on the dashboard (Figure 3). Click on the Default link to view details about the service and start creating an SLO by pressing+CREATE SLO on the right side. Again, this opens a wizard:
- In the first step, select Availabiliy as the metric. Leave the Request-based checkbox checked at the bottom, which means that availability is calculated over the entire period, regardless of the load. Pressing Continue takes you to the next page.
- Now check the SLI settings. If you wait awhile, you will see a preview of the requests.
- On the third page, configure the SLO, setting the Period type to Rolling , the Period length to seven days, and a Performance goal of 99.5 percent. Clicking Continue takes you to the next step in the wizard.
- The last page gives you an overview with the option of viewing the configuration you created in JSON. Choosing CREATE SLO closes the wizard.
When you look at the SLO, everything seems to be in the green zone. The service-level indicator, the error budget, and the alerts are as expected. The concept of the error budget comes from Google, describing a measurable number of errors or a percentage of the time a service might not be available and still be considered a valid SLO. In this example, this value is 100 percent minus 99.5 percent, or 0.5 percent. A product team can use the current error budget to carry out maintenance work, for example. When the budget is exhausted, the work just has to wait until the error budget has replenished.
Regardless of the values, you might want to send warnings if an SLO is not reached. To do this, click the CREATE SLO ALERT link on the SLO. You again need to configure a number of parameters:
- Lookback duration determines the period of time over which you want to look at data. Longer values are especially interesting for compliance, and shorter values give you a quicker warning if errors occur. For this example, I'll just go for 10 (hours).
- Burn rate threshold determines how fast the error budget might be used up. A value of 1 on a period indicates that the error budget is used up exactly in a period. I configured 1.5 here.
The next steps regarding the Notification Channel should be familiar, and you can go on to complete the wizard.
If you want to provoke more errors, you can modify the application accordingly. If you open the index.js
files in the cloud shell and search for random-error
, you can massively increase the number of errors in this line by reducing the number 1,000
to a lower value (e.g., 20
). You can then use the
gcloud app deploy
command to redeploy the application. As soon as you call the application again, the SLO count should decrease rapidly.
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)