Lead Image © Gerard Boissinot, Fotolia.com

System logging for data-based answers

Log Everything

Article from ADMIN 43/2018

By Jeff Layton

To be a good HPC system administrator for today's environment, you need to be a lumberjack.

Oh, I'm a lumberjack, and I'm okay,

I sleep all night and I work all day.

– from "Lumberjack" byMonty Python

Can't you just imagine yourself in the wilds of British Columbia swinging your ax, breathing fresh air, sleeping under the stars?!!! I can't either, but Monty Python's "Lumberjack" song has a strong message for admins, particularly HPC admins – Log Everything.

Why log everything? Doesn't that require a great deal of work and storage? The simple answer to these questions is yes. In fact, you might need to start thinking about a small logging cluster in conjunction with the HPC computational cluster. Such a setup will give you answers to questions.

Answering questions is the cornerstone of running HPC systems. These questions include those from users such as, "Why is my application not running?" or "Why is my application running slow?" or "Why did I run out of space?" It also answers system administrator questions such as, "What commands did the user run?" or "What nodes was the user allocated during their run?" or "Is the user storing a bunch of Taylor Swift videos?"

If you haven't read about the principle of Managing Up [1], you should. One of the keys of this dynamic is anticipating questions your manager might ask, such as something seemingly as simple as "How's the cluster running?" or something with a little more meat to it such as "Why isn't Ms. Johnson's application running?" or perhaps the targeted question, "How could you screw up so badly?" Implicit in these questions are questions from your manager's manager, and on up the chain. Managing up means anticipating these questions or situations that might be encountered up the management chain (answering the "Bob's" question about what you actually do). More than likely, management is not

...

Use Express-Checkout link below to read the full article (PDF).