The Fine Art of Troubleshooting
Welcome
System troubleshooting is an art. It is a science. And, sometimes, it's brute force.
Junior system administrators have often asked, "How do you troubleshoot a problem when you have no clue where to start?" My answer has never changed: Start with the simple things first. This advice has helped me resolve every problem I've ever encountered over the past 20 years. Sure, some problems are difficult to solve, and some even seem impossible, but if you start with the simple things first, your chances of success are very high.
People in general tend to complicate problems and solutions. They tend to reach for the least probable cause for a problem and then apply the least likely solution to resolve it. I guess it's just human nature to assume that there is no easy problem or easy solution. I have found just the opposite. Most of the problems that I've seen have a reasonable cause and a relatively simple solution. I've been on many root cause analysis and postmortem calls, where I said, "I rebooted the system and everything came back as it should." Of course, I always had to explain why that resolution was the correct one and it was usually met with unhealthy skepticism and much criticism.
I can't count the number of times I heard, "Well, rebooting fixed the issue temporarily, but you didn't really resolve the problem or apply a permanent fix to it." My task was to restore service and not to spend days or weeks researching a memory leak in an application. A reboot fixed the problem. Subsequent reboots will continue to resolve the problem. Until the developers fix the application, rebooting is the correct response to the problem.
System administrators, especially junior admins, love to see long uptimes for systems. It is impressive to see a system that has an uptime of 500+ days. Everyone loves bragging rights of long uptimes. I once worked on a system that had an uptime of more than 1,300 days – a Sun Enterprise 450 running Solaris 5.6 and an Oracle database. After such a long time, no one dared reboot it. Who knows how long it had been since any real patching had been done? I inherited the system and insisted that we reboot it at once. In fact, I insisted that the operating system had to be patched, the database had to be patched, the firmware had to be updated, and then rebooted as necessary until it was completely up to date. I also stated that the server had to be completely shutdown and cold booted.
You can only imagine the resistance and harsh words flung at me from every direction. "Unix systems don't need to be rebooted," I heard in loud voices. "This isn't a Windows server, Ken," they said with zeal. I ignored them all and quietly stated that yes, in fact, all systems, regardless of the operating system, need to be rebooted on a regular basis. They scoffed. "Well, this one's been up for almost four years with no need for a reboot." My protests went unheeded. That system eventually received updates from a junior system administrator who hadn't heard all those heated conversations and, per protocol, rebooted the system. We had a lot of problems with the database after that reboot and no one could troubleshoot it because the database software was so far out of date, no one left at the company knew what to do. Someone also performed a shutdown on the system and found that one of the RAID drives had failed – an undetermined amount of time ago. After that problem was resolved, we were told that the system would be decommissioned because of its age. I installed a fresh complement of applications on newer hardware and all was well with the world.
A new policy went into effect soon thereafter that required all systems to be patched and rebooted once a month. Sometimes the best resolution to a problem is to allow complete failure, so, for you junior system administrators, as well as you old salts, take a lesson from my book of frustrations. When troubleshooting any problem, use this order: Warm boot. Cold boot. Leather boot. Simple solutions and regular maintenance often prevent complex problems from occurring. How many times do we have to relearn this lesson?
Ken Hess * Senior Editor
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.