20080513

What is 'Normal', anyway?

Yow! It's already well into May, and April got by without a post. It was a busy month, with a LOT of discovery tasks happening. I'll try to get a few more posts in this month to summarize. But first, I wanted to get this thought online...

Those who know me are not surprised if the come into a dark wiring closet, and find me sitting on a milk crate, staring at the 'blinkly lights'. (Frankly, I've startled plenty of people who don't know me this way!) Most just write it of as a fixation on blinky lights, but it is not. This is how I can 'visually spot trends' of activity on devices and interfaces. It's also a chance to remember which link status indicators are on, off, in an alarm condition. And, I'm also listening... to fans, air conditioner inlets, hard drive spindle bearings, modem speakers, relays clicking.

This is how I get to know what "Normal" looks like, sounds like, and smells like in the data center and wiring closets. (Yes, smell... does it 'usually' smell damp in this room? Is the 'ozone' smell something that's always here, or does it indicate a component failure. Scent is a strong trigger for memory!)

Knowing what is 'Normal' helps us spot what is unusual. When I have a network failure, I can go to the associated wiring closet and look, listen, and smell...I don't need to ponder "has that always been like this?", because I'll remember. "That light is usually blinking...so there isn't traffic on that interface!" Knowing what is normal is a key to fast troubleshooting.

The RRD tool has been a great resource for graphing monitoring data, allowing you to visualize 'normal', to see 'now', and to identify trends. You find this under MANY open source tools, such as MRTG, Cricket, Cacti, and many more.

But, can you tell what's 'normal' with a serial console? Yes, you can! The key is, you need to LOOK when things are operating normally. Look when the system/device/network is idle some night or weekend. Look again when backups are running. Look again when the network is busy, but not failing. Then, compare your notes, or, your LOGS!

You can do a LOT interactively, using a simple terminal emulator and a cable. You have a lot more flexibility with a Console Server and multiple Telnet sessions (for example, you can monitor many consoles simultaneously, and cut-and-paste between them). The real benefits are had when you combine the console servers with a Serial Console Management Application such as Conserver, or ConsoleWorks, and you can compare historic data with today's results.

I have a handful of devices which have "diagnostics' ports, that only the field engineers will use. They are not for normal use by customers. However, when you connect to these ports, you can find some of the devices are 'beaconing' about events, or reporting regular status messages. Even if the port doesn't say much, if you hit a carriage return, you'll probably get a prompt...doing this occasionally will tell you it's still alive. And, sometimes, you can also leverage the simple diagnostics that are built in. It's good to have a baseline log of what the devices are reporting to these diagnostics when they are healthy, so you can compare them if the device starts misbehaving.

But, to do any of these things, you need to start by looking when things are working as expected, doing their job the way you want them to. That's when you want to get your first look, and save that data for later comparison. (Speaking of comparing log data, try SPLUNK! Check out splunkbase, and the SPLUNK Forums! Splunk is worth a few blog articles by itself...later.)

The bottom line: Do you KNOW what you are missing? Do you WANT to know? Then LOOK!

-Z-