20070831

When did my console service become Critical Infrastructure?

Many Conserver instances were started as an experiment. It was something to add to an existing console server deployment. It's usually installed on a seldom-used machine, probably older hardware, and maybe without support. But, it was certainly "enough to do the job, for now."

Once it was working, a few other folks started to use it. Then you offered to add more ports into the config for other system and network administrators. Soon, you've pulled in most of the original console server ports in the shop, and you're buying more console servers, and you're starting to look for more RAM and bigger hard drives. You're wondering if you are backing this system up, since the logs have become useful data. Then, one day, some data retention policy comes along, and you realize your console data just became Vital Records, and needs to be protected. It's time to think about an upgrade, and a service contract, and a way to write directly to archival media.

Why didn't you think about those things sooner? That's the topic for today's blog. What started as a demonstration just became critical infrastructure at one site I support. Here's what we are considering as we are looking for making this Conserver deployment a "Production Service".

Supported Hardware. Rather than a hand-me-down or a "Frankenstein machine", consider getting newer hardware. This will give you a longer product life (read that as "you can get tech support and replacement parts from the vendor" for 3-4 years), and you should consider the support contract, since this will likely become Critical Infrastructure if it hasn't already.

If you are trying to do this "on the cheap", you could try using a hand-me-down machine. Make sure you get a spare chassis (with power supply and motherboard), and and many drives and RAM! Remember, older drives and RAM get to be more expensive when they are no longer the new stuff! You'll also need to be able to service your own gear, on your own time.

Redundant Power Supplies. Unless your data center has fancy power distribution units that source two circuits to a since power cable, you should consider using a chassis that has dual power supplies. Make sure that the chassis can run fine (fully configured) on just one power supply! You should make sure that you are sourcing the power supplies from two different circuits. Also, find out if the power supplies have to be on the same PHASE of power, and find out BEFORE you plug them in. (Have I mentioned the value of a support contract for your hardware?)

I need RAM. Lots and lots of RAM. But how much is "lots". This depends on the number of consoles you plan to support, including your "someday" scenarios. Remember that Conserver starts with it's own process, and then spawns children for (generally) every 16 ports. Your OS will want some. And any other tools and scripts will need some. (If you are going to be editing large log files, searching large data sets, or processing many large log files, you're going to want a LOT of RAM, so that you avoid swapping memory to the hard drives.) Don't skimp on RAM.

Dedicated Log Data Drive(s). You don't want your logs on the same drive a your main OS. (If your console logs fill the disk over a holiday weekend, and your system can't write it's own system logs, your Monday morning is going to be a LONG morning!) How much space do you need? This is the hardest thing to estimate, since it really depends on how much you use your consoles. But, here is a good ballpark to start - estimate 20 MBytes for every console you plan to support, plus 1-2 GBytes as a buffer for when some logs start filling up faster than you expected.

Since Conserver can auto-rotate the logfiles when they reach a certain size, you want to consider how large a file you can open with the editing tools you want to use. I rotate my logs between 10-20 MB, and I use grep and PERL to find things. But if you use other tools that can't open files larger than, say, 5 MB, then you should adjust your log rotation size accordingly.

Use RAID to help protect your data. Even basic mirroring of your log drive will help protect the data set (your vital log files). If you are capturing a lot of very busy log files, you might want to consider striping the data. You may also want to consider hardware RAID support. (This can save you system some CPU overhead, but it also adds some hardware complication. Still, many sites are successfully using hardware RAID. If this is your first attempt to use hardware RAID, you should consider getting the hardware support contract for your machine.)

Consider using a RAID pair to protect your OS, scripts and tools as well! Many newer machines can host five 2.5-inch drives in a 1-Rack-Unit chassis. That gives you a pair for the OS, and a pair for log data.

Backing up your critical log files to archival media. CD media is cheap, but so is DVD media. (Heck, DVD drives are cheap, too!) You can get 5+ GBytes on a single-layer DVD, and must laptops and other workstations can read them. It's an ideal media for the day when the auditor comes and asks you to produce the log files for certain machines across a range of dates.

Remember, too, that most log data is going to be ASCII data. It's VERY compressible, and you can use PERL and cron jobs to compress the latest newly-rotated log files to a gzip version. This will let you store more log data longer, and lets you delete some of the older non-zipped versions. (Compressing the log files means you are going to be backing up the compressed logs, so you can store more of them on the DVD media...this means some savings in the number of discs you need to write over time.)


So, what would I recommend? Let me start by giving you the information that I'm using to base the decision, and then I'll tell you.

Given:
650 console ports today, could grow to 1024
(20 MB x 1,024 = 20 GB of drive, plus overhead, 25 GB minimum)
I want to store large amounts of compressed log files.
(3 yr retention needs for some files, 60 GB minimum)
1024 consoles means 65 processes, minimum, but more if my processes are really busy due to verbose logging. Could be 128+ Conserver processes later.
I want to run Splunk for log checking (RAM and drive implications.)
I also need drive space for backup, and log report manipulation.

What I'm proposing;
Dell 2950, with two 1.8 GHz Quad-core processors (Energy Smart)
Pick your OS (I'll pick Suse Linux, 3 yr license for the OS support)
8 GB of RAM (four 2 GB Dual Ranked DIMMs, Energy Smart)
Hardware RAID
76 GB, 15k-RPM drive threesome for the OS
146 GB, 10k-RPM drive threesome for the logs
Dual PS

This gives me a pair of drives for the OS, plus an on-site warm-spare, plus a pair for the logs with a warm spare. The drives are hot-swappable, so if I have a failed drive that RAID cannot recover, I'll yank the failed drive and swap in the warm spare, then let RAID rebuild it, and then I'll call Dell support for a next-day replacement, for the next three years. If I were really worried about hardware failure, I might opt for the 4-hour on-site contract.

Total cost comes in at $8.5k(US) today, from Dell, though the price may be better through other channels. (That works out to about $2,900 per year, or about $240 per month, to support 1,000 vital ports in the shop.)

Can you get by for less? Certainly. But, what is the cost to you if you lose log files you need to be retaining? What will the business impact be if the server is down for a day or more, and your administrators have to scramble to manage their servers and network gear 'the old fashioned way'? The cost isn't too bad, for Critical Infrastructure, now that it's proven itself to be useful and reliable. Shouldn't you put it on reliable hardware as well?

-Z-

No comments: