subject: =?UTF-8?Q?Downtime,_What=E2=80=99s_That=3F?=
posted: Thu, 20 Sep 2007 11:28:32 +0100


[One of the FreeBSD servers I run has an uptime of 350 days, and
counting, it's never been rebooted since it was first switched on. I
had another machine - admittedly running Windows 98SE - it used to
crash daily, I installed FreeBSD on it, it's now been up 24 days
without a reboot. I had another box with an onboard SCSI controller
and dual processors, I put on FreeBSD, it automatically detected and
used the SCSI and both the processors without any help from me.
FreeBSD rocks. See: http://www.cyberdelix.net/tech/bsddiary.htm - Stu]

http://www.processor.com/editorial/article.asp?article=articles/P2935/21p35/21p35.asp&guid=

August 31, 2007 o Vol.29 Issue 35
Page(s) 9 in print issue

Downtime, What´s That?
by Julie Sartain

Tips For Improving Server Uptime
I
n April 2007, Co-op Atlantic (a wholesaler offering products and
services to 135 co-ops and 237,000 co-op members) was offline for 18
hours-its database server failed, and then its backup server failed.
The next month, the entire June issue of Business 2.0 magazine was
accidentally deleted with no backup because of a failed backup
server. The following month, customers of the largest Web and
application hosting company in Sydney, Australia, WebCentral, went
without email service for a week due to server failure.

How is this possible? According to Bruce Taylor, chief strategist at
the Uptime Institute (www.uptimeinstitute.com), maintaining the
integrity of server farms and data centers is now (or should be) a
matter of corporate governance and risk management. And developing
extensive and comprehensive business continuity and disaster recovery
plans is essential to companies´ survival.

"Protecting systems is accomplished by eliminating single points of
failure in servers, storage systems, and networks," says Taylor.
"Redundant hardware and networks, along with alternate sources of
electrical power and network connections, make the data center
resilient. IT operating procedures reinforce the data center´s
robustness."

Taylor suggests that IT managers responsible for the physical
facilities infrastructure must begin to look at their data centers
from a "whole-system" perspective and should firmly integrate IT
infrastructure planning with site infrastructure subsystems (power
and cooling, primarily systems) planning.

Master Your Code

"Push all of your infrastructure with your code," says Rick Bentley,
CEO of Connexed Technologies (www.connexed.com). "Connexed, like many
other companies, is an SaaS [software as a service] provider. This
means that we not only have to manage our own hosted infrastructure,
we have to write and test our own code. Like many companies, we have
multiple environments (development, QA, stage, and production). If we
were to just push code from one environment to the next, we would be
testing the code but not the infrastructure."

For example, continues Bentley, after you buy that big, new, and
expensive storage array, you connect it to your development
environment. Once everything is working, you connect it to your QA
environment (with the next code build). Then you connect it to stage
and finally to production. The first time you connect it to your
development environment, you will probably make some mistakes.

"Then," says Bentley, "you might have to reboot the development boxes
more than once (causing downtime each time). By the time you´ve
deployed the hardware for the fourth time, when you push the hardware
to production, things should go much more smoothly than they did the
first time, minimizing downtime."

Consider An Open-Source OS

According to Matt Olander, CTO at iXsystems (www.ixsystems.com), a
simple single method for increasing server uptime is to choose an OS
that is robust, reliable, and stable. The open-source FreeBSD
(www.freebsd.org) OS with its roots in BSD Unix has a history of
solid development and is world-renowned for its performance and
stability. "Many large installation system administrators consider
FreeBSD to be one of the Internet´s best-kept secrets," says Olander.

"Organizations such as Yahoo!, Juniper Networks, Network Appliance,
IronPort, Isilon, The Weather Channel, and NASA all rely on FreeBSD
to deliver enterprise-class products and services, protect their
networks, and serve millions of Web pages a day. FreeBSD increases
uptime by focusing on development practices that produce well-tested
code that is reliable, stable, and secure. FreeBSD is the perfect
choice for the enterprise data center, as well as for small to medium
businesses," continues Olander.

"Agreed," says Bentley. "We all have fixed budgets. Are you running
Oracle on Solaris? How much would you save if you moved to PostgreSQL
or My-SQL on Linux? If you have enough CPUs that you´re paying
license fees on, you could probably hire another full-time DBA to
improve your architecture so things are more reliable to begin with."

Bentley notes: Why spend money on proprietary monitoring software
when there are open-source alternatives such as Nagios
(www.nagios.org) or Cacti (cacti.net)? With the money you save using
open-source, you can spend more on hardware. "Money equals up-time.
Spend it wisely," he says.

Eliminate Single Points Of Failure

"In critical computing environments, redundancy is a must," says
Taylor. "Many supposedly `redundant´ systems still contain single
points of failure; for example, how useful is a server´s dual power
supplies if they´re both connected to a single (failure-prone) PDU?
How useful are your redundant PDUs if they both draw power from the
same UPS system? If there´s a small fire or accident in your
facility, will you find that both your wire and your backup wire run
side-by-side in the same conduit?"

According to Taylor, backup systems fail because of either poor
practices or single points of failure. For data centers that demand
maximum reliability (to the Institute´s Tier III or Tier IV fault-
tolerance specifications), there must be two independent power paths
all the way from the grid to the back of the server.

Don´t Forget Security

"Downtime is not always caused by hardware failure," says iXsystems´
Olander. "Security must be a primary concern in server uptime. Add
redundant firewalls to protect the network from malicious attacks
that can cause interrupted service."

The open-source and free CARP (Common Address Redundancy Protocol)
manages failover at the intersection of Layers 2 and 3 in the OSI
Model (link layer and IP layer), continues Olander. CARP allows a
backup host to assume the identity of the primary host. Combined with
pf (packet filter), the free and open-source firewall solution in the
FreeBSD and OpenBSD (www.openbsd.org) operating systems provide an
excellent technique to build scalable redundant firewalls that will
help keep the servers in a network safe and secure.

"Many free resources are available on the Internet that can assist
administrators with deploying scalable FreeBSD or OpenBSD servers
running pf firewalls using CARP for failover. An open-source
redundant firewall built on commodity server hardware can help
protect against interrupted service from malicious attacks and easily
be scaled as traffic grows," concludes Olander.

------

sidebar: Biggest Immediate Payback

"Think cluster," says Rick Bentley, CEO of Connexed Technologies
(www.connexed.com). "Remember when RAID stood for Redundant Array of
Inexpensive Disks? Many cheap drives in RAID are more reliable than
one expensive drive. Think of your servers the same way. Rather than
buying expensive servers with dual hot-swappable power supplies, hot-
swappable fan banks, etc. for $3,000, $5,000, or more, why not buy
several boxes for $1,000 each and put them in a cluster?"

For the same amount of money, notes Bentley, would you rather have
three times the computing power you need (which is great if you get a
big load peak) and occasionally have to swap out the motherboard on
one of your 1U pizza boxes, or buy fewer, more expensive servers,
essentially placing more eggs in fewer baskets?

Think about your database, continues Bentley. Is it on one big,
monolithic box? What happens if/when that box fails? Do you have
another big, expensive box waiting on standby? How fast does that
standby kick in? How do you know if it will work-do you have the guts
to occasionally pull the plug? Two boxes set up in automated failover
means, basically, you´re running an experiment on failover time the
first time your primary box fails.

"If you set up five boxes in a DB cluster," concludes Bentley, "You
will feel confident enough to pull the plug on any one of them, any
time you want, because now you have a fully tested system."

---
* Origin: [adminz] tech, security, support -
http://cyberdelix.net/adminz/

generated by msg2page 0.06 on Sep 21, 2007 at 07:53:03