Featured

Downtime is the Enemy

by Josh Liberman, President of Net Sciences, Inc.

Who hasn’t heard “the network is down right now” at least once or twice? But what if you really could be without any downtime? Airlines, web retailers and others lose hundreds of thousands of dollars per minute of downtime. They have budgets in the millions to deal with this. So how do you build a network that doesn’t go down on a much smaller budget? This is the question faced by New Mexico 811 (NM811). The Common Ground Alliance’s (CGA) best practices call for disaster recovery and hardware designed to tolerate a single point of failure. The New Mexico Public Regulation Commission and NM811’s customers expect high reliability. The answer is in the design of NM811’s new campus, the choice of superior hardware, a highly fault tolerant architecture and process automation. And the right partner to make it all happen.

NM811’s new two-acre campus was designed and laid out with two buildings separated by over 600 feet, connected by a high speed fiber optic link. The main building (Building A) is 16,000 square feet on two floors that house both public use spaces (on the first floor) and NM811 operations (on the secure, second floor). The “pillbox” (Building B) is 2,500 square feet and serves three purposes: it provides “offsite” data storage, it hosts a second TELCO entrance for the main building and it hosts the backup generator. It also acts as a “warm spare” site, housing the duplicate server system and providing the space and workstations for operators and managers to operate from should the main building be unavailable.

The servers at each site are of a very special and highly reliable design, known as Intel Modular Servers (IMS). Each IMS has four compute modules or blades, three of which house dual CPUs and 48G of memory. These modules reside in a chassis that connects them to an array of drives (known as a SAN) working together to support all the modules. The IMS also provides redundancy of power, fans and drive controllers. Each IMS has a compute module that acts as Command and Control, managing the other compute modules and handling server and data replication to keep the servers at each site in sync with each other. The three other compute modules are in a Microsoft “HyperV cluster,” sharing resources and acting as a unit in case of hardware failure. They handle the heavy lifting of file sharing, email, databases, web hosting and more.

Fault Tolerant Design is the Answer
The network is designed with the ability to withstand the failure of a component on many different levels, both “physically” and “logically.” Physically, there are multiple compute modules, network connections, hard drives and controllers, power supplies and cooling fans working together at all times. This makes it highly unlikely that a single component failure can drop the network. At the “logical” level, all servers are “virtualized” and can run on any of three different physical servers or be moved to other modules should any of the hardware fail. The SAN based storage is also virtualized, so that should a drive that the servers run from or that houses data start to fail, or even just begin to run out of space, the others can chip in and carry that weight until a replacement arrives or space is expanded.

But wait, as they say, there’s more. Each building houses its own server, workstations, security and network equipment, and has its own independent Internet access provision. The buildings are physically linked by very fast 10G fiber optic connection that enables the replication of all the data to occur in “near real time,” thus guaranteeing that both sites are in nearly perfect sync. This means that if one part of the infrastructure in the first building fails, there is an instant spare that comes online automatically in most cases. However, should the failure be due to a serious disaster that takes out the entire primary building, staff can move to the second building and be fully back online in a matter of a few hours. For icing on the cake there is extensive battery backup and a diesel generator with automatic startup, should a power outage outlast the battery capacity.

Automation and Monitoring Seals the Deal
The site-to-site replication between the buildings is performed between the Modular Servers on a scheduled basis. This data backup and replication is completely automated, allowing nearly real time replication of both servers and workstations. Also, since the system has to gather data and provide access to remote users over the Internet, there is redundancy and automated failover there as well. NM811 has two different Internet providers using different data paths, two very powerful, linked firewalls, two linked secure remote access (SSLVPN) devices, and a resilient data architecture both within and across the buildings. Finally, the entire infrastructure of the new NM811 network will be monitored 24/7, with everything from drive space at the servers to the temperature and humidity of the computing room under constant surveillance. This level of vigilance allows NM811 to attain even higher levels of reliability and uptime from their investment, and with automated alerting, their support team (Net Sciences, Inc.) can respond quickly and proactively to any issues that might arise. Monitoring the site this way also provides them access to the fastest possible repair times, as often, in the event of a failure, parts can be available on the first trip to the site for repair.