I wanted to cover in this blog entry a topic that is near and dear to my heart.
Nearly six months ago WP-ORG had a network outage that was one of the longest lasting outages we have had since we started operating in the mid-1990s. My first concern was that we get operational again, but I was also concerned about why we went down and more importantly, what impediments there were to getting fully operational in a timely manner.
Megan, Warren, and Steve Brunasso did a yeoman’s job in getting us back up and running again. The greatest concerns were: the root cause of the outage and what prevented a timely restoration of services. In terms of the root cause of the failure, we cannot be 100% certain, but all indications appear to be that we had a massive failure of our network attached storage (NAS).
NASs are like the hard drive on your computers, but instead of operating inside of your computer, it operates “attached” to the network. Each of the servers that serve up content for various network functions (web, mail, etc.) get their content from the NASs. NASs by design are generally robust in terms of fault tolerance and redundancies. As an example, a NAS will normally have an array of hard disc drives that compose the storage capabilities of the NAS. A single hard disc drive failure is generally not a problem. The NAS will indicate to the operator that a hard drive has failed and it needs to be replaced. While the single hard drive is in a failed state, the NAS continues to function, albeit slightly slower in speed. Once the failed drive is replaced, the NAS is back to 100% efficiency.
Likewise other major components of the NAS, such as the power supply, have redundant capabilities. Some components in the NAS will not have redundant capabilities and if they fail, the entire NAS may fail. Normally, things like the NAS back-plane and sometimes the NAS controller board has single points of failure. It was never determined exactly where the direct point of failure occurred, but what did occur was corruption of multiple hard drive arrays and the entire NAS had to be rebuilt from scratch. Rebuilding a NAS takes a lot of time. This leads to the second topic area – time required to gain full operational capability.
We first had to bring a new NAS online and quickly. Some of the components had to be obtained and Megan had to drive for hours to get these components immediately. Once all of the necessary components were brought to bear in the NAS, the array (the assemblage of individual hard disc drives) had to be rebuilt. Once the array was rebuilt, then the content had to be placed back on the array. This took the most time of the entire operation. We are re-looking different storage/backup technologies in this area to dramatically reduce the entire time to restore to fully operational capability.
I have been associated with WP-ORG since its early days in the mid to late 1990s in both a non-advisor and advisor capacity. I have seen WP-ORG’s capabilities grow over the years and we have gone from a back closet operation in Ditus Bolanos’s house to a substantial presence in a server colocation facility in Austin Texas (Data Foundry). Along the way we have tried to grow with our membership’s needs, but always being cost conscious on our equipment and software purchases. We made decisions early on to standardize on open source software (which is generally free) and it has mostly met our needs. Where open source software did not meet our needs, we purchased or modified existing software to get the job done. Hardware has been a different matter.
Network hardware such as servers, switches, routers, etc., is inherently expensive. There is really no way around that cold hard fact. In order to keep costs down, we have made a conscious decision to not necessarily go with the newest and fastest equipment. This tends to save a lot of money in hardware expense, but it does so at a potential hidden cost. In the network industry there is a term called “IT refresh.” Here is an article that covers some of the basic concepts of IT refresh and how it impacts an organization:
The underlying principle is you have to set a target on the wall and abide by that target for the age of your equipment. This target varies based upon the specific type and criticality of the equipment you are analyzing. Certain types of equipment and inherently reliable and tend to continue running without interruptions year in and year out. One such example is found here with a Cisco router running 13 years without a reboot:
Other devices on the network tend to be more complex and inherently less reliable. Depending upon the type and manufacturer of the equipment, you can go a long way to increase the reliability. Data servers tend to fall in this category. There are many potential single points of failure and some of the higher end manufacturers recognize this and try to build in redundancies for critical components that fail more often, such as power supplies, on-board discs, network cards, etc. Still, you cannot build a server that is completely fault tolerant – it doesn’t exist. In the past we have tried to buy highly fault tolerant servers that were not necessarily new. Generally, this has worked well, but we did so with risk.
This leads me to the final point of where we are headed. In a perfect world, we would buy the newest, best, most highly fault tolerant network gear available. Unfortunately, that would not be terribly affordable for our membership. We are now looking at all of our network gear and seeing how well we are situated with a realistic IT refresh policy, which for us would be somewhere between an oldest age of 5-8 years. We may need in certain narrow circumstances to go older than that but if we do, we will likely procure additional spares to reduce our downtime. An analysis of all our critical network gear yields an increased need for newer and better equipment. In the coming fund drives we will be addressing those needs with increased donations allocated to the capital expense of new equipment. We welcome your support and suggestions.