We are currently dealing with an issue on our network which caused a complete outage earlier this afternoon. This matter is currently being worked on at our highest priority and we have staff onsite currently working to rectify this. Further details will be made available as and when we can, however please appreciate that we cannot provide time for detailed explanations via ticket, e-mail or telephone at the moment as our priority is to have every member of staff working on the rectification of this matter. There is no ETA set for full rectification as our obvious aim is to have all services working as fast as possible to resolve this matter.
15:40 Update:
The issue we have experienced today is that the power tripped to our core network rack, therefore causing a complete network outage. This was immediately investigated and found by the staff onsite as soon as the rack was examined. Upon trying to restore power, our core power distribution system for that rack suffered a failure, although a fault with this this was likely the cause of the initial failure. This left us with no power to our core network.
Due to this, systems are currently being re-routed and power being brought back to systems and servers one by one. We have staff onsite and additional staff en route to the data centre with replacement power distribution hardware as we carry spares in stock for all of our infrastructure.
We are therefore working as fast as possible to initially restore connectivity to all systems still affected by this and then will work to replace the faulty hardware and so bring services back to their normal level of operation.
17:10 Update:
We are hoping to0 have the majority of servers operational in the next 10-15 minutes as power is temporarily being diverted to them
22:25 Update:
Staff onsite are in the process of changing over the power distribution hardware which will result in a temporary outage to all services. This will be no longer than 5 - 10 minutes and when completed, all services will be fully resumed.
00:30 Update:
We can confirm that all faulty hardware has now been changed and normal service has resumed.
Mon 29th Update:
Following the issues experienced yesterday, we would like to provide an update as to this for you. Investigation of the failed equipment yesterday found that our main power distribution system had partially failed, providing power to some systems and not others. This is a very rare failure and we will be looking to send the hardware to the vendor for investigation. Due to the way that this hardware is integrated into our systems, replacement of this is not an easy task as it provides power to our core systems and so these cannot be powered up whilst the hardware is replaced. Additionally it physically takes considerable time to replace, however as a direct result of this issue we have redesigned our systems in the process of replacing the failed component and so now in the unlikely event that such a rare failure should ever occur again, we have now shortened replacement time of the power distribution hardware from approx 3-4 hours to approx 10 minutes, which is an obvious vast improvement.
This was an unexpected failure and certainly not something that is easy to prevent due to this being a power issue and of course being a component failure was not something that could be forseen. We will however be fully analysing our network to look where further redundancy can be added, as well as putting further procedures in place for dealing with such a matter which should drastically reduce the impact of such a problem by getting services back online within minutes.
We would also like to thank you all for your cooperation and understanding whilst this matter was being attended to yesterday.
יום ראשון, אוגוסט 28, 2011