Yesterday was frantic for us as we scrambled to identify connectivity issues affecting our web site & hosted customers.
Due to the lack of updates via the Azure management portal and the Azure status page (these pages shown all services as operational) we initially thought the connectivity issues were introduced as a result of a security update we rolled out earlier on the 19th. This was the only recent change made to the servers. This security update was applied to resolve a critical vulnerability within Windows secure channel communication related to MS14-066.
We started rolling out mitigations for this vulnerability at 3.15am GMT on the 19th to ensure minimum disruption to our customers. This work was completed by 5am GMT. This did require a restart of our servers but everything went smoothly as we had tested the update process beforehand. We tested after applying this security update and all servers were responding correctly.
At 11am GMT on the same day the 19th we received a email notification from a monitoring service we use that informed us several of our hosted customer URLs were returning 404 & 500 errors from every location. We immediately checked the Azure status and everything shown as all green / fully operational. We tested customer sites in a browser and sure enough they would not load and we cold not connect remotely to the servers. We then thought although we had tested this may be related to the security update we applied a few hours earlier as this was the only recent change to our servers.
To troubleshoot further we then contacted the Azure support team. We have a paid support subscription with Microsoft so received a call back from an Azure engineer within 2 hours. This was from the 3rd party consultants Microsoft use but they were very helpful. We worked with the Azure support engineer to bring effected servers back online. The solution proposed was one we had heard many times before “turn it off and on again” – oh really – we pay for that. We restarted the effected virtual machines and indeed this did seem to resolve the problems.
At this point we didn't know of the other general Azure storage issues as the support engineer didn't inform us of any issues effecting other Azure customers.
This allowed us to bring a subset of hosted customers back online at 2pm GMT yesterday however we didn't have full service for all hosted customers until 4pm GMT yesterday. We continued to verify all servers were fully working yesterday evening and so far everything is OK.
Due to the lack of updates on the Azure health & status pages we only discovered yesterday evening that Azure was actually experiencing general problems with Windows Azure Storage Services that was effecting various services in Europe, Asia & the US. The pieces started to fall into place - the issues were not due to the security update applied earlier on the 19th.
We received the following email from Microsoft alongside many other updates after we raised several tickets...
" Jason Zander, Corporate VP of Microsoft Azure, has posted a blog to provide an Update on Azure Storage Service Interruption. This blog provides a high level description of the issue and what has been done to date. Please accept our apologies for this interruption.
These problems unfortunately effected all our hosted customers even those customers who have high availability sets configured as both the primary data center (North Europe) and sister data center (Weston Europe) were affected by these storage issues.
I wanted to sincerely apologize to all our hosted customers for these issues. We pay a premium to Microsoft every month for high availability and dedicated support to guarantee a reliable service for our customers - nothing is more important to us - however we feel we've been let down on this occasion.
Generally we have found Azure to be very reliable. We used Azure ourselves for 2 years internally before choosing the platform for our customers. This is the first major issues we've encountered in 4 years of using Azure. We do believe in Azure but issues like this unfortunately knock our confidence and the confidence of our customers.
We were impressed with Microsoft’s response times, once they knew we were experiencing issues they did keep us fully up to date via email. We also exchanged several emails & calls. They sent us a direct link to Jason’s official blog post via email linked above which is how we first learnt of the general issues effecting all Azure customers.
That said we will continue to always evaluate our suppliers and if we encounter any further significant service interruption we may consider moving to an alternative platform. We would like to give Azure a chance as it’s often better the devil you know and we have faith in the folks at Microsoft.
I created a forum post yesterday which also contains additional information on the security update we applied.
You can also find further links below to general news articles from around the web related to the recent Azure outage. We will continue to monitor our hosting provisions here and will always do what’s best for our hosted customers. If we can assist with any questions of course please don’t hesitate to comment below or email me directly on firstname.lastname@example.org
Optionally provide your comments to help us improve this blog entry...
Thank you for your feedback!