Deleting Storage Account associated with a StorSimple Volume Container disables ALL volumes on the device
I came across the following interesting situation with an 8600 StorSimple device running software version 3.0 (17759).
All iSCSI volumes from the StorSimple device in question are down at the Windows 2012 R2 host (about a dozen volumes in this case). If you create a new volume and present it to the Windows host, attempting to partition it (GPT) fails with the error message ‘disk not ready’.
2 Additional observations were made:
- All cloud snapshots failed about a month before this incident.
- Software update 3.0 was applied to this device roughly the same time the incident occurred. The accompanying firmware update was not applied.
The time line is as follows:
- Storage account was deleted prior to 6/28/2016 (not showing in Operation Logs which are kept for 90 days)
- 60+ days later (8/28/2016) all cloud snapshots started to fail. The error message suggests failure to access the storage account
- 82+ days later around 9/20/2016, users started to report volumes not available
- 9/24/2016, software update 3.0 was applied from the classic portal
- About a dozen volumes were provisioned from this device to one Windows 2012 R2 host. Volume Containers were associated with 3 Storage Accounts in the same subscription
- One of the 3 Storage Accounts (the one on the top of the above image) was missing. Apparently it was inadvertantly deleted.
- Get-HCSSystem showed normal device condition
- iSCSI connectivity including iSCSI initiator and MPIO configuration were reviewed, tested and showed no issues.
- Ping (Test-Connection) and tracert.exe (Trace-HCSRoute) from each of the host iSCSI interfaces to each of the device iSCSI interfaces and back came back OK.
- Test-HCSMConnection showed no problems.
- Test-HcsStorageAccountCredential against the 2 existing Storage Accounts showed no problem.
Troubleshooting and Root Cause Analysis
I initially suspected that the Storage Account keys were changed without getting synchronized with the StorSimple Manager service, cutting off the device from its Storage Accounts. That would explain volume failure of all volumes and cloud snapshot failure. Both of which need to read and write to Storage Accounts.
- However, Operation Logs showed no events or records related to change of Storage Account keys associated with a StorSimple volume.
- Operation Logs showed no event/record of Storage Account deletion.
- Synchronizing the Storage Account keys of the 2 existing Storage Accounts did not solve the problem.
After opening a ticket with Microsoft, they obtained a device Support Package and recognized that the device appears to be constantly trying and failing to reach the volume whose Storage Account is deleted which is causing failure to serve the remaining unaffected volumes.
Steps to reproduce the problem
- Create 3 Storage Accounts
- Create 3 Volume Containers, each using a separate Storage Account
- Create 3 volumes, 1 in each volume container
- Present all volumes to a Windows 2012 R2 host, online, partition, format, copy test data
- Delete 1 Storage Account
- All 3 volumes will fail (inaccessible) after some time (see questions and answers section below about how much time)
Create a Storage Account with the same name as the one that was accidentally deleted, and synchronize the keys with the StorSimple Manager service
Although this solution will untie the device to serve the volumes whose Storage Accounts have not been deleted, it does not restore the volume(s) whose data is lost when their Storage Account was deleted. Such volumes’ data need to be restored from snapshot.
Questions and answers:
- If the Storage Account has been deleted 60+ days before cloud snapshots started to fail, what prompted the cloud snapshot failure if that was caused by Storage Account deletion?
- If the Storage Account has been deleted 82+ days before volumes started to fail, what prompted volume failure if that was caused by Storage Account deletion?
What happened here is that eventually, all the failed authentication attempts to the deleted storage account filled out the barrier queue (queue to the cloud). Once it is filled beyond a certain point, it becomes completely stuck, and anything in line behind it is unable to get through. It was once the barrier queue was completely overrun with all these connection issues to the deleted storage account that cause all other cloud traffic to be affected. This has the same effect as losing your cloud connection, and with this being a hybrid appliance when this happens it can cause many different issues such as we saw here with volumes being unavailable and backups unable to complete.
Recommendations to Microsoft
- Log events of Storage Account key changes in Operation Logs
- Currently 90 days worth of events show up in Operation Logs. It would be helpful if that retention period is configurable by the client on each subscription
- Make the device Support Package available to the client without the need for a key from Microsoft. In this case, information available only in the Support Package held the key to the workaround/solution.
- Update the device software so that loss of a Storage Account affects only its associated volumes not all volumes (Perhaps a separate queue per volume container instead of a queue per device)
- Update the StorSimple Manager service or/and Storage Account so that a Storage Account cannot be deleted if there’s an associated StorSimple Volume Container
StorSimple is an amazing Hybrid Storage array that extends on-prem storage seamlessly to the cloud. The 5000 and 7000 series models require the setup of a cloud storage account prior to array installation. To setup a storage account in Azure for StorSimple array:
- Log into you Azure Manage portal at https://manage.windowsazure.com/
You need to have at least one active Azure subscription before you start.
- Click Storage on the left, then click New at the bottom:
- Type in a name for the new account you wish to create – must be lower case letter only. Pick an Azure data center – typically one that’s physically close to your location to get better latency. Pick a subscription. Pick a replication setting. Locally-redundant give you 3 copies of your data in the data center you selected. Geo-redundant gives you 3 additional copies in another Azure data center. Geo-redundant is typically twice the cost of locally-redundant storage account, is the default option, and is well worth the addition 2.25 cents per GB in my opinion.
In a minute or 2 Azure will finish creating the storage account. click on the account name:
- Next click Dashboard, and click Manage Access Keys at the bottom:
- Copy the account name and the primary access key. You will need them to setup your StorSimple array later.
Secure this information because it provides access to your Azure data. Data can be accessed by using either the primary or secondary keys. Each key is 88 characters long and is made up of alphanumeric upper and lower case letters and special characters. The availability of 2 keys allows us to change keys without any access interruption by applications or machines that use the account. For example in case of key compromise, and you’re using the primary key in an application or machine, you can:
- Regenerate the secondary key
- Replace the key in the script/application/machine using the storage account (no access interruption)
- Regenerate the primary key
Now you have changed your account keys without any service interruption
Note that this applies to StorSimple 5000 and 7000 series only. StorSimple 8000 series is setup differently. See this post for more details.