I came across the following interesting situation with an 8600 StorSimple device running software version 3.0 (17759).
All iSCSI volumes from the StorSimple device in question are down at the Windows 2012 R2 host (about a dozen volumes in this case). If you create a new volume and present it to the Windows host, attempting to partition it (GPT) fails with the error message ‘disk not ready’.
2 Additional observations were made:
- All cloud snapshots failed about a month before this incident.
- Software update 3.0 was applied to this device roughly the same time the incident occurred. The accompanying firmware update was not applied.
The time line is as follows:
- Storage account was deleted prior to 6/28/2016 (not showing in Operation Logs which are kept for 90 days)
- 60+ days later (8/28/2016) all cloud snapshots started to fail. The error message suggests failure to access the storage account
- 82+ days later around 9/20/2016, users started to report volumes not available
- 9/24/2016, software update 3.0 was applied from the classic portal
- About a dozen volumes were provisioned from this device to one Windows 2012 R2 host. Volume Containers were associated with 3 Storage Accounts in the same subscription
- One of the 3 Storage Accounts (the one on the top of the above image) was missing. Apparently it was inadvertantly deleted.
- Get-HCSSystem showed normal device condition
- iSCSI connectivity including iSCSI initiator and MPIO configuration were reviewed, tested and showed no issues.
- Ping (Test-Connection) and tracert.exe (Trace-HCSRoute) from each of the host iSCSI interfaces to each of the device iSCSI interfaces and back came back OK.
- Test-HCSMConnection showed no problems.
- Test-HcsStorageAccountCredential against the 2 existing Storage Accounts showed no problem.
Troubleshooting and Root Cause Analysis
I initially suspected that the Storage Account keys were changed without getting synchronized with the StorSimple Manager service, cutting off the device from its Storage Accounts. That would explain volume failure of all volumes and cloud snapshot failure. Both of which need to read and write to Storage Accounts.
- However, Operation Logs showed no events or records related to change of Storage Account keys associated with a StorSimple volume.
- Operation Logs showed no event/record of Storage Account deletion.
- Synchronizing the Storage Account keys of the 2 existing Storage Accounts did not solve the problem.
After opening a ticket with Microsoft, they obtained a device Support Package and recognized that the device appears to be constantly trying and failing to reach the volume whose Storage Account is deleted which is causing failure to serve the remaining unaffected volumes.
Steps to reproduce the problem
- Create 3 Storage Accounts
- Create 3 Volume Containers, each using a separate Storage Account
- Create 3 volumes, 1 in each volume container
- Present all volumes to a Windows 2012 R2 host, online, partition, format, copy test data
- Delete 1 Storage Account
- All 3 volumes will fail (inaccessible) after some time (see questions and answers section below about how much time)
Create a Storage Account with the same name as the one that was accidentally deleted, and synchronize the keys with the StorSimple Manager service
Although this solution will untie the device to serve the volumes whose Storage Accounts have not been deleted, it does not restore the volume(s) whose data is lost when their Storage Account was deleted. Such volumes’ data need to be restored from snapshot.
Questions and answers:
- If the Storage Account has been deleted 60+ days before cloud snapshots started to fail, what prompted the cloud snapshot failure if that was caused by Storage Account deletion?
- If the Storage Account has been deleted 82+ days before volumes started to fail, what prompted volume failure if that was caused by Storage Account deletion?
What happened here is that eventually, all the failed authentication attempts to the deleted storage account filled out the barrier queue (queue to the cloud). Once it is filled beyond a certain point, it becomes completely stuck, and anything in line behind it is unable to get through. It was once the barrier queue was completely overrun with all these connection issues to the deleted storage account that cause all other cloud traffic to be affected. This has the same effect as losing your cloud connection, and with this being a hybrid appliance when this happens it can cause many different issues such as we saw here with volumes being unavailable and backups unable to complete.
Recommendations to Microsoft
- Log events of Storage Account key changes in Operation Logs
- Currently 90 days worth of events show up in Operation Logs. It would be helpful if that retention period is configurable by the client on each subscription
- Make the device Support Package available to the client without the need for a key from Microsoft. In this case, information available only in the Support Package held the key to the workaround/solution.
- Update the device software so that loss of a Storage Account affects only its associated volumes not all volumes (Perhaps a separate queue per volume container instead of a queue per device)
- Update the StorSimple Manager service or/and Storage Account so that a Storage Account cannot be deleted if there’s an associated StorSimple Volume Container