Around mid February 2017, Microsoft released StorSimple software version 4.0 (17820). This is a release that includes firmware and driver updates that require using Maintenance mode and the serial console.
Trying the new cmdlets, the Get-HCSControllerReplacementStatus cmdlet returns a message like:
The Get-HCSRehydrationJob returns no output (no restore jobs are running)
The Invoke-HCSDisgnostics seems pretty useful and returns output similar to:
The cmdlet takes a little while to run. In this case it took 14 minutes and 38 seconds:
It returns data from its several sections like;
System Information section:
This is output similar to what we get from the Get-HCSSystem cmdlet for both controllers.
Update Availability section:
This is output similar to Get-HCSUpdateAvailability cmdlet, although the MaintenanceModeUpdatesTitle property is empty !!??
Cluster Information section:
This is new exposed information. I’m guessing this is the output of some Get-HCSCluster cmdlet, but this is pure speculation on my part. I’m also guessing that this is a list of clustered roles in a traditional Server 2012 R2 failover cluster.
Service Information section:
This is also new exposed information. Get-Service is not an exposed cmdlet.
Failed Hardware Components section:
This is new exposed information. This device is in good working order, so this list may be false warnings.
Firmware Information section:
This output is similar to what we get from Get-HCSFirmwareVersion cmdlet
Network Diagnostics section:
Most of this information is not new, but it’s nicely bundled into one section.
Performance Diagnostics section:
Finally, this section provides new information about read and write latency to the configured Azure Storage accounts.
The full list of exposed cmdlets in Version 4.0 is:
19 December 2016
After a conference call with Microsoft Azure StorSimple product team, they explained:
- “The maximum recommended full backup size when using an 8100 as a primary backup target is 10TiB. The maximum recommended full backup size when using an 8600 as a primary backup target is 20TiB”
- “Backups will be written to array, such that they reside entirely within the local storage capacity”
Microsoft acknowledge the difficulty resulting from the maximum provisionable space being 200 TB on an 8100 device, which limits the ability to over-provision thin-provisioned tiered iSCSI volumes when expecting significant deduplication/compression savings with long term backup copy job Veeam files for example.
- When used as a primary backup target, StorSimple 8k devices are intended for SMB clients with backup files under 10TB/20TB for the 8100/8600 models respectively
- Compared to using an Azure A4 VM with attached disks (page blobs), StorSimple provides 7-22% cost savings over 5 years
15 December 2016
On 13 December 2016, Microsoft announced the support of using StorSimple 8k devices as a backup target. Many customers have asked for StorSimple to support this workload. StorSimple hybrid cloud storage iSCSI SAN features automated tiering at the block level from its SSD to SAS to Azure tiers. This makes it a perfect fit for Primary Data Set for unstructured data such as file shares. It also features cloud snapshots which provide the additional functionality of data backup and disaster recovery. That’s primary storage, secondary storage (short term backups), long term storage (multiyear retention), off site storage, and multi-site storage, all in one solution.
However, the above features that lend themselves handy to the primary data set/unstructured data pose significant difficulties when trying to use this device as a backup target, such as:
- Automated tiering: Many backup software packages (like Veeam) would do things like a forward incremental, synthetic full, backup copy job for long term retention. All of which would scan/access files that are typically dozens of TB each. This will cause the device to tier data to Azure and back to the local device in a way that slows things down to a crawl. DPM is even worse; specifically the way it allocates/controls volumes.
- The arbitrary maximum allocatable space for a device (200TB for an 8100 device for example), makes it practically impossible to use the device as backup target for long term retention.
- Example: 50 TB volume, need to retain 20 copies for long term backup. Even if change rate is very low and actual bits after deduplication and compression of 20 copies is 60 TB, we cannot provision 20x 50 TB volumes, or a 1 PB volume. Which makes the maximum workload size around 3TB if long term retention requires 20 recovery points. 3TB is way too small of a limit for enterprise clients who simply want to use Azure for long term backup where a single backup file is 10-200 TB.
- The specific implementation of the backup catalog and who (the backup software versus StorSimple Manager service) has it.
- Single unified tool for backup/recovery – now we have to use the backup software and StorSimple Manager, which do not communicate and are not aware of each other
- Granular recoveries (single file/folder). Currently to recover a single file from snapshot, we must clone the entire volume.
In this article published 6 December 2016, Microsoft lays out their reference architecture for using StorSimple 8k device as a Primary Backup Target for Veeam
There’s a number of best practices relating to how to configure Veeam and StorSimple in this use case, such as disabling deuplication, compression, and encryption on the Veeam side, dedicating the StorSimple device for the backup workload, …
The interesting part comes in when you look at scalability. Here’s Microsoft’s listed example of a 1 TB workload:
This architecture suggests provisioning 5*5TB volumes for the daily backups and a 26TB volume for the weekly, monthly, and annual backups:
This 1:26 ratio between the Primary Data Set and Vol6 used for the weekly, monthly, and annual backups suggests that the maximum supported Primary Data Set is 2.46 TB (maximum volume size is 64 TB) !!!???
This reference architecture suggests that this solution may not work for a file share that is larger than 2.5TB or may need to be expanded beyond 2.5TB
Furthermore, this reference architecture suggests that the maximum Primary Data Set cannot exceed 2.66TB on an 8100 device, which has 200TB maximum allocatable capacity, reserving 64TB to be able to restore the 64TB Vol6
It also suggests that the maximum Primary Data Set cannot exceed 8.55TB on an 8600 device, which has 500TB maximum allocatable capacity, reserving 64TB to be able to restore the 64TB Vol6
Even if we consider cloud snapshots to be used only in case of total device loss – disaster recovery, and we allocate the maximum device capacity, the 8100 and 8600 devices can accommodate 3.93TB and 9.81TB respectively:
Although the allocation of 51TB of space to backup 1 TB of data resolves the tiering issue noted above, it significantly erodes the value proposition provided by StorSimple.
Server 2016 has enhanced and added new features to Storage Spaces. Most notably is the introduction of Storage Spaces Direct, Storage Replica, and Storage QoS. This post explores upgrading a physical Server 2012 R2 that uses mirrored tiered storage space.
After installing Server 2016 (Desktop Experience), and choosing to keep ‘nothing’
In Server Manager, File and Storage Services\Volumes\Storage Pools, we see the old Storage Pool from the prior installation of Server 2012 R2
To recover the Storage Pool, its virtual disks, and all data follow these steps:
- Set Read-Write access
- Upgrade the Storage Pool Version Note that this step is irreversible
- Right click on each virtual disk and attach it
- Finally, in Disk Management, right click on each virtual disk and online it
The virtual disks retain the drive letters and volume labels assigned to them in the old 2012 R2 server. All data is intact.
Test-HcsStorageAccountCredential is a function in the HCS (Hybrid Cloud Storage) PowerShell module.
This module is only available on StorSimple device. The purpose of this function is to test connectivity and authentication to an Azure Storage account or other supported public clouds’ storage containers. This may be needed during device deployment to troubleshoot connectivity issues; specifically Storage Account access.
The cmdlet/function has 3 parameter sets. When using the ‘name’ parameter set, we may see several outputs like:
Once a volume container is created to use a newly created Storage Account
The Test-HcsStorageAccountCredential returns a different output:
The above output indicates that the StorSimple device can access the Storage Account successfully. What’s indicative of success here is NOT the ‘HcsErrorMessage: Success’ message. This is considered a success because of the ‘StatusCode: 0‘ message.
Now, if you change the Storage account keys (password portion of the credential needed to access it), the Test-HcsStorageAccountCredential returns output similar to:
The HcsErrorMessage and the HttpMessage above seem to be accurate.
After synchronizing the Storage Account keys with the StorSimple Manager service, deleting the volume container associated with the Storage Account, and deleting the Storage Account, the Test-HcsStorageAccountCredential returns output similar to:
The above message is a bit confusing. I expect to see a message similar to the red error message above indicating that the Storage Account does not exist. ‘HcSErrorMessage: Success’ here is inaccurate. On the bright side, ‘HttpMessage: ResourceNotFound’ is accurate.
In one scenario, where volume container creation fails with error 502, the Test-HcsStorageAccountCredential returns output similar to:
Again, ‘HcSErrorMessage: Success’ here is inaccurate. This particular error ended up being a mis-configured proxy settings on the device where NTLM was specified instead of None and no username/pwd were provided. The proxy was not requiring or using any authentication.
The PowerShell commands to use are:
Get-HCSWebProxy # to view current Proxy settings Set-HCSWebProxy -ConnectionURI 'http://myproxy.mydomain.com:8000' -Authentication None # to configure the device to use Proxy Enable-HCSWebProxy # to enable Proxy use
When using the Test-HcsStorageAccountCredential function/cmdlet with the ‘name’ parameter set, any StatusCode value other than 0 indicates failure to connect or/and authenticate to the Storage Account. ‘HcSErrorMessage: Success’ may be inaccurate.
Deleting Storage Account associated with a StorSimple Volume Container disables ALL volumes on the device
I came across the following interesting situation with an 8600 StorSimple device running software version 3.0 (17759).
All iSCSI volumes from the StorSimple device in question are down at the Windows 2012 R2 host (about a dozen volumes in this case). If you create a new volume and present it to the Windows host, attempting to partition it (GPT) fails with the error message ‘disk not ready’.
2 Additional observations were made:
- All cloud snapshots failed about a month before this incident.
- Software update 3.0 was applied to this device roughly the same time the incident occurred. The accompanying firmware update was not applied.
The time line is as follows:
- Storage account was deleted prior to 6/28/2016 (not showing in Operation Logs which are kept for 90 days)
- 60+ days later (8/28/2016) all cloud snapshots started to fail. The error message suggests failure to access the storage account
- 82+ days later around 9/20/2016, users started to report volumes not available
- 9/24/2016, software update 3.0 was applied from the classic portal
- About a dozen volumes were provisioned from this device to one Windows 2012 R2 host. Volume Containers were associated with 3 Storage Accounts in the same subscription
- One of the 3 Storage Accounts (the one on the top of the above image) was missing. Apparently it was inadvertantly deleted.
- Get-HCSSystem showed normal device condition
- iSCSI connectivity including iSCSI initiator and MPIO configuration were reviewed, tested and showed no issues.
- Ping (Test-Connection) and tracert.exe (Trace-HCSRoute) from each of the host iSCSI interfaces to each of the device iSCSI interfaces and back came back OK.
- Test-HCSMConnection showed no problems.
- Test-HcsStorageAccountCredential against the 2 existing Storage Accounts showed no problem.
Troubleshooting and Root Cause Analysis
I initially suspected that the Storage Account keys were changed without getting synchronized with the StorSimple Manager service, cutting off the device from its Storage Accounts. That would explain volume failure of all volumes and cloud snapshot failure. Both of which need to read and write to Storage Accounts.
- However, Operation Logs showed no events or records related to change of Storage Account keys associated with a StorSimple volume.
- Operation Logs showed no event/record of Storage Account deletion.
- Synchronizing the Storage Account keys of the 2 existing Storage Accounts did not solve the problem.
After opening a ticket with Microsoft, they obtained a device Support Package and recognized that the device appears to be constantly trying and failing to reach the volume whose Storage Account is deleted which is causing failure to serve the remaining unaffected volumes.
Steps to reproduce the problem
- Create 3 Storage Accounts
- Create 3 Volume Containers, each using a separate Storage Account
- Create 3 volumes, 1 in each volume container
- Present all volumes to a Windows 2012 R2 host, online, partition, format, copy test data
- Delete 1 Storage Account
- All 3 volumes will fail (inaccessible) after some time (see questions and answers section below about how much time)
Create a Storage Account with the same name as the one that was accidentally deleted, and synchronize the keys with the StorSimple Manager service
Although this solution will untie the device to serve the volumes whose Storage Accounts have not been deleted, it does not restore the volume(s) whose data is lost when their Storage Account was deleted. Such volumes’ data need to be restored from snapshot.
Questions and answers:
- If the Storage Account has been deleted 60+ days before cloud snapshots started to fail, what prompted the cloud snapshot failure if that was caused by Storage Account deletion?
- If the Storage Account has been deleted 82+ days before volumes started to fail, what prompted volume failure if that was caused by Storage Account deletion?
What happened here is that eventually, all the failed authentication attempts to the deleted storage account filled out the barrier queue (queue to the cloud). Once it is filled beyond a certain point, it becomes completely stuck, and anything in line behind it is unable to get through. It was once the barrier queue was completely overrun with all these connection issues to the deleted storage account that cause all other cloud traffic to be affected. This has the same effect as losing your cloud connection, and with this being a hybrid appliance when this happens it can cause many different issues such as we saw here with volumes being unavailable and backups unable to complete.
Recommendations to Microsoft
- Log events of Storage Account key changes in Operation Logs
- Currently 90 days worth of events show up in Operation Logs. It would be helpful if that retention period is configurable by the client on each subscription
- Make the device Support Package available to the client without the need for a key from Microsoft. In this case, information available only in the Support Package held the key to the workaround/solution.
- Update the device software so that loss of a Storage Account affects only its associated volumes not all volumes (Perhaps a separate queue per volume container instead of a queue per device)
- Update the StorSimple Manager service or/and Storage Account so that a Storage Account cannot be deleted if there’s an associated StorSimple Volume Container
You may have the situation where you need to move your StorSimple 8k iSCSI SAN from one physical location to another. Assuming that the move is not so far as to move to another continent or thousands of miles away, the following process is what I recommend for the move:
- On the file servers that receive iSCSI volumes from this StorSimple device, open Disk Management, and offline all volumes from this StorSimple device
- (Optional) In the classic portal, under the device/maintenance page, install the latest Software and Firmware update. The reason this unrelated step is here, is to take advantage of the down time window to perform device update. This may take 1-12 hours, and may require access to the device serial interface.
- Ensure that you have the Device Administrator password. You’ll need that to change the device IP configuration for the new site. If you don’t have it, you can reset it by going into the classic portal, under the device/configuration page.
- Power down the device by going to the classic portal, under device/maintenance, click Manage Controllers at the bottom, and shutdown Controller0, and repeat to shutdown Controller1
- After the device is powered down, toggle the power buttons on the back on the PCM’s to the off position. Do the same for the EBOD enclosure if this is an 8600 model device.
- Move the device to the new location
- Rack, cable, and power on the device by toggling the power buttons on the back of the PCM modules.
- In the serial console,
- Type 1 to login with full access, enter the device Administrator password.
- Type in Invoke-HCSSetupWizard, enter the new information for data0 interface: IP, mask, gateway, DNS server, NTP server, Proxy information if that’s needed for Internet access in the new site (Proxy URL as http://my.proxy.domain.com:8888, authentication is typically T for NTLM, Proxy username and password if needed by your Proxy – Proxy must be v1.1 compliant)
- Back in the classic portal, you should see your device back online, go to the device/configuration page, update any settings as needed such as controller0 and controller1 fixed IPs, and iSCSI interface configuration if that has changed.
- If the same file servers have moved with the StorSimple device,
- Bring online the file servers, change IP configuration as needed
- Verify iSCSI connectivity to the StorSimple device
- Verify iSCSI initiator configuration
- Online the iSCSI volumes
- Test file access
This post lists StorSimple software versions, their release dates, and major new features for reference. Microsoft does not publish release dates for StorSimple updates. The release dates below are from published documentation and/or first hand experience. They may be off by up to 15 days.
- Version 4.0 (17820) – released 12 February 2017 – see release notes, and this post.
- Major new features: Invoke-HCSDiagnostics new cmdlet, and heatmap based restores
- Version 3.0 (17759) – released 6 September 2016 – see release notes, and this post.
- Major new features: The use of a StorSimple as a backup target (9/9/2016 it’s unclear what that means)
- Version 2.2 (17708) – see release notes
- Version 2.1 (17705) – see release notes
- Version 2.0 (17673) – released January 2016 – see release notes, this post, and this post
- Major new features: Locally pinned volumes, new virtual device 8020 (64TB SSD), ‘proactive support’, OVA (preview)
- Version 1.2 (17584) – released November 2015 – see release notes, this post, and this post
- Major new features: (Azure-side) Migration from legacy 5k/7k devices to 8k devices, support for Azure US GOV, support for cloud storage from other public clouds as AWS/HP/OpenStack, update to latest API (this should allow us to manage the device in the new portal, yet this has not happened as of 9/9/2016)
- Version 1.1 (17521) – released October 2015 – see release notes
- Version 1.0 (17491) – released 15 September 2015 – see release notes and this post
- Version 0.3 (remains 17361) – released February 2015 – see release notes
- Version 0.2 (17361) – released January 2015 – see release notes and this post
- Version 0.1 (17312) – released October 2014 – see release notes
- Version GA (General Availability – 0.0 – Kernel 6.3.9600.17215) – released July 2014 – see release notes – This is the first Windows OS based StorSimple software after Microsoft’s acquisition of StorSimple company.
- As Microsoft acquired StorSimple company, StorSimple 5k/7k series ran Linux OS based StorSimple software version 188.8.131.52 – August 2012
This post describes one experience of updating StorSimple 8100 series device from version 0.2 (17361) to current (8 September 2016) version 3.0 (17759). It’s worth noting that:
- StorSimple 8k series devices that shipped in mid 2015 came with software version 0.2
- Typically, the device checks periodically for updates and when updates are found a note similar to this image is shown in the device/maintenance page:
- The device admin then picks the time when to deploy the updates, by clicking INSTALL UPDATES link. This kicks off an update job, which may take several hours
- This update method is known as updating StorSimple device using the classic Azure portal, as opposed to updating the StorSimple device using the serial interface by deploying the update as a hotfix.
- Released updates may not show up, in spite of scanning for updates manually several times:
The image above was taken on 9 September 2016 (update 3.0 is the latest at this time). It shows that no updates are available even after scanning for updates several times. The reason is that Microsoft deploys updates in a ‘phased rollout’, so they’re not available in all regions at all times.
- Updates are cumulative. This means for a device running version 0.2 for example, we upgrade directly to 3.0 without the need to manually upgdate to any intermediary version first.
- An update may include one or both of the following 2 types:
- Software updates: This is an update of the core 2012 R2 server OS that’s running on the device. Microsoft identifies this type as a non intrusive update. It can be deployed while the device is in production, and should not affect mounted iSCSI volumes. Under the covers, the device controller0 and controller1 are 2 nodes in a traditional Microsoft failover cluster. The device uses the traditional Cluster Aware Update to update the 2 controllers. It updates and reboots the passive controller first, fails over the device (iSCSI target and other clustered roles) from one controller to the other, then updates and reboots the second controller. Again this should be a no-down-time process.
Maintenance mode updates:
These are updates to shared components in the device that require down time. Typically we see LSI SAS controller updates and disk firmware updates in this category. Maintenance mode updates must be done from the serial interface console (not Azure web interface or PowerShell interface). The typical down time for a maintenance mode update is about 30 minutes, although I would schedule a 2 hour window to be safe. The maintenance mode update steps are:
- On the file servers, offline all iSCSI volumes provisioned from this device.
- Log in to the device serial interface with full access
- Put the device in Maintenance mode: Enter-HcsMaintenanceMode, wait for the device to reboot
- Identify available updates: Get-HcsUpdateAvailability, this should show available Maintenance mode updates (TRUE)
- Start the update: Start-HcsUpdate
- Monitor the update: Get-HcsUpdateStatus
- When finished, exit maintenance mode: Exit-HcsMaintenanceMode, and wait for the device to reboot.