Around mid February 2017, Microsoft released StorSimple software version 4.0 (17820). This is a release that includes firmware and driver updates that require using Maintenance mode and the serial console.
Trying the new cmdlets, the Get-HCSControllerReplacementStatus cmdlet returns a message like:
The Get-HCSRehydrationJob returns no output (no restore jobs are running)
The Invoke-HCSDisgnostics seems pretty useful and returns output similar to:
The cmdlet takes a little while to run. In this case it took 14 minutes and 38 seconds:
It returns data from its several sections like;
System Information section:
This is output similar to what we get from the Get-HCSSystem cmdlet for both controllers.
Update Availability section:
This is output similar to Get-HCSUpdateAvailability cmdlet, although the MaintenanceModeUpdatesTitle property is empty !!??
Cluster Information section:
This is new exposed information. I’m guessing this is the output of some Get-HCSCluster cmdlet, but this is pure speculation on my part. I’m also guessing that this is a list of clustered roles in a traditional Server 2012 R2 failover cluster.
Service Information section:
This is also new exposed information. Get-Service is not an exposed cmdlet.
Failed Hardware Components section:
This is new exposed information. This device is in good working order, so this list may be false warnings.
Firmware Information section:
This output is similar to what we get from Get-HCSFirmwareVersion cmdlet
Network Diagnostics section:
Most of this information is not new, but it’s nicely bundled into one section.
Performance Diagnostics section:
Finally, this section provides new information about read and write latency to the configured Azure Storage accounts.
The full list of exposed cmdlets in Version 4.0 is:
19 December 2016
After a conference call with Microsoft Azure StorSimple product team, they explained:
- “The maximum recommended full backup size when using an 8100 as a primary backup target is 10TiB. The maximum recommended full backup size when using an 8600 as a primary backup target is 20TiB”
- “Backups will be written to array, such that they reside entirely within the local storage capacity”
Microsoft acknowledge the difficulty resulting from the maximum provisionable space being 200 TB on an 8100 device, which limits the ability to over-provision thin-provisioned tiered iSCSI volumes when expecting significant deduplication/compression savings with long term backup copy job Veeam files for example.
- When used as a primary backup target, StorSimple 8k devices are intended for SMB clients with backup files under 10TB/20TB for the 8100/8600 models respectively
- Compared to using an Azure A4 VM with attached disks (page blobs), StorSimple provides 7-22% cost savings over 5 years
15 December 2016
On 13 December 2016, Microsoft announced the support of using StorSimple 8k devices as a backup target. Many customers have asked for StorSimple to support this workload. StorSimple hybrid cloud storage iSCSI SAN features automated tiering at the block level from its SSD to SAS to Azure tiers. This makes it a perfect fit for Primary Data Set for unstructured data such as file shares. It also features cloud snapshots which provide the additional functionality of data backup and disaster recovery. That’s primary storage, secondary storage (short term backups), long term storage (multiyear retention), off site storage, and multi-site storage, all in one solution.
However, the above features that lend themselves handy to the primary data set/unstructured data pose significant difficulties when trying to use this device as a backup target, such as:
- Automated tiering: Many backup software packages (like Veeam) would do things like a forward incremental, synthetic full, backup copy job for long term retention. All of which would scan/access files that are typically dozens of TB each. This will cause the device to tier data to Azure and back to the local device in a way that slows things down to a crawl. DPM is even worse; specifically the way it allocates/controls volumes.
- The arbitrary maximum allocatable space for a device (200TB for an 8100 device for example), makes it practically impossible to use the device as backup target for long term retention.
- Example: 50 TB volume, need to retain 20 copies for long term backup. Even if change rate is very low and actual bits after deduplication and compression of 20 copies is 60 TB, we cannot provision 20x 50 TB volumes, or a 1 PB volume. Which makes the maximum workload size around 3TB if long term retention requires 20 recovery points. 3TB is way too small of a limit for enterprise clients who simply want to use Azure for long term backup where a single backup file is 10-200 TB.
- The specific implementation of the backup catalog and who (the backup software versus StorSimple Manager service) has it.
- Single unified tool for backup/recovery – now we have to use the backup software and StorSimple Manager, which do not communicate and are not aware of each other
- Granular recoveries (single file/folder). Currently to recover a single file from snapshot, we must clone the entire volume.
In this article published 6 December 2016, Microsoft lays out their reference architecture for using StorSimple 8k device as a Primary Backup Target for Veeam
There’s a number of best practices relating to how to configure Veeam and StorSimple in this use case, such as disabling deuplication, compression, and encryption on the Veeam side, dedicating the StorSimple device for the backup workload, …
The interesting part comes in when you look at scalability. Here’s Microsoft’s listed example of a 1 TB workload:
This architecture suggests provisioning 5*5TB volumes for the daily backups and a 26TB volume for the weekly, monthly, and annual backups:
This 1:26 ratio between the Primary Data Set and Vol6 used for the weekly, monthly, and annual backups suggests that the maximum supported Primary Data Set is 2.46 TB (maximum volume size is 64 TB) !!!???
This reference architecture suggests that this solution may not work for a file share that is larger than 2.5TB or may need to be expanded beyond 2.5TB
Furthermore, this reference architecture suggests that the maximum Primary Data Set cannot exceed 2.66TB on an 8100 device, which has 200TB maximum allocatable capacity, reserving 64TB to be able to restore the 64TB Vol6
It also suggests that the maximum Primary Data Set cannot exceed 8.55TB on an 8600 device, which has 500TB maximum allocatable capacity, reserving 64TB to be able to restore the 64TB Vol6
Even if we consider cloud snapshots to be used only in case of total device loss – disaster recovery, and we allocate the maximum device capacity, the 8100 and 8600 devices can accommodate 3.93TB and 9.81TB respectively:
Although the allocation of 51TB of space to backup 1 TB of data resolves the tiering issue noted above, it significantly erodes the value proposition provided by StorSimple.
Deleting Storage Account associated with a StorSimple Volume Container disables ALL volumes on the device
I came across the following interesting situation with an 8600 StorSimple device running software version 3.0 (17759).
All iSCSI volumes from the StorSimple device in question are down at the Windows 2012 R2 host (about a dozen volumes in this case). If you create a new volume and present it to the Windows host, attempting to partition it (GPT) fails with the error message ‘disk not ready’.
2 Additional observations were made:
- All cloud snapshots failed about a month before this incident.
- Software update 3.0 was applied to this device roughly the same time the incident occurred. The accompanying firmware update was not applied.
The time line is as follows:
- Storage account was deleted prior to 6/28/2016 (not showing in Operation Logs which are kept for 90 days)
- 60+ days later (8/28/2016) all cloud snapshots started to fail. The error message suggests failure to access the storage account
- 82+ days later around 9/20/2016, users started to report volumes not available
- 9/24/2016, software update 3.0 was applied from the classic portal
- About a dozen volumes were provisioned from this device to one Windows 2012 R2 host. Volume Containers were associated with 3 Storage Accounts in the same subscription
- One of the 3 Storage Accounts (the one on the top of the above image) was missing. Apparently it was inadvertantly deleted.
- Get-HCSSystem showed normal device condition
- iSCSI connectivity including iSCSI initiator and MPIO configuration were reviewed, tested and showed no issues.
- Ping (Test-Connection) and tracert.exe (Trace-HCSRoute) from each of the host iSCSI interfaces to each of the device iSCSI interfaces and back came back OK.
- Test-HCSMConnection showed no problems.
- Test-HcsStorageAccountCredential against the 2 existing Storage Accounts showed no problem.
Troubleshooting and Root Cause Analysis
I initially suspected that the Storage Account keys were changed without getting synchronized with the StorSimple Manager service, cutting off the device from its Storage Accounts. That would explain volume failure of all volumes and cloud snapshot failure. Both of which need to read and write to Storage Accounts.
- However, Operation Logs showed no events or records related to change of Storage Account keys associated with a StorSimple volume.
- Operation Logs showed no event/record of Storage Account deletion.
- Synchronizing the Storage Account keys of the 2 existing Storage Accounts did not solve the problem.
After opening a ticket with Microsoft, they obtained a device Support Package and recognized that the device appears to be constantly trying and failing to reach the volume whose Storage Account is deleted which is causing failure to serve the remaining unaffected volumes.
Steps to reproduce the problem
- Create 3 Storage Accounts
- Create 3 Volume Containers, each using a separate Storage Account
- Create 3 volumes, 1 in each volume container
- Present all volumes to a Windows 2012 R2 host, online, partition, format, copy test data
- Delete 1 Storage Account
- All 3 volumes will fail (inaccessible) after some time (see questions and answers section below about how much time)
Create a Storage Account with the same name as the one that was accidentally deleted, and synchronize the keys with the StorSimple Manager service
Although this solution will untie the device to serve the volumes whose Storage Accounts have not been deleted, it does not restore the volume(s) whose data is lost when their Storage Account was deleted. Such volumes’ data need to be restored from snapshot.
Questions and answers:
- If the Storage Account has been deleted 60+ days before cloud snapshots started to fail, what prompted the cloud snapshot failure if that was caused by Storage Account deletion?
- If the Storage Account has been deleted 82+ days before volumes started to fail, what prompted volume failure if that was caused by Storage Account deletion?
What happened here is that eventually, all the failed authentication attempts to the deleted storage account filled out the barrier queue (queue to the cloud). Once it is filled beyond a certain point, it becomes completely stuck, and anything in line behind it is unable to get through. It was once the barrier queue was completely overrun with all these connection issues to the deleted storage account that cause all other cloud traffic to be affected. This has the same effect as losing your cloud connection, and with this being a hybrid appliance when this happens it can cause many different issues such as we saw here with volumes being unavailable and backups unable to complete.
Recommendations to Microsoft
- Log events of Storage Account key changes in Operation Logs
- Currently 90 days worth of events show up in Operation Logs. It would be helpful if that retention period is configurable by the client on each subscription
- Make the device Support Package available to the client without the need for a key from Microsoft. In this case, information available only in the Support Package held the key to the workaround/solution.
- Update the device software so that loss of a Storage Account affects only its associated volumes not all volumes (Perhaps a separate queue per volume container instead of a queue per device)
- Update the StorSimple Manager service or/and Storage Account so that a Storage Account cannot be deleted if there’s an associated StorSimple Volume Container
This post lists StorSimple software versions, their release dates, and major new features for reference. Microsoft does not publish release dates for StorSimple updates. The release dates below are from published documentation and/or first hand experience. They may be off by up to 15 days.
- Version 4.0 (17820) – released 12 February 2017 – see release notes, and this post.
- Major new features: Invoke-HCSDiagnostics new cmdlet, and heatmap based restores
- Version 3.0 (17759) – released 6 September 2016 – see release notes, and this post.
- Major new features: The use of a StorSimple as a backup target (9/9/2016 it’s unclear what that means)
- Version 2.2 (17708) – see release notes
- Version 2.1 (17705) – see release notes
- Version 2.0 (17673) – released January 2016 – see release notes, this post, and this post
- Major new features: Locally pinned volumes, new virtual device 8020 (64TB SSD), ‘proactive support’, OVA (preview)
- Version 1.2 (17584) – released November 2015 – see release notes, this post, and this post
- Major new features: (Azure-side) Migration from legacy 5k/7k devices to 8k devices, support for Azure US GOV, support for cloud storage from other public clouds as AWS/HP/OpenStack, update to latest API (this should allow us to manage the device in the new portal, yet this has not happened as of 9/9/2016)
- Version 1.1 (17521) – released October 2015 – see release notes
- Version 1.0 (17491) – released 15 September 2015 – see release notes and this post
- Version 0.3 (remains 17361) – released February 2015 – see release notes
- Version 0.2 (17361) – released January 2015 – see release notes and this post
- Version 0.1 (17312) – released October 2014 – see release notes
- Version GA (General Availability – 0.0 – Kernel 6.3.9600.17215) – released July 2014 – see release notes – This is the first Windows OS based StorSimple software after Microsoft’s acquisition of StorSimple company.
- As Microsoft acquired StorSimple company, StorSimple 5k/7k series ran Linux OS based StorSimple software version 184.108.40.206 – August 2012
This post describes one experience of updating StorSimple 8100 series device from version 0.2 (17361) to current (8 September 2016) version 3.0 (17759). It’s worth noting that:
- StorSimple 8k series devices that shipped in mid 2015 came with software version 0.2
- Typically, the device checks periodically for updates and when updates are found a note similar to this image is shown in the device/maintenance page:
- The device admin then picks the time when to deploy the updates, by clicking INSTALL UPDATES link. This kicks off an update job, which may take several hours
- This update method is known as updating StorSimple device using the classic Azure portal, as opposed to updating the StorSimple device using the serial interface by deploying the update as a hotfix.
- Released updates may not show up, in spite of scanning for updates manually several times:
The image above was taken on 9 September 2016 (update 3.0 is the latest at this time). It shows that no updates are available even after scanning for updates several times. The reason is that Microsoft deploys updates in a ‘phased rollout’, so they’re not available in all regions at all times.
- Updates are cumulative. This means for a device running version 0.2 for example, we upgrade directly to 3.0 without the need to manually upgdate to any intermediary version first.
- An update may include one or both of the following 2 types:
- Software updates: This is an update of the core 2012 R2 server OS that’s running on the device. Microsoft identifies this type as a non intrusive update. It can be deployed while the device is in production, and should not affect mounted iSCSI volumes. Under the covers, the device controller0 and controller1 are 2 nodes in a traditional Microsoft failover cluster. The device uses the traditional Cluster Aware Update to update the 2 controllers. It updates and reboots the passive controller first, fails over the device (iSCSI target and other clustered roles) from one controller to the other, then updates and reboots the second controller. Again this should be a no-down-time process.
Maintenance mode updates:
These are updates to shared components in the device that require down time. Typically we see LSI SAS controller updates and disk firmware updates in this category. Maintenance mode updates must be done from the serial interface console (not Azure web interface or PowerShell interface). The typical down time for a maintenance mode update is about 30 minutes, although I would schedule a 2 hour window to be safe. The maintenance mode update steps are:
- On the file servers, offline all iSCSI volumes provisioned from this device.
- Log in to the device serial interface with full access
- Put the device in Maintenance mode: Enter-HcsMaintenanceMode, wait for the device to reboot
- Identify available updates: Get-HcsUpdateAvailability, this should show available Maintenance mode updates (TRUE)
- Start the update: Start-HcsUpdate
- Monitor the update: Get-HcsUpdateStatus
- When finished, exit maintenance mode: Exit-HcsMaintenanceMode, and wait for the device to reboot.
From the IT prospective a WordPress web site requires:
- A web server like Microsoft IIS or Apache
- mySQL database
Migrating a WordPress website includes copying all its files/folder structure, and its mySQL database, and changing the wp-config.php file to point to the new mySQL database. These tasks could be complicated for a large site and may require specific skills related to web site configuration and mySQL database administration. This post goes over a very simple way to migrate a WordPress web site to Azure, using the Duplicator WordPress plugin.
- Add Duplicator WordPress Plugin
- Create New Package
- Create new Azure WebApp
- Add mySQL database
- Upload the Duplicator package to the new Azure WebApp
- Run the Duplicator Package Installer
If you don’t have it already, add the WordPress Duplicator Plugin. On the Plugins page click Add New
Search for Duplicator, click Install Now
Create New Package
Click on Duplicator link on the left, then click Create New
Accept the defaults and click Next to scan your WordPress site
Duplicator scans your current WordPress site
and displays the result like:
In this example, I have a couple of warnings about large site size, and some large files. I check the box and click Build.
Duplicator builds the package:
The package consists of an Installer (installer.php file) and Archive (.zip file). I download both to my desktop. The zip file contains all the WordPress site files and folder structure + a scripted copy of the associated mySQL database
Create new Azure WebApp
In the Azure Portal, click New/Web+Mobile/Web App
In the Web App blade I type in the new Web App name ‘MyWebApp407’ which must be unique under .azurewebsites.net. I pick the Azure subscription from the Subscription drop down menu. I choose to create a new Resource Group. I give it a name; ‘MyWebApp-RG’. I click the arrow to create a new Service Plan
In the App Service Plan blade (middle) I click Create New, type in MyWebApp-SP as the name, select East-US and accept the default Pricing teir of S1 Standard.
Finally, I click OK and Create
In a few minutes Azure complete MyWebApp deployment
Add mySQL database
I browse under Resource Groups/MyWebApp-RG, and click Add
I search for mySQL, and select MySQL Database by ClearDB
and click Create
I give it a name ‘MyWebAppDB’ (avoid using other than alphanumeric characters in DB name), pick East US for the location, click the arrow and OK to accept the terms, and finally click Create
Click Refresh and note the new blank mySQL database:
Upload the Duplicator package to the new Azure WebApp
If you browse to the new web site now you may see a temporary page like:
First zip the 2 files downloaded from the Duplicator Package above into 1 file:
Next browse to the KUDU page http://MyWebApp407.scm.azurewebsites.net
Click CMD under the Debug Console menu
Browse to d:\site\wwwroot
Drag the zip file from prior steps and drag it on the right side as shown below:
Azure will upload
and unzip the file
Run the Duplicator Package Installer
Browse to the installer.php file as in http://mywebapp407.azurewebsites.net/installer.php
You will see a page similar to
Back in the Azure Portal, click MyWebAppDB/Properties
Note the Database Name, Hostname, username, and password
Back in the installer.php screen, enter the required information as shown below:
Click ‘Connect and Remove All Data’, click ‘Test Connection’, check the box to acknowledge the notices, and click ‘Run Deployment’
Click OK to continue..
The installer extracts the Duplicator Package zip file restoring the file system and rebuilds the mySQL database from the script contained in the zip file
Accept the defaults and click Run Update
The Installer makes the selected changes to the WebApp config files
Follow the installer instructions to do final testing:
Step number 2 above is actually important. Clicking on the link next to ‘2.’ above will take you to the site admin login page:
Use the same credentials from the original site.
Adjust your permalinks setting as it is on the original site.:
As a last step, once the site users have tested that everything looks OK, add a custom domain to the site and switch the domain DNS records to point to your new Azure site.
StorSimple Hybrid Cloud Storage array is an on-premise iSCSI SAN that extends seamlessly to the cloud. iSCSI volumes provisioned from a StorSimple device can be expanded but cannot be shrunk. So, a typical recommendation here is to start a volume small and grow it as needed. Growing a volume is a process that does not require down time. This script grows a StorSimple volume automatically based on set conditions of volume free space and a not-to-exceed value.
The input region of this script is the one that should be edited by the script user:
Similar to the script that monitors StorSimple Backups, the values for SubscriptionName, SubscriptionID, and StorSimpleManagerName variables can be found in the classic Azure Management Interface under your StorSimple Manager node Dashboard and Device pages:
and the RegistrationKey:
and the SSDeviceName (StorSimple Device Name)
The value for the SSVolumeName (StorSimple volume name) variable can be found under the device\volume container:
Notify variable can be either $true or $false. This instructs the script whether or not to send email notification when an expansion is triggered,
Similarly, Expand variable can be either $true or $false. This instructs the script whether or not to expand the volume when an expansion is triggered, When set to $false (and Notify is set to $true) and an expansion is triggered, the script will send an email notification that an expansion is triggered but will not do the actual expansion.
ExpandThresholdGB and ExpandThresholdPercent variables are used by the script to identify the amount of free space on the volume below which a volume expansion is triggered. Only one of these variables is needed. If both are provided the script will use the larger value.
- Example 1: If the volume size is 100 GB, and the ExpandThresholdGB is set to 10 (GB) and the ExpandThresholdPercent is set to 15 (%), the script will trigger a volume expansion if the amount of free space is at or below 15 GB
- Example 2: If the volume size is 100 GB, and the ExpandThresholdGB is set to 10 (GB) and the ExpandThresholdPercent is set to 5 (%), the script will trigger a volume expansion if the amount of free space is at or below 10 GB
Similarly, the ExpandAmountGB and ExpandAmountPercent variables instruct the script on how much to expand the volume once expansion is triggered. Only one of these variables is needed. If both are provided the script will use the larger value.
- Example 1: If the volume size is 100 GB, and the ExpandAmountGB is set to 10 (GB) and the ExpandAmountPercent is set to 15 (%), the script will expand the volume by 15 GB once expansion is triggered.
- Example 2: If the volume size is 100 GB, and the ExpandAmountGB is set to 10 (GB) and the ExpandAmountPercent is set to 5 (%), the script will expand the volume by 10 GB once expansion is triggered.
The value assigned to the variable NotToExceedGB is used by the script as volume maximum size that the script must not exceed. For example, if the prior 4 variables instruct the script to expand a 900 GB volume by an additional 200 GB and the NotToExceedGB variable is set to 1024 (1 TB), the script will expand the volume by 124 GB only to reach the NotToExceedGB amount but to not to exceed it.
DiskNumber and DriveLetter are values that the script user should obtain from the server’s Disk Management screen of the file server using this iSCSI volume:
As of the time of writing this post and script (1 April 2016), there’s no way to correlate a volume on a file server to a volume on a StorSimple device. For example, if you create 3 volumes of the same size on a StorSimple device and call them data1, data2, and data3, and present them to the same file server and format them with the same file system and block size, and use volume labels data1, data2, data3, there’s no way to tell if data1 on the StorSimple device is the volume labeled data1 on the file server. This is why it’s recommended to provision and format StorSimple volumes one at a time and use the same volume label when formatting the volume as the volume name on StorSimple. Long story short, it’s the user’s responsiblity to:
- Make sure the DrviveLetter and DiskNumber correspond to the SSVolumeName, and
- Update the DrviveLetter and DiskNumber values if they change on the file server due to adding or removing volumes.
One last point here; if this iSCSI volume is presented to a Windows Failover cluster, this script must be run on the owner node.
LogFile is the path to where the script will log its actions – each log line will be time stamped. This could be on a network share.
EmailSender is the name and email address you wish to have the email notification appear to come from. For example: StorSimple Volume Size Monitor <DoNotReply@YourDomain.com>
$EmailRecipients = @(
‘Sam Boutros <email@example.com>’
‘Your Name <YourName@YourDomain.com>’
is an array that takes one or more email addresses in the format shown above.
SMTPServer is your SMTP relay server. You need to make necessary configuration/white-listing changes to allow your SMTP server to accept and relay SMTP email from the server running the script.
Sample script output:
and example of email notification:
Possible future enhancements to this script include:
- Rewrite the script as a function so that it can handle several volumes
- Rewrite the script to use Powershell remoting, so that it does not have to run on the file server.
- Add functionality to detect if the target file server is a member of a failover cluster, and to automatically target the owner node.
By design StorSimple hybrid cloud storage tiers off automatically the oldest blocks from the local SSD tier down to the SAS tier as the SSD tier fills up (reaches ~80% capacity). In turn it also tiers down the oldest blocks from the SAS tier to the Azure tier as that fills up (reaches ~80% capacity).
This has the great benefits of:
- Automated tiering: This negates the need for data classification and the entirety of the efforts associated with that.
- Granular tiering: Tiering happens at the block level not at the file level. That’s 64KB for tiered volumes. So, a file can have some hot blocks in SSD, some older blocks in SAS, and some cold blocks that have been displaced all the way down to the Azure tier by warmer blocks (of the same or other files)
As of the time of writing this post (28 March 2016), tiering is fully automated and not configurable. The exception is ‘Locally Pinned Volume’ feature that comes with StorSimple software update 2.0 (17673) and above. A locally pinned volume loses the deduplication and compression features of a ‘Tiered Volume’, and always resides on the physical device. Currently no visibility is provided as to what tier a Locally Pinned Volume resides (SSD or SAS).
In the following scenario – take the example of an 8100 StorSimple device that has 15.8 TB local usable capacity (prior to dedplication and compression):
- Customer creates handful of volumes – about 30 TB provisioned out of 200 TB max allowed on the device, migrates some 25 TB of data:
The above ‘Primary’ capacity graph shows about 25 TB of data as it appears to the SMB file servers that consume the iSCSI volumes, while the below ‘Device’ capacity graph shows that about 10 TB of that 25 TB resides on the same device for the same time period.
- Customer does an archive data dump, such as 2 TB of old backup or archive files. Any new data comes in as hot and in a ‘full’ device, it will displace older blocks to Azure. In this case, we have several TB of active production data that got inadvertently displaced to Azure. The following access pattern is observed:
- End user attempts to retrieve files. If the file blocks are in Azure, they will be retrieved, but to make room for them in the SSD tier, other blocks has be tiered down to the full SAS tier, which will have to tier off blocks back to Azure to make room for blocks coming down from SSD. So, a read operation has caused 2 tiering operations including a write operation to Azure. This is described as high latency IO operation.
- If this is taking several minutes, during the period where the device is handling high latency IO’s described above, if other users are requesting files that RESIDE ENTIRELY LOCALLY on the device (described as low latency IO operations), it has been observed that those read requests are slowed down as well to a crawl. That’s is high latency IO’s appear to block low latency IO’s.
- So in this scenario, a 2 TB archive data dump on an 8100 device with 10 TB on the device, result in the entire 10 TB being shuffled out to Azure and back in, few blocks at a time, until the 2 TB archive data ends up in Azure returning the device to its pre-incident status.
In my opinion, this is a situation to be avoided at all costs. Once it occurs, the device may exhibit very slow performance that may last for weeks until the archive data dump has made its way through the rest of the data on the device to Azure.
Best practices recommended to avoid this scenario:
- Adhere to the recommended device use cases, particularly unstructured data/file shares. StorSimple is not meant for multi-terabyte high performance SQL databases for example. Another example that is not recommended as a workload on StorSimple is large PST files. They’re essentially database file that are accessed frequently, and get scanned, indexed and accessed in their entirety.
- Do not run any workload or process that scans the active data set in its entirely. Anti-virus and anti-malware scans must be configured for incremental use or quick scans only, never for a full scan of all files on a volume. This applies to any process that may try to index, categorize, classify, or read all files on a volume. The exception is a process or application that reads files metadata and properties only – not open the files and reads inside of them. Reading metadata is OK because metadata always resides locally on the device.
- Carefully plan your data migration to StorSimple, putting emphasis on migrating the oldest data first. Robocopy can be a very helpful tool in the process.
I’m adding the following enhancements to my wishlist that I hope to see implemented by Microsoft in the next StorSimple software release:
- Resolving the core issue of high latency IO’s seeming to block/impede low latency IO’s
- More visibility into the device tiering metrics. Simply put, a storage admin needs to know when a StorSimple device is ‘full’ and is tiering off blocks from the primary data set to Azure. This knowledge is critical to avoid the situation described above. A metric of the amount of space available before the device is full, is even better to help provide predictability before reaching that point.
- ‘Cloud Pinned Volume’ feature would be very helpful. This should allow the StorSimple storage admin to provision an iSCSI volume that resides always in Azure and does not affect the device heat map.
StorSimple update 2.0 brings in a number of new exciting features such as Locally Pinned Volumes, OVA (On-premise Virtual Array), and enhanced SVA (StorSimple Virtual Array) model 8020 with 64TB capacity as opposed to 30 TB capacity of the prior model 1100 (now renamed 8010).
Update 2.0 is another intrusive update that requires down time. It includes LSI firmware update (KB 3121900), and SSD disk firmware update (KB 3121899).
Prior to the update, we can see the device running Software version 1.2 (17584)
This can also be seen from the serial or Powershell interfaces by using the Get-HcsSystem cmdlet:
Ensure that both controllers have routable IPs
As suggested by the update instructions, we ensure that both controllers 0 and 1 have routable IPs prior to start. To do so, I ping some external Internet IP address such as bing.com from each of the controllers’ fixed IPs:
From Controller 0 (the prompt must say ‘Controller0>’):
Test-HcsConnection -Source 10.1.2.86 -Destination bing.com
A positive response looks like:
From Controller 1 (the prompt must say ‘Controller1>’):
Test-HcsConnection -Source 10.1.2.87 -Destination bing.com
Phase I – Software update – start the update from the Azure Management Interface
In the classic portal, under the device Maintenance page, click Install Updates at the bottom:
check the box and the check mark:
Pre-upgrade checks are started:
And a Software Update Job is created:
Unlike prior updates, the 2.0 update starts on the passive controller:
Under the StorSimple Manager/Jobs page, we can see an update job in progress:
The controller being updated will reboot several times. During the update we’ll see unusual controller health and state information in the portal:
This is normal while the update is in progress.
A few hours later, we can see that the passive controller has been patched to version 2.0
and that a controller failover has occurred, where controller 1 is now active, and controller 0 (now passive) is being patched:
About 4.5 hours the first phase of the update is finished:
We can see the device in normal state and health under the Maintenance page:
Phase II – Maintenance Mode LSI firmware update
Unfortunately this is an intrusive update that requires down time, similar to phase 2 of StorSimple version 1.2 update posted here.
To summarize the steps of maintenance mode updates:
- Schedule a down-time window
- Offline all StorSimple iSCSI volumes on the file servers
- Run a manual cloud snapshot of all volumes
- On the Device serial (not Powershell) interface, put the device in Maintenance mode:
Both controllers will reboot
- Patch controller 0:
Check update progress:
- After controller o is patched repeat last step on controller 1 to patch it
- Finally exit Maintenance mode:
Both controllers will reboot
The device is now back in normal operating condition, and we can online the volumes back on the file servers.