Troubleshooting StorSimple high latency IO’s blocking low latency IO’s


By design StorSimple hybrid cloud storage tiers off automatically the oldest blocks from the local SSD tier down to the SAS tier as the SSD tier fills up (reaches ~80% capacity). In turn it also tiers down the oldest blocks from the SAS tier to the Azure tier as that fills up (reaches ~80% capacity).

This has the great benefits of:

  1. Automated tiering: This negates the need for data classification and the entirety of the efforts associated with that.
  2. Granular tiering: Tiering happens at the block level not at the file level. That’s 64KB for tiered volumes. So, a file can have some hot blocks in SSD, some older blocks in SAS, and some cold blocks that have been displaced all the way down to the Azure tier by warmer blocks (of the same or other files)

As of the time of writing this post (28 March 2016), tiering is fully automated and not configurable. The exception is ‘Locally Pinned Volume’ feature that comes with StorSimple software update 2.0 (17673) and above. A locally pinned volume loses the deduplication and compression features of a ‘Tiered Volume’, and always resides on the physical device. Currently no visibility is provided as to what tier a Locally Pinned Volume resides (SSD or SAS).

In the following scenario – take the example of an 8100 StorSimple device that has 15.8 TB local usable capacity (prior to dedplication and compression):

  1. Customer creates handful of volumes – about 30 TB provisioned out of 200 TB max allowed on the device, migrates some 25 TB of data:
    Capacity02
    The above ‘Primary’ capacity graph shows about 25 TB of data as it appears to the SMB file servers that consume the iSCSI volumes, while the below ‘Device’ capacity graph shows that about 10 TB of that 25 TB resides on the same device for the same time period.
    Capacity01
  2. Customer does an archive data dump, such as 2 TB of old backup or archive files. Any new data comes in as hot and in a ‘full’ device, it will displace older blocks to Azure. In this case, we have several TB of active production data that got inadvertently displaced to Azure. The following access pattern is observed:
    1. End user attempts to retrieve files. If the file blocks are in Azure, they will be retrieved, but to make room for them in the SSD tier, other blocks has be tiered down to the full SAS tier, which will have to tier off blocks back to Azure to make room for blocks coming down from SSD. So, a read operation has caused 2 tiering operations including a write operation to Azure. This is described as high latency IO operation.
    2. If this is taking several minutes, during the period where the device is handling high latency IO’s described above, if other users are requesting files that RESIDE ENTIRELY LOCALLY on the device (described as low latency IO operations), it has been observed that those read requests are slowed down as well to a crawl. That’s is high latency IO’s appear to block low latency IO’s.
    3. So in this scenario, a 2 TB archive data dump on an 8100 device with 10 TB on the device, result in the entire 10 TB being shuffled out to Azure and back in, few blocks at a time, until the 2 TB archive data ends up in Azure returning the device to its pre-incident status.

In my opinion, this is a situation to be avoided at all costs. Once it occurs, the device may exhibit very slow performance that may last for weeks until the archive data dump has made its way through the rest of the data on the device to Azure.

Best practices recommended to avoid this scenario:

  1. Adhere to the recommended device use cases, particularly unstructured data/file shares. StorSimple is not meant for multi-terabyte high performance SQL databases for example. Another example that is not recommended as a workload on StorSimple is large PST files. They’re essentially database file that are accessed frequently, and get scanned, indexed and accessed in their entirety.
  2. Do not run any workload or process that scans the active data set in its entirely. Anti-virus and anti-malware scans must be configured for incremental use or quick scans only, never for a full scan of all files on a volume. This applies to any process that may try to index, categorize, classify, or read all files on a volume. The exception is a process or application that reads files metadata and properties only – not open the files and reads inside of them. Reading metadata is OK because metadata always resides locally on the device.
  3. Carefully plan your data migration to StorSimple, putting emphasis on migrating the oldest data first. Robocopy can be a very helpful tool in the process.

I’m adding the following enhancements to my wishlist that I hope to see implemented by Microsoft in the next StorSimple software release:

  • Resolving the core issue of high latency IO’s seeming to block/impede low latency IO’s
  • More visibility into the device tiering metrics. Simply put, a storage admin needs to know when a StorSimple device is ‘full’ and is tiering off blocks from the primary data set to Azure. This knowledge is critical to avoid the situation described above. A metric of the amount of space available before the device is full, is even better to help provide predictability before reaching that point.
  • ‘Cloud Pinned Volume’ feature would be very helpful. This should allow the StorSimple storage admin to provision an iSCSI volume that resides always in Azure and does not affect the device heat map.
Advertisements

5 responses

  1. Pingback: StorSimple 8k wish list (updated 3/28/2015) | Sam's Corner

  2. Gregory Beckowski

    Hi Sam

    When it comes to migrating data into SS, I am not sure there is a way to get around displacing data on device to Azure when the device capacity is full. From what I can see, all data copied to SS is ‘new’ regardless of its age on the source. I believe the age is based on how long a block has been within a tier. So if the device capacity is fully utilized, data migrating into the SS device will displace SSD data first and then work its way up (or down depending on your preference) the tiers, displacing data as it goes.

    I believe we have seen the kind of blocking you describe above even when we have migrated data to the SS over a period of weeks. Once we cutover the users/apps and pointed them to a a share backed by the SS, we see what looks like a “settling out” period where performance can be unpredictable. In our last case it seemed to take about a week to settle out. The difficulty, as you also point out, is the metrics are not granular enough to correlate back to the user’s experience. Knowing, for each file, what part is in what tier, or even what is in the device and what is in Azure, would be helpful.

    Greg

    April 12, 2016 at 4:14 pm

    • Hi,
      I have been working with StorSimple for several months now and I have rapidly seen the great benefits that it brings. I come from a NetApp storage background. NetApp sadly does not take advantage of ‘auto-tiering’ which is why I am keen to gain a greater understanding of StorSimple.

      I am currently in the process of migrating huge amounts of data from our NetApp storage environment onto our StorSimple 8600. I have found that most recently after a considerably large CIFS share migration users began to complain of poor performance. Having checked the appliance using the very rudimentary tools available (portal…) I can see that the physical performance is good but latency on ‘some’ of the volumes both read and write (in particular read) is very high. It is for this reason why I have been reading these posts with great interest. It seems that I am now stuck between a ‘rock and a hard place’ in terms of what to do?? I either continue with the migrations, copying data across ultimately flushing the SSDs/HDDs and take the performance hit or take the migrations much slower thus making the migration take much longer. I am using Microsofts recommendation in that I am including the ‘/MINAGE:30’ switch in my robocopy scripts. This is great as it copies all data older than 30 days which is the bulk of what could be termed as ‘archive’ data. It’s just a shame that StorSimple can’t see that this data is termed as ‘archive’ thus automatically copy it direct to either the HDDs (is it SAS or SATA?) or direct to the cloud storage account.

      I have raised a ticket with Microsoft regarding the current performance and latency of our shares. I suspect they will come to a similar conclusion as I in that the more I copy the more I ‘flush’ thus autotiering is currently working hard… If, however on the other hand they do come up with a solution I will post it here.

      Thank you for your very useful blogs regarding StorSimple. Please continue to do so as I will be following with great interest.

      Regards

      January 9, 2017 at 5:49 am

      • Justin, my suggestion at this point is to promptly stop data migration. You’re on the way to this condition if you haven’t arrived at it already. Stopping the migration will allow the device time to properly tier the data based on daily file access pattern.
        Trying to identify when it is safe is not straight forward, and will depend on many variables in your environment including:
        – Your available egress bandwidth from the device to Azure
        – Daily data change rate in terms of number of 64KB blocks
        If users are currently using shares/files on the device, the changes they make count as new blocks (tiered up to SSD which is accompanied with a corresponding block tier down to SAS, which if at 80% capacity will also require a block tier down to Azure => high latency).
        New migrated data is new to the device and will be treated exactly as new blocks – goes to SSD, knocks down a block to SAS, and possibly a block to Azure.
        The recommendation is to finish all data migration first before using any files off the device, migrating the oldest data first.
        I can provide additional help if you like to engage us (VertitechIT) on a Time and Materials basis (hourly rate). This is typically done over Skype shared desktop session(s). GMT time zone is not a problem. I can be reached at sboutros@vertitechit.com

        January 9, 2017 at 2:08 pm

  3. Pingback: StorSimple 8k series as a backup target? | Sam's Corner

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s