StorSimple 8k series – troubleshooting ‘Controller Off’ issue
This post has originally been published on 4 November, 2015
You may encounter a situation with StorSimple where one of the controllers shows as ‘off’ in the Azure Management Interface.
By default, all Service Administrators and co-administrators of the Azure Subscription where the device is managed from will receive email alerts in case of this and similar error conditions.
Attempting to bring the ‘off’ controller back online by changing controller settings in Azure does not work:
Azure StorSimple Manager issues the ‘on’ command:
But the controller remains ‘off’
The reason is that the ‘off’ controller is ‘unreachable’. This can also be seen when running the Get-HcsSystem Powershell cmdlet on the device either via the serial interface or via Powershell remoting.
The 2 controllers in the StorSimple 8k device are 2 computers running Windows 2012 R2 ‘Storage Edition’ which seems to be a Core-only version stripped down from unneeded Windows components and has the HCS (Hybrid Cloud Storage) Powershell module. It appears to be locked down via JEA exposing a very small subset of Powershell cmdlets. The 2 controllers (computers) are setup in a failover cluster. In this case, the controller is actually ‘on’ and you can see lights on its ports on the back of the device. It’s however not reachable by the cluster. The repair will hinge on bringing the cluster back in sync.
I also tried to reboot the active controller (Restart-HCSController -Force) to initiate a failover to the passive controller, hoping that might bring it back online, but that did not work. The Active controller was rebooted but the ‘off’ controller remained ‘off’. In normal operation, rebooting the active controller will initiate a controller failover where the Passive controller becomes Active while the controller that was Active is rebooting. This should be a seamless no-down-time process.
Connecting to Controller 1 via the serial interface showed it’s status as ‘Recovery’:
Get-HCSSystem on Controller 1 shows:
Since we don’t have access to the Powershell cmdlets that we need to bring the cluster back in sync, we have to open a support ticket with Microsoft via this link. You’ll get a confirmation message similar to:
In 2015 Microsoft have made significant improvements to StorSimple support. I typically get a call back from a knowledgeable engineer within 30 minutes.
We will need access to Controller 1 serial interface to run the necessary Powershell cmdlets to bring it back online. This is exactly why it’s a recommended best practice to have remote access to the serial interfaces of BOTH controllers at all times. A device like Lantronix EDS2100 2-Port Secure Device Server – CY1733 would do the job.
To fix this issue you simply need to run the Exit-HCSRecoveryMode cmdlet on Controller 1 which is in Recovery mode. After the controller reboots, all should be back in order.
Unfortunately, the Exit-HCSRecoveryMode cmdlet is only made available to Microsoft support. To access it, you need to run Enable-HCSSupport cmdlet, provide the encryption key to MS support on the phone/remote session, they would decrypt it and give you a password. Run the Enter-HCSSupport cmdlet and enter the provided password to gain access to all available Powershell cmdlets including the Exit-HCSRecoveryMode cmdlet that we need in this case.