We recently helped a customer troubleshoot some SQL 2014 timeout issues in an Azure hosted environment. In this instance, the Azure VM was a well powered G-Series machine with a properly configured Storage Space stripe set of drives. The behavior seen was that SQL transactions would complete, but become no longer valid and then rollback. The apparent cause was I/O related, as SQL would process a request… but then fail to commit the change as the drive with the log files was timing out and apparently being rest as a device.
Investigating into the Event Logs we noticed a steady rash of Event 129 errors – with this description:
Reset to device, \Device\RaidPort1, was issued.
Digging deeper, we enabled diagnostics on the storage accounts hosting this VM’s virtual disks. The monitoring showed high latency (> 100ms) and an error rate of about 1% with no IOP throttling. After working we Microsoft support, we were directed to the following hotfix.
This hotfix is not a standard Windows Update patch and is designed to address Event 129 errors in an Azure hosted VMs only. The issue is specific to how Azure exposes storage to the VM and is not related to Hyper-V. Apparently, the hot fix increases / improves the tolerances that trigger this issue. Timeout can still happen in the underlying storage, but the VM’s I/O will no longer hard-stop (unless the timeouts are prolonged) and the error is no longer raised.