Storage Spaces Performance Troubleshooting

Over the last few years, well, since Storage Spaces came to light, I had from time to time to investigate performance issues. The goal of this post to is show you what you might typically look at while troubleshooting performance with Storage Spaces in an scale out file server setup. I’ll refine this post over time as there is quite a bit to look at but it should already give you a good start.

Let’s use a top down approach to troubleshoot an issue by looking at the following statistics in order:

  1. SMB Share
  2. Cluster Shared Volume
  3. If you are using an SSD write back cache, Storage Spaces Write Cache
  4. If you are using a tiered volume, Storage Spaces Tier
  5. Individual SSD/HDD

So the logic is, find which share/CSV is not performing and then start diving into the components of those by looking at the write cache, the tiers and finally the physical disks composing the problematic volume.

Let’s now dive in each of the high level categories.

SMB Share
  • Under the SMB Share Perfmon category:
    • Avg. Bytes/Read and Avg. Bytes/Write
      • What are the typical IO size on that share?
    • Avg. Sec/Read and Avg. Sec/Write
      • Is that share showing sign of high latency during IO?
    • Read Bytes/sec and Write Bytes/sec
      • How much data is flowing back and forth on that share?
Cluster Shared Volume

In order to capture the physical disk statistics for the particular CSV you are after, you will need to lookup the Disk Number of the CSV in the Failover Clustering console. Once you have that, you can look at the following:

  • Physical Disk category
    • Avg. Disk Bytes/Read and Avg. Disk Bytes/Write
      • What are the typical IO size on that?
    • Avg. Disk Sec/Read and Avg. Disk Sec/Write
      • What are the typical IO size on that disk?
    • Disk Read Bytes/sec and Disk Write Bytes/sec
      • How much is being read/written per second on that disk? Is the volume performing as per the baseline already established?

You can also look at CSV specific counters such as:

  • Cluster CSV Volume Manager
    • IO Read Bytes/sec – Redirected/IO Write Bytes/sec – Redirected
      • Is there a lot of redirected IO happening?
    • IO Read – Bytes/sec/IO Write – Bytes/sec
      • How much data is being read/written per second on that CSV

Microsoft also published a good article you should look at here about CSV performance monitoring: Cluster Shared Volume Performance Counter

Write Back Cache
  • Storage Spaces Write Cache category
    • Cache Write Bytes/sec
      • How much data is being written to the cache?
    • Cache Overwrite Bytes/sec
      • Is data being overwritten prior being flushed to the final HDD destination?
Storage Spaces Tier
  • Storage Spaces Tier category
    • Avg. Tier Bytes/Read and Avg. Tier Bytes/Write
      • Are the size of the IO on each tier aligned with the physical sector size of the drive composing those?
    • Avg. Tier Sec/Read and Avg. Tier Sec/Write
      • Are you HDDs thrashing? Are you SSDs experiencing high latency spikes because of background operations on the drive (i.e. garbage collection)
    • Tier Read Bytes/sec and Tier Write Bytes/sec
      • Are each tiers providing the expected throughput?
Individual SSD/HDD

To see some basic latency and reliability metrics for your physical disks, you can use the following PowerShell command:

Get-VirtualDisk -FriendlyName VirtualDisk01 | Get-PhysicalDisk | Get-StorageReliabilityCounter | Select DeviceId,FlushLatencyMax,ReadLatencyMax,WriteLatencyMax,ReadErrorsCorrected,WriteErrorsCorrected,ReadErrorsTotal,WriteErrorsTotal

Just by looking at the latency statistics from the command above, you can get a feeling if a drive is misbehaving either by looking at the errors/errors corrected or simply by looking at the read/write latency metrics. For the latency metrics, note that those are the maximums since the last reset or reboot. If you want to get a feel of how they perform over time or under a specific load, I would recommend you use the Physical Disk metrics in Perfmon. You can use the same Physical Disk counters as what we used for the CSV, namely:

  • Physical Disk category
    • Avg. Disk Bytes/Read and Avg. Disk Bytes/Write
    • Avg. Disk Sec/Read and Avg. Disk Sec/Write
    • Disk Read Bytes/sec and Disk Write Bytes/sec

In order to match the counter disk instance found in Perfmon with the StorageReliability counters from the PowerShell above, simply use the number in the DeviceId column of the output. Make sure you connect Perfmon to the same cluster node as where you ran the PowerShell command as sometimes the disk numbers/deviceId do not match exactly between nodes.

By using those Perfmon counters, you can see if the latency is simply high because the Bytes/Read or Bytes/Write are quite large, if your drives are simply overwhelmed and are performing as per specifications or because there’s another underlying issue with those drives.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s