Distributed Universal Memory for Windows

*Disclaimer* This is only an idea I’ve been toying with. It doesn’t represent in any way, shape or form future Microsoft plans in regards to memory/storage management. This page will evolve over time as the idea is being refined and fleshed out.

**Last Updated 2017-03-23**

The general ideal behind Distributed Universal Memory is to have a common memory management API that would achieve the following:
  • Abstract the application from the memory medium required to maintain application state, whether it’s volatile or permanent
  • Allow the application to express memory behavior requirements and not worry about the storage medium to achieve this
  • Support legacy constructs for backward compatibility
  • Enable new capabilities for legacy applications without code change
  • Give modern applications a simple surface to persist data
  • Enables scale out applications to use potentially a single address space
  • Could potentially move towards a more microservice based approach instead of the current monolithic code base
  • Could easily leverage advances in hardware development such as disaggregation of compute and memory, usage of specialized hardware such FPGAs or GPUs to accelerate certain memory handling operations
  • Could be ported/backported to further increase the reach/integration capabilities. This memory management subsystem to could be cleanly decoupled from the underlying operating system.
  • Allow the data to be optimally placed for performance, availability and ultimately cost

Availability Management

  • Process memory can be replicated either systematically or on demand
  • This would allow existing process memory to be migrated from one operating system instance to another transparently.
  • This could offer higher resiliency to process execution in the event of an host failure
  • This could also allow some OS components to be updated while higher level processes keep going. (i.e. redirected memory IO)
Performance Management
  • Required medium to achieve performance could be selected automatically using different mechanisms (MRU/LRU, machine learning, etc.)
  • Memory performance can be expressed explicitly by the application
    • By expressing its need, it would be easier to characterize/model/size the required system to support the application
    • Modern applications could easily specify how each piece of data it interacts with should be performing
  • Could provide multiple copies of the same data element for compute locality purposes. i.e. Distributed read cache
    • This distributed read-cache could be shared between client and server processes if desired. This would enable to have a single cache mechanism independently of the client/process accessing it.
Capacity Management
  • Can adjust capacity management techniques depending on performance and availability requirements
  • For instance, if data is rarely used by the application, several data reduction techniques could be applied such as deduplication, compression and/or erasure coding
  • If data access time doesn’t require redundancy/locality/tolerates time for RDMA, it could be spread evenly across the Distributed Universal Memory Fabric

High Level Cluster View


Here’s an high level diagram of what it might look like:

Let’s go over some of the main components.

Data Access Manager

The Data Access Manager is the primary interface layer to access data. The legacy API would sit on top of this layer in order to properly abstract the underlying subsystems in play.

  • Transport Manager
    • This subsystem is responsible to push/pull the data on the remote host. All inter-node data transfers would occur over RDMA to minimize the overhead of copying data back and forth between nodes.
  • Addressing Manager
    • This would be responsible to give a universal memory address for the data that’s independent of storage medium and cluster nodes.

Data Availability Manager

This component would be responsible to ensure the proper level of data availability and resiliency are enforced as per defined policies in the system. It would be made of the following subsystems:

  • Availability Service Level Manager
    • The Availability Service Level Manager’s responsibility to to ensure the overall availability of data. For instance, it would act as the orchestrator responsible to trigger the replication manager to ensure the data is meeting its availability objective.
  • Replication Manager
    • The Replication Manager is responsible to enforce the right level of data redundancy across local and remote memory/storage devices. For instance, if 3 copies of the data must be maintained for the data of a particular process/service/file/etc. across 3 different failure domains, the Replication Manager is responsible of ensuring this is the case as per the policy defined for the application/data.
  • Data History Manager
    • This subsystem ensure that the appropriate point in time copies of the data are maintained. Those data copies could be maintained in the system itself by using the appropriate storage medium or they could be handed of to a thrid party process if necessary (i.e. standard backup solution). The API would provide a standard way for data recovery operations.

Data Capacity Manager

The Data Capacity Manager is responsible to ensure enough capacity of the appropriate memory/storage type is available for applciations and also for applying the right capacity optimization techniques to optimize the physical storage capacity available. The following methods could be used:

  • Compression
  • Deduplication
  • Erasure Coding

Data Performance Manager

The Data Performance Manager is responsible to ensure that each application can access each piece of data at the appropriate performance level. This is accomplished using the following subsystems:

  • Latency Manager
    • This is responsible to place the data on the right medium to ensure that each data element can be accessed at the right latency level. This can be determined either by pre-defined policy or by heuristic/machine learning to detect data access pattern beyond LRU/MRU methods.
    • The Latency Manager could also monitor if a local process tends to access data that’s mostly remote. If that’s the case, instead of generally incurring the network access penalty, the process could simply be moved to the remote host for better performance through data locality.
  • Service Level Manager
    • The Service Level Manager is responsible to manage the various applications expectations in regards to performance.
    • The Service Level Manager could optimize data persistence in order to meet its objective. For example, if the local non-volatile storage response time is unacceptable, it could choose to persist the data remotely and then trigger the Replication Manager to bring a copy of the data back locally.
  • Data Variation Manager
    • A subsystem could be conceived to persist a tranformed state of the data. For example, if there’s an aggregation on a dataset, it could be persisted and linked to the original data. If the original data changes the dependent aggregation variations could either be invalidated or updated as needed.

Data Security Manager

  • Access Control Manager
    • This would create hard security boundary between processes and ensure only authorized access is being granted, independently of the storage mechanism/medium.
  • Encryption Manager
    • This would be responsible for the encryption of the data if required as per a defined security policy.
  • Auditing Manager
    • This would audit data access as per a specific security policy. The events could be forwarded to a centralized logging solution for further analysis and event correlation.
    • Data accesses could be logged in an highly optimized graph database to allow:
      • Build a map of what data is accessed by processes
      • Build a temporal map of how the processes access data
  • Malware Prevention Manager
    • Data access patterns can be detected in-line by this subsystem. For instance, it could notice that a process is trying to access credit card number data based on things like regex for instance. Third-party anti-virus solutions would also be able to extend the functionality at that layer.

Legacy Construct Emulator

The goal of the Legacy Construct Emulator to is to provide to legacy/existing applications the same storage constructs they are using at this point in time to ensure backward compatibility. Here are a few examples of constructs that would be emulated under the Distributed Universal Memory model:

  • Block Emulator
    • To emulate the simplest construct to simulator the higher level construct of the disk emulator
  • Disk Emulator
    • Based on the on the block emulator, simulates the communication interface of a disk device
  • File Emulator
    • For the file emulator, it could work in a couple of ways.
      • If the application only needs to have a file handle to perform IO and is fairly agnostic of the underlying file system, the application could simply get a file handle it can perform IO on.
      • Otherwise, it could get that through the file system that’s layered on top of a volume that makes use of the disk emulator.
  • Volatile Memory Emulator
    • The goal would be to provide the necessary construct to the OS/application to store it’s state data that’s might be typically stored in RAM.

One of the key thing to note here is that even though all those legacy constructs are provided, the Distributed Universal memory model has the flexibility to persist the data as it sees fit. For instance, even though the application might think it’s persisting data to volatile memory, the data might be persisted to an NVMe device in practice. Same principle would apply for file data; a file block might actually be persisted to RAM (similar a block cache) that’s then being replicated to multiple nodes synchronously to ensure availability, all of this potentially without the application being aware of it.

Metrics Manager

The metrics manager is to capture/log/forward all data points in the system. Here’s an idea:

  • Availability Metrics
    • Replication latency for synchronous replication
    • Asynchronous backlog size
  • Capacity Metrics
    • Capacity used/free
    • Deduplication and compression ratios
    • Capacity optimization strategy overhead
  • Performance Metrics
    • Latency
    • Throughput (IOPS, Bytes/second, etc.)
    • Bandwidth consumed
    • IO Type Ratio (Read/Write)
    • Latency penalty due to SLA per application/process
  • Reliability Metrics
    • Device error rate
    • Operation error rate
  • Security Metrics
    • Encryption overhead

High Level Memory Allocation Process

More details coming soon.

Potential Applications

  • Application high availability
    • You could decide to synchronously replicate a process memory to another host and simply start the application binary on the failover host in the event where the primary host fails
  • Bring server cached data closer to the client
    • One could maintain a distributed coherent cache between servers and client computers
  • Move processes closer to data
    • Instead of having a process try to access data accross the network, why not move the process to where the data is?
  • User State Mobility
    • User State Migration
      • A user state could move freely between a laptop, a desktop and a server (VDI or session host) depending on what the user requires.
    • Remote Desktop Service Session Live Migration
      • As the user session state memory is essentially virtualized from the host executing the session, it can be freely moved from one host to another to allow zero impact RDS Session Host maintenance.
  • Decouple OS maintenance/upgrades from the application
    • For instance, when the OS needs to be patched, one could simply move the process memory and execution to another host. This would avoid penalties such as buffer cache rebuilds in SQL Server for instance which can trigger a high number of IOPS on a disk subsystem in order to repopulate the cache based on popular data. For systems with an large amount of memory, this can be fairly problematic.
  • Have memory/storage that spans to the cloud transparently
    • Under this model it would be fairly straightforward to implement a cloud tier for cold data
  • Option to preserve application state on application upgrades/patches
    • One could swap the binaries to run the process while maintaining process state in memory
  • Provide object storage
    • One could layer object storage service on top of this to support Amazon S3/Azure Storage semantics. This could be implemented on top of the native API if desired.
  • Provide distributed cache
    • One could layer distributed cache mechanisms such as Redis using the native Distributed Universal Memory API to facilitate porting of applications to this new mechanism
  • Facilitate application scale out
    • For instance, one could envision a SQL Server instance to be scaled out using this mechanism by spreading worker threads across multiple hosts that share a common coordinated address space.
  • More to come…

Storage Spaces Performance Troubleshooting

Over the last few years, well, since Storage Spaces came to light, I had from time to time to investigate performance issues. The goal of this post to is show you what you might typically look at while troubleshooting performance with Storage Spaces in an scale out file server setup. I’ll refine this post over time as there is quite a bit to look at but it should already give you a good start.

Let’s use a top down approach to troubleshoot an issue by looking at the following statistics in order:

  1. SMB Share
  2. Cluster Shared Volume
  3. If you are using an SSD write back cache, Storage Spaces Write Cache
  4. If you are using a tiered volume, Storage Spaces Tier
  5. Individual SSD/HDD

So the logic is, find which share/CSV is not performing and then start diving into the components of those by looking at the write cache, the tiers and finally the physical disks composing the problematic volume.

Let’s now dive in each of the high level categories.

SMB Share
  • Under the SMB Share Perfmon category:
    • Avg. Bytes/Read and Avg. Bytes/Write
      • What are the typical IO size on that share?
    • Avg. Sec/Read and Avg. Sec/Write
      • Is that share showing sign of high latency during IO?
    • Read Bytes/sec and Write Bytes/sec
      • How much data is flowing back and forth on that share?
Cluster Shared Volume

In order to capture the physical disk statistics for the particular CSV you are after, you will need to lookup the Disk Number of the CSV in the Failover Clustering console. Once you have that, you can look at the following:

  • Physical Disk category
    • Avg. Disk Bytes/Read and Avg. Disk Bytes/Write
      • What are the typical IO size on that?
    • Avg. Disk Sec/Read and Avg. Disk Sec/Write
      • What are the typical IO size on that disk?
    • Disk Read Bytes/sec and Disk Write Bytes/sec
      • How much is being read/written per second on that disk? Is the volume performing as per the baseline already established?

You can also look at CSV specific counters such as:

  • Cluster CSV Volume Manager
    • IO Read Bytes/sec – Redirected/IO Write Bytes/sec – Redirected
      • Is there a lot of redirected IO happening?
    • IO Read – Bytes/sec/IO Write – Bytes/sec
      • How much data is being read/written per second on that CSV

Microsoft also published a good article you should look at here about CSV performance monitoring: Cluster Shared Volume Performance Counter

Write Back Cache
  • Storage Spaces Write Cache category
    • Cache Write Bytes/sec
      • How much data is being written to the cache?
    • Cache Overwrite Bytes/sec
      • Is data being overwritten prior being flushed to the final HDD destination?
Storage Spaces Tier
  • Storage Spaces Tier category
    • Avg. Tier Bytes/Read and Avg. Tier Bytes/Write
      • Are the size of the IO on each tier aligned with the physical sector size of the drive composing those?
    • Avg. Tier Sec/Read and Avg. Tier Sec/Write
      • Are you HDDs thrashing? Are you SSDs experiencing high latency spikes because of background operations on the drive (i.e. garbage collection)
    • Tier Read Bytes/sec and Tier Write Bytes/sec
      • Are each tiers providing the expected throughput?
Individual SSD/HDD

To see some basic latency and reliability metrics for your physical disks, you can use the following PowerShell command:

Get-VirtualDisk -FriendlyName VirtualDisk01 | Get-PhysicalDisk | Get-StorageReliabilityCounter | Select DeviceId,FlushLatencyMax,ReadLatencyMax,WriteLatencyMax,ReadErrorsCorrected,WriteErrorsCorrected,ReadErrorsTotal,WriteErrorsTotal

Just by looking at the latency statistics from the command above, you can get a feeling if a drive is misbehaving either by looking at the errors/errors corrected or simply by looking at the read/write latency metrics. For the latency metrics, note that those are the maximums since the last reset or reboot. If you want to get a feel of how they perform over time or under a specific load, I would recommend you use the Physical Disk metrics in Perfmon. You can use the same Physical Disk counters as what we used for the CSV, namely:

  • Physical Disk category
    • Avg. Disk Bytes/Read and Avg. Disk Bytes/Write
    • Avg. Disk Sec/Read and Avg. Disk Sec/Write
    • Disk Read Bytes/sec and Disk Write Bytes/sec

In order to match the counter disk instance found in Perfmon with the StorageReliability counters from the PowerShell above, simply use the number in the DeviceId column of the output. Make sure you connect Perfmon to the same cluster node as where you ran the PowerShell command as sometimes the disk numbers/deviceId do not match exactly between nodes.

By using those Perfmon counters, you can see if the latency is simply high because the Bytes/Read or Bytes/Write are quite large, if your drives are simply overwhelmed and are performing as per specifications or because there’s another underlying issue with those drives.

Windows Server Deduplication Job Execution

While working with deduplication with volumes of around 30TB, we noticed the various job types were not executing as we were expecting. As a Microsoft MVP, I’m very fortunate to have direct access to the people with deep knowledge of the technology. A lot of the credit for this post goes to Will Gries and Ran Kalach from Microsoft who were kind enough to answer my questions as to what was going on under the hood. Here’s a summary the things I learned in the process of understanding what was going on.

Before we dive in any further, it’s important to understand the various deduplication job types as they have different resource requirements.

Job Types (source)

  • Optimization
    • This job performs both deduplication and compression of files according data deduplication policy for the volume. After initial optimization of a file, if that file is then modified and again meets the data deduplication policy threshold for optimization, the file will be optimized again.
  • Scrubbing
    • This job processes data corruptions found during data integrity validation, performs possible corruption repair, and generates a scrubbing report.
  • GarbageCollection
    • This job processes previously deleted or logically overwritten optimized content to create usable volume free space. When an optimized file is deleted or overwritten by new data, the old data in the chunk store is not deleted right away. By default, garbage collection is scheduled to run weekly. We recommend to run garbage collection only after large deletions have occurred.
  • Unoptimization
    • This job undoes deduplication on all of the optimized files on the volume. At the end of a successful unoptimization job, all of the data deduplication metadata is deleted from the volume.

Another operation that happens in the background that you need to be aware of is the reconciliation process.  This happens when the hash index doesn’t fit entirely in memory. I don’t have the details at this point in time as to what exactly this is doing but I suspect it tries to restore index coherency across multiple index partitions that were processed in memory successively during the optimization/deduplication process.

Server Memory Sizing vs Job Memory Requirements

To understand the memory requirements of the deduplication jobs running on your system, I recommend you have a look at the event with id 10240 in the Windows Event log Data Deduplication/Diagnostic. Here what it looks like for an Optimization job:

Optimization job memory requirements.

Volume C:\ClusterStorage\POOL-003-DAT-001 (\\?\Volume{<volume GUID>})
Minimum memory: 6738MB
Maximum memory: 112064MB
Minimum disk: 1024MB

Here are a few key things to consider about the job memory requirement and the host RAM sizing:

  • Memory requirements scales almost linearly with the total size of the data to dedup
    • The more data to dedup, the more entries in the hash index to keep track of
    • You need to meet at least the minimum memory requirement for the job to run for a volume
  • The more the memory on the host running deduplication, the better the performance because:
    • You can run more jobs in parallel
    • The job will run faster because
      • You can find more of the hash index in memory
        • The more index you fit in memory, the less reconciliation job will have to be performed
        • If the job fits completely in memory, the reconciliation process is not required
  • If you use throughput scheduling (which is usually recommended)
    • The deduplication engine will allocate by default 50% of the host’s memory but this is configurable
    • If you have multiple volumes to optimize, it will try to run them all in parallel
      • It will try to allocate as much memory as possible for each job to accelerate them
      • If not enough memory is available, other optimization jobs will be queued
  • If you start optimization jobs manually
    • The job is not aware of other jobs that might get started, it will try to allocate as much memory as possible to run the job potentially leaving other future jobs on hold as not enough memory is available to run them

Job Parallelism

I’ve touched a bit in the previous point about memory sizing but here’s a recap with additional information:

  • You can run multiple jobs in parallel
  • The dedup throughput scheduling engine can manage the complexity around the memory allocation for each of the volume for you
  • You need to have enough memory to at least meet the minimum memory requirement of each volume that you want to run in parallel
    • If all the memory has been allocated and you try to start a new job, it will be queued until resources become available
    • The deduplication engine tries to stick to the memory quota determined when the job was started
  • Each job in currently single threaded in Windows Server 2012 R2
    • Windows Server 2016 (currently in TP4) supports multiple threads per job, meaning multiple threads/cores can process a single volume
      • This greatly improves the throughput of optimization jobs
  • If you have multiple volumes residing on the same physical disks, it would be best to run only one job at a time for those specific disks to minimize disk thrashing

To put things into perspective, let’s look at some real world data:

Volume Min RAM (MB) Max RAM (MB) Min Disk (MB) Volume Size (TB ) Unoptimized Data Size (TB )
POOL-003-DAT-001 6 738 112 064 1 024 30 32.81
POOL-003-DAT-002 7 137 118 900 1 024 30 35.63
POOL-003-DAT-004 7 273 121 210 1024 30 35.28
POOL-003-DAT-006 4 089 67 628 1 024 2 18.53
  • To run optimization in parallel on all volumes I need at least 25.2GB of RAM
  • To avoid reconciliation while running those jobs in parallel, I would need a whopping 419.8GB of RAM
    • This might not be too bad if you have multiple nodes in your cluster with each of them running a job

Monitoring Job

To keep an eye on the deduplication jobs, here are the methods I have found so far:

  • Get-DedupJob and Get-DedupStatus will give you the state of the job as they are running
  •  Perfmon will give you information about the current throughput of the jobs currently running
    • Look at the typical physical disk counters
      • Disk Read Bytes/sec/
      • Disk Write Bytes/sec
      • Avg. Disk sec/Transfer
    • You can get an idea of the saving ratio by looking at how much data is being read and how much is being written per interval
  • Event Log Data Deduplication/Operational
    • Event ID 6153 which will give you the following pieces of information once the job has completed:
      • Job Elapsed Time
      • Job Throughput (MB/second)
  • Windows Resource Monitor (if not on Server Core or Nano)
    • Filter the list of process on fsdmhost.exe and look at the IO it’s doing on the files under the Disk tab

Important Updates

I recommend that you have the following updates installed if you are running deduplication on your system as of 2016-04-18:

  • November 2014 Update Rollup
  • KB 3094197 (Will update dedup.sys to version 6.3.9600.18049)
  • KB 3138865 (Will update dedup.sys to version 6.3.9600.18221)
  • If you are running dedup in a cluster, you should install the patches listed in KB 2920151
    • That will simplify your life with Microsoft Support 😉

Final Thoughts

Deduplication is definitely a feature that can save you quite a bit of money. While it might not fit every workload, it has its use and benefits. In one particular cluster we use to store DPM backup data, we were able to save more than 27TB (and still counting as the jobs are still running). Windows Server 2016 will bring much improved performance and who knows what the future will bring, dedup support for ReFS? Who knows!

I will try to keep this post updated as I find out more information about deduplication operationally.

Other Resources



GEM Automation Feature – Run-DiskSpd

With the recent release of GEM Automation, I thought I would go through some of the latest additions. The first one is the inclusion of Run-DiskSpd in InfrastructureTesting\libStorageTesting.psm1.

Run-DiskSpd aims to make it easier to go through various IO tests using Microsoft’s favorite and latest IO generating tool, Diskspd.

Let’s go first through some of the Run-DiskSpd function features:

  • Pre-configured test suites
    • Stored in an XML config file called DiskSpd_TestCases.config
    • Provides pre-built test suites for various workloads. Currently the following are included:
      • SQL Server
        • Generic SQL Server workload
        • Resource Governor IOPS sizing
      • Exchange
      • General Purpose File Server
      • Quick
        • Small number of IO tests pattern
      • Exhaustive
        • Wide range of tests performed, this takes a long time to run due to the large number of tests
    • The goal is to add test cases for other workloads based on community feedback
    • The test case configuration includes other settings but they are used by Test-VirtualDiskPerformance in libStorageSpacesTesting.psm1
  • Dynamic IO warmup before tests (Run-DiskSpdVolumeWarmup)
    • As suggested by Dan Lovinger, the test data file needs to be warmed up before running IO tests to ensure media stability. This is done as follow:
      • A first run is done to estimate the volume sequential throughput rate
      • Once the rate has been determined, the function calculates how long diskspd needs to run to perform a pass of large sequential IO on the test file based on its size.
      • Two passes are used to warm up the test data file
  • The function will then generate and run the following diskspd tests based on the test case configuration
    • Variations in IO block size
    • Variations in outstanding IO counts
    • Variations in threads per file counts
    • Random/Sequential
    • Read/Write
  • Once the tests have been run, Parse-DiskSpdXmlOut is run to extract the info from the XML report output of diskspd to convert it to a PSObject
  • Additional information can be persisted along test results
    • For Storage Spaces
      • Disk resiliency type
      • Interleave size
      • Number of columns
      • Number of physical disks
      • Physical disk type
    • For file systems
      • File system name (i.e. NTFS or ReFS)
      • File system allocation unit size
  • The results of the test case are then written to the standard output which you can then pipe to a csv file

Here’s an example of how you would use Run-DiskSpd to perform test cases pertinent to SQL Server (I used ridiculously small values for this particular test on purpose):

Run-DiskSpd -testFileSizeInMB 100 -testDirectoryPath c:\temp -testDurationInSeconds 5 -testDescription "SQL Server test on local SSD" -testCaseName "SQLServerVM" | Export-Csv -NoTypeInformation .\diskspd_output.csv -Force

While the tests are executed, you can easily see the overall progress and the actual test being performed:


Once the tests have run and the results are saved in the CSV file, you can then use Storage Testing Analysis.xlsx to analyze the results of the tests. Here’s what it looks like at this point in time:


As you can see in the above screenshots, you can use the Excel slicers to filter out which test case you are after when you are reviewing the results.

While Run-DiskSpd is very much a work in progress, I hope it can already be useful to the community! I use this function to qualify new storage and baseline IO throughput. I also wrote a couple of additional function that leverages this (Test-VirtualDiskPerformance and Set-SQLResourcePoolIOPS) which will be topics of future posts.

If you have feedback on how it can be improved, I invite you to comment this post or submit items directly here in Codeplex. For those who are more inclined to use GUIs (that’s quite alright if you do), I invite you to check out DiskSpeed from my MVP peer Darryl van der Peijl that also helps running diskspd.

Intel 3D XPoint Non-Volatile Storage Memory – What just happened?

When you think you had an handle on things, some new technology comes up and throws you a curve ball. Let’s look at some of the early claims of 3D XPoint (pronounced 3D Crosspoint).


  • 8x to 10x greater density than DRAM
    • Think about DIMM (?) modules with 320GB of storage
    • Current technology is using 2 layers, more could be possible, further increasing this number
  • 100x faster than NAND latency wise
    • If I take a Samsung XS1715 NVMe drive, that would put it at 900ns read/250ns write
  • 10x more performance than NAND over PCIe/NVMe
    • With NVMe drives reaching about 3GB/s, that would mean 30GB/s per drive. Let that sink in for a moment.
    • Xeon SkyLake CPUs are supposed to have up to 48 PCIe 3.0 lanes@0.9GB/s, yielding 47.2GB/s… Hmm we might have a problem here!
      • For a 2 sockets system, that would mean about 3 drives and you’re not getting that data out on the network at all as there are no PCIe lanes left
    • For each drive, you would need 240Gbps of network bandwidth
      • In practice you would have systems where 1 socket is dedicated to the 3D XPoint drive and another for the NIC (PCIe lane wise)
    • Compared to SAS SSD (i.e. HGST 1600MM), those drives are about 30x faster
  • Writes have almost no impact on durability
    • This always have been a concerned for NAND users
  • 3D XPoint is supposed to be affordable, what that means in practice remains to be seen
  • It will be available in 2016, which is pretty darn soon
As some of you may know, I’m currently looking at 100Gb networking for our next generation Storage Spaces Direct network. Keeping with 3D XPoint in mind, HDR Infiniband is almost a minimum (200Gbps) but that only comes out in 2017 or so. Looking at the Ethernet roadmap we see 200Gb in 2019-2020 or 400Gb in 2017 (estimated). 800Gb is targeted for 2020. Looks like we’re going to live with a network bottleneck for quite sometime.
If you thought like me that NAND with NVMe  really added a lot of pressure on the networking side of things, 3D XPoint brought that to a whole other level!
More info about the technology can be found here.

First Impressions of Storage Spaces Direct on a Nano Server Cluster

I finally found some time to experiment with two new things with Windows Server 2016 (TP2), Storage Spaces Direct and Nano Server. Overall the experience of getting that up and running was straightforward and as advertised. The performance was as good as expected even though I’m testing on a virtualized setup with modest hardware. The resources consumed in order to deliver that service are pretty darn minimal. Between 300 and 550MB of RAM and 1.5GB and 2GB of storage for the OS drives. Looking at the list of running processes gives you a good sense of how minimal that setup is and how secure and manageable it will be. One thing that takes some getting used to, is how stripped the OS is. i.e. Right now, you have the cmdlet to change IP configuration but not DNS. I’m sure that will be added in builds following TP2.


Here are a few things I’ve noticed while diskspd against the Storage Spaces Direct cluster:

Memory Consumption
Memory is jumping significantly on only one of the nodes, most likely because it’s the one targeted with the SMB connection. When running on Nano server, going from 300MB to 900MB is a big deal. 😉
SMB Connection
For some reason, my SMB client was not being redirected to the proper node (well according to my current understanding). To make the test, I’ve moved all of the cluster resources to the “first” node of the cluster (disk, pool, SOFS resource). Still my client had a connection established with the “second” node.
Share Continuous Availability
Another thing I noticed I didn’t expect was when a server that is part of the cluster is stopped/restarted, IO pauses for a few seconds. The first time I noticed this, I thought it was because I restarted the server which had some of the cluster resources, which causes a failover, which can take some time before everything comes back to normal. I then made sure every resources was running on the “first” cluster node and then went ahead and restarted/stopped the “last” node. Every time I did this, an IO pause occurred. I suspect it’s because the node serving the share has some backend connections for block redirection to that specific node and those need to be re-established/renegotiated with another node to serve those blocks. As those blocks are mirrored to other nodes in the cluster, I would have expected that process to be absolutely transparent.
When restarting
When stopping
This led me to an idea, perhaps an absurd one but again maybe not. In the case of a continuously available file share on an SOFS cluster running on top of Storage Spaces Direct, it might be beneficial to exchange a file allocation map when a file lock is acquired (perhaps for intended file regions would work too). This way the client performing IO against that file would be able to recover itself and perhaps in a quicker fashion from IO errors on a particular node. Another benefit from this would be to add the capacity to perform IO from all the Storage Spaces Direct nodes in parallel for a single file. Right now, you basically have to spread the VHDX of a VM across multiple SOFS shares which are then owned by different cluster nodes in order to achieve this effect, a cumbersome solution in my opinion.
I’ll keep digging on those new things that are Nano Server and Storage Spaces Direct, so far I’m pretty pleased with what I’m seeing!

Storage Spaces Direct Practical Scalability

When I was attending Ignite, I was pretty anxious to hear about what was called Storage Spaces Shared Nothing which is now called Storage Spaces Direct.

One of the main feature of Storage Spaces Direct is the ability to leverage NVMe flash drives. Let’s look at some of the practical sizing calculation for this use case.

Maximum Storage Spaces Direct Cluster Nodes: 12 at this point in time, possibly more in the future
Maximum Throughput per NVMe drive: 3GB/s for a Samsung XS1715
Maximum NVMe drives per cluster node: 8 (AIC SB122 or Dell R920)
Maximum Throughput/Node: 24GB/s
Total Cluster Throughput: 288GB/s

Those are sure impressive numbers but if you want to have a non-blocking architecture for this, your network design needs to follow. Here’s what would be needed.

With each server pushing 192Gbps, a cluster could potentially push 12 * 192Gbps, so 2.3Tbps. In order to get to those numbers, from what I understand so far from Storage Spaces Direct, you would need to create basically one SOFS share per cluster node and spread your data across those 12 shares. Not too bad so far. Why? CSV block IO redirection is still around from what I gathered which means if you data slab/chunk are distributed across the cluster, you have a problem, more on this later. Let’s say you want to do something crazy like running SQL Server and leverage that nice throughput. One does not simply partition his SQL database across 12 VHDX. Why? That would mean all the IO of the cluster would need to converge back to the node running your SQL Server instance. 2.3Tbps incoming! Take cover! So what kind of hardware would you need to support this you ask? If you take 100GbE, which to my knowledge is the fastest interconnect generally available, you would need 23 ports per server running your workload or more tangibly 12 Mellanox ConnectX-4 dual port cards. You only have 10 PCIe slots in your R920? How unfortunate! Forget about the AIC 1U machine! I would also like to see the guy doing the wiring job on those servers! Another small detail, at 24 ports per machine, you could only hook up 3 servers to a set of 2 switches with 36 ports if you’re trying to do dual pathing for link redundancy. Hmm..Don’t we need to connect 12 of those servers? Looks like a incoming sizing fail to me. Perhaps I’m wrong and I missed something… If that’s the case let me know! Really what’s needed TODAY for this is 1Tbps link interconnect in order to make this setup practical and fully leverage your investment and future proof your infrastructure a bit.

So the only viable option to use that kind of throughput would be to shard your database across multiple SQL Server instances. You could use AlwaysOn and create read-only replicas (up to 8 in 2014, getting close to 12 but nope, not there, not sure if that’s bumped up in 2016) but then you are wasting quite a bit of storage potentially by duplicating your data around. Azure SQL on premise running on Azure Stack/Azure Service Fabric could start to look very nice. Perhaps another NoSQL database or Hadoop would be more suitable but in the vast majority of enterprises, it’s a radical change for sure!

Food for thoughts! I’ll still let that one stir for a bit…