Measuring GPU Utilization in Remote Desktop Services

I recently spent some time experimenting with GPU Discrete Device Assignment in Azure using the NV* series of VM.  As we noticed that Internet Explorer was consuming quite a bit CPU resources on our Remote Desktop Services session hosts, I wondered how much of an impact on the CPU using a GPU would do by accelerating graphics through the specialized hardware.  We did experiments with Windows Server 2012 R2 and Windows Server 2016. While Windows Server 2012 R2 does deliver some level of hardware acceleration for graphics, Windows Server 2016 did provide a more complete experience through better support for GPUs in an RDP session.

In order to enable hardware acceleration for RDP, you must do the following in your Azure NV* series VM:

  1. Download and install the latest driver recommended by Microsoft/NVidia from here
  2. Enable the Group Policy Setting  Administrative Templates\Windows Components\Remote Desktop Services\Remote Desktop Session Host\Remote Session Environment\Use the hardware default graphics adapter for all Remote Desktop Services sessions as shown below:

To validate the acceleration, I used a couple of tools to generate and measure the GPU load. For load generation I used the following:

  • Island demo from Nvidia which is available for download here.
    • This scenario worked fine in both Windows Server 2012 R2 and Windows Server 2016
    • Here’s what it looks like when you run this demo (don’t mind the GPU information displayed, that was from my workstation, not from the Azure NV* VM):
  • Microsoft Fish Tank page which leverages WebGL in the browser which is in turn accelerated by the GPU when possible
    • This proved to be the scenario that differentiated Windows Server 2016 from Windows Server 2012 R2. Only under Windows Server 2016 could high frame rate and low CPU utilization was achieved. When this demo runs using only the software renderer, I observed CPU utilization close to 100% on a fairly beefy NV6 VM that has 6 cores and that just by running a single instance of that test.
    • Here’s what FishGL looks like:

To measure the GPU utilization, I ended up using the following tools:

In order to do a capture with Windows Performance Recorder, make sure that GPU activity is selected under the profiles to be recorded:

Here’s a recorded trace of the GPU utilization from the Azure VM while running FishGL in Internet Explorer that’s being visualized in Windows Performance Analyzer:

As you can see in the WPA screenshot above, quite a few processes can take advantage of the GPU acceleration.

Here’s what it looks like in Process Explorer when you’re doing live monitoring. As you can see below, you can see which process is consuming GPU resources. In this particular screenshot, you can see what Internet Explorer consumes while running FishGL my workstation.

Windows Server 2016 takes great advantage of an assigned GPU to offload compute intensive rendering tasks. Hopefully this article helped you get things started!

Distributed Universal Memory for Windows

*Disclaimer* This is only an idea I’ve been toying with. It doesn’t represent in any way, shape or form future Microsoft plans in regards to memory/storage management. This page will evolve over time as the idea is being refined and fleshed out.

**Last Updated 2017-03-23**

The general ideal behind Distributed Universal Memory is to have a common memory management API that would achieve the following:
General
  • Abstract the application from the memory medium required to maintain application state, whether it’s volatile or permanent
  • Allow the application to express memory behavior requirements and not worry about the storage medium to achieve this
  • Support legacy constructs for backward compatibility
  • Enable new capabilities for legacy applications without code change
  • Give modern applications a simple surface to persist data
  • Enables scale out applications to use potentially a single address space
  • Could potentially move towards a more microservice based approach instead of the current monolithic code base
  • Could easily leverage advances in hardware development such as disaggregation of compute and memory, usage of specialized hardware such FPGAs or GPUs to accelerate certain memory handling operations
  • Could be ported/backported to further increase the reach/integration capabilities. This memory management subsystem to could be cleanly decoupled from the underlying operating system.
  • Allow the data to be optimally placed for performance, availability and ultimately cost

Availability Management

  • Process memory can be replicated either systematically or on demand
  • This would allow existing process memory to be migrated from one operating system instance to another transparently.
  • This could offer higher resiliency to process execution in the event of an host failure
  • This could also allow some OS components to be updated while higher level processes keep going. (i.e. redirected memory IO)
Performance Management
  • Required medium to achieve performance could be selected automatically using different mechanisms (MRU/LRU, machine learning, etc.)
  • Memory performance can be expressed explicitly by the application
    • By expressing its need, it would be easier to characterize/model/size the required system to support the application
    • Modern applications could easily specify how each piece of data it interacts with should be performing
  • Could provide multiple copies of the same data element for compute locality purposes. i.e. Distributed read cache
    • This distributed read-cache could be shared between client and server processes if desired. This would enable to have a single cache mechanism independently of the client/process accessing it.
Capacity Management
  • Can adjust capacity management techniques depending on performance and availability requirements
  • For instance, if data is rarely used by the application, several data reduction techniques could be applied such as deduplication, compression and/or erasure coding
  • If data access time doesn’t require redundancy/locality/tolerates time for RDMA, it could be spread evenly across the Distributed Universal Memory Fabric

High Level Cluster View

Components

Here’s an high level diagram of what it might look like:

Let’s go over some of the main components.

Data Access Manager

The Data Access Manager is the primary interface layer to access data. The legacy API would sit on top of this layer in order to properly abstract the underlying subsystems in play.

  • Transport Manager
    • This subsystem is responsible to push/pull the data on the remote host. All inter-node data transfers would occur over RDMA to minimize the overhead of copying data back and forth between nodes.
  • Addressing Manager
    • This would be responsible to give a universal memory address for the data that’s independent of storage medium and cluster nodes.

Data Availability Manager

This component would be responsible to ensure the proper level of data availability and resiliency are enforced as per defined policies in the system. It would be made of the following subsystems:

  • Availability Service Level Manager
    • The Availability Service Level Manager’s responsibility to to ensure the overall availability of data. For instance, it would act as the orchestrator responsible to trigger the replication manager to ensure the data is meeting its availability objective.
  • Replication Manager
    • The Replication Manager is responsible to enforce the right level of data redundancy across local and remote memory/storage devices. For instance, if 3 copies of the data must be maintained for the data of a particular process/service/file/etc. across 3 different failure domains, the Replication Manager is responsible of ensuring this is the case as per the policy defined for the application/data.
  • Data History Manager
    • This subsystem ensure that the appropriate point in time copies of the data are maintained. Those data copies could be maintained in the system itself by using the appropriate storage medium or they could be handed of to a thrid party process if necessary (i.e. standard backup solution). The API would provide a standard way for data recovery operations.

Data Capacity Manager

The Data Capacity Manager is responsible to ensure enough capacity of the appropriate memory/storage type is available for applciations and also for applying the right capacity optimization techniques to optimize the physical storage capacity available. The following methods could be used:

  • Compression
  • Deduplication
  • Erasure Coding

Data Performance Manager

The Data Performance Manager is responsible to ensure that each application can access each piece of data at the appropriate performance level. This is accomplished using the following subsystems:

  • Latency Manager
    • This is responsible to place the data on the right medium to ensure that each data element can be accessed at the right latency level. This can be determined either by pre-defined policy or by heuristic/machine learning to detect data access pattern beyond LRU/MRU methods.
    • The Latency Manager could also monitor if a local process tends to access data that’s mostly remote. If that’s the case, instead of generally incurring the network access penalty, the process could simply be moved to the remote host for better performance through data locality.
  • Service Level Manager
    • The Service Level Manager is responsible to manage the various applications expectations in regards to performance.
    • The Service Level Manager could optimize data persistence in order to meet its objective. For example, if the local non-volatile storage response time is unacceptable, it could choose to persist the data remotely and then trigger the Replication Manager to bring a copy of the data back locally.
  • Data Variation Manager
    • A subsystem could be conceived to persist a tranformed state of the data. For example, if there’s an aggregation on a dataset, it could be persisted and linked to the original data. If the original data changes the dependent aggregation variations could either be invalidated or updated as needed.

Data Security Manager

  • Access Control Manager
    • This would create hard security boundary between processes and ensure only authorized access is being granted, independently of the storage mechanism/medium.
  • Encryption Manager
    • This would be responsible for the encryption of the data if required as per a defined security policy.
  • Auditing Manager
    • This would audit data access as per a specific security policy. The events could be forwarded to a centralized logging solution for further analysis and event correlation.
    • Data accesses could be logged in an highly optimized graph database to allow:
      • Build a map of what data is accessed by processes
      • Build a temporal map of how the processes access data
  • Malware Prevention Manager
    • Data access patterns can be detected in-line by this subsystem. For instance, it could notice that a process is trying to access credit card number data based on things like regex for instance. Third-party anti-virus solutions would also be able to extend the functionality at that layer.

Legacy Construct Emulator

The goal of the Legacy Construct Emulator to is to provide to legacy/existing applications the same storage constructs they are using at this point in time to ensure backward compatibility. Here are a few examples of constructs that would be emulated under the Distributed Universal Memory model:

  • Block Emulator
    • To emulate the simplest construct to simulator the higher level construct of the disk emulator
  • Disk Emulator
    • Based on the on the block emulator, simulates the communication interface of a disk device
  • File Emulator
    • For the file emulator, it could work in a couple of ways.
      • If the application only needs to have a file handle to perform IO and is fairly agnostic of the underlying file system, the application could simply get a file handle it can perform IO on.
      • Otherwise, it could get that through the file system that’s layered on top of a volume that makes use of the disk emulator.
  • Volatile Memory Emulator
    • The goal would be to provide the necessary construct to the OS/application to store it’s state data that’s might be typically stored in RAM.

One of the key thing to note here is that even though all those legacy constructs are provided, the Distributed Universal memory model has the flexibility to persist the data as it sees fit. For instance, even though the application might think it’s persisting data to volatile memory, the data might be persisted to an NVMe device in practice. Same principle would apply for file data; a file block might actually be persisted to RAM (similar a block cache) that’s then being replicated to multiple nodes synchronously to ensure availability, all of this potentially without the application being aware of it.

Metrics Manager

The metrics manager is to capture/log/forward all data points in the system. Here’s an idea:

  • Availability Metrics
    • Replication latency for synchronous replication
    • Asynchronous backlog size
  • Capacity Metrics
    • Capacity used/free
    • Deduplication and compression ratios
    • Capacity optimization strategy overhead
  • Performance Metrics
    • Latency
    • Throughput (IOPS, Bytes/second, etc.)
    • Bandwidth consumed
    • IO Type Ratio (Read/Write)
    • Latency penalty due to SLA per application/process
  • Reliability Metrics
    • Device error rate
    • Operation error rate
  • Security Metrics
    • Encryption overhead

High Level Memory Allocation Process

More details coming soon.

Potential Applications

  • Application high availability
    • You could decide to synchronously replicate a process memory to another host and simply start the application binary on the failover host in the event where the primary host fails
  • Bring server cached data closer to the client
    • One could maintain a distributed coherent cache between servers and client computers
  • Move processes closer to data
    • Instead of having a process try to access data accross the network, why not move the process to where the data is?
  • User State Mobility
    • User State Migration
      • A user state could move freely between a laptop, a desktop and a server (VDI or session host) depending on what the user requires.
    • Remote Desktop Service Session Live Migration
      • As the user session state memory is essentially virtualized from the host executing the session, it can be freely moved from one host to another to allow zero impact RDS Session Host maintenance.
  • Decouple OS maintenance/upgrades from the application
    • For instance, when the OS needs to be patched, one could simply move the process memory and execution to another host. This would avoid penalties such as buffer cache rebuilds in SQL Server for instance which can trigger a high number of IOPS on a disk subsystem in order to repopulate the cache based on popular data. For systems with an large amount of memory, this can be fairly problematic.
  • Have memory/storage that spans to the cloud transparently
    • Under this model it would be fairly straightforward to implement a cloud tier for cold data
  • Option to preserve application state on application upgrades/patches
    • One could swap the binaries to run the process while maintaining process state in memory
  • Provide object storage
    • One could layer object storage service on top of this to support Amazon S3/Azure Storage semantics. This could be implemented on top of the native API if desired.
  • Provide distributed cache
    • One could layer distributed cache mechanisms such as Redis using the native Distributed Universal Memory API to facilitate porting of applications to this new mechanism
  • Facilitate application scale out
    • For instance, one could envision a SQL Server instance to be scaled out using this mechanism by spreading worker threads across multiple hosts that share a common coordinated address space.
  • More to come…

Getting .NET 4.x Versions Using PowerShell

Not too long ago, we had to figure out which versions of .NET Framework Runtime were installed in our environment. It turns out it’s not difficult but not exactly straightforward either. I was expecting to find something out of the box in SCCM to cover this but I couldn’t find what I was looking for.

I then decided to turn to my old friend, PowerShell and create the script Get-NETFrameworkVersions.ps1 which is available in the GEM Automation CodePlex source code repository.

Here’s what the script does at a high level:

  1. Get a list of computers from the Active Directory (clients and servers, you might have to modify the filtering a bit for your environment)
  2. For each computer discovered from the Active Directory, start a PowerShell job that performs the following:
    1. Load the PSRemoteRegistry module which is required to query the registry keys containing the .NET Framework versions on the remote computers
    2. Check if the remote computer is online/reachable
    3. If the computer is online, check if the Remote Registry service is started, if not start it!
    4. Fetch registry keys from SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full
    5. Export the result in a CSV file for the computer
  3. Once the discovery is complete, merge all of the computers CSV files into one file

A few things to note about the discovery process:

  • Versions prior to 4.x are not covered in this versions of the script
  • Discovery jobs launches are throttled to not overwhelm the computer running the script, you may need to adjust the number of concurrent jobs for your computer

With the CSV file in hand you can then perform analysis on the captured data using Excel.

Should you have questions/comments about this, feel free to drop a few words below!

Windows Server Deduplication Job Execution

While working with deduplication with volumes of around 30TB, we noticed the various job types were not executing as we were expecting. As a Microsoft MVP, I’m very fortunate to have direct access to the people with deep knowledge of the technology. A lot of the credit for this post goes to Will Gries and Ran Kalach from Microsoft who were kind enough to answer my questions as to what was going on under the hood. Here’s a summary the things I learned in the process of understanding what was going on.

Before we dive in any further, it’s important to understand the various deduplication job types as they have different resource requirements.

Job Types (source)

  • Optimization
    • This job performs both deduplication and compression of files according data deduplication policy for the volume. After initial optimization of a file, if that file is then modified and again meets the data deduplication policy threshold for optimization, the file will be optimized again.
  • Scrubbing
    • This job processes data corruptions found during data integrity validation, performs possible corruption repair, and generates a scrubbing report.
  • GarbageCollection
    • This job processes previously deleted or logically overwritten optimized content to create usable volume free space. When an optimized file is deleted or overwritten by new data, the old data in the chunk store is not deleted right away. By default, garbage collection is scheduled to run weekly. We recommend to run garbage collection only after large deletions have occurred.
  • Unoptimization
    • This job undoes deduplication on all of the optimized files on the volume. At the end of a successful unoptimization job, all of the data deduplication metadata is deleted from the volume.

Another operation that happens in the background that you need to be aware of is the reconciliation process.  This happens when the hash index doesn’t fit entirely in memory. I don’t have the details at this point in time as to what exactly this is doing but I suspect it tries to restore index coherency across multiple index partitions that were processed in memory successively during the optimization/deduplication process.

Server Memory Sizing vs Job Memory Requirements

To understand the memory requirements of the deduplication jobs running on your system, I recommend you have a look at the event with id 10240 in the Windows Event log Data Deduplication/Diagnostic. Here what it looks like for an Optimization job:

Optimization job memory requirements.

Volume C:\ClusterStorage\POOL-003-DAT-001 (\\?\Volume{<volume GUID>})
Minimum memory: 6738MB
Maximum memory: 112064MB
Minimum disk: 1024MB

Here are a few key things to consider about the job memory requirement and the host RAM sizing:

  • Memory requirements scales almost linearly with the total size of the data to dedup
    • The more data to dedup, the more entries in the hash index to keep track of
    • You need to meet at least the minimum memory requirement for the job to run for a volume
  • The more the memory on the host running deduplication, the better the performance because:
    • You can run more jobs in parallel
    • The job will run faster because
      • You can find more of the hash index in memory
        • The more index you fit in memory, the less reconciliation job will have to be performed
        • If the job fits completely in memory, the reconciliation process is not required
  • If you use throughput scheduling (which is usually recommended)
    • The deduplication engine will allocate by default 50% of the host’s memory but this is configurable
    • If you have multiple volumes to optimize, it will try to run them all in parallel
      • It will try to allocate as much memory as possible for each job to accelerate them
      • If not enough memory is available, other optimization jobs will be queued
  • If you start optimization jobs manually
    • The job is not aware of other jobs that might get started, it will try to allocate as much memory as possible to run the job potentially leaving other future jobs on hold as not enough memory is available to run them

Job Parallelism

I’ve touched a bit in the previous point about memory sizing but here’s a recap with additional information:

  • You can run multiple jobs in parallel
  • The dedup throughput scheduling engine can manage the complexity around the memory allocation for each of the volume for you
  • You need to have enough memory to at least meet the minimum memory requirement of each volume that you want to run in parallel
    • If all the memory has been allocated and you try to start a new job, it will be queued until resources become available
    • The deduplication engine tries to stick to the memory quota determined when the job was started
  • Each job in currently single threaded in Windows Server 2012 R2
    • Windows Server 2016 (currently in TP4) supports multiple threads per job, meaning multiple threads/cores can process a single volume
      • This greatly improves the throughput of optimization jobs
  • If you have multiple volumes residing on the same physical disks, it would be best to run only one job at a time for those specific disks to minimize disk thrashing

To put things into perspective, let’s look at some real world data:

Volume Min RAM (MB) Max RAM (MB) Min Disk (MB) Volume Size (TB ) Unoptimized Data Size (TB )
POOL-003-DAT-001 6 738 112 064 1 024 30 32.81
POOL-003-DAT-002 7 137 118 900 1 024 30 35.63
POOL-003-DAT-004 7 273 121 210 1024 30 35.28
POOL-003-DAT-006 4 089 67 628 1 024 2 18.53
  • To run optimization in parallel on all volumes I need at least 25.2GB of RAM
  • To avoid reconciliation while running those jobs in parallel, I would need a whopping 419.8GB of RAM
    • This might not be too bad if you have multiple nodes in your cluster with each of them running a job

Monitoring Job

To keep an eye on the deduplication jobs, here are the methods I have found so far:

  • Get-DedupJob and Get-DedupStatus will give you the state of the job as they are running
  •  Perfmon will give you information about the current throughput of the jobs currently running
    • Look at the typical physical disk counters
      • Disk Read Bytes/sec/
      • Disk Write Bytes/sec
      • Avg. Disk sec/Transfer
    • You can get an idea of the saving ratio by looking at how much data is being read and how much is being written per interval
  • Event Log Data Deduplication/Operational
    • Event ID 6153 which will give you the following pieces of information once the job has completed:
      • Job Elapsed Time
      • Job Throughput (MB/second)
  • Windows Resource Monitor (if not on Server Core or Nano)
    • Filter the list of process on fsdmhost.exe and look at the IO it’s doing on the files under the Disk tab

Important Updates

I recommend that you have the following updates installed if you are running deduplication on your system as of 2016-04-18:

  • November 2014 Update Rollup
  • KB 3094197 (Will update dedup.sys to version 6.3.9600.18049)
  • KB 3138865 (Will update dedup.sys to version 6.3.9600.18221)
  • If you are running dedup in a cluster, you should install the patches listed in KB 2920151
    • That will simplify your life with Microsoft Support 😉

Final Thoughts

Deduplication is definitely a feature that can save you quite a bit of money. While it might not fit every workload, it has its use and benefits. In one particular cluster we use to store DPM backup data, we were able to save more than 27TB (and still counting as the jobs are still running). Windows Server 2016 will bring much improved performance and who knows what the future will bring, dedup support for ReFS? Who knows!

I will try to keep this post updated as I find out more information about deduplication operationally.

Other Resources

 

 

Validate Microsoft Recommended Updates with PowerShell

Sometimes you need to validate multiple computers to ensure that a specific patch has been installed. That can happen in the course of a support case with Microsoft who recommends certain updates to be installed as per a specific knowledge base article. In order to do this, I’ve build a simple function in PowerShell to gather that information and output a report. You can find this function on the GEM Automation Codeplex project here:

http://gemautomation.codeplex.com/SourceControl/latest#Windows/libWindowsUpdate.psm1

In order to use the function, you would do something like the following:

@("HOST01","HOST02","HOST03") | Validate-RecommendedUpdateStatus 

The function will then return something like the following:

ComputerName HotfixId InstallStatus KBURL                                          AffectedFeaturesOrRoles                  
------------ -------- ------------- -----                                          -----------------------                  
HOST001     2883200  Missing       https://support.microsoft.com/en-us/kb/2883200 Hyper-V                                  
HOST001     2887595  Missing       https://support.microsoft.com/en-us/kb/2887595 Hyper-V                                                    
HOST001     2903939  Installed     https://support.microsoft.com/en-us/kb/2903939 Hyper-V                                   
HOST001     2919442  Installed     https://support.microsoft.com/en-us/kb/2919442 Hyper-V   

While running, the function does the following:

  • Gets the list of features installed on the host
  • Checks the recommended updates for the installed feature against the RecommendedUpdates.csv file
    • I try to keep this file up to date as much possible as the Microsoft KB are getting updated
    • I updated the file on March 18th 2016
  • Lists whether the recommended update was installed or is missing

If you have any questions or issues regarding this, let me know!

GEM Automation 3.2.0.0

Version 3.2.0.0 of GEM Automation has been published to CodePlex. Here are the highlights of this new release.

Hyper-V
– Initial release of New-NanoServerCluster.ps1
– Removed explicit type casting in Set-VMNetworkConfiguration in libVM.psm1
– Modified Microsoft supplied script new-nanoserverimage.ps1 to include the containers package

Networking
– Initial release of Test-TCPPortResponseTime.ps1

Windows
AnalyzeCrashDumps.ps1 initial release

AnalyzeCrashDumps.ps1 is simply a master script that wraps the execution of 3 main scripts that collect, analyze and summarizes crash dumps in your environment. Simplifies scheduling the process in Task Scheduler.

Get-MUILanguagesReport.ps1 initial release
Get-UserPreferredLanguageReport.ps1 initial release

Those scripts are for our friends at the Office québecois de la langues française to help us assess the deployment of language packs and users language preferences accross all the computers in Active Directory. It achieves this by at the WMI class Win32_OperatingSystem using the MUILanguages property and by looking at a registry key in HKU\<user SID>\Control Panel\Desktop\MuiCached called MachinePreferredUILanguages.

If you encounter any issues with this release, don’t hesitate to log issues on CodePlex here.

 

Storage Spaces Direct Practical Scalability

When I was attending Ignite, I was pretty anxious to hear about what was called Storage Spaces Shared Nothing which is now called Storage Spaces Direct.

One of the main feature of Storage Spaces Direct is the ability to leverage NVMe flash drives. Let’s look at some of the practical sizing calculation for this use case.

Maximum Storage Spaces Direct Cluster Nodes: 12 at this point in time, possibly more in the future
Maximum Throughput per NVMe drive: 3GB/s for a Samsung XS1715
Maximum NVMe drives per cluster node: 8 (AIC SB122 or Dell R920)
Maximum Throughput/Node: 24GB/s
Total Cluster Throughput: 288GB/s

Those are sure impressive numbers but if you want to have a non-blocking architecture for this, your network design needs to follow. Here’s what would be needed.

With each server pushing 192Gbps, a cluster could potentially push 12 * 192Gbps, so 2.3Tbps. In order to get to those numbers, from what I understand so far from Storage Spaces Direct, you would need to create basically one SOFS share per cluster node and spread your data across those 12 shares. Not too bad so far. Why? CSV block IO redirection is still around from what I gathered which means if you data slab/chunk are distributed across the cluster, you have a problem, more on this later. Let’s say you want to do something crazy like running SQL Server and leverage that nice throughput. One does not simply partition his SQL database across 12 VHDX. Why? That would mean all the IO of the cluster would need to converge back to the node running your SQL Server instance. 2.3Tbps incoming! Take cover! So what kind of hardware would you need to support this you ask? If you take 100GbE, which to my knowledge is the fastest interconnect generally available, you would need 23 ports per server running your workload or more tangibly 12 Mellanox ConnectX-4 dual port cards. You only have 10 PCIe slots in your R920? How unfortunate! Forget about the AIC 1U machine! I would also like to see the guy doing the wiring job on those servers! Another small detail, at 24 ports per machine, you could only hook up 3 servers to a set of 2 switches with 36 ports if you’re trying to do dual pathing for link redundancy. Hmm..Don’t we need to connect 12 of those servers? Looks like a incoming sizing fail to me. Perhaps I’m wrong and I missed something… If that’s the case let me know! Really what’s needed TODAY for this is 1Tbps link interconnect in order to make this setup practical and fully leverage your investment and future proof your infrastructure a bit.

So the only viable option to use that kind of throughput would be to shard your database across multiple SQL Server instances. You could use AlwaysOn and create read-only replicas (up to 8 in 2014, getting close to 12 but nope, not there, not sure if that’s bumped up in 2016) but then you are wasting quite a bit of storage potentially by duplicating your data around. Azure SQL on premise running on Azure Stack/Azure Service Fabric could start to look very nice. Perhaps another NoSQL database or Hadoop would be more suitable but in the vast majority of enterprises, it’s a radical change for sure!

Food for thoughts! I’ll still let that one stir for a bit…