Measuring GPU Utilization in Remote Desktop Services

I recently spent some time experimenting with GPU Discrete Device Assignment in Azure using the NV* series of VM.  As we noticed that Internet Explorer was consuming quite a bit CPU resources on our Remote Desktop Services session hosts, I wondered how much of an impact on the CPU using a GPU would do by accelerating graphics through the specialized hardware.  We did experiments with Windows Server 2012 R2 and Windows Server 2016. While Windows Server 2012 R2 does deliver some level of hardware acceleration for graphics, Windows Server 2016 did provide a more complete experience through better support for GPUs in an RDP session.

In order to enable hardware acceleration for RDP, you must do the following in your Azure NV* series VM:

  1. Download and install the latest driver recommended by Microsoft/NVidia from here
  2. Enable the Group Policy Setting  Administrative Templates\Windows Components\Remote Desktop Services\Remote Desktop Session Host\Remote Session Environment\Use the hardware default graphics adapter for all Remote Desktop Services sessions as shown below:

To validate the acceleration, I used a couple of tools to generate and measure the GPU load. For load generation I used the following:

  • Island demo from Nvidia which is available for download here.
    • This scenario worked fine in both Windows Server 2012 R2 and Windows Server 2016
    • Here’s what it looks like when you run this demo (don’t mind the GPU information displayed, that was from my workstation, not from the Azure NV* VM):
  • Microsoft Fish Tank page which leverages WebGL in the browser which is in turn accelerated by the GPU when possible
    • This proved to be the scenario that differentiated Windows Server 2016 from Windows Server 2012 R2. Only under Windows Server 2016 could high frame rate and low CPU utilization was achieved. When this demo runs using only the software renderer, I observed CPU utilization close to 100% on a fairly beefy NV6 VM that has 6 cores and that just by running a single instance of that test.
    • Here’s what FishGL looks like:

To measure the GPU utilization, I ended up using the following tools:

In order to do a capture with Windows Performance Recorder, make sure that GPU activity is selected under the profiles to be recorded:

Here’s a recorded trace of the GPU utilization from the Azure VM while running FishGL in Internet Explorer that’s being visualized in Windows Performance Analyzer:

As you can see in the WPA screenshot above, quite a few processes can take advantage of the GPU acceleration.

Here’s what it looks like in Process Explorer when you’re doing live monitoring. As you can see below, you can see which process is consuming GPU resources. In this particular screenshot, you can see what Internet Explorer consumes while running FishGL my workstation.

Windows Server 2016 takes great advantage of an assigned GPU to offload compute intensive rendering tasks. Hopefully this article helped you get things started!


Distributed Universal Memory for Windows

*Disclaimer* This is only an idea I’ve been toying with. It doesn’t represent in any way, shape or form future Microsoft plans in regards to memory/storage management. This page will evolve over time as the idea is being refined and fleshed out.

**Last Updated 2017-03-23**

The general ideal behind Distributed Universal Memory is to have a common memory management API that would achieve the following:
  • Abstract the application from the memory medium required to maintain application state, whether it’s volatile or permanent
  • Allow the application to express memory behavior requirements and not worry about the storage medium to achieve this
  • Support legacy constructs for backward compatibility
  • Enable new capabilities for legacy applications without code change
  • Give modern applications a simple surface to persist data
  • Enables scale out applications to use potentially a single address space
  • Could potentially move towards a more microservice based approach instead of the current monolithic code base
  • Could easily leverage advances in hardware development such as disaggregation of compute and memory, usage of specialized hardware such FPGAs or GPUs to accelerate certain memory handling operations
  • Could be ported/backported to further increase the reach/integration capabilities. This memory management subsystem to could be cleanly decoupled from the underlying operating system.
  • Allow the data to be optimally placed for performance, availability and ultimately cost

Availability Management

  • Process memory can be replicated either systematically or on demand
  • This would allow existing process memory to be migrated from one operating system instance to another transparently.
  • This could offer higher resiliency to process execution in the event of an host failure
  • This could also allow some OS components to be updated while higher level processes keep going. (i.e. redirected memory IO)
Performance Management
  • Required medium to achieve performance could be selected automatically using different mechanisms (MRU/LRU, machine learning, etc.)
  • Memory performance can be expressed explicitly by the application
    • By expressing its need, it would be easier to characterize/model/size the required system to support the application
    • Modern applications could easily specify how each piece of data it interacts with should be performing
  • Could provide multiple copies of the same data element for compute locality purposes. i.e. Distributed read cache
    • This distributed read-cache could be shared between client and server processes if desired. This would enable to have a single cache mechanism independently of the client/process accessing it.
Capacity Management
  • Can adjust capacity management techniques depending on performance and availability requirements
  • For instance, if data is rarely used by the application, several data reduction techniques could be applied such as deduplication, compression and/or erasure coding
  • If data access time doesn’t require redundancy/locality/tolerates time for RDMA, it could be spread evenly across the Distributed Universal Memory Fabric

High Level Cluster View


Here’s an high level diagram of what it might look like:

Let’s go over some of the main components.

Data Access Manager

The Data Access Manager is the primary interface layer to access data. The legacy API would sit on top of this layer in order to properly abstract the underlying subsystems in play.

  • Transport Manager
    • This subsystem is responsible to push/pull the data on the remote host. All inter-node data transfers would occur over RDMA to minimize the overhead of copying data back and forth between nodes.
  • Addressing Manager
    • This would be responsible to give a universal memory address for the data that’s independent of storage medium and cluster nodes.

Data Availability Manager

This component would be responsible to ensure the proper level of data availability and resiliency are enforced as per defined policies in the system. It would be made of the following subsystems:

  • Availability Service Level Manager
    • The Availability Service Level Manager’s responsibility to to ensure the overall availability of data. For instance, it would act as the orchestrator responsible to trigger the replication manager to ensure the data is meeting its availability objective.
  • Replication Manager
    • The Replication Manager is responsible to enforce the right level of data redundancy across local and remote memory/storage devices. For instance, if 3 copies of the data must be maintained for the data of a particular process/service/file/etc. across 3 different failure domains, the Replication Manager is responsible of ensuring this is the case as per the policy defined for the application/data.
  • Data History Manager
    • This subsystem ensure that the appropriate point in time copies of the data are maintained. Those data copies could be maintained in the system itself by using the appropriate storage medium or they could be handed of to a thrid party process if necessary (i.e. standard backup solution). The API would provide a standard way for data recovery operations.

Data Capacity Manager

The Data Capacity Manager is responsible to ensure enough capacity of the appropriate memory/storage type is available for applciations and also for applying the right capacity optimization techniques to optimize the physical storage capacity available. The following methods could be used:

  • Compression
  • Deduplication
  • Erasure Coding

Data Performance Manager

The Data Performance Manager is responsible to ensure that each application can access each piece of data at the appropriate performance level. This is accomplished using the following subsystems:

  • Latency Manager
    • This is responsible to place the data on the right medium to ensure that each data element can be accessed at the right latency level. This can be determined either by pre-defined policy or by heuristic/machine learning to detect data access pattern beyond LRU/MRU methods.
    • The Latency Manager could also monitor if a local process tends to access data that’s mostly remote. If that’s the case, instead of generally incurring the network access penalty, the process could simply be moved to the remote host for better performance through data locality.
  • Service Level Manager
    • The Service Level Manager is responsible to manage the various applications expectations in regards to performance.
    • The Service Level Manager could optimize data persistence in order to meet its objective. For example, if the local non-volatile storage response time is unacceptable, it could choose to persist the data remotely and then trigger the Replication Manager to bring a copy of the data back locally.
  • Data Variation Manager
    • A subsystem could be conceived to persist a tranformed state of the data. For example, if there’s an aggregation on a dataset, it could be persisted and linked to the original data. If the original data changes the dependent aggregation variations could either be invalidated or updated as needed.

Data Security Manager

  • Access Control Manager
    • This would create hard security boundary between processes and ensure only authorized access is being granted, independently of the storage mechanism/medium.
  • Encryption Manager
    • This would be responsible for the encryption of the data if required as per a defined security policy.
  • Auditing Manager
    • This would audit data access as per a specific security policy. The events could be forwarded to a centralized logging solution for further analysis and event correlation.
    • Data accesses could be logged in an highly optimized graph database to allow:
      • Build a map of what data is accessed by processes
      • Build a temporal map of how the processes access data
  • Malware Prevention Manager
    • Data access patterns can be detected in-line by this subsystem. For instance, it could notice that a process is trying to access credit card number data based on things like regex for instance. Third-party anti-virus solutions would also be able to extend the functionality at that layer.

Legacy Construct Emulator

The goal of the Legacy Construct Emulator to is to provide to legacy/existing applications the same storage constructs they are using at this point in time to ensure backward compatibility. Here are a few examples of constructs that would be emulated under the Distributed Universal Memory model:

  • Block Emulator
    • To emulate the simplest construct to simulator the higher level construct of the disk emulator
  • Disk Emulator
    • Based on the on the block emulator, simulates the communication interface of a disk device
  • File Emulator
    • For the file emulator, it could work in a couple of ways.
      • If the application only needs to have a file handle to perform IO and is fairly agnostic of the underlying file system, the application could simply get a file handle it can perform IO on.
      • Otherwise, it could get that through the file system that’s layered on top of a volume that makes use of the disk emulator.
  • Volatile Memory Emulator
    • The goal would be to provide the necessary construct to the OS/application to store it’s state data that’s might be typically stored in RAM.

One of the key thing to note here is that even though all those legacy constructs are provided, the Distributed Universal memory model has the flexibility to persist the data as it sees fit. For instance, even though the application might think it’s persisting data to volatile memory, the data might be persisted to an NVMe device in practice. Same principle would apply for file data; a file block might actually be persisted to RAM (similar a block cache) that’s then being replicated to multiple nodes synchronously to ensure availability, all of this potentially without the application being aware of it.

Metrics Manager

The metrics manager is to capture/log/forward all data points in the system. Here’s an idea:

  • Availability Metrics
    • Replication latency for synchronous replication
    • Asynchronous backlog size
  • Capacity Metrics
    • Capacity used/free
    • Deduplication and compression ratios
    • Capacity optimization strategy overhead
  • Performance Metrics
    • Latency
    • Throughput (IOPS, Bytes/second, etc.)
    • Bandwidth consumed
    • IO Type Ratio (Read/Write)
    • Latency penalty due to SLA per application/process
  • Reliability Metrics
    • Device error rate
    • Operation error rate
  • Security Metrics
    • Encryption overhead

High Level Memory Allocation Process

More details coming soon.

Potential Applications

  • Application high availability
    • You could decide to synchronously replicate a process memory to another host and simply start the application binary on the failover host in the event where the primary host fails
  • Bring server cached data closer to the client
    • One could maintain a distributed coherent cache between servers and client computers
  • Move processes closer to data
    • Instead of having a process try to access data accross the network, why not move the process to where the data is?
  • User State Mobility
    • User State Migration
      • A user state could move freely between a laptop, a desktop and a server (VDI or session host) depending on what the user requires.
    • Remote Desktop Service Session Live Migration
      • As the user session state memory is essentially virtualized from the host executing the session, it can be freely moved from one host to another to allow zero impact RDS Session Host maintenance.
  • Decouple OS maintenance/upgrades from the application
    • For instance, when the OS needs to be patched, one could simply move the process memory and execution to another host. This would avoid penalties such as buffer cache rebuilds in SQL Server for instance which can trigger a high number of IOPS on a disk subsystem in order to repopulate the cache based on popular data. For systems with an large amount of memory, this can be fairly problematic.
  • Have memory/storage that spans to the cloud transparently
    • Under this model it would be fairly straightforward to implement a cloud tier for cold data
  • Option to preserve application state on application upgrades/patches
    • One could swap the binaries to run the process while maintaining process state in memory
  • Provide object storage
    • One could layer object storage service on top of this to support Amazon S3/Azure Storage semantics. This could be implemented on top of the native API if desired.
  • Provide distributed cache
    • One could layer distributed cache mechanisms such as Redis using the native Distributed Universal Memory API to facilitate porting of applications to this new mechanism
  • Facilitate application scale out
    • For instance, one could envision a SQL Server instance to be scaled out using this mechanism by spreading worker threads across multiple hosts that share a common coordinated address space.
  • More to come…

Validating Service Principal Name Entries Using PowerShell

In this blog post, I’ll be going over how to use a function I wrote earlier this year to help you validate the Service Principal Names in your environment.

The function in question is Validate-ComputerSPN which is located in libActiveDirectory.psm1 in the GEM Automation CodePlex project.

Here’s a brief overview of the capabilities of the function at this point in time:

  • Enumerates services that are using an Active Directory account
  • If the account is used for a SQL Server instance:
    • Checks if there are alternative names (CNAME) for the instance based on DNS records
    • Checks if there are alternatives names coming from failover clustering or Availability Groups
    • For each of those alternative names, validates if the required SPN is present in the Active Directory
  • If IIS is installed on the computer:
    • Enumerate all sites, applications and application pools
      • The function will capture host headers and ports used by the sites
    • For each combination of application/host header/port, validates if the required SPN is present in the Active Directory
  • In all of the cases above, the following information is captured and can be exported in a CSV as shown in the example below.
    • Name of the computer
    • Name of the service (MSSQLSERVER or the name of the IIS web site)
    • The Active Directory service account
    • The DNS entry used by the service
    • The SPN entry that was found
    • The SPN entry that was expected

Here’s an example of how you would call this function:

$computers=@("SERVER01", "SERVER01")
$computers | Validate-ComputerSPN -dnsServerName "DNS01" -domainName "" -serviceAccountSearchBase "OU=Service Accounts,OU=Generic Accounts,dc=contoso,dc=com" | Export-Csv -NoTypeInformation -Delimiter "^" -Path ComputerSPNValidationReport.csv -Force 

The generated CSV will have the following information:

  • Computer Name
  • Service Name
  • Service Account Name
  • DNS Entry
  • SPN Found
  • SPN Expected

For instance, you can use an Excel PivotTable to show you information in a similar way:

      • CONTOSO\dbs_001_svc
        • MSSQLSvc/SERVERNAME:1433
        • MSSQLSvc/production-databases:1433
        • MSSQLSvc/

I’ve also used a formula in the Excel spreadsheet to generate the required setspn command to run in case a missing SPN was found. This is accomplished by simply concatenating the various field with the proper setspn.exe switches. For instance:

  • =CONCATENATE(“setspn -A “,F2,” “,C2)
  • Where F2 is the Expected SPN and C2 the Service Account Name

Should you have any questions or comments about this, feel free to let me know!

Capturing Logged In Remote Desktop Sessions on Servers Using PowerShell

While it’s a best practice to avoid logging on servers using Remote Desktop for management tasks, some things are just easier when you do and some things are almost impossible to do otherwise. That will change significantly with Windows Server 2016 but in the mean time, we have to manage this.

I’m sure it happened to you or to your colleagues, we sometime disconnect our RDP sessions from our beloved pet servers and forget we ever logged onto those. The problem with this is that we end up wasting precious server resources in our environment for no valid reasons. So how can we be aware of those lingering RDP sessions?

In our case, I built the following script to help us assess the situation.

For the latest version: Get-RDPSession.ps1

Here’s what the script is doing at a high level:

  1. Get a list of servers from Active Directory
  2. Load the awesome PSTerminalServices.psm1 module (it hasn’t been updated in a little while though)
  3. For each server:
    1. Capture the Remote Desktop Services sessions core information:
      1. Computer Name
      2. Domain Name
      3. User Name
      4. Connection state
      5. Time of connection
      6. Time of disconnection
      7. Time when last input was received from the user
      8. Login time
      9. User idle time
    2. For each session discovered, measure the amount of memory consumed by the user
    3. Export this data to a CSV file for further analysis

You can then easily use Excel or Power BI to perform additional analysis on your opened sessions. Here’s an screenshot from our environment where I’ve protected the names of the innocents:


You now go clean your RDP sessions and THEN go bug your colleagues about their sessions! 😉

If you have any questions about this, feel free to ask via the comments!

Remote Desktop Services VDI User Experience Monitoring

With the release of GEM Automation came a new set of capabilities to monitor core network statistics that directly affects the user experience in Remote Desktop Services Virtual Desktop Infrastructure (VDI). As we have quite a few users who are either working from home or from another country (on another continent), it became obvious we needed some extra information to assess and diagnose the experience of our users.

The core of this new functionality resides in the Windows\libWindowsRDS.psm1 module. The function that performs the grunt of the work is called Get-RDSPerfmonStatistics. An easy way to monitor your RDS infrastructure is to put the name of your brokers in RDSServers.txt and then you can simply call Monitor-RDSPerformance.ps1 without any parameters assuming you have configured your environment properly in the configuration database using ConfigurationDatabase\libConfigurationDatabase.psm1. Only the credentials of the user performing the monitoring need to be added as a configuration setting. For example, you could do this by calling the New-ConfigurationSetting function

New-ConfigurationSetting -scope Global -name AutomationCredentialUserName -value <user name>
New-ConfigurationSetting -scope Global -name AutomationCredentialPassword -value <user password>

Doing it this way allows you to easily schedule the Monitoring-RDSPerformance.ps1 script in Task Scheduler.

Once the script has been properly configured and is running, it will collect the following metrics:

  • \RemoteFX Network(*)\Current TCP RTT
  • \RemoteFX Network(*)\Current UDP RTT
  • \RemoteFX Network(*)\Total Sent Rate
  • \RemoteFX Network(*)\Total Received Rate

The script will also collect the following user information to facilitate the reconciliation of the performance data with actual users:

  • Broker computer name
  • Collection name
  • Host server name (i.e. Hyper-V host running the VM)
  • User company
  • User title

A caveat of this script to note is that session information is only refreshed a specific interval. The refresh process lists the currently active sessions in order to setup the perfmon monitoring required (i.e. target the right VDI computer). By default this is done every 30 minutes. This means if a user connects and has a session shorter than 30 minutes, there’s a potential it won’t show up in the statistics.

Here’s and idea what the data looks like in the Excel spreadsheet:


As you can see, I can slice and dice the user statistics in multiple ways in order to get to what I’m looking for. One graph that’s particularly interesting to help you assess the situation is the one in the bottom left. It shows you how many samples were within certain RTT buckets. It gives you a nice an easy way to see if the user experience is generally good or bad. It’s also important to keep an eye on the Average TCP RTT (if you are not using UDP) and the TCP RTT Jitter. They show you the average connection latency and the amount of latency variations happening.

Right now the Excel spreadsheet is not available. Once I clean our confidential data from it, I’ll publish it to CodePlex.

If you have any questions about this, feel free to leave me a comment!


Validate Microsoft Recommended Updates with PowerShell

Sometimes you need to validate multiple computers to ensure that a specific patch has been installed. That can happen in the course of a support case with Microsoft who recommends certain updates to be installed as per a specific knowledge base article. In order to do this, I’ve build a simple function in PowerShell to gather that information and output a report. You can find this function on the GEM Automation Codeplex project here:

In order to use the function, you would do something like the following:

@("HOST01","HOST02","HOST03") | Validate-RecommendedUpdateStatus 

The function will then return something like the following:

ComputerName HotfixId InstallStatus KBURL                                          AffectedFeaturesOrRoles                  
------------ -------- ------------- -----                                          -----------------------                  
HOST001     2883200  Missing Hyper-V                                  
HOST001     2887595  Missing Hyper-V                                                    
HOST001     2903939  Installed Hyper-V                                   
HOST001     2919442  Installed Hyper-V   

While running, the function does the following:

  • Gets the list of features installed on the host
  • Checks the recommended updates for the installed feature against the RecommendedUpdates.csv file
    • I try to keep this file up to date as much possible as the Microsoft KB are getting updated
    • I updated the file on March 18th 2016
  • Lists whether the recommended update was installed or is missing

If you have any questions or issues regarding this, let me know!

Accessing SMB File Share from Windows Mobile 10

Just a quick post to potentially save some time and frustration to my Windows Mobile 10 compatriots. In order to access the SMB share from your Windows Mobile 10 device, I recommend that you use Metro File Manager Pro which works great to manage files locally, in OneDrive and also in SMB shares. Oddly enough, the built in File Manager doesn’t support network shares.

If you get an error while trying to browse your share, check for the following event in the System event log for the Server service (srv):

The server’s configuration parameter “irpstacksize” is too small for the server to use a local device.  Please increase the value of this parameter.

I was hitting that particular error every time the application tried to access a share that was residing on a 4TB thin Storage Spaces virtual disk.

You can apply the solution mentioned in this KB article and specify a value of 50. Once the registry was created with the proper value and the Server service restarted, I was able to access and browse the share successfully from my phone.