Introduction to Azure Privileged Identity Management

As a general security best practice, it’s best to operate and manage IT infrastructure under the least privilege principle. Doing this on premise has often been problematic as it involved either a manual escalation process (Run As) or a custom automated process to achieve. The Run As approach is typically not ideal as even those secondary accounts have generally way more privileges than required to perform administrative tasks on systems. PowerShell Just Enough Administration definitely helps in that regard but today I will cover Azure’s take on this problem by covering the basics of Azure Privileged Identity Management (PIM).

With Azure PIM, you will have better visibility on the privileges required to manage your environment. It’s fairly easy to get started and to use so I highly encourage you to adopt this security practice in your environment, especially if you are just getting started with Azure in general.

Initially, Azure Privileged Identity Management (PIM) only covered privilege escalation for Azure Active Directory roles. This changed when Microsoft announced they are now covering Azure Resource Manager resources as well. This means you can now do just in time escalation of privileges to manage things like subscriptions, networking, VMs etc. In this post, I’ll cover the Azure AD roles portion of Azure PIM.

To quickly get started with Azure PIM with Azure AD roles, you can simply login to the Azure Portal and start assigning users as eligible to specific Azure AD roles. To achieve this, you go to the Azure AD Directory Roles section.

Once in the section, you can now go in the Roles section to start making users eligible to specific Azure AD roles by clicking the Add user button. A thing to note, is that you can only assign roles to specific users, not to a group.

Once you have specified a user as eligible to a role, that user can now activate it. To do this, they simply have to go in the Azure PIM section of the Azure Portal and pick My Roles. The user can then select the appropriate role to activate in order to perform the desired administrative task.

When you activate a role, you will be prompted to enter a reason as to why you need to elevate your privileges. This is generally good practice as it will allow the persons reviewing the escalations to understand why certain high privileges had to be used to perform a task.

Now that we have covered the basics to quickly get you started with PIM. We can dive a bit into how that experience can be customized. Here are the configuration options for an Azure AD role:

  • Maximum Activation duration: When the user activates a role, how long should it remain activated? A shorter duration is desirable for security reasons.
  • Notifications: Should an email be sent to an administrator when a role is activated? This can also give the admin a feeling as to whether an admin role is abused. i.e. Why use Global Admin when its not necessary to perform task X?
  • Incident/Request Ticket: You could enforce a support ticket number to be entered with each activation. This can be useful if you really need to close the loop as to why elevation is required. i.e. Need to change a setting to apply a change request or resolve an incident #####.
  • Multi-Factor Authentication: A user will need to be enrolled in Azure MFA in order to activate a role.
  • Require approval: When this is enabled, an admin will need to approve the activation for a user. This might be useful for high privilege roles such as Global Admin where you don’t want to have abuse of privileges. It also documents the full process better. i.e. User X asked for elevation and admin Y approved the request.

From an operational standpoint, you can also get alerts for the following things:

Out of those alerts, you can tune the thresholds in order to match your organization requirements:

  • For There are too many global administrators alerts, you can define the number of allowed global admins and the percentage of global admins versus the total number of administrators configured.
  • For Roles are being activated too frequently, you can specify the maximum duration between activation and the number of acceptable activation during that period. This could be useful to flag users that simply activate all roles for no good reasons just to make sure they have the required privileges to perform a task.

You can also configure the Access review functionality which specifies how you want to review the user activation history in order to maintain a tight ship security wise. You can configure the access review with the following settings:

  • Mail notifications: Send an email to advise an administrator to perform the access review
  • Reminders: Send an email to advise an administrator to complete an access review
  • Require reason for approval: Make sure the reviewer documents why an activation was approved/makes sense.
  • Access review duration: The number of days between each access review exercise (default is 30 days).

Once all this is configured, you can monitor the activation/usage of roles using the Directory Roles Audit History section:

I hope this quick introduction to Azure Privileged Identity Management was helpful. Should you have any questions about this, let me know!

Advertisements

Doing an Email Loop For 3rd Party DLP Inspection With O365 Exchange Online

While Exchange Online provides Data Leakage Prevention (DLP) capabilities, it’s still possible to integrate it with a third party DLP solution. The goal was to achieve this while still providing a solution that’s highly available and not dependent on on premises resources. Here’s the configuration we’ve picked to experiment with this.

The first thing needed to be setup was a pair of VM appliances hosted in Azure. Those appliances are receiving emails from Exchange Online, inspect them and send them back to Exchange Online. We could have opted with a configuration where the appliances would send the emails directly without involving Exchange again but we wanted to maintain the IP/service reputation and message tracking capabilities provided by Exchange. I will not got into the details of creating those VMs as this is vendor dependent. In our particular case, we uploaded a VM VHD to Azure Storage and then created an Azure Image using that. It was then fairly straightforward to deploy the VMs afterward using an Azure Resource Manager template. The VMs are part of an Azure Availability Set and an Azure Network Security Group for traffic filtering.

Once the VM appliances have been deployed in Azure IaaS, an Azure Load Balancer was configured in order to provide high availability. This is achieved by first configuring a load balancing rule for SMTP (port 25).

Load Balancing Rule Configuration

Once that was completed, an health probe that monitors the availability of the backend VMs delivering the DLP service again for port 25 was created.

Health Probe Configuration

With the Azure portion of the setup completed, we now move on to the Exchange Online configuration. First we configured two connectors. One to send emails from Exchange Online to the DLP solution and another to ensure that Exchange Online would accept emails from the DLP solution back and then send those to the Internet.

From Connector Configuration
To Connector Configuration

Once the connectors have been created, it was required to create a mail flow/transport rule that would send all emails to the DLP solution while also avoiding to create a mail loop when those emails would come back from it. To achieve this, the rule was configured to send all emails to the connector that’s responsible to send the emails to the DLP solution as an action and an exception on the sender IP address was configured. In this particular case, we want to make sure that all emails coming from the public IP of the load balancer in front of the DLP solution are excluded from that rule to avoid the mail loop.

Mail Flow Rule Configuration

With that configuration in place, we were able to successfully send the emails through the DLP servers and then back to Exchange Online to be sent on the Internet. We can confirm this by looking at the message trace in Exchange Online:

If you have any questions about this, let me know!

Windows 10 Fall Creators Update – Hyper-V VM Sharing – Because Sharing is Caring

With the latest Windows 10 Insider Build of 16226, Microsoft introduced a new feature in Hyper-V to allow easy sharing of VMs amongst users. To share a VM, connect to its console in Hyper-V Manager and click the Share button as seen below:

You will then be prompted to select a location to save the compressed VM export/import file with the extension vmcz (VM Compressed Zip perhaps?). Depending on the VM size, that might take a little while. If you want to check what’s in that export file, you can simply rename append .zip to its file name and open it either with Explorer or your favorite archive handling application. As you can see below, the structure is fairly familiar to anyone using Hyper-V:

You can find the VM hard disk drives (.vhd or .vhdx), its configuration file (.vmcx) and the run state file (.vmrs). So, there’s really no magic there! It creates a nice clean package of all the VMs artifact to easily send it around.

One thing I would like to see in future build is to trigger this process in other ways in Hyper-V Manager as it’s oddly missing from the VM right action pane and the right click contextual menu of the VM. Maybe that’ll come in future builds. I also couldn’t find a way to trigger this in PowerShell yet.

Once your friend has the vmcz file in hand, they can simply double click on it to trigger the import. In the background, the utility C:\Program Files\Hyper-V\vmimport.exe is called. Unfortunately on my test laptop, the import process bombs out as seen below:

I suspect one has only to type a name for the VM that will be imported and click Import Virtual Machine. Those kind of issues are to be expected when you’re in the Fast ring for the Insider Builds! I’m sure that will turn out to be a useful feature for casual Hyper-V users.

Hardware Performance Monitoring Deep Dive using Intel Performance Counter Monitor

A little while ago, I had to take a deep dive into hardware statistics in order to troubleshoot a performance bottleneck. In order to achieve this, I ended up using Intel Performance Counter Monitor. As one cannot simply download pre-compiled binaries of those tools, I had to dust off my mad C++ compiler skills. You can find the compiled binaries I did here as part of the GEM Automation latest release to save you some trouble. You’re welcome! 🙂

In order to use those tools, simply extract the GEM Automation archive to a local path on the machine you want to monitor. You can change the current working directory to:

<extraction path>\InfrastructureTesting\IntelPerformanceCounterMonitor\x64\

Here’s an overview of each of the exe in the directory and a sample output of each. Do note that you can export data to a CSV file for easier analysis. It seems to also include more metrics when you output the data that way.

  • pcm.exe
    • Provides CPU statistics for both sockets and cores

 EXEC  : instructions per nominal CPU cycle
 IPC   : instructions per CPU cycle
 FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
 AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state'  (includes Intel Turbo Boost)
 L3MISS: L3 cache misses 
 L2MISS: L2 cache misses (including other core's L2 cache *hits*) 
 L3HIT : L3 cache hit ratio (0.00-1.00)
 L2HIT : L2 cache hit ratio (0.00-1.00)
 L3MPI : number of L3 cache misses per instruction
 L2MPI : number of L2 cache misses per instruction
 READ  : bytes read from memory controller (in GBytes)
 WRITE : bytes written to memory controller (in GBytes)
 TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature
 energy: Energy in Joules


 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP

   0    0     0.01   0.32   0.02    1.16      28 K     44 K    0.36    0.81    0.00    0.00     65
   1    0     0.00   0.23   0.01    1.16    3270       18 K    0.82    0.81    0.00    0.00     65
   2    0     0.00   0.20   0.01    1.16    5487       19 K    0.73    0.81    0.00    0.00     61
   3    0     0.00   0.22   0.01    1.16    4425       16 K    0.73    0.84    0.00    0.00     61
   4    0     0.01   0.51   0.01    1.16      47 K     82 K    0.42    0.69    0.00    0.00     69
   5    0     0.00   0.22   0.02    1.16      32 K     48 K    0.34    0.76    0.00    0.01     69
   6    0     0.00   0.23   0.01    1.16    5810       20 K    0.71    0.81    0.00    0.00     67
   7    0     0.00   0.26   0.01    1.16    5952       35 K    0.83    0.73    0.00    0.00     67
   8    0     0.00   0.24   0.01    1.16    9282       26 K    0.64    0.77    0.00    0.00     63
   9    0     0.00   0.20   0.01    1.16    2845       12 K    0.78    0.87    0.00    0.00     63
  10    0     0.01   0.53   0.02    1.16    8552       55 K    0.85    0.66    0.00    0.00     65
  11    0     0.01   0.82   0.01    1.16    7612       28 K    0.73    0.78    0.00    0.00     65
  12    0     0.01   0.39   0.02    1.16      13 K    112 K    0.88    0.59    0.00    0.01     62
  13    0     0.00   0.21   0.01    1.16    3111       17 K    0.82    0.83    0.00    0.00     62
  14    0     0.00   0.31   0.01    1.16      20 K     61 K    0.66    0.65    0.00    0.01     62
  15    0     0.00   0.25   0.01    1.16    2127       14 K    0.85    0.86    0.00    0.00     62
  16    0     0.00   0.22   0.01    1.16    3462       17 K    0.80    0.85    0.00    0.00     61
  17    0     0.00   0.33   0.01    1.16      32 K     65 K    0.50    0.64    0.00    0.01     61
  18    0     0.00   0.21   0.01    1.16    3476       13 K    0.74    0.88    0.00    0.00     62
  19    0     0.00   0.23   0.01    1.16    2169       11 K    0.81    0.89    0.00    0.00     63
  20    1     0.04   0.60   0.06    1.16     123 K    515 K    0.76    0.62    0.00    0.01     60
  21    1     0.00   0.21   0.01    1.16    3878       39 K    0.90    0.73    0.00    0.01     60
  22    1     0.01   0.39   0.03    1.16      41 K    259 K    0.84    0.61    0.00    0.01     58
  23    1     0.00   0.18   0.01    1.16    4880       33 K    0.85    0.75    0.00    0.01     58
  24    1     0.02   1.07   0.02    1.16      24 K    207 K    0.88    0.79    0.00    0.00     67
  25    1     0.00   0.20   0.01    1.16    4392       30 K    0.86    0.76    0.00    0.01     67
  26    1     0.01   0.46   0.02    1.16      25 K    133 K    0.81    0.58    0.00    0.01     61
  27    1     0.00   0.30   0.01    1.16      42 K    134 K    0.68    0.51    0.00    0.01     61
  28    1     0.01   0.35   0.02    1.16      13 K    106 K    0.87    0.61    0.00    0.01     63
  29    1     0.00   0.21   0.01    1.16    9944       39 K    0.75    0.73    0.00    0.01     63
  30    1     0.00   0.24   0.01    1.16    5716       59 K    0.90    0.67    0.00    0.01     61
  31    1     0.01   0.30   0.02    1.16      16 K    106 K    0.84    0.59    0.00    0.01     61
  32    1     0.00   0.28   0.01    1.16    9956       74 K    0.87    0.64    0.00    0.01     64
  33    1     0.00   0.28   0.01    1.16      38 K     78 K    0.51    0.58    0.01    0.01     64
  34    1     0.00   0.30   0.01    1.16    9211       85 K    0.89    0.62    0.00    0.01     65
  35    1     0.01   0.39   0.01    1.16      10 K     81 K    0.87    0.64    0.00    0.01     65
  36    1     0.00   0.30   0.01    1.16    7509       83 K    0.91    0.63    0.00    0.01     59
  37    1     0.00   0.20   0.01    1.16    5518       22 K    0.75    0.82    0.00    0.01     59
  38    1     0.00   0.27   0.01    1.16    9772       74 K    0.87    0.64    0.00    0.01     63
  39    1     0.00   0.29   0.01    1.16      10 K     58 K    0.82    0.68    0.00    0.01     63
---------------------------------------------------------------------------------------------------------------
 SKT    0     0.00   0.33   0.01    1.16     243 K    724 K    0.66    0.75    0.00    0.00     60
 SKT    1     0.01   0.41   0.02    1.16     417 K   2225 K    0.81    0.66    0.00    0.01     59
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.01   0.38   0.01    1.16     661 K   2949 K    0.78    0.69    0.00    0.01     N/A

 Instructions retired:  523 M ; Active cycles: 1382 M ; Time (TSC): 2508 Mticks ; C0 (active,non-halted) core residency: 1.19 %

 C1 core residency: 98.81 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 0.00 %;
 C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;

 PHYSICAL CORE IPC                 : 0.76 => corresponds to 18.93 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.01 => corresponds to 0.26 % core utilization over time interval

Intel(r) QPI data traffic estimation in bytes (data traffic coming to CPU/socket through QPI links):

              | 
---------------------------------------------------------------------------------------------------------------
 SKT    0     |  
 SKT    1     |  
---------------------------------------------------------------------------------------------------------------
Total QPI incoming data traffic:    0       QPI data traffic/Memory controller traffic: 0.00

Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

              | 
---------------------------------------------------------------------------------------------------------------
 SKT    0     |  
 SKT    1     |  
---------------------------------------------------------------------------------------------------------------
Total QPI outgoing data and non-data traffic:    0  

          |  READ |  WRITE | CPU energy | DIMM energy
---------------------------------------------------------------------------------------------------------------
 SKT   0     0.09     0.06      37.51      16.17
 SKT   1     0.07     0.05      38.45      13.03
---------------------------------------------------------------------------------------------------------------
       *     0.16     0.11      75.97      29.20

  • pcm-core.exe
    • Provides detailed core level information
Time elapsed: 1004 ms
txn_rate: 1

Core | IPC | Instructions  |  Cycles  | Event0  | Event1  | Event2  | Event3 
   0   0.44         102 M      232 M     301 K     768 K      91 K     830 K
   1   1.04         137 M      131 M     140 K     336 K      12 K     918 K
   2   0.85         194 M      228 M     247 K     569 K      82 K     613 K
   3   0.25        7377 K       29 M      17 K      31 K    4364        93 K
   4   0.66          99 M      149 M     148 K     373 K      49 K     407 K
   5   0.61         169 M      275 M     163 K     770 K      94 K    1105 K
   6   0.89         186 M      209 M     258 K     399 K      55 K     635 K
   7   0.48         101 M      211 M     200 K     641 K      64 K     670 K
   8   0.50          88 M      176 M     177 K     547 K      73 K     510 K
   9   0.19        4422 K       22 M    4572        20 K    3379        83 K
  10   0.71         124 M      175 M     167 K     389 K      49 K     388 K
  11   0.24        5738 K       24 M    6407        24 K    4258        90 K
  12   0.67          58 M       87 M      73 K     184 K      23 K     249 K
  13   0.90         161 M      180 M     160 K     308 K      80 K     603 K
  14   0.71          49 M       69 M      70 K     100 K      16 K     193 K
  15   0.29          16 M       56 M      37 K      51 K      37 K     241 K
  16   0.73          46 M       63 M      40 K      80 K      25 K     300 K
  17   0.28        6441 K       23 M    6106        22 K    4619       104 K
  18   0.27        9346 K       34 M      28 K      52 K    8449       120 K
  19   0.46         130 M      285 M     358 K     914 K      95 K     874 K
  20   0.65         807 M     1240 M     502 K    4783 K     785 K    5832 K
  21   0.16        4350 K       26 M    4635        74 K    3481        84 K
  22   0.53         123 M      232 M     207 K     710 K     131 K     738 K
  23   0.17        4402 K       25 M    5703        32 K    4500        93 K
  24   0.50          87 M      175 M     188 K     617 K      37 K     524 K
  25   0.18        4483 K       24 M    5430        24 K    4040        90 K
  26   0.56         200 M      360 M     250 K    1192 K      84 K    3315 K
  27   1.45         958 M      661 M     434 K     920 K      50 K      13 M
  28   0.31          17 M       56 M      57 K     173 K      17 K     178 K
  29   1.43         888 M      622 M     457 K     622 K      38 K    2603 K
  30   0.41          29 M       72 M      68 K     228 K      25 K     233 K
  31   0.56          68 M      122 M     159 K     287 K      20 K     544 K
  32   0.39          23 M       62 M      59 K     164 K      19 K     222 K
  33   0.31        8809 K       28 M      26 K      49 K    6731       119 K
  34   0.61         156 M      255 M     146 K     923 K      70 K     740 K
  35   0.43          22 M       51 M      58 K     114 K      12 K     180 K
  36   0.74         737 M     1001 M     177 K    3782 K     730 K    3088 K
  37   0.35          29 M       86 M      30 K     157 K      13 K    2449 K
  38   0.39          16 M       42 M      16 K     112 K      17 K     133 K
  39   0.69         664 M      961 M     115 K    3848 K     722 K    2978 K
-------------------------------------------------------------------------------------------------------------------
   *   0.75        6556 M     8780 M    5584 K      25 M    3673 K      46 M

  • pcm-memory.exe
    • Provides socket and channel level read/write throughput information
Time elapsed: 1000 ms
Called sleep function for 1000 ms
|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s):    49.91 --||-- Mem Ch  0: Reads (MB/s):     3.42 --|
|--            Writes(MB/s):    43.65 --||--            Writes(MB/s):     1.13 --|
|-- Mem Ch  1: Reads (MB/s):    13.95 --||-- Mem Ch  1: Reads (MB/s):     3.37 --|
|--            Writes(MB/s):     5.32 --||--            Writes(MB/s):     1.15 --|
|-- Mem Ch  2: Reads (MB/s):    10.08 --||-- Mem Ch  2: Reads (MB/s):    46.07 --|
|--            Writes(MB/s):     3.59 --||--            Writes(MB/s):    42.18 --|
|-- Mem Ch  3: Reads (MB/s):    13.52 --||-- Mem Ch  3: Reads (MB/s):     3.31 --|
|--            Writes(MB/s):     4.43 --||--            Writes(MB/s):     1.10 --|
|-- NODE 0 Mem Read (MB/s) :    87.47 --||-- NODE 1 Mem Read (MB/s) :    56.17 --|
|-- NODE 0 Mem Write(MB/s) :    56.98 --||-- NODE 1 Mem Write(MB/s) :    45.56 --|
|-- NODE 0 P. Write (T/s):     624374 --||-- NODE 1 P. Write (T/s):     622531 --|
|-- NODE 0 Memory (MB/s):      144.45 --||-- NODE 1 Memory (MB/s):      101.74 --|
|---------------------------------------||---------------------------------------|
        
|---------------------------------------||---------------------------------------|
        
|--                   System Read Throughput(MB/s):    143.64                  --|
        
|--                  System Write Throughput(MB/s):    102.54                  --|
        
|--                 System Memory Throughput(MB/s):    246.19                  --|
        
|---------------------------------------||---------------------------------------|
  • pcm-msr.exe
    • Not entirely sure what this does…
  • pcm-numa.exe
    • Provides memory NUMA memory access information information
Time elapsed: 1014 ms
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses 
   0   0.33         15 M       47 M        22 K              3620                
   1   0.23       4114 K       17 M      4843                1060                
   2   0.20       5205 K       25 M      6682                4486                
   3   0.23       6016 K       26 M      1369                1070                
   4   0.80         22 M       28 M      4045                1435                
   5   0.23       9756 K       42 M        11 K              6362                
   6   0.22       5305 K       24 M      4357                1152                
   7   0.56         25 M       44 M        57 K                10 K              
   8   0.24       5380 K       22 M      3655                1807                
   9   0.21       4525 K       21 M      2075                1219                
  10   0.53         20 M       38 M      6579                2557                
  11   0.22       4857 K       22 M      4607                2460                
  12   0.38         16 M       44 M        25 K              2940                
  13   1.42         70 M       49 M      5793                2280                
  14   0.24       5952 K       24 M      2233                1007                
  15   0.25       5551 K       22 M      2150                 835                
  16   0.31       8273 K       26 M        22 K              1730                
  17   0.23       3939 K       17 M      1309                 592                
  18   0.20       4401 K       21 M      3583                1833                
  19   0.27       5272 K       19 M        10 K              1558                
  20   0.55        102 M      188 M        76 K                69 K              
  21   0.20       4772 K       24 M      1801                1430                
  22   0.50         68 M      137 M        89 K                46 K              
  23   0.25       7923 K       31 M      8629                  17 K              
  24   0.35         17 M       51 M        38 K              7632                
  25   0.19       5416 K       27 M      3670                1265                
  26   0.34         16 M       48 M        24 K              9108                
  27   0.31         12 M       40 M        21 K                34 K              
  28   0.34         14 M       43 M      7770                3473                
  29   0.24       7116 K       30 M      6161                1686                
  30   0.33         13 M       41 M      9403                3111                
  31   0.32         12 M       40 M        13 K              2672                
  32   0.30         11 M       37 M        12 K              1773                
  33   0.32         10 M       31 M        77 K              2129                
  34   0.32         11 M       36 M      5342                2449                
  35   0.24       6862 K       28 M      4013                5977                
  36   0.35         12 M       36 M      7212                1994                
  37   0.23       5039 K       22 M      1721                1333                
  38   0.25       7346 K       29 M      5205                1658                
  39   0.26       7379 K       28 M      8195                4296                
-------------------------------------------------------------------------------------------------------------------
   *   0.39        606 M     1542 M       625 K               270 K              

  • pcm-pcie.exe
    • Provides PCIe link usage information (useful to determine if you hit a PCIe bottleneck)
Skt | PCIeRdCur | PCIeNSRd  | PCIeWiLF | PCIeItoM | PCIeNSWr | PCIeNSWrF
 0       759 K         0           0        612 K        0          0  
 1         0           0           0          0          0          0  
-----------------------------------------------------------------------------------
 *        759 K         0           0        612 K        0          0  
  • pcm-power.exe
    • Provides memory power consumption statistics

----------------------------------------------------------------------------------------------
Time elapsed: 1000 ms
Called sleep function for 1000 ms
S0CH0; DRAMClocks: 933924607; Rank0 CKE Off Residency: 0.02%; Rank0 CKE Off Average Cycles: 159520; Rank0 Cycles per transition: 933924607
S0CH0; DRAMClocks: 933924607; Rank1 CKE Off Residency: 0.02%; Rank1 CKE Off Average Cycles: 157305; Rank1 Cycles per transition: 933924607
S0CH1; DRAMClocks: 933925096; Rank0 CKE Off Residency: 0.02%; Rank0 CKE Off Average Cycles: 153645; Rank0 Cycles per transition: 933925096
S0CH1; DRAMClocks: 933925096; Rank1 CKE Off Residency: 0.02%; Rank1 CKE Off Average Cycles: 151533; Rank1 Cycles per transition: 933925096
S0CH2; DRAMClocks: 933925354; Rank0 CKE Off Residency: 0.02%; Rank0 CKE Off Average Cycles: 149329; Rank0 Cycles per transition: 933925354
S0CH2; DRAMClocks: 933925354; Rank1 CKE Off Residency: 0.02%; Rank1 CKE Off Average Cycles: 148905; Rank1 Cycles per transition: 933925354
S0CH3; DRAMClocks: 933924943; Rank0 CKE Off Residency: 0.02%; Rank0 CKE Off Average Cycles: 147401; Rank0 Cycles per transition: 933924943
S0CH3; DRAMClocks: 933924943; Rank1 CKE Off Residency: 0.02%; Rank1 CKE Off Average Cycles: 145298; Rank1 Cycles per transition: 933924943
S0; PCUClocks: 800627536; Freq band 0/1/2 cycles: 99.84%; 99.84%; 0.00%
S0; Consumed energy units: 2457737; Consumed Joules: 37.50; Watts: 37.50; Thermal headroom below TjMax: 60
S0; Consumed DRAM energy units: 1061128; Consumed DRAM Joules: 16.19; DRAM Watts: 16.19
S1CH0; DRAMClocks: 933902607; Rank0 CKE Off Residency: 0.02%; Rank0 CKE Off Average Cycles: 164508; Rank0 Cycles per transition: 933902607
S1CH0; DRAMClocks: 933902607; Rank1 CKE Off Residency: 0.02%; Rank1 CKE Off Average Cycles: 164626; Rank1 Cycles per transition: 933902607
S1CH1; DRAMClocks: 933901094; Rank0 CKE Off Residency: 0.02%; Rank0 CKE Off Average Cycles: 166178; Rank0 Cycles per transition: 933901094
S1CH1; DRAMClocks: 933901094; Rank1 CKE Off Residency: 0.02%; Rank1 CKE Off Average Cycles: 166269; Rank1 Cycles per transition: 933901094
S1CH2; DRAMClocks: 933900756; Rank0 CKE Off Residency: 0.02%; Rank0 CKE Off Average Cycles: 166668; Rank0 Cycles per transition: 933900756
S1CH2; DRAMClocks: 933900756; Rank1 CKE Off Residency: 0.02%; Rank1 CKE Off Average Cycles: 166654; Rank1 Cycles per transition: 933900756
S1CH3; DRAMClocks: 933900898; Rank0 CKE Off Residency: 0.02%; Rank0 CKE Off Average Cycles: 166572; Rank0 Cycles per transition: 933900898
S1CH3; DRAMClocks: 933900898; Rank1 CKE Off Residency: 0.02%; Rank1 CKE Off Average Cycles: 166625; Rank1 Cycles per transition: 933900898
S1; PCUClocks: 800628916; Freq band 0/1/2 cycles: 100.00%; 100.00%; 100.00%
S1; Consumed energy units: 2521661; Consumed Joules: 38.48; Watts: 38.48; Thermal headroom below TjMax: 56
S1; Consumed DRAM energy units: 854553; Consumed DRAM Joules: 13.04; DRAM Watts: 13.04

 

 

 

Measuring GPU Utilization in Remote Desktop Services

I recently spent some time experimenting with GPU Discrete Device Assignment in Azure using the NV* series of VM.  As we noticed that Internet Explorer was consuming quite a bit CPU resources on our Remote Desktop Services session hosts, I wondered how much of an impact on the CPU using a GPU would do by accelerating graphics through the specialized hardware.  We did experiments with Windows Server 2012 R2 and Windows Server 2016. While Windows Server 2012 R2 does deliver some level of hardware acceleration for graphics, Windows Server 2016 did provide a more complete experience through better support for GPUs in an RDP session.

In order to enable hardware acceleration for RDP, you must do the following in your Azure NV* series VM:

  1. Download and install the latest driver recommended by Microsoft/NVidia from here
  2. Enable the Group Policy Setting  Administrative Templates\Windows Components\Remote Desktop Services\Remote Desktop Session Host\Remote Session Environment\Use the hardware default graphics adapter for all Remote Desktop Services sessions as shown below:

To validate the acceleration, I used a couple of tools to generate and measure the GPU load. For load generation I used the following:

  • Island demo from Nvidia which is available for download here.
    • This scenario worked fine in both Windows Server 2012 R2 and Windows Server 2016
    • Here’s what it looks like when you run this demo (don’t mind the GPU information displayed, that was from my workstation, not from the Azure NV* VM):
  • Microsoft Fish Tank page which leverages WebGL in the browser which is in turn accelerated by the GPU when possible
    • This proved to be the scenario that differentiated Windows Server 2016 from Windows Server 2012 R2. Only under Windows Server 2016 could high frame rate and low CPU utilization was achieved. When this demo runs using only the software renderer, I observed CPU utilization close to 100% on a fairly beefy NV6 VM that has 6 cores and that just by running a single instance of that test.
    • Here’s what FishGL looks like:

To measure the GPU utilization, I ended up using the following tools:

In order to do a capture with Windows Performance Recorder, make sure that GPU activity is selected under the profiles to be recorded:

Here’s a recorded trace of the GPU utilization from the Azure VM while running FishGL in Internet Explorer that’s being visualized in Windows Performance Analyzer:

As you can see in the WPA screenshot above, quite a few processes can take advantage of the GPU acceleration.

Here’s what it looks like in Process Explorer when you’re doing live monitoring. As you can see below, you can see which process is consuming GPU resources. In this particular screenshot, you can see what Internet Explorer consumes while running FishGL my workstation.

Windows Server 2016 takes great advantage of an assigned GPU to offload compute intensive rendering tasks. Hopefully this article helped you get things started!

Distributed Universal Memory for Windows

*Disclaimer* This is only an idea I’ve been toying with. It doesn’t represent in any way, shape or form future Microsoft plans in regards to memory/storage management. This page will evolve over time as the idea is being refined and fleshed out.

**Last Updated 2017-03-23**

The general ideal behind Distributed Universal Memory is to have a common memory management API that would achieve the following:
General
  • Abstract the application from the memory medium required to maintain application state, whether it’s volatile or permanent
  • Allow the application to express memory behavior requirements and not worry about the storage medium to achieve this
  • Support legacy constructs for backward compatibility
  • Enable new capabilities for legacy applications without code change
  • Give modern applications a simple surface to persist data
  • Enables scale out applications to use potentially a single address space
  • Could potentially move towards a more microservice based approach instead of the current monolithic code base
  • Could easily leverage advances in hardware development such as disaggregation of compute and memory, usage of specialized hardware such FPGAs or GPUs to accelerate certain memory handling operations
  • Could be ported/backported to further increase the reach/integration capabilities. This memory management subsystem to could be cleanly decoupled from the underlying operating system.
  • Allow the data to be optimally placed for performance, availability and ultimately cost

Availability Management

  • Process memory can be replicated either systematically or on demand
  • This would allow existing process memory to be migrated from one operating system instance to another transparently.
  • This could offer higher resiliency to process execution in the event of an host failure
  • This could also allow some OS components to be updated while higher level processes keep going. (i.e. redirected memory IO)
Performance Management
  • Required medium to achieve performance could be selected automatically using different mechanisms (MRU/LRU, machine learning, etc.)
  • Memory performance can be expressed explicitly by the application
    • By expressing its need, it would be easier to characterize/model/size the required system to support the application
    • Modern applications could easily specify how each piece of data it interacts with should be performing
  • Could provide multiple copies of the same data element for compute locality purposes. i.e. Distributed read cache
    • This distributed read-cache could be shared between client and server processes if desired. This would enable to have a single cache mechanism independently of the client/process accessing it.
Capacity Management
  • Can adjust capacity management techniques depending on performance and availability requirements
  • For instance, if data is rarely used by the application, several data reduction techniques could be applied such as deduplication, compression and/or erasure coding
  • If data access time doesn’t require redundancy/locality/tolerates time for RDMA, it could be spread evenly across the Distributed Universal Memory Fabric

High Level Cluster View

Components

Here’s an high level diagram of what it might look like:

Let’s go over some of the main components.

Data Access Manager

The Data Access Manager is the primary interface layer to access data. The legacy API would sit on top of this layer in order to properly abstract the underlying subsystems in play.

  • Transport Manager
    • This subsystem is responsible to push/pull the data on the remote host. All inter-node data transfers would occur over RDMA to minimize the overhead of copying data back and forth between nodes.
  • Addressing Manager
    • This would be responsible to give a universal memory address for the data that’s independent of storage medium and cluster nodes.

Data Availability Manager

This component would be responsible to ensure the proper level of data availability and resiliency are enforced as per defined policies in the system. It would be made of the following subsystems:

  • Availability Service Level Manager
    • The Availability Service Level Manager’s responsibility to to ensure the overall availability of data. For instance, it would act as the orchestrator responsible to trigger the replication manager to ensure the data is meeting its availability objective.
  • Replication Manager
    • The Replication Manager is responsible to enforce the right level of data redundancy across local and remote memory/storage devices. For instance, if 3 copies of the data must be maintained for the data of a particular process/service/file/etc. across 3 different failure domains, the Replication Manager is responsible of ensuring this is the case as per the policy defined for the application/data.
  • Data History Manager
    • This subsystem ensure that the appropriate point in time copies of the data are maintained. Those data copies could be maintained in the system itself by using the appropriate storage medium or they could be handed of to a thrid party process if necessary (i.e. standard backup solution). The API would provide a standard way for data recovery operations.

Data Capacity Manager

The Data Capacity Manager is responsible to ensure enough capacity of the appropriate memory/storage type is available for applciations and also for applying the right capacity optimization techniques to optimize the physical storage capacity available. The following methods could be used:

  • Compression
  • Deduplication
  • Erasure Coding

Data Performance Manager

The Data Performance Manager is responsible to ensure that each application can access each piece of data at the appropriate performance level. This is accomplished using the following subsystems:

  • Latency Manager
    • This is responsible to place the data on the right medium to ensure that each data element can be accessed at the right latency level. This can be determined either by pre-defined policy or by heuristic/machine learning to detect data access pattern beyond LRU/MRU methods.
    • The Latency Manager could also monitor if a local process tends to access data that’s mostly remote. If that’s the case, instead of generally incurring the network access penalty, the process could simply be moved to the remote host for better performance through data locality.
  • Service Level Manager
    • The Service Level Manager is responsible to manage the various applications expectations in regards to performance.
    • The Service Level Manager could optimize data persistence in order to meet its objective. For example, if the local non-volatile storage response time is unacceptable, it could choose to persist the data remotely and then trigger the Replication Manager to bring a copy of the data back locally.
  • Data Variation Manager
    • A subsystem could be conceived to persist a tranformed state of the data. For example, if there’s an aggregation on a dataset, it could be persisted and linked to the original data. If the original data changes the dependent aggregation variations could either be invalidated or updated as needed.

Data Security Manager

  • Access Control Manager
    • This would create hard security boundary between processes and ensure only authorized access is being granted, independently of the storage mechanism/medium.
  • Encryption Manager
    • This would be responsible for the encryption of the data if required as per a defined security policy.
  • Auditing Manager
    • This would audit data access as per a specific security policy. The events could be forwarded to a centralized logging solution for further analysis and event correlation.
    • Data accesses could be logged in an highly optimized graph database to allow:
      • Build a map of what data is accessed by processes
      • Build a temporal map of how the processes access data
  • Malware Prevention Manager
    • Data access patterns can be detected in-line by this subsystem. For instance, it could notice that a process is trying to access credit card number data based on things like regex for instance. Third-party anti-virus solutions would also be able to extend the functionality at that layer.

Legacy Construct Emulator

The goal of the Legacy Construct Emulator to is to provide to legacy/existing applications the same storage constructs they are using at this point in time to ensure backward compatibility. Here are a few examples of constructs that would be emulated under the Distributed Universal Memory model:

  • Block Emulator
    • To emulate the simplest construct to simulator the higher level construct of the disk emulator
  • Disk Emulator
    • Based on the on the block emulator, simulates the communication interface of a disk device
  • File Emulator
    • For the file emulator, it could work in a couple of ways.
      • If the application only needs to have a file handle to perform IO and is fairly agnostic of the underlying file system, the application could simply get a file handle it can perform IO on.
      • Otherwise, it could get that through the file system that’s layered on top of a volume that makes use of the disk emulator.
  • Volatile Memory Emulator
    • The goal would be to provide the necessary construct to the OS/application to store it’s state data that’s might be typically stored in RAM.

One of the key thing to note here is that even though all those legacy constructs are provided, the Distributed Universal memory model has the flexibility to persist the data as it sees fit. For instance, even though the application might think it’s persisting data to volatile memory, the data might be persisted to an NVMe device in practice. Same principle would apply for file data; a file block might actually be persisted to RAM (similar a block cache) that’s then being replicated to multiple nodes synchronously to ensure availability, all of this potentially without the application being aware of it.

Metrics Manager

The metrics manager is to capture/log/forward all data points in the system. Here’s an idea:

  • Availability Metrics
    • Replication latency for synchronous replication
    • Asynchronous backlog size
  • Capacity Metrics
    • Capacity used/free
    • Deduplication and compression ratios
    • Capacity optimization strategy overhead
  • Performance Metrics
    • Latency
    • Throughput (IOPS, Bytes/second, etc.)
    • Bandwidth consumed
    • IO Type Ratio (Read/Write)
    • Latency penalty due to SLA per application/process
  • Reliability Metrics
    • Device error rate
    • Operation error rate
  • Security Metrics
    • Encryption overhead

High Level Memory Allocation Process

More details coming soon.

Potential Applications

  • Application high availability
    • You could decide to synchronously replicate a process memory to another host and simply start the application binary on the failover host in the event where the primary host fails
  • Bring server cached data closer to the client
    • One could maintain a distributed coherent cache between servers and client computers
  • Move processes closer to data
    • Instead of having a process try to access data accross the network, why not move the process to where the data is?
  • User State Mobility
    • User State Migration
      • A user state could move freely between a laptop, a desktop and a server (VDI or session host) depending on what the user requires.
    • Remote Desktop Service Session Live Migration
      • As the user session state memory is essentially virtualized from the host executing the session, it can be freely moved from one host to another to allow zero impact RDS Session Host maintenance.
  • Decouple OS maintenance/upgrades from the application
    • For instance, when the OS needs to be patched, one could simply move the process memory and execution to another host. This would avoid penalties such as buffer cache rebuilds in SQL Server for instance which can trigger a high number of IOPS on a disk subsystem in order to repopulate the cache based on popular data. For systems with an large amount of memory, this can be fairly problematic.
  • Have memory/storage that spans to the cloud transparently
    • Under this model it would be fairly straightforward to implement a cloud tier for cold data
  • Option to preserve application state on application upgrades/patches
    • One could swap the binaries to run the process while maintaining process state in memory
  • Provide object storage
    • One could layer object storage service on top of this to support Amazon S3/Azure Storage semantics. This could be implemented on top of the native API if desired.
  • Provide distributed cache
    • One could layer distributed cache mechanisms such as Redis using the native Distributed Universal Memory API to facilitate porting of applications to this new mechanism
  • Facilitate application scale out
    • For instance, one could envision a SQL Server instance to be scaled out using this mechanism by spreading worker threads across multiple hosts that share a common coordinated address space.
  • More to come…

AMD Naples – More than an Intel challenger for Storage Spaces Direct?

With the recent announce of the new AMD “Napples” processor, a few things have changed in regards to options for Storage Spaces Direct. Let’s have a look to see what’s this new CPU is about.

AMD-Naples-Zen-CPU-14--pcgh.pngA few key points:

  • Between 16/32 threads and 32 cores/64 threads per socket or up to 64 cores/128 threads in a 2 socket server
    • Intel Skylake is “only” expected to have 28 cores per socket (** Update 2017-03-19 ** There are now rumors of 32 cores Skylake E5 v5 CPUs)
  • 2TB of RAM per socket
  • 8 channel DDR4
    • Bandwidth is expected to be in the 170GB/s range
    • Intel Skylake is expected to only have 6 channel memory
  • 128 PCIe 3.0 lanes PER socket
    • In 2 sockets configuration, it’s “only” 64 lanes that will be available as the other 64 are used for socket to socket transport
    • In other words for S2D, this means a single socket can properly support 2 x 100GbE ports AND 24 NVMe drives without any sorcery like PCIe switches in between
    • That’s roughly 126GB/s of PCIe bandwidth, not too shabby

Here’s an example of what it looks like in the flesh:

AMD-Naples-Speedway-Internal

With that kind of horse power, you might be able to start thinking about having a few million IOPS per S2D node if Microsoft can manage to scale up to that level. Scale that out to the supported 16 nodes in a cluster and now we have a party going! Personally, I think  going with a single socket configuration with 32 cores would be fine sizing/configuration for S2D. It would also give you a server failure domain that’s reasonable. Furthermore, from a licensing standpoint, a 64 cores Datacenter Edition server is rather pricey to say the least… You might want to go with a variant with less cores if your workload allows it. The IO balance being provided by this new AMD CPU is much better than what’s being provided by Intel at this point in time. That may change if Intel decides to go with PCIe 4.0 but it doesn’t look like we’ll see this any time soon.

If VDI/RDS SH is your thing, perhaps taking advantage of those extra PCIe lanes for GPUs will be a nice advantage. Top that with a crazy core/thread count and you would be able to drive some pretty demanding user workload without overcommitting too much your CPU and while also having access to tons of memory.

I’ll definitely take a look at AMD systems when Naples is coming out later this year. A little competition in the server CPU market is long overdue! Hopefully AMD will price this one right and reliability will be what we expect for a server. Since it’s a new CPU architecture, it might take a little while before software manufacturers support and optimize for this chip. With the right demand from customer, that might accelerate the process!