Generate Azure Firewall Rules for Office 365 Traffic

While working with Azure Firewall, I wanted to take advantage of its FQDN filtering capabilities in order to control traffic to Office 365. As the list of FQDNs required to allow traffic can be quite large, especially in the “Common” service area’s list of endpoints, I wrote a little PowerShell function to generate the appropriate ARM template code for the application rule. Here’s what the function looks like:

Function Get-Office365AzureFirewallApplicationRule($serviceArea,$ruleName)
{
    #Get the latest endpoints information from Microsoft
    $office365IPs=Invoke-webrequest -Uri https://endpoints.office.com/endpoints/worldwide?clientrequestid=b10c5ed1-bad1-445f-b386-b919946339a7 | ConvertFrom-Json

    #Capture required Fqdns and ports for the service area    
    $serviceAreaUrls=($office365IPs | Where-Object {$_.ServiceArea -eq $serviceArea -and ($_.tcpPorts -ilike "*80*" -or $_.tcpPorts -ilike "*443*") }).urls | sort | select -Unique

    #Generate Azure Firewall Application Rule
    $azFwRule='{
                            "name": "'+ $ruleName +'",
                            "protocols": [
                                {
                                    "port": "80",
                                    "protocolType": "http"
                                },
                                {
                                    "port": "443",
                                    "protocolType": "https"
                                }
                            ],
                            "TargetFqdns": ' + ($serviceAreaUrls | ConvertTo-Json)
    $azFwRule+='}'

    #Output rule generated to json file
    $azFwRule | Set-Content -Path ".\ARM\Networking\$ruleName.json" -Force
}

Here are a couple of examples to call this function and their outputs:

Exchange Online Rule

Get-Office365AzureFirewallApplicationRule -serviceArea Exchange -ruleName "net-azfw-rul-application-allow-http-ExchangeOnline"

Output

{
    "name": "net-azfw-rul-application-allow-http-ExchangeOnline",
    "protocols": [
        {
            "port": "80",
            "protocolType": "http"
        },
        {
            "port": "443",
            "protocolType": "https"
        }
    ],
    "TargetFqdns": [
        "*.outlook.com",
        "*.outlook.office.com",
        "*.protection.outlook.com",
        "*.store.core.windows.net",
        "asl.configure.office.com",
        "attachments.office.net",
        "domains.live.com",
        "mshrcstorageprod.blob.core.windows.net",
        "outlook.office.com",
        "outlook.office365.com",
        "r1.res.office365.com",
        "r3.res.office365.com",
        "r4.res.office365.com",
        "tds.configure.office.com"
    ]
}

SharePoint Online Rule

Get-Office365AzureFirewallApplicationRule -serviceArea SharePoint -ruleName "net-azfw-rul-application-allow-http-SharePointOnline"

Output

{
    "name": "net-azfw-rul-application-allow-http-SharePointOnline",
    "protocols": [
        {
            "port": "80",
            "protocolType": "http"
        },
        {
            "port": "443",
            "protocolType": "https"
        }
    ],
    "TargetFqdns": [
        "*.log.optimizely.com",
        "*.search.production.apac.trafficmanager.net",
        "*.search.production.emea.trafficmanager.net",
        "*.search.production.us.trafficmanager.net",
        "*.sharepoint.com",
        "*.sharepointonline.com",
        "*.svc.ms",
        "*-files.sharepoint.com",
        "*-myfiles.sharepoint.com",
        "admin.onedrive.com",
        "cdn.sharepointonline.com",
        "click.email.microsoftonline.com",
        "g.live.com",
        "officeclient.microsoft.com",
        "oneclient.sfx.ms",
        "privatecdn.sharepointonline.com",
        "prod.msocdn.com",
        "publiccdn.sharepointonline.com",
        "skydrive.wns.windows.com",
        "spoprod-a.akamaihd.net",
        "ssw.live.com",
        "static.sharepointonline.com",
        "storage.live.com",
        "watson.telemetry.microsoft.com"
    ]
}

Now that you have generated the desired rules, you can simply copy/paste those in your Azure Firewall ARM template/parameter file under the applicationRuleCollections property.

In the ARM template I’m using to deploy Azure Firewall, I’m using a technique similar to the one I described here to manage Network Security Group rules (Azure VNET Subnet Network Security Group Rules Management in ARM TemplateAzure VNET Subnet Network Security Group Rules Management in ARM TemplateAzure VNET Subnet Network Security Group Rules Management in ARM Template), where I can have core rules that apply to all my Azure Firewall deployments but also rules that would only apply to a specific deployment. At deployment time, I’m simply concatenating both the custom rules and the core rules together like this in the ARM template:

"applicationRuleCollections": "[concat(variables('net-azfw-rul-basic-application-rules'),parameters('net-azfw-rul-custom-application-rules'))]"

In the example above, the variable named net-azfw-rul-basic-application-rules are the rules that will be present in all deployments while the parameter named net-azfw-rul-custom-application-rules is for the rules that are specific to a particular deployment in an environment/region.

Once deployed, you would see something similar for your Azure Firewall deployment:

Should you have questions or comments regarding this post, feel free to leave a comment below!

Advertisements

Azure Firewall DNAT and Network Security Group Rules

While troubleshooting a particular DNAT rule implemented with Azure Firewall, we noticed the outside traffic was not reaching the targeted VM as intended.

In order to troubleshoot this, we first inspected the Azure Firewall logs in Azure Log Analytics to confirm the traffic was indeed hitting the proper NAT rule. Here’s the query that was used in Log Analytics to determine this:

AzureDiagnostics
| where Category == “AzureFirewallNetworkRule”and msg_s contains”DNAT”

Here’s what the result of the query looked like:

Once we were able to confirm the traffic was hitting the right rule, we then proceeded to loosen the NSG rules applied to the subnet for the VM and then used a network capture tool such as tcpdump to determine what source IP was showing up for the traffic that was DNATed by Azure Firewall. For example, here’s what the tcpdump looked like:

sudo tcpdump ‘port <DNATed port number> and !host <my ip>’

Note that my internal IP is excluded so that I can see only the traffic coming from outside Azure for a particular DNATed port.

By using this, we were able to determine the source IP seen was from the AzureFirewall subnet but not exactly the internal IP being shown in the portal. That is mostly likely due to the HA/load balanced setup for Azure Firewall requiring multiple instances to operate. Once we added the appropriate address to the NSG rule, the traffic was flowing properly to the VM and the NSG was re-tighten for the pleasure of our security folks!

Newly Increased Azure Application Gateway Limits Gotcha

Microsoft recently greatly increased its limits for Azure Application Gateway. Those latest changes allow you to do more with your App Gateway deployment. Refer to the following article for all the details: Azure Application Gateway limits

I’ve hit a particular limit that was not clearly documented in the link above. In one deployment I was working on, I was at the limit of backend address pools, listeners and rules but for some reason, the deployment seemed like it was never-ending and the App Gateway instance ended up being in a failed provisioning state after the resource was in a deploying or updating state.

After discussing with support, we found out an internal part of Azure called the infrastructure agent couldn’t handle the size of the configuration file to be deployed (limited at 128KB per setting), which in turn causes the Azure Network Regional Manager’s operation to timeout. When that happens, the Application Gateway deployment will appear as it never ends but will eventually fail and the instance will end up in a failed provisioning state as mentioned earlier. In my case, the ARM template parameter file alone had over 395KB but that size is not exactly representative of the issue. The particular issue in my case seem to be related to the size of the requestRoutingRules setting, which was sitting around 220KB by itself.

What I did at this point in time to lower the size of the ARM template configuration was to combine the backend address pools containing the same servers into one address pools instead of having one per web application hosted behind Application Gateway. The original idea for this was to be able to modify the backend address pool of an application to add/remove servers as needed without touching other applications. While combining backend address pools moves away from that principle, it does have the advantage of simplifying things a bit configuration wise, which in turns reduces the size of the configuration file passed to Azure. By applying this change and updating all the request routing rules, I was able to reduce the template parameter file from ~395KB to 340KB but of particular interest, was the requestRoutingRules section which was now down under the 128KB mark at ~120KB, simply by shortening the name of the backend address pools specified in the rule.

With that rationalization applied to the ARM template parameter file, the deployment went through successfully. With that limit in mind, all the little things matter such as keeping name concise as that will help keep the size under the limit. I’m still in touch with Microsoft Support to make sure that information makes it to the service limit documentation.

Azure Networking – Identifying Flows Blocked by NSGs using Traffic Analytics data in Log Analytics

When working on identifying flows that should be allowed by your Network Security Groups in Azure, a great tool you can leverage is Azure Traffic Analytics data stored in Log Analytics. In supported regions, you can send NSG flow logs into Azure Log Analytics where you can run queries to help you identify legitimate flows you might be blocking in your network.

Here are the high level steps to get this going in your environment:

  1. Enable Network Watcher
  2. Enable flow logging and Traffic Analytics for your Network Security Groups
    1. Store flow logs in a storage account AND Azure Log Analytics
  3. Query Flow Logs in Azure Log Analytics (…and complement with flow logs stored in Azure blob storage)
Enable Network Watcher

I highly recommend you enable Network Watcher in each region. Like everything in Azure, there’s multiple ways of achieving this. You can do this via an Azure Resource Manager template, PowerShell, Azure Portal, etc. I personally do this in the the ARM template responsible for deploying core network resources (VNET, subnet NSGs), that way I’m making sure it’s there in every new region I’m setting up. Here’s what the ARM template piece to do this looks like:

{
            "type": "Microsoft.Network/networkWatchers",
            "name": "[concat('net-watcher-',parameters('environmentCode'),'-',variables('regionFullCode'))]",
            "apiVersion": "2018-07-01",
            "location": "[parameters('vnetLocation')]",
            "tags": {
                "environmentCode": "[parameters('environmentCode')]",
                "serviceCode": "net",
                "deploymentARMTemplateVersion": "[parameters('deploymentARMTemplateVersion')]",
                "deploymentARMTemplateParameterFileVersion": "[parameters('deploymentARMTemplateParameterFileVersion')]",
                "deploymentDateTime": "[parameters('deploymentDateTime')]"
            },
            "properties": {},
            "dependsOn": []
        }

If you remove the resource tags portion, you can see it’s pretty straightforward to enable Network Watcher!

Enable flow logging for your Network Security Groups

I’ve tried to enable Flow Logging and Traffic Analytics directly via Azure Resource Manager but initially but it didn’t work. I’ve engaged with the Azure Networking folks who kindly explained that enabling flow logging and traffic analytics is done post-process, something that’s not supported by ARM template as that would make the template not idempotent anymore.

For the time being, I’m using PowerShell to do this. Here’s what the code looks like to enable flow logging and Traffic Analytics on the NSGs in a specific region (Inspired by this post from Alexandre Verkinderen and adapted/enhanced for my environment):

Function Set-GEMAzureNSGFlowLogging([string]$companyCode="ge",
                                    [string]$businessUnitCode,
                                    [string]$tenantEnvironmentCode,
                                    [string]$countryCode,
                                    [string]$regionCode,
                                    [string]$regionId,
                                    [string]$environmentCode,
                                    [string] $logAnalyticsCountryCode,
                                    [string]$logAnalyticsRegionCode,
                                    [string]$logAnalyticsRegionId)
{
    Import-Module AzureRM.Resources
    Import-Module -Force .\libAzureResourceManager.psm1
    
    #Login the the proper subscription
    Set-GEMAzureRMLogin -companyCode $companyCode `
    -businessUnitCode $businessUnitCode `
    -tenantEnvironmentCode $tenantEnvironmentCode `
    -environmentCode $environmentCode `
    -countryCode $countryCode `
    -regionCode $regionCode `
    -regionId $regionId
    

    #Initialize variables for storage accounts and log analytics workspace name to use
    $storageAccountResourceGroup = "rg-$environmentCode`-mon-diag"
    $StorageAccountLogs = "$companyCode$businessUnitCode$tenantEnvironmentCode$environmentCode`mondiag$countryCode$regionCode$regionId"
    $retentionperiod = 7
    $logAnalyticsWorkspaceName="$companyCode$businessUnitCode$tenantEnvironmentCode`-log-law-$environmentCode`-$logAnalyticsCountryCode`-$logAnalyticsRegionCode$logAnalyticsRegionId`-1"
    $logAnalyticsWorkspaceResourceGroup="rg-$environmentCode`-log-law"
    $logAnalyticsWorkspace = Get-AzureRmOperationalInsightsWorkspace -Name $logAnalyticsWorkspaceName -ResourceGroupName $logAnalyticsWorkspaceResourceGroup


    
    Register-AzureRmResourceProvider -ProviderNamespace Microsoft.Insights

    #Get proper storage account and Network Watcher references
    $storageAccount = Get-AzureRmStorageAccount -ResourceGroupName $storageAccountResourceGroup -Name $StorageAccountLogs
    $NWs = Get-AzurermNetworkWatcher -ResourceGroupName "rg-$environmentCode`-net-generic" -Name "net-watcher-$environmentCode`-$countryCode`-$regionCode$regionId"

    #endregion

    Foreach($NW in $NWs){

    $NWlocation = $NW.location
    write-host "Looping trough $NWlocation" -ForegroundColor Yellow


    #region Enable NSG Flow Logs

    $nsgs = Get-AzureRmNetworkSecurityGroup | Where-Object {$_.Location -eq $NWlocation}

    Foreach($nsg in $nsgs)
    {
        #Get-AzureRmNetworkWatcherFlowLogStatus -NetworkWatcher $NW -TargetResourceId $nsg.Id
        #Flow analytics not supported in Canada East
        if($countryCode -eq "ca" -and $regionCode -eq "ea" -and $regionId -eq "1")
        {
            Write-Host "Enabling Flow Logging only for"$nsg.Name
            Set-AzureRmNetworkWatcherConfigFlowLog -NetworkWatcher $NW -TargetResourceId $nsg.Id -StorageAccountId $storageAccount.Id -EnableFlowLog $true -EnableRetention $true -RetentionInDays $retentionperiod
        }
        else {
            Write-Host "Enabling Flow Logging and Traffic analytics for"$nsg.Name
            Set-AzureRmNetworkWatcherConfigFlowLog -NetworkWatcher $NW -TargetResourceId $nsg.Id -StorageAccountId $storageAccount.Id -EnableFlowLog $true -EnableRetention $true -RetentionInDays $retentionperiod -EnableTrafficAnalytics -Workspace $logAnalyticsWorkspace
            
        }
        write-host "Diagnostics enabled for $nsg.Name " -BackgroundColor Green
    }


    #endregion


    }
    
}

You can verify that flow logging and Traffic Analytics is correctly enabled in the Azure Portal by going to the Flow Logs section in Network Watcher.

Query Flow Logs in Azure Log Analytics

It may take a little while before the flow logs start showing up in the specified Azure Log Analytics workspace but once it’s there, you can can issue a query like to following to help you identify at a high level which flow are getting blocked.

AzureNetworkAnalytics_CL| extend NSGRuleAction=split(NSGRules_s,'|',3)[0]| extend NSGRuleName=tostring(split(NSGRules_s,'|',1)[0])| where NSGRuleAction == "D" and VMIP_s contains "192.168" | summarize count() by VM_s,VMIP_s,SrcIP_s,DestIP_s,DestPort_d,

In the screenshot above, a couple of computers required port 135 to communicate, which was blocked by a core/deny all NSG rule.

You may discover some chatty computers that shouldn’t be communicating with each other due to misconfiguration or actual legitimate flows that should not be blocked and are probably causing issues in your environment.

One thing that’s worth mentionning, I noticed in our environment that some of the flow records don’t always have the destination IP included (I’m talking with the Azure Networking folks about this as well). In that situation, I have to resort to look at the flow log files stored in Azure Storage to determine what’s actual the targeted IP being blocked. At the moment, I only noticed this for outbound flows. Source IPs of inbound flows seem to show up correctly.

Should you have any questions about the blog post, feel free to ask questions via the comments section.

Thanks for reading!

Azure VNET Subnet Network Security Group Rules Management in ARM Template

When building Azure Resource Manager template, it’s often a challenge to keep your template generic enough so that it can be reused.

A VNET ARM template which leverages subnet Network Security Groups (NSG) can be especially challenging on that side as you often need to specify IPs in your rules that are specific to a particular deployment. If you hardcode rules , you lose the reusability of your template which offsets a bit the benefit of having an ARM template.

In order to work around this, I opted to use the following strategy:

  • VNET ARM template
    • Include parameters that allow you to pass the custom rules you want
    • Ruleset construction logic
      • If you have rules that apply to a lot of subnet NSG, create a variable to contain those
      • If you have basic rules that apply to a specific subnet NSG independantly of the deployment, create a variable to contain those
      • When you specify the securityRules property of your subnet NSG, you can then combine any permutations of the custom, basic and subnet specific rules
  • Use parameter files
    • Build a parameter file per specific deployment
    • Supply the custom rules you want by passing them as an array of NSG rules

Using the strategy above allows you to have the following:

  • A lighter template
    • You can define common rules only once in your template instead of repeating those for each NSG; this removes quite a few lines out of your JSON file
  • Flexibility
    • To specify rules specific to a particular deployment
    • That pattern can be extended/simplified as needed depending on the scenario
      • i.e. if you only need generic rules, don’t need to define subnet common rules and/or deployment specific subnet rules; keep it simple!

Here’s an example of what that looks like in practice.

Parameter File Sample

As you can see below, the parameter named net-nsg-snet-edg-custom-rules includes a single NSG rule that will be passed to the ARM template. It includes sample IP addresses for source and destination.

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "net-nsg-snet-edg-custom-rules": {
            "value": [
                {
                    "name": "In_HTTP_HTTPS_AppGateway",
                    "properties": {
                        "description": "Allow HTTP/s traffic to Azure App Gateway",
                        "protocol": "*",
                        "sourcePortRange": "*",
                        "destinationAddressPrefix": "192.168.100.4",
                        "access": "Allow",
                        "priority": 1215,
                        "direction": "Inbound",
                        "destinationPortRanges": [
                            "80",
                            "443"
                        ],
                        "sourceAddressPrefixes": [
                            "192.168.1.0/24"
                        ],
                        "destinationAddressPrefixes": [],
                        "destinationPortRange": ""
                    }
                }
                
            ]
        }
    }
}

VNET ARM Template (simplified for the purpose of the discussion)

In the ARM template, you can see the following:

  • A parameter net-nsg-snet-edg-custom-rules defined as type array to allow the passing of one or more custom rules object
  • A variable named net-nsg-snet-basic-rules that defines NSG rules that apply to several subnets
  • A variable named net-nsg-snet-edg-basic-rules that would apply only to the subnet with “edg”
  • When the securityRules property is defined for the “edg” NSG, the concat function is used to combine the rules from the variables net-nsg-snet-basic-rules, net-nsg-snet-edg-basic-rules and the parameter net-nsg-snet-edg-custom-rules

You may have a noticed few other tricks that keep the template generic:

  • I’m passing network ranges that are reused throughout the template as parameters (i.e. onPremiseManagementRangePrefixes)
  • The naming conventions for the specific deployment is also built from parameters/variables. I didn’t include all the intricacies of this in order to simplify the ARM template example below.
  • When defining the subnet addressPrefix property, I’m using a couple of parameters and concatenate them to build the IP range
    • I take the baseAddressPrefix (i.e. 192.168)
    • I then calculate the particular segment using the vnetStartingPoint and then add a number to identify that range. i.e If the starting point is 0 and that subnet is the 9th one, I add 9 to the starting point
    • I then add .0/24 (or whatever you want as a mask)
    • Combine all that together you get 192.168.9.0/24
    • If you want to deploy another VNET using the same layout but with a different range, you can simply change the baseAddressPrefix and/or the vnetStartingPoint
{
    "$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "baseAddressPrefix": {
            "type": "string"
        },
        "vnetStartingPoint": {
            "type": "int"
        },
        "vnetNetmask": {
            "type": "int"
        },
        "onPremiseManagementRangePrefixes": {
            "type": "array",
        },
        "dnsServerIP1": {
            "type": "string"
        },
        "dnsServerIP2": {
            "type": "string"
        },
        "deploySiteToSiteVPN": {
            "type": "bool",
            "defaultValue": "false"
        },
        "net-nsg-snet-edg-custom-rules": {
            "type": "array",
            "defaultValue": []
        }
    },
    "variables": {
        "net-nsg-snet-basic-rules": [
            {
                "name": "In_RDP_3389_CAN_She_IT_Management",
                "properties": {
                    "protocol": "*",
                    "sourcePortRange": "*",
                    "destinationPortRange": "3389",
                    "sourceAddressPrefixes": "[parameters('onPremiseManagementRangePrefixes')]",
                    "destinationAddressPrefix": "*",
                    "access": "Allow",
                    "priority": 1000,
                    "direction": "Inbound",
                    "destinationAddressPrefixes": []
                }
            },
            {
                "name": "In_WinRM_5986_CAN_She_IT_Management",
                "properties": {
                    "protocol": "Tcp",
                    "sourcePortRange": "*",
                    "destinationPortRange": "5986",
                    "sourceAddressPrefixes": "[parameters('onPremiseManagementRangePrefixes')]",
                    "destinationAddressPrefix": "*",
                    "access": "Allow",
                    "priority": 1100,
                    "direction": "Inbound",
                    "destinationAddressPrefixes": []
                }
            },
            {
                "name": "In_Deny_TCP",
                "properties": {
                    "protocol": "Tcp",
                    "sourcePortRange": "*",
                    "destinationPortRange": "*",
                    "sourceAddressPrefix": "*",
                    "destinationAddressPrefix": "*",
                    "access": "Deny",
                    "priority": 3000,
                    "direction": "Inbound",
                    "destinationAddressPrefixes": []
                }
            },
            {
                "name": "In_Deny_UDP",
                "properties": {
                    "protocol": "Udp",
                    "sourcePortRange": "*",
                    "destinationPortRange": "*",
                    "sourceAddressPrefix": "*",
                    "destinationAddressPrefix": "*",
                    "access": "Deny",
                    "priority": 3001,
                    "direction": "Inbound",
                    "destinationAddressPrefixes": []
                }
            },
            {
                "name": "In_Allow_ICMP_CAN_She_IT_Management",
                "properties": {
                    "protocol": "*",
                    "sourcePortRange": "*",
                    "destinationPortRange": "*",
                    "sourceAddressPrefixes": "[parameters('onPremiseManagementRangePrefixes')]",
                    "destinationAddressPrefix": "*",
                    "access": "Allow",
                    "priority": 3002,
                    "direction": "Inbound",
                    "destinationAddressPrefixes": []
                }
            },
            {
                "name": "[concat('In_Deny_VNET')]",
                "properties": {
                    "protocol": "*",
                    "sourcePortRange": "*",
                    "destinationPortRange": "*",
                    "destinationAddressPrefix": "*",
                    "access": "Deny",
                    "priority": 3003,
                    "direction": "Inbound",
                    "sourceAddressPrefix": "VirtualNetwork",
                    "destinationAddressPrefixes": []
                }
            }
        ],
        "net-nsg-snet-edg-basic-rules":[
            {
                "name": "In_AzureLBS_Any",
                "properties": {
                    "protocol": "*",
                    "sourcePortRange": "*",
                    "destinationPortRange": "*",
                    "sourceAddressPrefix": "AzureLoadBalancer",
                    "destinationAddressPrefix": "*",
                    "access": "Allow",
                    "priority": 1200,
                    "direction": "Inbound",
                    "destinationAddressPrefixes": []
                }
            },
            {
                "name": "In_Azure_HeathCheck",
                "properties": {
                    "protocol": "*",
                    "sourcePortRange": "*",
                    "destinationPortRange": "65503-65534",
                    "sourceAddressPrefix": "*",
                    "destinationAddressPrefix": "*",
                    "access": "Allow",
                    "priority": 1210,
                    "direction": "Inbound",
                    "destinationAddressPrefixes": []
                }
            },
        ],
    },
    "resources": [
        {
            "type": "Microsoft.Network/networkSecurityGroups",
            "name": "[concat('net-nsg-snet-',parameters('environmentCode'),'-edg-',variables('regionFullCode'))]",
            "apiVersion": "2017-06-01",
            "location": "[resourceGroup().location]",
            "scale": null,
            "tags": {
                "environmentCode": "[parameters('environmentCode')]",
                "serviceCode": "edg"
            },
            "properties": {
                "securityRules": "[concat(variables('net-nsg-snet-basic-rules'),variables('net-nsg-snet-edg-basic-rules'),parameters('net-nsg-snet-edg-custom-rules'))]",
                "defaultSecurityRules": []
            },
            "dependsOn": []
        },
        {
            "apiVersion": "2017-10-01",
            "name": "[variables('vnetName')]",
            "type": "Microsoft.Network/virtualNetworks",
            "location": "[resourceGroup().location]",
            "tags": {
                "environmentCode": "[parameters('environmentCode')]",
                "serviceCode": "net"
            },
            "properties": {
                "addressSpace": {
                    "addressPrefixes": [
                        "[concat(parameters('baseAddressPrefix'),'.',parameters('vnetStartingPoint'),'.0/',parameters('vnetNetmask'))]"
                    ]
                },
                "dhcpOptions": {
                    "dnsServers": [
                        "[concat(parameters('dnsServerIP1'))]",
                        "[concat(parameters('dnsServerIP2'))]"
                    ]
                },
                "subnets": [
                    {
                        "name": "[concat('net-snet-',parameters('environmentCode'),'-edg-1-',variables('regionFullCode'))]",
                        "properties": {
                            "addressPrefix": "[concat(parameters('baseAddressPrefix'),'.', add(parameters('vnetStartingPoint'),8),'.0/24')]",
                            "networkSecurityGroup": {
                                "id": "[resourceId('Microsoft.Network/networkSecurityGroups', concat('net-nsg-snet-',parameters('environmentCode'),'-edg-',variables('regionFullCode')))]"
                            }
                        }
                    }
                ],
                "enableDdosProtection": "[variables('enableDdosProtection')]"
            },
            "dependsOn": [
                "[resourceId('Microsoft.Network/networkSecurityGroups', concat('net-nsg-snet-',parameters('environmentCode'),'-edg-',variables('regionFullCode')))]"
            ]
        }
    ]
}

Dynamic VM Data Disks Creation in ARM Templates

I recently had to create an ARM template to create VMs in Azure and wanted to dynamically create data disks. I first started by using a simple copy operation within the ARM template but that didn’t provide the flexibility I wanted.

If you use the copy capability in ARM templates, all your data disk will have the same attributes (size/name, etc).

It often happens when you’re building VMs like for SQL Server that you want to have disks of various sizes for different purposes, for instance a disk for each database.

I then came up with a solution that allows you to do the following:

  • Per disk
    • Size
    • Name with custom tag

In the PowerShell wrapper function I use to create VMs from ARM templates, I added the following bit of code:

$allDataDisks=@()

    for ($j = 0; $j -lt $virtualMachineCount; $j++) {
        $vmDataDisks=@()
        for ($i = 0; $i -lt $dataDisks.Count; $i++) {

            $vmDataDisks+=@{caching="ReadWrite";
                            diskSizeGB=$dataDisks[$i].SizeInGB;
                            lun=$i;
                            name="SRV" + ($virtualMachineStartingId+$j).ToString() + "-dat-" + $dataDisks[$i].Tag;
                            managedDisk=@{storageAccountType=$dataDiskStorageAccountType};
                            createOption="Empty";
                            }

        }
        $allDataDisks+=,$vmDataDisks;

    }

    $templateParameters["dataDisks"]=$allDataDisks

The loop above creates an array within an array to support multiple VMs with each multiple disks. When $allDataDisks is passed to the ARM template, it gets properly serialized as JSON so it can be used by ARM.

One trick I learned while doing is was to use the “,” unary operator to properly create an array of arrays. If you simply use $allDataDisks+=$vmDataDisks it will will simply merge both arrays together, which is not what is needed to properly pass the data disks over to the ARM template. By using “+=,” it will properly add the $vmDataDisks array object to the $allDataDisks array.

In the ARM template, you can specify the dataDisks parameter as following:

"dataDisks": {
            "type": "array"
        }

In the section where the dataDisks are actually specified in the ARM template you would use the following:

 "dataDisks": "[parameters('dataDisks')[copyIndex()]]"

Note in the example above, there’s a reference to copyIndex as that particular ARM template can create multiple VMs with the same configuration. As each VM has its own set of disks, you can simply reference the proper VM by specifying the proper index of the array being passed to the template.

You can then pass your disks to your PowerShell wrapping function as following:

Deploy-VMSuperWrapper -dataDisks @(@{SizeInGB=128;Tag="master";},@{SizeInGB=256;Tag="tempdb"})

More flexibility could be provided by adding/passing additional disk properties.

I found a couple of caveats using this method:

  • You cannot add tags to the managed disks that get created
  • If you need to create multiple disks of the same size, i.e. if you want to create a Storage Spaces Direct cluster, you would need to specify each disk.
    • I’m thinking I would change my wrapper function to dynamically generate the $dataDisks parameter to simplify the call as something like -dataDiskCount x -dataDiskSizeInGB 1024

I might tackle those in another blog post! 😉

 

Azure VM Sizing – An Automated Approach

In the following post, I will try to explain the approach I’ve used to estimate the costs of running a few hundred VMs in Azure IaaS.As manually sizing each individual VMs would take quite some time, I preferred to go with an automated approach.

At a high level, here are the steps I’ve taken to achieve this.

  1. Collect performance metrics for the VM candidates over a significant period of time
  2. Capture Azure VMs pricing information
  3. Capture Azure VMs characteristics
  4. Select appropriate VM size
  5. Calculate VM operating hours
  6. Determine VM pricing strategy
  7. Generate VM storage configuration

Collect Performance Metrics

For this step, I’ve opted to use Hyper-V metering data I was already collecting for all of the VMs running on premises. Alternatively, one could also use data coming from perfmon but that would take some extra data preparation steps in order to be usable in the VM sizing exercise. I’ve covered the basics of the scripts in this other blog post if you’re interested: Hyper-V Resource Metering and Cloud Costing Benchmarking

In this data sets I’m collecting a few data elements that are critical for roughly determining the VM size:

  • CPU Utilization (actual MHz consumed)
  • Memory Utilization
  • Total Disk IOPS
  • Total Disk Capacity

I could have opted to included network utilization but have decided to keep this aside for the time being as my workload is not network IO bound. Here’s what the raw data that will be used for the sizing exercise looks like:

Capture Azure Pricing Information

For this part I’m using a couple of price lists as the basis for the automated analysis. Here are the two main sources of information I’m using:

Based on that data, I’ve extracted pricing information for all the VM sizes in the particular region I was interested in. I took the CSV files and loaded them in a SQL Server table in the same database containing all my Hyper-V metering data.

Capture Azure VMs characteristics

Once you have the basic pricing information loaded and processed, the next step would be capture the actual sizing information for the Azure VM sizes. In order to capture that information, I used the page Sizes for Windows virtual machines in Azure to capture the following key pieces of information about the VM configuration:

  • vCPU count
  • RAM
  • Maximum IOPS
  • Maximum number of disks

Here’s what the data looks like in the table to give you and idea.

Select Appropriate VM Size

Now here comes the tricky bit of the process. At a high level, here’s how the Azure VM size logic works.

  1. Find a VM with enough vCPU to support the number of MHz the VM is currently consuming.
    1. This logic is crude at the moment as I’m strictly doing a conversion from MHz to number of cores independently of the actual CPU in Azure. I will work on tweaking this aspect in the future.
  2. Find a VM size with as much RAM as what’s being used on premises.
    1. In this particular case, I put an artificial cap of ~448GB as this is the largest VM size I can get in my targeted region
  3. Find a VM size that can accommodate the maximum number of IOPS
    1. In this particular case, I put an artificial cap of 80000 IOPS
  4. As there’s mix of both Dev/Test and Production VMs, I’m filtering to get either Dev/Test subscription pricing or production pricing
  5. I also make sure I’m getting VM SKUs that don’t include Windows licenses in the price
  6. Of all the options that are matching what’s needed, sort them in ascending order of price and pick the cheapest option.

Calculate VM Operating Hours

An important step in cost sizing and optimizing your workload in Azure involves determining the VM operating hours. i.e. Some VMs don’t always need to be running 24/7, so why pay for those extra hours? In my case, I applied the following general assumption in my logic. This can definitely be refined but it gives a good idea.

  1. If it’s a Dev/Test VM, set the operating hours as 10 hours per day, 5 days a week
  2. If it’s a production VM, set it to 24/7 unless it’s an RDS session host, then in that case, I adjust the operating hours based on our actual user demand (i.e. more hosts at peak during the day, less during the night)

Determine VM Pricing Strategy

Now that you have the operating hours in hand and the actual VM size, you can now determine if it’s better to pay hourly for it or if it’s better to get a reserved instance. If your VM is running 24/7 you definitely want to leverage the pricing of Azure Reserved Instances (RI). For the other cases, you have to evaluate if the operating hours of the VM vs the discount level you get with an Azure is worth it.

Generate VM Storage Configuration

Another fun part of sizing your VM is determining what type of disks you will use and how many of them are required to support your workload. As there’s a wide range of options, finding the most cost effective option can be tricky/time consuming. As you can see below, there are quite a few permutations you need to consider!

Load the pricing of all the storage options in a table, managed/unmanaged, standard/premiumIn my case here’s what I ended up doing, it’s not perfect and still needs some work but it gives an idea!

  1. For each disk option
    1. Determine the number of disks required to reach capacity
    2. Determining the number of disks required to reach IOPS required
  2. As you iterate through the disk options, keep the lowest priced option
  3. Discard options where the disk count required is higher than the VM size selected in the previous steps

The Result

Now for the moment of truth/did it blend moment! Here’s a sample output of that process. Here’s what the output of the SQL query looks like:

Now that I can get that output from SQL Server, I can load that up in Excel to do all sorts of fancy charts and tables and PivotTable the heck out of that data for further analysis

Bonus

À la Apple, there’s one more thing I omitted to mention. In that exercise, I wanted to compare costs with AWS out of due dilligence. In order to achieve this, I went through the list of Azure VM sizes and manually found the closest size equivalent in Amazon EC2. So when I’m I’m picking the size of an Azure VM, I also pick an equivalent size at Amazon for that VM along with its pricing information. The same type of pricing logic is applied to keep things as fair as possible between the two options. Right now I have yet to tackle the VM storage sizing piece, that’s one of my next step.

I’m also attempting to compare cost with our on-premises infrastructure but that involves a whole separate set of calculation that I will not cover in this version of the article. Just be aware that it’s feasible if you roll-up your sleeves a bit. In the end you can have a nice looking chart comparing On-Premises/Azure/AWS/etc.!

Caveats/Disclaimer

Needless to say this is by no mean a perfect sizing methodology. It’s still very rough around the edges but should give pricing ballpark. The goal is to have an iterative approach in order to appropriately size your workload for execution in Azure. You may find that some workloads are just not good candidate at all depending on your requirements. There are a LOT of variables to consider when sizing a VM and not all of those variables were considered in the current iteration of the process I have so far. I’ll keep adding those to improve/optimize the costing model.

Right now my process works for VMs for which I have Hyper-V metering statistics but it wouldn’t be too difficult to extend it to include future/hypothetical VMs as well. One would simply have to throw the simulation data in another table and process it using the same logic, which in my case is a T-SQL Table Valued function. Here’s what the actual query I’m using in Excel looks like to give you a feel for this:

select vmnhs.ClusterName,
	   vmnhs.VMName,
	   vmnhs.PrimaryEnvironment,
vmnhs.PrimarySystemName,
cvi.*,
cvi.VMYearlyOperatingHours*AzureVMHourlyPrice AS AzureYearlyCost,
cvi.VMYearlyOperatingHours*AzureVMHourlyPrice + cvi.AzureVMYearlyStoragePrice AS AzureTotalYearlyCost,
cvi.VMYearlyOperatingHours*AWSVMHourlyPrice AS AWSYearlyCost,
vmnhs.AverageIOPS,
vmnhs.MaximumIOPS,
vmnhs.MaximumEstimatedCores,
vmnhs.MaximumEstimatedRAMGB,
vmnhs.MaximumTotalDiskAllocation
from (select vmnhs.ClusterName,
	   vm.VMName,
	   vm.PrimaryEnvironment,
vm.PrimarySystemName,
       MAX(vmnhs.MaximumMemoryUsage) AS MaximumMemoryUsage,
	   MAX(vmnhs.MaximumProcessorUsage) AS MaximumProcessorUsage,
	   MAX(vmnhs.MaximumAggregatedAverageNormalizedIOPS) AS MaximumIOPS,
	   AVG(vmnhs.AverageAggregatedAverageNormalizedIOPS) AS AverageIOPS,
	   MAX(vmnhs.MaximumTotalDiskAllocation) AS MaximumTotalDiskAllocation,
	   CEILING(((MAX(CAST(vmnhs.MaximumProcessorUsage AS FLOAT)))/2600)) AS MaximumEstimatedCores, 
	   CEILING(((MAX(CAST(vmnhs.MaximumMemoryUsage AS FLOAT)))/1024)) AS MaximumEstimatedRAMGB,
	   MAX(SampleTime) AS SampleTime
from [dbo].[VirtualMachineNormalizedHourlyStatistics] vmnhs INNER JOIN VirtualMachines vm ON vmnhs.VMName=vm.vmname AND vmnhs.ClusterName = vm.ClusterName
where SampleTime > '2017-11-01' 
AND vm.PrimarySystemName NOT IN ('Microsoft Remote Desktop Virtual Desktop Infrastructure') 
GROUP BY vmnhs.ClusterName,vm.VMName,vm.PrimaryEnvironment,vm.PrimarySystemName) AS vmnhs
CROSS APPLY dbo.getCloudVMSizingInformation(vmnhs.VMName,'Microsoft',vmnhs.ClusterName,vmnhs.ClusterName,vmnhs.MaximumMemoryUsage,vmnhs.MaximumProcessorUsage,vmnhs.SampleTime,vmnhs.PrimaryEnvironment,vmnhs.PrimarySystemName,vmnhs.AverageIOPS*1.25,vmnhs.MaximumTotalDiskAllocation) cvi

I’d like to package this better so that I can share the sizer with the rest of the community. When things stabilize a bit with the sizer, I’ll definitely work on that.

If you have questions/comments about this blog post, feel free to comment below!