Visualize SentryOne Data Using Elasticsearch and Kibana

I’ve recently worked on consolidating a lot of log data into Elasticsearch and visualize it using Kibana. One of the source I wanted to see in Kibana was SQL Sentry (now called SentryOne) to view SQL queries performance issues and failures.

In order to achieve this, I built a few PowerShell functions in a new module called libElasticsearch.psm1 in the GEM Automation project. Here’s the process that’s involved at a high level:

  1. Run query against the SQL Sentry database to collect SQL query events incrementally
  2. Create bulk indexing calls to elasticsearch by batching the SQL result set using the bulk API format
  3. Persist the last time the data was collected to disk for use in the next iteration

Before sending data to elasticsearch, it’s typically better to define the index/object type. This will allow elasticsearch to properly identify the content of each field and index it accordingly. For the SQL Sentry data that I’m capturing, the index definition is the following:

{
      "mappings": {
         "sqlsentryevent": {
            "properties": {
              "eventype": {
                "type": "keyword"
              },
              "normalizedtextdata": {
                "type": "keyword"
              },
              "database": {
                "type": "keyword"
              },
              "clientcomputer": {
                "type": "keyword"
              },
              "application": {
                "type": "keyword"
              },
              "username": {
                "type": "keyword"
              },
              "duration": {
                "type": "integer"
              },
              "cpu": {
                "type": "integer"
              },
              "read": {
                "type": "integer"
              },
              "writes": {
                "type": "integer"
              },
              "runstatus": {
                "type": "integer"
              },
              "eventtime": {
                "type": "date",
                "format": "basic_date_time",
                "null_value": "19500101T000000.000Z"
              },
              "querytype": {
                "type": "keyword"
              },
              "servername": {
                "type": "keyword"
              }
            }
         }
      }
}

You can find the definition file in SQLSentryEvents.json

To load this definition in elasticsearch, you can use the following PowerShell code:

Import-Module -Force .\libElasticsearch.psm1

Add-ElasticsearchIndex -serverName <elasticsearch Host Name> -indexName sqlsentryevents -indexDefinition (Get-Content -Path .\IndexDefinitions\SQLSentryEvents.json)

To capture the SQL Sentry SQL query event data, I wrote the following query:

SELECT tl.RemoteObjectID AS EventType,
th.NormalizedTextData,
tl.[Database],
tl.Computer AS ClientComputer,
tl.Application,
tl.Operator AS UserName,
(tl.Duration/10000000) AS Duration,
tl.CPU,
tl.Reads,
tl.Writes,
tl.RunStatus,
FORMAT(DATEADD(mi, DATEDIFF(mi,GETDATE() ,GETUTCDATE()), tl.StartTime),'yyyyMMdd')+'T'+FORMAT(DATEADD(mi, DATEDIFF(mi, GETDATE(),GETUTCDATE() ), tl.StartTime),'HHmmss')+'.000Z' AS EventTime,
CASE
WHEN th.NormalizedTextData LIKE '%trace_getdata%' THEN 'Monitoring'
WHEN th.NormalizedTextData LIKE '%BACKUP%' THEN 'Maintenance'
WHEN th.NormalizedTextData LIKE '%STATISTICS%' THEN 'Maintenance'
WHEN th.NormalizedTextData LIKE '%INDEX%' THEN 'Maintenance'
WHEN th.NormalizedTextData LIKE '%updatestats%' THEN 'Maintenance'
WHEN th.NormalizedTextData LIKE '%sys.%' THEN 'Monitoring'
WHEN th.NormalizedTextData LIKE '%repl%' THEN 'Replication'
WHEN th.NormalizedTextData LIKE '%sp_server_diagnostics%' THEN 'Monitoring'
WHEN th.NormalizedTextData LIKE '%sp_readrequest%' THEN 'Replication'
WHEN th.NormalizedTextData LIKE '%sp_MSdistribution%' THEN 'Replication'
WHEN th.NormalizedTextData LIKE '%syncobj_%' THEN 'Replication'
WHEN th.NormalizedTextData LIKE '%waitfor delay @waittime%' THEN 'CDC'
ELSE 'Application Query'
END AS QueryType,
esc.ObjectName AS ServerName
FROM dbo.vwMetaHistorySqlServerTraceLog (nolock) tl
INNER JOIN dbo.PerformanceAnalysisTraceHash (nolock) th ON tl.NormalizedTextMD5 = th.NormalizedTextMD5
INNER JOIN EventSource (nolock) es ON tl.EventSourceId = es.ObjectId
INNER JOIN EventSourceConnection (nolock) esc ON es.EventSourceConnectionID = esc.ObjectId
WHERE (esc.ObjectName LIKE 'SRV%') AND
tl.StartTime >= @EventStartTime AND
tl.StartTime < @EventEndTime

I can now call the function that does the query execution and converts it the bulk API format. Note that you can use this function to send pretty much any SQL Server query results to Elasticsearch. At this point in time the function is very crude but functional.

Import-Module -Force .\libElasticsearch.psm1

Add-SQLQueryResultAsDocumentBatch -sqlServerName <SQL Server instance with SQLSentry DB> -databaseName SQLSentry -query (Get-Content .\SQLQueries\SQLSentryEvents.sql) -queryParameters @{EventStartTime=(Get-Content .\Checkpoints\<SQL Server instance with SQLSentry DB>_sqlsentryevents_checkpoint.txt);EventEndTime=Get-Date -Format "yyyy-MM-dd HH:mm:ss"} -batchSize 1000 -elasticsearchServerName <elasticsearch Host Name> -indexName sqlsentryevents -documentType sqlsentryevent -trackCheckPoint $true -checkPointParameterName "EventEndTime"

Here are the parameters for the function:
– sqlServerName: This is the name of the SQL Server instance you wish to connect to execute the query
– databaseName: The name of the database against which the query will be run
– query: The query text of what you wish to run. As sometimes this can be quite a long string of text, I’m putting it in a file called SQLSentryEvents.sql in this case and I load it using Get-Content
– queryParameters: This contains a hashtable of parameters that are required to execute the query. For instance, the above query contains two T-SQL parameters, @EventStartTime and @EventEndTime. Those are used to apply our moving window to incrementally load the events. Another trick in this is that I’m loading the last time the job was run by loading a checkpointing text file that contains this information.
– batchSize: The maximum number of items that will be sent in one bulk API call to elasticsearch
– indexName: The name of the index in which the documents/records will be sent
– documentType: The type of documents that will be sent in the index. This is the type that was defined in JSON earlier.
– trackCheckPoint: This is just a flag to make sure you keep the last loading point to allow incremental loads.

You can then schedule that call using Windows Task Scheduler for instance and have this run at regular intervals.

With you data now loaded, you can then start visualizing this using Kibana.

In order to do this, you do the following:

– Create saved searches using the Discover section using the filtering options you want.
– Build visualizations (those are widgets that will compose your dashboard) based on the saved searches or you can also just use the whole index as the target
– Build your dashboard using the visualizations you just built

Here’s an example of a dashboard I did:

sqlsentry_kibana_dashboard

Also note that you can still view the raw data if you need to while you explore your data:

sqlsentry_kibana_event_details

That’s it! I hope this will be useful to you. Should you have any questions regarding this, feel free to drop me a comment below.

PowerShell Performance Tip – Concatenating Strings Using StringBuilder

While doing performance tuning on a script that did a lot of string concatenation, I switched from using a traditional string concatenation using += to using the .NET object System.Text.StringBuilder.

Here’s a comparison of the speed of both methods:

measure-command {$body=New-Object -TypeName System.Text.StringBuilder;for($i=0;$i -lt 100000;$i++){$null=$body.Append("test")}}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 162
Ticks             : 1620291
TotalDays         : 1.87533680555556E-06
TotalHours        : 4.50080833333333E-05
TotalMinutes      : 0.002700485
TotalSeconds      : 0.1620291
TotalMilliseconds : 162.0291

measure-command {$body="";for($i=0;$i -lt 100000;$i++){$body+="test"}}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 48
Milliseconds      : 584
Ticks             : 485840018
TotalDays         : 0.000562314835648148
TotalHours        : 0.0134955560555556
TotalMinutes      : 0.809733363333333
TotalSeconds      : 48.5840018
TotalMilliseconds : 48584.0018

PowerShell Performance Tip – Sending to Null

While performance tuning a PowerShell function, I noticed that piping to Out-Null is much slower than assigning to the $null variable. Here’s a comparison of both options:

measure-command {for($i=0;$i -lt 100000;$i++){"Hello Null!" | Out-Null}}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 3
Milliseconds      : 931
Ticks             : 39314263
TotalDays         : 4.5502619212963E-05
TotalHours        : 0.00109206286111111
TotalMinutes      : 0.0655237716666667
TotalSeconds      : 3.9314263
TotalMilliseconds : 3931.4263

measure-command {for($i=0;$i -lt 100000;$i++){$null="Hello Null!"}}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 113
Ticks             : 1139364
TotalDays         : 1.31870833333333E-06
TotalHours        : 3.1649E-05
TotalMinutes      : 0.00189894
TotalSeconds      : 0.1139364
TotalMilliseconds : 113.9364

Identify Assemblies and Applications .NET Runtime and Framework Versions

As .NET Frameworks versions expire, you may need to identify which applications need to be recompiled/retargeted with a new version of .NET Framework. One method I found was to use Mono.Cecil to gather .NET assembly information such as the runtime required as well as the .NET Framework version used. Note the Framework version is only available for assemblies using the 4.0 runtime and up.

For example, you can then do the following to capture information for multiple assemblies in IIS sites running .NET code:

(ls \\contoso001\c$\inetpub\wwwroot\*\*\bin\*.dll).FullName | .\Get-FrameworkVersion.ps1 | Export-Csv AssembliesVersions.csv -NoTypeInformation -Append

You can find the script here:

http://gemautomation.codeplex.com/SourceControl/latest#Utilities/Get-AssemblyFrameworkVersion.ps1

The new script and the required assembly are part of the new GEM Automation 3.10.0.0 release.

 

 

Troubleshooting PowerShell Memory Leaks with Visual Studio

I was recently investigating a couple of memory leaks issue with long running scripts that capture and process monitoring metrics.

Since those leaks happened over a fairly long period of the time, it was a bit difficult to troubleshoot by simply running the script in PowerShell ISE. Another thing that made the diagnostic process difficult is that PowerShell jobs were used to collect and process data, which makes it a bit hard to set break points in the code to see what gets stuck in memory. PowerShell jobs essentially spawn another process of PowerShell under which your code runs.

I started to see what I could do using Visual Studio. What I ended up doing is the following:

1) Let your PowerShell script run for a little while in order to get into a “bad enough” state

2) Use Process Explorer to create a full dump of the PowerShell process:

procexp_create_fulldump

You can also generate that dump file directly from Visual Studio by attaching the debugger to the running process. If the process is running on a remote host, you will need to run the Remote Debugger that’s appropriate for your platform. It’s usually located in c:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\Remote Debugger on the machine that has Visual Studio installed. For example the x64 remote host, you would simply run as Administrator:

\\<vs computer>\c$\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\Remote Debugger\x64\msvsmon.exe

Once Remote Debugger is running, you can simply attach to the desired process in Visual Studio:

vs_dbg_attach_to_process

and then pick the PID of the PowerShell instance you want to debug:

vs_dbg_attach_to_process_select

Once Visual Studio is attached, break the execution of the process:

vs_dbg_attach_to_process_break

Once the process execution has been suspended, you can then create the dump file using Save Dump As (make sure you select to save the heap):

vs_dbg_attach_to_process_savedump

3) Open the resulting dump file in Visual Studio and select Debug Managed Memory

vs_debug_manage_memory

4) You can then sort by Size (bytes) to figure out which object type ends up using the most memory

vs_debug_mm_sort

5) You can then view the instances of that object type by selecting View Instances

vs_debug_mm_view_instances

6) Once you are in the instances view, you can select one of the instances, right click on it and select Quick Watch

vs_debug_mm_quick_watch

7) You can now see the content of the object that’s causing the issue, giving you a better reference point in your code to fix the issue

vs_debug_mm_quick_watch_view

In my case, it turns out that the output of Write-Output at the end of a particular function that was used in my pipeline stayed in memory indefinitely, even though the object was done processing in the pipeline. For instance:

Write-Output -InputObject $body"sent"

After piping the output to Out-Null in my high level script that runs as non-stop scheduled task, the problem was resolved.

Maybe something worth mentioning, it’s a bit hard to dive in the dump file for a PowerShell process as it can be difficult to track back a particular function as the PowerShell script is instantiated as dynamic code in .NET.

Hopefully this post will help someone in the same situation as I was in!

Storage Spaces Performance Troubleshooting

Over the last few years, well, since Storage Spaces came to light, I had from time to time to investigate performance issues. The goal of this post to is show you what you might typically look at while troubleshooting performance with Storage Spaces in an scale out file server setup. I’ll refine this post over time as there is quite a bit to look at but it should already give you a good start.

Let’s use a top down approach to troubleshoot an issue by looking at the following statistics in order:

  1. SMB Share
  2. Cluster Shared Volume
  3. If you are using an SSD write back cache, Storage Spaces Write Cache
  4. If you are using a tiered volume, Storage Spaces Tier
  5. Individual SSD/HDD

So the logic is, find which share/CSV is not performing and then start diving into the components of those by looking at the write cache, the tiers and finally the physical disks composing the problematic volume.

Let’s now dive in each of the high level categories.

SMB Share
  • Under the SMB Share Perfmon category:
    • Avg. Bytes/Read and Avg. Bytes/Write
      • What are the typical IO size on that share?
    • Avg. Sec/Read and Avg. Sec/Write
      • Is that share showing sign of high latency during IO?
    • Read Bytes/sec and Write Bytes/sec
      • How much data is flowing back and forth on that share?
Cluster Shared Volume

In order to capture the physical disk statistics for the particular CSV you are after, you will need to lookup the Disk Number of the CSV in the Failover Clustering console. Once you have that, you can look at the following:

  • Physical Disk category
    • Avg. Disk Bytes/Read and Avg. Disk Bytes/Write
      • What are the typical IO size on that?
    • Avg. Disk Sec/Read and Avg. Disk Sec/Write
      • What are the typical IO size on that disk?
    • Disk Read Bytes/sec and Disk Write Bytes/sec
      • How much is being read/written per second on that disk? Is the volume performing as per the baseline already established?

You can also look at CSV specific counters such as:

  • Cluster CSV Volume Manager
    • IO Read Bytes/sec – Redirected/IO Write Bytes/sec – Redirected
      • Is there a lot of redirected IO happening?
    • IO Read – Bytes/sec/IO Write – Bytes/sec
      • How much data is being read/written per second on that CSV

Microsoft also published a good article you should look at here about CSV performance monitoring: Cluster Shared Volume Performance Counter

Write Back Cache
  • Storage Spaces Write Cache category
    • Cache Write Bytes/sec
      • How much data is being written to the cache?
    • Cache Overwrite Bytes/sec
      • Is data being overwritten prior being flushed to the final HDD destination?
Storage Spaces Tier
  • Storage Spaces Tier category
    • Avg. Tier Bytes/Read and Avg. Tier Bytes/Write
      • Are the size of the IO on each tier aligned with the physical sector size of the drive composing those?
    • Avg. Tier Sec/Read and Avg. Tier Sec/Write
      • Are you HDDs thrashing? Are you SSDs experiencing high latency spikes because of background operations on the drive (i.e. garbage collection)
    • Tier Read Bytes/sec and Tier Write Bytes/sec
      • Are each tiers providing the expected throughput?
Individual SSD/HDD

To see some basic latency and reliability metrics for your physical disks, you can use the following PowerShell command:

Get-VirtualDisk -FriendlyName VirtualDisk01 | Get-PhysicalDisk | Get-StorageReliabilityCounter | Select DeviceId,FlushLatencyMax,ReadLatencyMax,WriteLatencyMax,ReadErrorsCorrected,WriteErrorsCorrected,ReadErrorsTotal,WriteErrorsTotal

Just by looking at the latency statistics from the command above, you can get a feeling if a drive is misbehaving either by looking at the errors/errors corrected or simply by looking at the read/write latency metrics. For the latency metrics, note that those are the maximums since the last reset or reboot. If you want to get a feel of how they perform over time or under a specific load, I would recommend you use the Physical Disk metrics in Perfmon. You can use the same Physical Disk counters as what we used for the CSV, namely:

  • Physical Disk category
    • Avg. Disk Bytes/Read and Avg. Disk Bytes/Write
    • Avg. Disk Sec/Read and Avg. Disk Sec/Write
    • Disk Read Bytes/sec and Disk Write Bytes/sec

In order to match the counter disk instance found in Perfmon with the StorageReliability counters from the PowerShell above, simply use the number in the DeviceId column of the output. Make sure you connect Perfmon to the same cluster node as where you ran the PowerShell command as sometimes the disk numbers/deviceId do not match exactly between nodes.

By using those Perfmon counters, you can see if the latency is simply high because the Bytes/Read or Bytes/Write are quite large, if your drives are simply overwhelmed and are performing as per specifications or because there’s another underlying issue with those drives.

Madness! Testing SMB Direct network throughput with diskspd

An important process before deploying a new piece of core infrastructure in production is to make sure its performance meets your expectations. As explained in the post Storage Performance Baselining with Diskspd on the Altaro’s blog, it’s crucial to make sure your network performance is top-notch, especially if you’re using SMB3.x or iSCSI as the foundation to access or provide your storage infrastructure.

Historically, it’s been pretty straightforward to test the network. You whipped out iperf or ntttcp and away you go. Starting with protocols that take advantage of RDMA such as SMB 3.x with SMB Direct, we have to revise a bit our testing methodology to make sure we reach the maximum performance provided by our equipment. The problem with iperf of ntttcp is that they’re not taking advantage of RDMA. That means the performance number you get out of those tests are not representative of how the system will be used in real life with Hyper-V hosts accessing the storage hosted on a scale out file server using RDMA capable NICs. The other problem you hit at certain throughput is that you can’t simply saturate the network because you starting hitting a CPU bottleneck. If you’re like me, the storage in my lab is far from being exotic enough to match the performance of the production environment.

With 100Gb Ethernet equipment in my hands to test, I had to find a better way to make sure I get my money’s worth with the shiny new networking gear. After discussing with my go to guys when it comes to performance testing at Microsoft, Ned Pyle (follow him on twitter, funny and tons of good info) and Dan Lovinger (couldn’t find Dan on the Twittosphere), they mentioned I could use diskspd to perform network tests…

My first reaction:

17ppgp

But after letting that one sink a bit, I started to rollup my sleeves and give it a shot. In the first iteration of this, I was using the RAM Disk from Starwind and that did the job very well up to a certain degree but I wasn’t squeezing as much performance as I was expecting, most likely because of the overhead generated by running the RAM drive software. In a typical “Microsoft has all the goodies” fashion, Dan mentioned they had an internal version of diskspd that did an interesting trick that simulate a lot of the results obtained with a typical RAM drive. It essentially works by disabling the client side cache while leaving the server side cache enabled. It also uses a trick with the FILE_ATTRIBUTE_TEMPORARY flag that results in the IO to not be flushed on the backing media. Well, I wanted THAT version!

Not too long after our discussions, diskspd 2.0.17 was released and it contained the tricks I just explained. Since Ned brought it to my attention, I always referred to this mode as the hel.ya mode because of the example he sent me about how to use it:

diskspd.exe -b64K -c2M -t2 -r -o2 -d60 -Sr -ft -L -w0 \\srv\d$\hel.ya

I’m sure the folks at Mellanox are scratching their heads as to what is that file name! 😉

So here diskspd will create a very small 2MB file and then issue 64K random IO from 2 threads with 2 outstanding IO for a duration of 60 seconds while disabling the client side cache and leaving the server side enabled and also measuring the latency while the test is running.

So the next logical question is, does it actually work? Well, hel.ya!

100GbE_goodness

To give a bit of background on that test:

  • Using 2 ConnectX-4 single port per server (Firmware 12.16.10.20 and Driver 1.40)
    • 1 NIC in a x16 slot
    • 1 NIC in a x8 slot (don’t have 2 x16 in those particular servers)
    • A x16 slot can do around 126Gbps (no point for us to use dual ports)
    • A x8 slot can do around 63Gbps
  • Servers are connected back to back
    • Currently have a case open with engineering at Mellanox in regards to an issue with the Mellanox SN2700 switches. When performing this particular test, it looks like SMB Direct/RDMA dies in one direction for one of the NIC.
  • One server is a E5-2670 v1 and the other an E5-2670 v2
  • diskspd was run like so:

So ~30GB/s is not too shabby! All of this while only using 18% CPU, very impressive work done by the SMB Direct folks!

100GbE_Ludicrous_Speed

A  couple of things worth mentioning while testing with the -Sr switch with diskspd

  • Increasing the test file size will yield worse results for example, I had the following in some specific tests:
    • 2MB: 17558MB/s
    • 4MB: 15003MB/s
    • 16MB: 10055MB/s
    • 32MB: 7682.99MB/s
  • The location/path of the file is irrelevant
  • SMB Multichannel will kick-in by itself as usual, if you want to test a single NIC and easy way is to disable the NIC or setup SMB Multichannel Constraints.

On another note, doing packet capture at that speed starts to get problematic. For instance, a 3 second diskspd test gave me a 700MB/s capture pcap file. I’m also working with Mellanox for their pcap file that seems to be corrupted when using Mlx5cmd -sniffer and opening those using Wireshark or Microsoft Message Analyzer.

If you have any questions about this, I’ll be glad to give it a shot at answering them!